DornerWorks

Machine Learning Systems Made More Accessible with Xilinx DNNDK

Posted on June 3, 2019 by David Norwood

The Deep Neural Network Development Kit streamlines networks to fit on FPGAs and embedded systems without sacrificing performance

 
When it comes to the size of your neural network, bigger isn’t always better.

Machine learning (ML) applications are helping automotive companies develop solutions for vehicle detection and systems that can identify people, objects and animals but the sheer size and scope of these systems is often prohibitive to those who need real-time response in a confined space.

There are many excellent devices available for the edge deployed ML solutions. These devices can, however, be hindered by the more rigid architectures they employ when compared to an FPGA. Many solutions also have to contend not only with the ML portion of the design, but many other tasks such as:

  • Interfacing to a either standardized or custom ICD defined interfaces
  • Risk realization; for example, Late redefinition of design block APIs
  • Mitigating End-of-life memory components issues
  • Processing of sensor input data
  • Processing user input
  • Driving displays and other annunciators

Other often overlooked challenges of AI/ML designs are how to define or re-define the architectural boundaries of the solution. Xilinx Vivado SDSoC allows software engineers to explore hardware acceleration of key modules with the clock of a mouse. Vivado DNNDK provides many utilities to explore various ML approaches and optimization techniques.

An FPGA based approach also allows you to stay on the same platform from prototype to deployment. These reasons and more are why you should explore the overall benefits of a Xilinx UltraScale+ SoC for your ML project.

The Deep Neural Network Development Kit from Xilinx further lowers the barriers to successful ML development. Initially developed by DeePhi, a Beijing-based ML start-up acquired by Xilinx in 2018, the DNNDK takes in neural network models generated in Caffe, TensorFlow, or MXNet, shrinks the network complexity by pruning synapses and neurons and reduces the data type of the weights from 32 bits to 8 bits.


Figure 1: Neural Networks are compressed through pruning, and reweighted in an 8-bit model.



Figure 2: Quantization transfers a full precision network to low-bit networks. Pruning transfers a dense model to sparse model.




The accuracy of the model is reduced by about 1 percent, but the resulting network is streamlined enough to fit on an FPGA with minimal porting effort, enabling embedded systems to run ML applications without being held back by computational bottlenecks.


Figure 3: The DNNC maps the sparse neural network to DPU instructions that can fit on an FPGA.




AI inference models can be successfully implemented using the DNNDK on edge devices like the Xilinx Zynq MPSoC, as well as cloud-based data center systems like Xilinx Alveo accelerator cards. The implementation process looks like this (Figure 4):

Figure 4, DNNDK workflow.
  1. Model compression — using a small input training set, the network model is compressed and quantized using INT8 representation.
  2. Model compilation — ELF files are compiled and assembled to run on the data processing unit and any unsupported elements are identified. All the compilation and building is performed on the host machine.
  3. Build the program with DNNDK APIs — The DPU Kernels assist in building the application, which manages inputs and outputs, kernel life cycle, and tasks.
  4. Hybrid DPU compilation — Hybrid compiler makes the CPU code and links it to the ELFs, preparing the DPUs for FPGA implementation.
  5. Run the hybrid DPU executable
    • Choose one of the following on the host machine:

    • DECENT — Deep compression tool.
    • DNNC — Deep neural network compiler running alongside deep neural network assembler (DNNAS) which generates ELF files for the DPU.
      Choose one of the following on the target board:

    • DExplorer — DPU reporting tool.
    • DSight —  Builds data visualizations with input from Dtracer.
    • N2Cube — DPU run time engine that handles DNNDK application loading, scheduling, and resource allocation. Contains DPU driver, DPU loader, and DPU tracer.

DeePhi offers example reference designs for the Xilinx ZCU102, ZCU104, and Ultra96 development boards, and DornerWorks can guide your company to a successful AI/ML implementation using any of them. DornerWorks’ embedded engineers David Norwood and Corrin Meyer attended the Xilinx DNNDK workshop in spring 2019 and integrated this solution on MPSoC development boards alongside Xilinx’s own employees and FAEs.

Norwood and Meyer have experience working with neural network models on embedded systems, and are now able to build even lighter weight systems using the DNNDK.

With AI/ML applications running on the programmable FPGA logic, product developers can leverage higher throughput and lower latency than ever before, with response times less than 3ms. In the embedded space, Xilinx’ DNNDK is now being used help companies develop and deploy products for markets that hold to time-critical standards like IEEE 802.1 and that for Time-Sensitive Networking (TSN).

Neural network development can be complex, but you don’t have to master it on your own. Schedule a consultation with DornerWorks today and start building ML and AI applications that lead the market.

David Norwood
by David Norwood
Embedded Engineer
David Norwood is an embedded engineer at DornerWorks.