【计算机科学】【2016.08】【含源码】ZynqNet:一种FPGA加速的嵌入式卷积神经网络

【计算机科学】【2016.08】【含源码】ZynqNet:一种FPGA加速的嵌入式卷积神经网络
本文为瑞士苏黎世联邦理工学院(作者:David Gschwend)的硕士论文,共102页。

从医学诊断到自动驾驶汽车,图像理解正成为越来越多应用中的一个重要特征。许多应用程序需要嵌入式解决方案,这些解决方案可以集成到具有严格实时性和电源限制的现有系统中。卷积神经网络(CNN)目前在所有图像理解基准中都达到了破纪录的精度,但计算复杂度非常高。因此,嵌入式CNN需要小型、高效但功能强大的计算平台。

本硕士论文探讨了基于FPGA的CNN加速的潜力,并在Zynq芯片系统上演示了CNN实现概念的完整功能证明。ZynqNet嵌入式CNN是为ImageNet的图像分类而设计的,它由ZynqNet CNN、一个优化定制的CNN拓扑结构以及ZynqNet FPGA加速器(一种基于FPGA的评估架构)组成。

ZynqNet CNN是一种高效的CNN拓扑结构。使用定制设计的NetscopeCNN分析仪对先前的拓扑结构进行详细分析和优化,使CNN在计算复杂度仅为5.3亿次乘累加运算的情况下,具有进入前5位84.5%的精度。该拓扑结构具有很高的规则性,仅由卷积层、非线性ReLU和一个全局池化层组成。CNN非常适合于FPGA加速器。

ZynqNet的FPGA加速器允许对ZynqNet的CNN进行有效评估,它基于一个嵌套循环算法以加速整个网络,该算法将算术运算和内存访问的次数最小化。针对Xilinx Zynq XC-7Z045进行了高级综合,实现了频率为200MHz、器件利用率为80%~90%的FPGA加速器。

Image Understanding is becoming a vitalfeature in ever more applications ranging from medical diagnostics toautonomous vehicles. Many applications demand for embedded solutions thatintegrate into existing systems with tight real-time and power constraints. ConvolutionalNeural Networks (CNNs) presently achieve record-breaking accuracies in allimage understanding benchmarks, but have a very high computational complexity. EmbeddedCNNs thus call for small and efficient, yet very powerful computing platforms.

This master thesis explores the potentialof FPGA-based CNN acceleration and demonstrates a fully functionalproof-of-concept CNN implementation on a Zynq System-on-Chip. The ZynqNetEmbedded CNN is designed for image classification on ImageNet and consists of ZynqNetCNN, an optimized and customized CNN topology, and the ZynqNet FPGAAccelerator, an FPGA-based architecture for its evaluation.

ZynqNet CNN is a highly efficient CNNtopology. Detailed analysis and optimization of prior topologies using thecustom-designed Netscope CNN Analyzer have enabled a CNN with 84.5 % top-5accuracy at a computational complexity of only 530 million multiply accumulateoperations. The topology is highly regular and consists exclusively ofconvolutional layers, ReLU nonlinearities and one global pooling layer. The CNNfits ideally onto the FPGA accelerator.

The ZynqNet FPGA Accelerator allows anefficient evaluation of ZynqNet CNN. It accelerates the full network based on anested-loop algorithm which minimizes the number of arithmetic operations andmemory accesses. The FPGA accelerator has been synthesized using High LevelSynthesis for the Xilinx Zynq XC-7Z045, and reaches a clock frequency of 200MHz with a device utilization of 80 % to 90 %.

1 引言
2 背景与概念
3 卷积神经网络分析、训练与优化
4 FPGA加速器设计与实现
5 评估与结果
6 结论
附录A 原创声明
附录B 任务描述
附录C 卷积神经网络可视化
附录D CNN训练细节与结果
附录E FPGA加速器细节

更多精彩文章请关注公众号:【计算机科学】【2016.08】【含源码】ZynqNet:一种FPGA加速的嵌入式卷积神经网络