文献笔记(6)(2017ISSCC: 14.2)
文章目录
文献摘自DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks
部分参考了博文https://blog.****.net/xbinworld/article/details/55000567
和https://mp.weixin.qq.com/s?__biz=MzI3MDQ2MjA3OA==&mid=2247483716&idx=1&sn=7bdc857a1bb12700a0f48e2a3ab92339&chksm=ead1fc55dda67543c550f03b6eb62a599c984cb5dc3a5bd7de3a64c670e378d2bb589558862e&scene=21#wechat_redirect
1 英文缩写
RNN: recurrent neural networks
FCL: fully-connected layers
CL: convolution layer
LSTM: Long Short Term Memory长短时记忆
RL: RNN-LSTM layer
CP: convolution layer processor
FRP: FC-RL processor
DNPU: deep neural processing unit
Q-table: a quantization table
ID: image division
CD: channel division
MD: mixed division
2 overall architecture
因为CNN和FC/RNN两类网络需求和特点不同,所以之前都是单一加速,然后这里通过异构架构通过两个处理器协同工作。
In this paper, we present an 8.1TOPS/W reconfigurable CNN-RNN processor with the following 3 key features:
- A reconfigurable heterogeneous architecture(异构架构) with a convolution layer processor and a FC-RL processor
- a LUT-based reconfigurable multiplier optimized for the dynamic fixed point
- a quantization table based matrix multiplication
3 mixed division method
There are three possible division methods
- image division: weight需要被加载很多次
- channel division: a single divided image不能计算出结果,中间结果需要被存起来
- mixed division:最好
4 dynamic fixed-point
特点:每个层的weight所用位宽不一样的word length和fraction length,根据weight数据分布动态来选择
而且word length可以是4bit到16bit变化,量化后的权数可以是4bit到7bit变化
5 LUT-based multiplication
对于卷积层,相同权数要计算很多遍,所以可以利用查找表,把不同的4bit的输入跟同一个权数相乘的结果都保存起来,之后查表就可以。
通过移位累加,可以实现16bit的输入a与一个4bit weight的乘法
对于FC/RNN的计算,可以将weight量化成4bit的位宽。
当weight被量化后,可以进行预计算,将同一输入与不同的4bit的weight相乘的结果在查找表里存起来,实际乘法直接从表中读取计算结果。