bank 是CUDA中一个重要概念，是内存的访问时一种划分方式，在CPU中，访问某个地址的内存时，为了减少读写内次次数，访问地址并不是随机的，而是一次性访问bank内的内存地址，类似于内存对齐一样，一次性获取到该bank内的所有地址内存，以提高内存带宽利用率，一般CPU认为如果一个程序要访问某个内存地址时，其附近的数据也有很大概率会在接下来会被访问到。

在CUDA中在理解bank之前，需要了解共享内存。

shared memory

shared memory为CUDA中内存模型中的一中内存模式，为一个片上内存，比全局内存（global memory)要快很多，在同一个block内的所有线程都可以访问到该内存中的数据，与local 或者global内存相比具有高带宽、低延迟的作用。

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.

为了提高share memory的访问速度除了在硬件上采用片上内存的方式之外，还采用了很多其他技术。其中为了提高内存带宽，共享内存被划分为相同大小的内存模型，称之为bank,，这样就可以将n个地址读写合并成n个独立的bank，这样就有效提高了带宽。

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

映射关系如下所图：

CUDA bank 及bank conflict

如上图共享内存映射为bank采用列映射方式，例如warp size = 32, banks = 16,（计算能力1.x的设备）数据映射关系如下

CUDA bank 及bank conflict

例如对于一个 32*32大小的float数组，

__shared__ float sData[32][32];

在一个warp size = 32,bank=32的GPU中中bank的映射关系为：

CUDA bank 及bank conflict

上述例子中每一列为一个bank分布，同一个bank一次只能访问一次，不同bank可以同时访问。

Bank conflicts

如果在block内多个线程访问的地址落入到同一个bank内，那么就会访问同一个bank就会产生bank conflict，这些访问将是变成串行，在实际开发调式中非常主要bank conflict.

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

CUDA bank 及bank conflict 时