Shared Memory
1、引言
2、Introduction CUDA Shared Memory
1、On-board memory
2、On-chip memory
1、An intra-block thread communication channel 一个block中线程间交流通道
2、A program-managed cache for global memory data 可编程的cache
3、Scratch pad memory for transforming data to improve global memory access patterns
Shared Memory Allocation
__shared__ float tile[size_y][size_x];
如果在kernel中声明的话,其作用域就是kernel内,否则是对所有kernel有效。如果shared memory的大小在编译期未知的话,可以使用extern关键字修饰,例如下面声明一个未知大小的1D数组:extern __shared__ int tile[];
由于其大小在编译期未知,我们需要在每个kernel调用时,动态的分配其shared memory,也就是最开始提及的第三个参数:kernel<<<grid, block, isize * sizeof(int)>>>(...)
注意:只有1D数组才能这样动态使用。Shared Memory Banks and Access Mode
· Parallel access:多个地址分散在多个bank。
· Serial access:多个地址落在同一个bank。
· Broadcast access:一个地址读操作落在一个bank。
· Conflict-free broadcast access if threads access the same address within a bank
· Bank conflict access if threads access different addresses within a bank
· 4 bytes for devices of compute capability 2.x
· 8 bytes for devices of compute capability 3.x
bank index = (byte address ÷ 4 bytes/bank) % 32 banks
下图是Fermi的地址映射关系,注意到,bank中每个地址相差32,相邻的word分到不同的bank中以便使warp能够获得更多的并行获取内存操作(获取连续内存时,连续地址分配到了不同bank中)。· 64-bit mode
· 32-bit mode
bank index = (byte address ÷ 8 bytes/bank) % 32 banks
这里,如果两个thread访问同一个64-bit中的任意一个sub-word(1byte)也不会导致bank conflict,因为一次64-bit(bank带宽64bit/cycle)的读操作就可以满足请求了。也就是说,同等情况下,64-bit模式一般比32-bit模式更少碰到bank conflict。cudaError_t cudaDeviceGetSharedMemConfig(cudaSharedMemConfig *pConfig);
返回结果放在pConfig中,其结果可以是下面两种:cudaSharedMemBankSizeFourByte
cudaSharedMemBankSizeEightByte
cudaError_t cudaDeviceSetSharedMemConfig(cudaSharedMemConfig config);
cudaSharedMemBankSizeDefault
cudaSharedMemBankSizeFourByte
cudaSharedMemBankSizeEightByte
Configuring the Amount of Shared Memory
Per-device configuration
Per-kernel configuration
cudaError_t cudaDeviceSetCacheConfig(cudaFuncCache cacheConfig);
Synchronization
· Barriers
· Memory fences
void __syncthreads();
if (threadID % 2 == 0)
__syncthreads();
else如果在block之间不同步的话,thread blocks可能以任意顺序,并行或者串行,在任意的SM上被执行。如果一个CUDA kernel需要全局同步,可以通过在同步点分割kernel和启动多个kernel来达到这种期望的行为。
__syncthreads();
void __threadfence_block();
void __threadfence();
void __threadfence_system();
其中,第一个函数是对应的block范围的,也就是保证同一个block中thread在fence之前写完的值对block中其它的thread可见,不同于barrier,该函数不需要所有的thread都执行;第二个函数是对应grid范围的;第三个对用system的,其范围针对整个系统,包括device和host。3、Checking the Data Layout of Shared Memory
· Mapping data elements across Memory banks
· Mapping from thread index to shared Memory offset
Square Shared Memory
__shared__ int tile[N][N];
因为是方阵,可以从2D线程块中以相邻的thread获取相邻的元素的方式访问数据:tile[threadIdx.y][threadIdx.x]
tile[threadIdx.x][threadIdx.y]
#define BDIMX 32我们对这个kernel有如下两个操作:
#define BDIMY 32
dim3 block(BDIMX,BDIMY);
dim3 grid(1,1);
· 将thread索引以row-major写到2D的shared memory数组中;
· 从shared memory中读取这些值并写入到global memory中。
__global__ void setRowReadRow(int *out) {
// declare static 2D shared memory
__shared__ int tile[BDIMY][BDIMX];
// 因为block只有一个
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// 这里同步是为了使下面shared memory的获取以row-major执行
// 避免若有的线程未完成,而其他线程已经在读shared memory的情况
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.y][threadIdx.x] ;
}
__global__ void setColReadCol(int *out) {编译运行结果如下(在K40上以4-byte模式运行):
// static shared memor
__shared__ int tile[BDIMX][BDIMY];
// mapping from thread index to global memory index
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.x][threadIdx.y] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
$ nvcc checkSmemSquare.cu –o smemSquare
$ nvprof ./smemSquare
./smemSquare at device 0 of Tesla K40c with Bank Mode:4-byte从结果可以看出,row-major的kernel表现更出色。
<<< grid (1,1) block (32,32)>>
Time(%) Time Calls Avg Min Max Name
13.25% 2.6880us 1 2.6880us 2.6880us 2.6880us setColReadCol(int*)
11.36% 2.3040us 1 2.3040us 2.3040us 2.3040us setRowReadRow(int*)
shared_load_transactions_per_request shared_store_transactions_per_request
运行结果如下(K40,8-byte模式下),row-major只有一次transaction,而column-major需要16次,如果4-byte模式下,可能需要32次:Kernel:setColReadCol (int*)(2) Writing Row-Major and Reading Column-Major
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 16.000000
Kernel:setRowReadRow(int*)
1 shared_load_transactions_per_request 1.000000
1 shared_store_transactions_per_request 1.000000
Writing Row-Major and Reading Column-Major
__global__ void setRowReadCol(int *out) {下图展示了用简单的5路bank shared memory实现两种内存操作:
// static shared memory
__shared__ int tile[BDIMY][BDIMX];
// mapping from thread index to global memory index
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
Kernel:setRowReadCol (int*)从结果可以看出:写操作是没有conflict的,读操作则引起了一个16次的transaction。
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 1.000000
Dynamic Shared Memory
正如前文所说,我们可以全局范围的动态声明shared Memory,也可以在kernel内部动态声明一个局部范围的shared Memory。注意,动态声明必须是未确定大小一维数组,因此,我们就需要重新计算索引。因为我们将要以row-major写,以colu-major读,所以就需要保持下面两个索引值:
· row_idx:1D row-major 内存的偏移
· col_idx:1D column-major内存偏移
kernel代码:
__global__ void setRowReadColDyn(int *out) { // dynamic shared memory extern __shared__ int tile[]; // mapping from thread index to global memory index unsigned int row_idx = threadIdx.y * blockDim.x + threadIdx.x; unsigned int col_idx = threadIdx.x * blockDim.y + threadIdx.y; // shared memory store operation tile[row_idx] = row_idx; // wait for all threads to complete __syncthreads(); // shared memory load operation out[row_idx] = tile[col_idx]; }
kernel调用时配置的shared Memory:
setRowReadColDyn<<<grid, block, BDIMX * BDIMY * sizeof(int)>>>(d_C);
查看transaction:
Kernel: setRowReadColDyn(int*) 1 shared_load_transactions_per_request 16.000000 1 shared_store_transactions_per_request 1.000000
该结果和之前的例子相同,不过这里使用的是动态声明。
Padding Statically Declared Shared Memory
直接看kernel代码:
__global__ void setRowReadColPad(int *out) { // static shared memory __shared__ int tile[BDIMY][BDIMX+IPAD]; // mapping from thread index to global memory offset unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x; // shared memory store operation tile[threadIdx.y][threadIdx.x] = idx; // wait for all threads to complete __syncthreads(); // shared memory load operation out[idx] = tile[threadIdx.x][threadIdx.y]; }
改代码是setRowReadCol的翻版,查看结果:
Kernel: setRowReadColPad(int*) 1 shared_load_transactions_per_request 1.000000 1 shared_store_transactions_per_request 1.000000
正如期望的那样,load的bank_conflict已经消失。在Fermi上,只需要加上一列就可以解决bank-conflict,但是在Kepler上却不一定,这取决于2D shared Memory的大小,因此对于8-byte模式,可能需要多次试验才能得到正确结果。
原文链接:http://blog.163.com/[email protected]/blog/static/71988399201712735436357/