您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 企业文化 > 超大规模并行程序设计
CUDA超大规模并行程序设计赵开勇zhao.kaiyong@gmail.com~kyzhao香港浸会大学计算机系浪潮GPU高性能开发顾问22提纲从GPGPU到CUDA并行程序组织并行执行模型CUDA基础存储器CUDA程序设计工具新一代FermiGPU33GraphicProcessingUnit(GPU)用于个人计算机、工作站和游戏机的专用图像显示设备显示卡•nVidia和ATI(nowAMD)是主要制造商Intel准备通过Larrabee进入这一市场主板集成•Intel443维图像流水线一帧典型图像1Mtriangles3Mvertices25MfragmentsVertexProcessorFragmentProcessorRasterizerFramebufferTextureCPUGPU30frames/s30Mtriangles/s90Mvertices/s750Mfragments/s55传统GPU架构GraphicsprogramVertexprocessorsFragmentprocessorsPixeloperationsOutputimage66GPU的强大运算能力02040608010012020032004200520062007Memorybandwidth(GB/s)GPUCPUG80UltraG80G71NV40NV30HapertownWoodcrestPrescottEENorthwood02040608010012020032004200520062007Memorybandwidth(GB/s)GPUCPUG80UltraG80G71NV40NV30HapertownWoodcrestPrescottEENorthwood•数据级并行:计算一致性•专用存储器通道•有效隐藏存储器延时77GeneralPurposeComputingonGPU(GPGPU)88GPGPU核心思想用图形语言描述通用计算问题把数据映射到vertex或者fragment处理器但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧99G80GPUL2FBSPSPL1TFThreadProcessorVtxThreadIssueSetup/Rstr/ZCullGeomThreadIssuePixelThreadIssueInputAssemblerHostSPSPL1TFSPSPL1TFSPSPL1TFSPSPL1TFSPSPL1TFSPSPL1TFSPSPL1TFL2FBL2FBL2FBL2FBL2FBStreamingMultiprocessor(SM)StreamingProcessor(SP)1010CUDA:ComputeUnifiedDeviceArchitectureCUDA:集成CPU+GPUC应用程序通用并行计算模型单指令、多数据执行模式(SIMD)•所有线程执行同一段代码(1000sthreadsonthefly)•大量并行计算资源处理不同数据隐藏存储器延时•提升计算/通信比例•合并相邻地址的内存访问•快速线程切换1cycle@GPUvs.~1000cycles@CPU1111EvolutionofCUDA-EnabledGPUsCompute1.0:basicCUDAcompatibilityG80Compute1.1:asynchronousmemorycopiesandatomicglobaloperationsG84,G86,G92,G94,G96,andG98Compute1.2:dramaticallyimprovedmemorycoalescingrules,doubletheregistercount,intra-warpvotingprimitives,atomicsharedmemoryoperationsGT21XCompute1.3:doubleprecisionGT2001212CUDA成功案例1313提纲从GPGPU到CUDA并行程序组织并行执行模型CUDA基础存储器CUDA程序设计工具新一代FermiGPU1414并行性的维度1维y=a+b//y,a,bvectors2维P=MN//P,M,Nmatrices3维CTorMRIimaginga[0]a[1]…a[n]b[0]b[1]…b[n]y[0]y[1]…y[n]+++====1515并行线程组织结构Thread:并行的基本单位Threadblock:互相合作的线程组CooperativeThreadArray(CTA)允许彼此同步通过快速共享内存交换数据以1维、2维或3维组织最多包含512个线程Grid:一组threadblock以1维或2维组织共享全局内存Kernel:在GPU上执行的核心程序OnekernelonegridHostKernel1Kernel2DeviceGrid1Block(0,0)Block(1,0)Block(2,0)Block(0,1)Block(1,1)Block(2,1)Grid2Block(1,1)Thread(0,1)Thread(1,1)Thread(2,1)Thread(3,1)Thread(4,1)Thread(0,2)Thread(1,2)Thread(2,2)Thread(3,2)Thread(4,2)Thread(0,0)Thread(1,0)Thread(2,0)Thread(3,0)Thread(4,0)1616ParallelProgramOrganizationinCUDAThreadThreadblockGridSPSoftwareHardwareSMSMGPU…TPCSMSMSMTPCSMSMSMTPCSMSMSM1717并行线程执行调用kernelfunction需要指定执行配置Threads和blocks具有IDsthreadIdx:1D,2D,or3DblockIdx:1D,or2D由此决定相应处理数据__global__voidkernel(...);dim3DimGrid(3,2);//6threadblocksdim3DimBlock(16,16);//256threadsperblockkernelDimGrid,DimBlock(...);1818实例1:Element-WiseAddition//CPUprogram//sumoftwovectorsaandbvoidadd_cpu(float*a,float*b,intN){for(intidx=0;idxN;idx++)a[idx]+=b[idx];}voidmain(){.....fun_add(a,b,N);}//CUDAprogram//sumoftwovectorsaandb__global__voidadd_gpu(float*a,float*b,intN){Intidx=blockIdx.x*blockDim.x+threadIdx.x;if(idxN)a[idx]+=b[idx];}voidmain(){…..dim3dimBlock(256);dim3dimGrid(ceil(N/256);fun_adddimGrid,dimBlock(a,b,N);}1919提纲从GPGPU到CUDA并行程序组织并行执行模型CUDA基础存储器CUDA程序设计工具新一代FermiGPU2020CUDAProcessingFlow2121并行线程执行SM内以(warp即32threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位WarpBlock0Block1Block22222控制流(ControlFlow)同一warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行•每次执行warp中一条可能的路径•N条指令路径→1/Nthroughput只需要考虑同一warp即可,不同warp的不同的指令路径不具相关性G80上使用指令预测技术加速指令执行2323控制流(ControlFlow)常见情况:分支条件是threadID的函数时,容易导致分支(divergence)Examplewithdivergence:•If(threadIdx.x2){}在threadblock产生两条不同指令路径•Branchgranularitywarpsize•threads0and1与1stwarp中其它指令的指令路径不同Examplewithoutdivergence:•If(threadIdx.x/WARP_SIZE2){}也在threadblock产生两条不同指令路径•Branchgranularityisawholemultipleofwarpsize•同一warp的所有线程具备相同指令路径2424线程同步void__syncthreads();Barriersynchronization同步threadblock之内的所有线程避免访问共享内存时发生RAW/WAR/WAW冒险(hazard)__shared__floatscratch[256];scratch[threadID]=begin[threadID];__syncthreads();intleft=scratch[threadID-1];在此等待,直至所有线程到达才开始执行下面的代码2525Dead-Lockwith__syncthreadsDead-lockifSomethreadshavevallargerthanthresholdAndothersnot__global__voidcompute(...){//dosomecomputationforvalif(valthreshold)return;__syncthreads();//workwithval&storeitreturn;}2626提纲从GPGPU到CUDA并行程序组织并行执行模型CUDA基础存储器CUDA程序设计工具新一代FermiGPU2727CUDA扩展语言结构Declspecsglobal,device,shared,local,constantKeywordsthreadIdx,blockIdxthreadDim,blockDimIntrinsics__syncthreadsRuntimeAPIMemory,symbol,executionmanagementFunctionlaunch__device__floatfilter[N];__global__voidconvolve(float*image){__shared__floatregion[M];...region[threadIdx]=image[i];__syncthreads()...image[j]=result;}//AllocateGPUmemoryvoid*myimage=cudaMalloc(bytes)//100blocks,10threadsperblockfoo100,10(parameters);2828存储器空间R/Wper-threadregisters1-cyclelatencyR/Wper-threadlocalmemorySlow–registerspillingtoglobalmemoryR/Wper-blocksharedmemory1-cyclelatency“__shared__”ButbankconflictsmaydragdownR/Wper-gridg
本文标题:超大规模并行程序设计
链接地址:https://www.777doc.com/doc-3397311 .html