Volume Ray Casting on CUDA

1、Chapter6VolumeRayCastingonCUDATheperformanceofgraphicsprocessors(GPUs)isimprovingatarapidrate,almostdoublingeveryyear.SuchanevolutionhasbeenmadepossiblebecausetheGPUisspecializedforhighlyparallelcompute-intensiveapplications,primarilygraphicsrendering,andthusdesignedsuchthatmoretransistorsaredevotedtocomputationratherthancachingandbranchpredictionunits.Duetocompute-intensiveapplications'higharithmeticintensity(theratioofarithmeticoperationstomemoryoperations),thememorylatencycanbehiddenwithcomp。

2、utationsinsteadofusingcachesonGPUs.Inaddition,sincethesameinstructionsareexecutedonmanydataelementsinparallel,sophisticated°owcontrolunitssuchasbranchpredictionunitsinCPUsarenotrequiredonGPUsasmuch.Althoughtheperformanceof3DgraphicsrenderingachievedbydedicatinggraphicshardwaretoitfarexceedstheperformanceachievablefromjustusingCPU,graphicsprogrammershaduptonowtogiveupprogrammabilityinexchangeforspeed.Theywerelimitedtousinga¯xedsetofgraphicsoperations.Ontheotherhand,insteadofusingGPUs,imagesfor¯lm。

3、sandvideosarerenderedusingano®-linerenderingsystemthatusesgeneralpurposeCPUstorenderaframeinhoursbecausethegeneralpurposeCPUsgivegraphicsprogrammersalotof°exibilityto95createriche®ects.Thegeneralityand°exibilityofCPUsarewhattheGPUhasbeenmissinguntilveryrecently.Inordertoreducethegap,graphicshardwaredesignershavecontinuouslyintroducedmoreprogrammabilitythroughseveralgenerationsofGPUs.Upuntil2000,noprogrammabilitywassupportedinGPUs.However,in2001,vertex-levelprogrammabilitystartedtoappear,andin200。

4、2,pixel-levelprogrammabilityalsostartedbeingprovidedonGPUssuchasNVIDIA'sGeForceFXfamilyandATI'sRadeon9700series.Thislevelofprogrammabilityallowsprogrammerstohaveconsiderablymorecon¯gurabilitybymakingitpossibletospecifyasequenceofinstructionsforprocessingbothvertexandfragmentprocessors.However,accessingthecomputationalpowerofGPUsfornon-graphicsappli-cationsorglobalilluminationrenderingsuchasraytracingoftenrequiresingeniouse®orts.OnereasonisthatGPUscouldonlybeprogrammedusingagraphicsAPIsuchasOpenG。

5、L,whichimposesasigni¯cantoverheadtothenon-graphicsappli-cations.ProgrammershadtoexpresstheiralgorithmsintermsoftheinadequateAPIs,whichrequiredsometimesheroice®ortstomakeane±cientuseoftheGPU.AnotherreasonisthelimitedwritingcapabilityoftheGPU.TheGPUprogramcouldgatherdataelementfromanypartofmemory,butcouldnotscatterdatatoarbitrarylocations,whichremoveslotsoftheprogramming°exibilityavailableontheCPU.Inordertoovercometheabovelimitation,NVIDIAhasdevelopedanewhard-wareandsoftwarearchitecture,calledCUDA。

6、(ComputeUni¯edDeviceArchitec-ture),forissuingandmanagingcomputationsontheGPUasadata-parallelcom-96putingdevicethatdoesnotrequiremappinginstructionstoagraphicsAPI[NVI07].CUDAprovidesthegeneralmemoryaccessfeature,andthus,theGPUprogramisnowallowedtoreadfromandwritetoanylocationinmemoryonCUDA.InordertoharnessthepoweroftheCUDAarchitecture,weneednewdesignstrategiesandtechniquesthatfullyutilizethenewfeaturesofthearchitecture.CUDAisbasicallytailoredfordata-parallelcomputationsandthusisnotwellsuitedforot。

7、hertypesofcomputations.Moreover,thecurrentversionofCUDArequiresprogrammerstounderstandthespeci¯carchitecturedetailsinordertoachievethedesiredperformancegains.Programswrittenwithoutthecarefulattentiontothearchitecturedetailsareverylikelytoperformpoorly.Inthischapter,weexploretheapplicationofourstreamingmodel,whichwasintroducedinthepreviouschapterfortheCellprocessor,fortheCUDAarchitecture.Sincethemodelisdesignedforheterogeneouscomputeresourceenvironment,itisalsowellsuitedfortheCPUandCUDAcombineden。

8、vironment.OurbasicstrategyinthestreamingmodelisthesameasinthecaseofCellprocessor.Weassigntheworklistgenerationtothe¯rststage(CPU)andactualrenderingworktothesecondstage(CUDA)withdatamovementstreamlinedthroughthetwostages.Thekeyisthatthewecarefullymatchtheperformancesofthetwostagessothattwoprocessesarecompletelyoverlappedandnostagehastowaitfortheinputfromtheotherstage.Ourschemefeaturesthefollowing.First,weessentiallyremovetheoverheadcausedbytraversingthehierarchicaldatastructurebyoverlappingtheemp。

9、tyspaceskippingprocesswiththeactualrenderingprocess.Second,ouralgorithmsare97carefullytailoredtotakeintoaccounttheCUDAarchitecture'suniquedetailssuchastheconceptofwarpandlocalsharedmemorytoachievehighperformance.Last,theraycastingperformanceis1.5timesbetterthanthatoftheCellprocessorwithonlyathirdlinesofcodesoftheCellprocessorand15timesbetterthanthatofIntelXeonprocessor.6.1TheCUDAArchitectureOverviewTheCUDA(ComputeUni¯edDeviceArchitecture)hardwaremodelhasasetofSIMDmultiprocessorsasshowninFigure6.。

10、1.Eachmultiprocessorhasasmalllocalsharedmemory,constantcache,texturecacheandasetofprocessors.Atanygivenclock,everyprocessorinthemultiprocessorexecutesthesameinstruction.Forexample,NVIDIAGeforce8800GTXarchitectureiscomprisedof16multiprocessors.Eachmultiprocessorhas8streamingprocessorsforatotalof128processors.Figure6.2showstheCUDAprogrammingmodel.CUDAallowsprogrammerstouseC-languagetoprogramitinsteadofgraphicsAPIssuchasOpenGLandDirect3D.InCUDA,theGPUisacomputedevicethatcanexecuteaveryhighnumberofth。