您好,欢迎访问三七文档
arXiv:astro-ph/0108412v127Aug2001AnefficientparallelalgorithmforO(N2)directsummationmethodanditsvariationsondistributed-memoryparallelmachinesJunichiroMakinoDepartmentofAstronomy,SchoolofScience,UniversityofTokyo,7-3-1Hongo,Bunkyo-ku,Tokyo113-0033,Japan.AbstractWepresentanovel,highlyefficientalgorithmtoparallelizeO(N2)directsumma-tionmethodforN-bodyproblemswithindividualtimestepsondistributed-memoryparallelmachinessuchasBeowulfclusters.Previouslyknownalgorithms,inwhichallprocessorshavecompletecopiesoftheN-bodysystem,hastheseriousproblemthatthecommunication-computationratioincreasesasweincreasethenumberofprocessors,sincethecommunicationcostisindependentofthenumberofproces-sors.Inthenewalgorithm,pprocessorsareorganizedasa√p×√ptwo-dimensionalarray.EachprocessorhasN/√pparticles,butthedataaredistributedinsuchawaythatcompletesystemispresentedifwelookatanyroworcolumnconsistingof√pprocessors.Inthisalgorithm,thecommunicationcostscalesasN/√p,whilethecalculationcostscalesasN2/p.Thus,wecanuseamuchlargernumberofprocessorswithoutlosingefficiencycomparedtowhatwaspracticalwithpreviouslyknownalgorithms.PACS:02.60.Cb;95.10.Ce;98.10.+zKeywords:Celestialmechanics,stellardynamics;Methods:numerical1IntroductionInthispaperwepresentanovelalgorithmtoparallelizethedirectsum-mationmethodforastrophysicalN-bodyproblems,eitherwithandwith-outtheindividualtimestepalgorithm.TheproposedalgorithmworksalsowiththeAhmad-Cohenneighborscheme(AhmadandCohen1973),orwithGRAPEspecial-purposecomputersforN-bodyproblems(Sugimotoetal.PreprintsubmittedtoElsevierPreprint1February20081990,MakinoandTaiji1998).Ouralgorithmisdesignedtoofferbetterscal-ingofthecommunication-computationratioondistributed-memorymulti-computerssuchasBeowulfPCclusters(Sterlingetal.1999)comparedtotraditionalalgorithms.Thispaperwillbeorganizedasfollows.Insection2wedescribethetraditionalalgorithmstoparallelizedirectsummationmethodondistributed-memoryparallelcomputers,andthescalingofcommunicationtimeandcomputationaltimeasfunctionsofthenumberofparticlesNandnumberofprocessorp.ItwillbeshownthatforpreviouslyknownalgorithmsthecalculationtimescalesasO(N2/p),whilecommunicationtimeisO(N+logp).Thus,evenwithinfinitenumberofprocessorsthetotaltimepertimestepisstillO(N),andwecannotusemorethanO(N)processorswithoutlosingefficiency.O(N)soundslarge,butthecoefficientisrathersmall.Thus,itwasnotpracticaltousemorethan10processorsforsystemswithafewthousandparticles,ontypicalBeowulfclusters.Insection3wedescribethebasicideaofournewalgorithm.ItwillbeshownthatinthisalgorithmthecommunicationtimeisO(N/√p+logp).Thus,wecanuseO(N2)processorswithoutlosingefficiency.Thisimpliesalargegaininspeedforrelativelysmallnumberofparticlessuchasafewthousands.Wealsobrieflydiscusstherelationbetweenournewalgorithmandthehyper-systolicalgorithm(Lippertetal.1998).Inshort,thoughtheideasbehindthetwoalgorithmsareverydifferent,theactualcommunicationpatternsarequitesimilar,andthereforetheperformanceisalsosimilarforthetwoalgorithms.OuralgorithmshowsabetterscalingandalsoismucheasiertoextendtoindividualtimestepandAhmad-Cohenschemes.Insection4wediscussthecombinationofourproposedalgorithmandindivid-ualtimestepalgorithmandtheAhmad-Cohenscheme.Insection5,wepresentexamplesofestimatedperformance.Insection6wediscussthecombinationofouralgorithmwithGRAPEhardwares.Insection7wesumup.2TraditionalapproachesTheparallelizationofthedirectmethodhasbeenregardedsimpleandstraight-forward[see,forexample,(Foxetal.1994)].However,itisonlysoifNpandifweusesimpleshared-timestepmethod.Inthissection,wefirstdiscussthecommunication-calculationratioofpreviouslyknownalgorithmsforthesharedtimestepmethod,andthenthoseforindividualtimestepalgorithmwithandwithouttheAhmad-Cohenscheme.22.1SharedtimestepMostofthetextbooksandpapersdiscusstheringalgorithm.SupposewecalculatetheforceonNparticlesusingpprocessors.Weconnecttheprocessorsinaonedimensionalring,anddistributeNparticlessothateachprocessorhasN/pparticles(figure1).Hereandhereafter,weassumethatNisintegermultipleofp,tosimplifythediscussion.TheringalgorithmcalculatestheforcesonNparticlesinthefollowingsteps.(1)EachprocessorcalculatestheinteractionsbetweenN/pparticleswithinit.CalculationcostofthisstepisCf(N/p)2/2,whereCfisthetimetocalculateinteractionbetweenonepairofparticles.(2)Eachprocessorsendsallofitsparticlestothesamedirection.Herewecallthatdirection“right”.Thusallprocessorssendsitsparticlestotheirrightneighbors.ThecommunicationcostisCcN/p+Cs,whereCcisthetimetosendoneparticletotheneighboringprocessorandCsisthestartuptimeforcommunication.(3)Eachprocessoraccumulatestheforcefromparticlestheyreceivedtoitsownparticles.CalculationcostisCf(N/p)2.Ifforcefromallparticlesisaccumulated,gotostep5.(4)Eachprocessorthensendstheparticlesitreceivedintheprevioussteptoitsrightneighbor,andgoesbacktopreviousstep.(5)Forcecalculationcompleted.ThetimeforactualcalculationisgivenbyTf,ring=CfN2/p,(1)andthecommunicationtimeTc,ring=CcN+Csp.(2)ThetotaltimeperonetimestepofthisalgorithmisTring=Tf,ring+Tc,ring=CfN2/p+CcN+Csp.(3)Here,weneglectsmallcorrectionfactorsoforderO(1/p).For
本文标题:An efficient parallel algorithm for O(N^2) direct
链接地址:https://www.777doc.com/doc-6041700 .html