An efficient parallel algorithm for O(N^2) direct

arXiv:astro-ph/0108412v127Aug2001AneﬃcientparallelalgorithmforO(N2)directsummationmethodanditsvariationsondistributed-memoryparallelmachinesJunichiroMakinoDepartmentofAstronomy,SchoolofScience,UniversityofTokyo,7-3-1Hongo,Bunkyo-ku,Tokyo113-0033,Japan.AbstractWepresentanovel,highlyeﬃcientalgorithmtoparallelizeO(N2)directsumma-tionmethodforN-bodyproblemswithindividualtimestepsondistributed-memoryparallelmachinessuchasBeowulfclusters.Previouslyknownalgorithms,inwhichallprocessorshavecompletecopiesoftheN-bodysystem,hastheseriousproblemthatthecommunication-computationratioincreasesasweincreasethenumberofprocessors,sincethecommunicationcostisindependentofthenumberofproces-sors.Inthenewalgorithm,pprocessorsareorganizedasa√p×√ptwo-dimensionalarray.EachprocessorhasN/√pparticles,butthedataaredistributedinsuchawaythatcompletesystemispresentedifwelookatanyroworcolumnconsistingof√pprocessors.Inthisalgorithm,thecommunicationcostscalesasN/√p,whilethecalculationcostscalesasN2/p.Thus,wecanuseamuchlargernumberofprocessorswithoutlosingeﬃciencycomparedtowhatwaspracticalwithpreviouslyknownalgorithms.PACS:02.60.Cb;95.10.Ce;98.10.+zKeywords:Celestialmechanics,stellardynamics;Methods:numerical1IntroductionInthispaperwepresentanovelalgorithmtoparallelizethedirectsum-mationmethodforastrophysicalN-bodyproblems,eitherwithandwith-outtheindividualtimestepalgorithm.TheproposedalgorithmworksalsowiththeAhmad-Cohenneighborscheme(AhmadandCohen1973),orwithGRAPEspecial-purposecomputersforN-bodyproblems(Sugimotoetal.PreprintsubmittedtoElsevierPreprint1February20081990,MakinoandTaiji1998).Ouralgorithmisdesignedtooﬀerbetterscal-ingofthecommunication-computationratioondistributed-memorymulti-computerssuchasBeowulfPCclusters(Sterlingetal.1999)comparedtotraditionalalgorithms.Thispaperwillbeorganizedasfollows.Insection2wedescribethetraditionalalgorithmstoparallelizedirectsummationmethodondistributed-memoryparallelcomputers,andthescalingofcommunicationtimeandcomputationaltimeasfunctionsofthenumberofparticlesNandnumberofprocessorp.ItwillbeshownthatforpreviouslyknownalgorithmsthecalculationtimescalesasO(N2/p),whilecommunicationtimeisO(N+logp).Thus,evenwithinﬁnitenumberofprocessorsthetotaltimepertimestepisstillO(N),andwecannotusemorethanO(N)processorswithoutlosingeﬃciency.O(N)soundslarge,butthecoeﬃcientisrathersmall.Thus,itwasnotpracticaltousemorethan10processorsforsystemswithafewthousandparticles,ontypicalBeowulfclusters.Insection3wedescribethebasicideaofournewalgorithm.ItwillbeshownthatinthisalgorithmthecommunicationtimeisO(N/√p+logp).Thus,wecanuseO(N2)processorswithoutlosingeﬃciency.Thisimpliesalargegaininspeedforrelativelysmallnumberofparticlessuchasafewthousands.Wealsobrieﬂydiscusstherelationbetweenournewalgorithmandthehyper-systolicalgorithm(Lippertetal.1998).Inshort,thoughtheideasbehindthetwoalgorithmsareverydiﬀerent,theactualcommunicationpatternsarequitesimilar,andthereforetheperformanceisalsosimilarforthetwoalgorithms.OuralgorithmshowsabetterscalingandalsoismucheasiertoextendtoindividualtimestepandAhmad-Cohenschemes.Insection4wediscussthecombinationofourproposedalgorithmandindivid-ualtimestepalgorithmandtheAhmad-Cohenscheme.Insection5,wepresentexamplesofestimatedperformance.Insection6wediscussthecombinationofouralgorithmwithGRAPEhardwares.Insection7wesumup.2TraditionalapproachesTheparallelizationofthedirectmethodhasbeenregardedsimpleandstraight-forward[see,forexample,(Foxetal.1994)].However,itisonlysoifNpandifweusesimpleshared-timestepmethod.Inthissection,weﬁrstdiscussthecommunication-calculationratioofpreviouslyknownalgorithmsforthesharedtimestepmethod,andthenthoseforindividualtimestepalgorithmwithandwithouttheAhmad-Cohenscheme.22.1SharedtimestepMostofthetextbooksandpapersdiscusstheringalgorithm.SupposewecalculatetheforceonNparticlesusingpprocessors.Weconnecttheprocessorsinaonedimensionalring,anddistributeNparticlessothateachprocessorhasN/pparticles(ﬁgure1).Hereandhereafter,weassumethatNisintegermultipleofp,tosimplifythediscussion.TheringalgorithmcalculatestheforcesonNparticlesinthefollowingsteps.(1)EachprocessorcalculatestheinteractionsbetweenN/pparticleswithinit.CalculationcostofthisstepisCf(N/p)2/2,whereCfisthetimetocalculateinteractionbetweenonepairofparticles.(2)Eachprocessorsendsallofitsparticlestothesamedirection.Herewecallthatdirection“right”.Thusallprocessorssendsitsparticlestotheirrightneighbors.ThecommunicationcostisCcN/p+Cs,whereCcisthetimetosendoneparticletotheneighboringprocessorandCsisthestartuptimeforcommunication.(3)Eachprocessoraccumulatestheforcefromparticlestheyreceivedtoitsownparticles.CalculationcostisCf(N/p)2.Ifforcefromallparticlesisaccumulated,gotostep5.(4)Eachprocessorthensendstheparticlesitreceivedintheprevioussteptoitsrightneighbor,andgoesbacktopreviousstep.(5)Forcecalculationcompleted.ThetimeforactualcalculationisgivenbyTf,ring=CfN2/p,(1)andthecommunicationtimeTc,ring=CcN+Csp.(2)ThetotaltimeperonetimestepofthisalgorithmisTring=Tf,ring+Tc,ring=CfN2/p+CcN+Csp.(3)Here,weneglectsmallcorrectionfactorsoforderO(1/p).For

An efficient parallel algorithm for O(N^2) direct

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

电力公司质量管理员岗位说明书

第5章商品的包装与商标

旅游公司收购项目建议书(1)

企业的新员工培训与发展

银行个人客户价值评估体系设计及其应用

AOKE蓝驰新能源项目.商业计划书（PDF35页）

苏波战争与列宁的世界革命战略

T3用友通教程培训——ppt

会计与财务分析

留置胃管的护理

相关文档

相关搜索