您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > Introduction to Hadoop and HDFS
Public2009/5/13HadoopandHadoopDistributedFileSystem(HDFS)Copyright2009-TrendMicroInc.Outline•IntroductiontoHadoopproject•IntroductiontoHadoopDistributedFileSystem–Architecture–Administration–ClientInterface•ReferenceClassificationCopyright2009-TrendMicroInc.WhatisHadoop?•Hadoopisacloudcomputingplatformforprocessingandkeepingvastamountofdata.•Apachetop-levelproject•OpenSource•HadoopCoreincludes–HadoopDistributedFileSystem(HDFS)–MapReduceframework•Hadoopsubprojects–HBase,Zookeeper,…•WritteninJava.•ClientinterfacescomeinC++/Java/ShellScripting•Runson–Linux,MacOS/X,Windows,andSolaris–CommodityhardwareAClusterofMachinesHadoopDistributedFileSystem(HDFS)MapReduceHBaseCloudApplicationsCopyright2009-TrendMicroInc.ABriefHistoryofHadoop•2003.2–FirstMapReducelibrarywrittenatGoogle•2003.10–GoogleFileSystempaperpublished•2004.12–GoogleMapReducepaperpublished•2005.7–DougCuttingreportsthatNutchnowusenewMapReduceimplementation•2006.2–HadoopcodemovesoutofNutchintonewLucenesub-project•2006.11–GoogleBigtablepaperpublishedCopyright2009-TrendMicroInc.ABriefHistoryofHadoop(continue)•2007.2–FirstHBasecodedropfromMikeCafarella•2007.4–Yahoo!RunningHadoopon1000-nodecluster•2008.1–HadoopmadeanApacheTopLevelProjectClassificationCopyright2009-TrendMicroInc.WhouseHadoop?•Yahoo!–Morethan100,000CPUsin~20,000computersrunningHadoop•Google–UniversityInitiativetoAddressInternet-ScaleComputingChallenges•Amazon–AmazonbuildsproductsearchindicesusingthestreamingAPIandpre-existingC++,Perl,andPythontools.–Processmillionsofsessionsdailyforanalytics•IBM–BlueCloudComputingClusters•TrendMicro–Threatsolutionresearch•More…–•SingleNamespaceforentirecluster•VeryLargeDistributedFileSystem–10Knodes,100millionfiles,10PB•DataCoherency–Write-once-read-manyaccessmodel–Afileoncecreated,writtenandclosedneednotbechanged.–Appendingwritetoexistingfiles(inthefuture)•Filesarebrokenupintoblocks–Typically128MBblocksize–EachblockreplicatedonmultipleDataNodesCopyright2009-TrendMicroInc.HDFSDesign•Movingcomputationischeaperthanmovingdata–Datalocationsexposedsothatcomputationscanmovetowheredataresides•Filereplication–Defaultis3copies.–Configurablebyclients•AssumesCommodityHardware–Filesarereplicatedtohandlehardwarefailure–Detectfailuresandrecoversfromthem•StreamingDataAccess–Highthroughputofdataaccessratherthanlowlatencyofdataaccess–OptimizedforBatchProcessingClassificationCopyright2009-TrendMicroInc.Copyright2009-TrendMicroInc.Copyright2009-TrendMicroInc.NameNode•ManagesFileSystemNamespace–Mapsafilenametoasetofblocks–MapsablocktotheDataNodeswhereitresides•ClusterConfigurationManagement•ReplicationEngineforBlocksCopyright2009-TrendMicroInc.NameNodeMetadata•Meta-datainMemory–Theentiremetadataisinmainmemory–Nodemandpagingofmeta-data•TypesofMetadata–Listoffiles–ListofBlocksforeachfile–ListofDataNodesforeachblock–Fileattributes,e.gcreationtime,replicationfactorCopyright2009-TrendMicroInc.NameNodeMetadata(cont.)•ATransactionLog(calledEditLog)–Recordsfilecreations,filedeletions.Etc•FsImage–Theentirenamespace,mappingofblockstofilesandfilesystempropertiesarestoredinafilecalledFsImage.–NameNodecanbeconfiguredtomaintainmultiplecopiesofFsImageandEditLog.•Checkpoint–OccurwhenNameNodestartup–ReadFsImangeandEditLogfromdiskandapplyalltransactionsfromtheEditLogtothein-memoryrepresentationoftheFsImange–ThenflushesoutthenewversionintonewFsImageClassificationCopyright2009-TrendMicroInc.SecondaryNameNode•CopiesFsImageandEditLogfromNameNodetoatemporarydirectory•MergesFSImageandEditLogintoanewFSImageintemporarydirectory.•UploadsnewFSImagetotheNameNode–TransactionLogonNameNodeispurgedFsImageEditLogFsImage(new)Copyright2009-TrendMicroInc.NameNodeFailure•Asinglepointoffailure•TransactionLogstoredinmultipledirectories–Adirectoryonthelocalfilesystem–Adirectoryonaremotefilesystem(NFS/CIFS)•NeedtodeveloparealHAsolutionSPOF!!Copyright2009-TrendMicroInc.DataNode•ABlockServer–Storesdatainthelocalfilesystem(e.g.ext3)–Storesmeta-dataofablock(e.g.CRC)–Servesdataandmeta-datatoClients–Servesdataandmeta-datatoClients•BlockReport–PeriodicallysendsareportofallexistingblockstotheNameNode•FacilitatesPipeliningofData–ForwardsdatatootherspecifiedDataNodesCopyright2009-TrendMicroInc.HDFS-Replication•Defaultis3xreplication.•Theblocksizeandreplicationfactorareconfiguredperfile.•Blockplacementalgorithmisrack-awareCopyright2009-TrendMicroInc.BlockPlacement•Strategy(v0.19.doc,2009.2.25,inprogress)–Onereplicaonlocalnodeinthelocalrack–Secondreplicaonadifferentnodeinthelocalrack–Thirdreplicaintheremoterack–Additionalreplicasarerandomlyplaced•ClientsreadfromnearestreplicaCopyright2009-TrendMicroInc.Heartbeats•DataNodessendheartbeattotheNameNode–Onceevery3seconds•NameNodeusedheartbeatstodetectDataNodefailureCopyright2009-TrendMicroInc.DataCorrectness•UseChecksumstovalidatedata–UseCRC32•FileCreation–Clientcomputeschecksumper512byte–DataNodestoresthechecksum•Fileaccess–Cl
本文标题:Introduction to Hadoop and HDFS
链接地址:https://www.777doc.com/doc-6131182 .html