您好,欢迎访问三七文档
当前位置:首页 > 电子/通信 > 数据通信与网络 > Twitter从支撑千万到万亿级索引的搜索引擎架构演化
TheRoadtoaCompleteTweetIndexOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzMorethan2billionsearchqueriesperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofmillionTweetsareindexedperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearch@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofbillionsofTweetshavebeensentsincecompanyfoundingin2006.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchOurCompleteTweetIndexisservedbythousandsofinstances,eachwith256GBRAMand2TBSSD.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchBut…oursearchinfrastructureiscurrentlysupportedbyonlyasmallnumberofengineersandSREs.Wearehiring!Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yz@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyreplicatedMySQLinstancesandMySQLtextmatching.@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyMySQL.HundredsofTweetspersecond.Afewthousandofqueriespersecond.Basictextsearch:nofancytokenization,nosearchassistance,slowgeosearchetc.Manyincidentsanddowntimes.MySQLmaster/slavedyingwasparticularlyproblematic.@yzTheRoadtoaCompleteTweetIndex2011LaunchedLucene-basedsearchengine:Earlybird*.LuceneAPI,butcustomdatastructuresoptimizedforin-memoryoperationsandRealtimesearch.Novelconcurrentandlockfreememorymodels:concurrentlywritingandsearchinganindexsegment.Containsabout7daysofTweets.*~jimmylin/publications/Busch_etal_ICDE2012.pdfEarlybirdLucene/ElasticSearchOptimizedforin-memorydatastructuresOptimizedforDisksOptimizedforRealtimeindexingandupdatesRelativelyslowRealtimeindexingandupdatesOptimizedforTweetsIndexgeneraldocumentsFacet&TermStatisticsSupportN/AwhenwebuiltEarlybirdHighlyoptimizedforJVMGarbageCollectionGeneratesrelativelymoregarbageThriftQuery/Schema/DocAPIsJSONQuery/Schema/DocAPIs@yzTheRoadtoaCompleteTweetIndexEarlybirdvsLucene/ElasticSearchEarlybirdEarlybirdEarlybird@yzTweetFirehose(JSON)Ingestion(Preprocessing,Analysis,Tokenization,Partitioning,etc)ReplicatedMySQLTheRoadtoaCompleteTweetIndex2011RetiredMySQLtextmatching,butstillutilizeMySQLtopipedataintoEarlybird.EarlybirdIndicesIndicesIndicesIndicesHashPartitioning:TweetID%numberofpartitionsIngestionTokenization,IngestionTokenization,Analysis,ReplicatedReplicatedEarlybird2012EliminatedSinglePointsofFailureviapartitioning,decreasingtheimpactofMySQLmaster/slavefailures.@yzTheRoadtoaCompleteTweetIndexTweetFirehose(JSON)Tokenization,EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesHashPartitioning:TweetID%numberofpartitionsIngestionIngestion(Preprocessing,(Preprocessing,(Preprocessing,Partitioning,etc)(Preprocessing,Partitioning,etc)Tokenization,etc)ReplicatedMySQLMySQLReplicatedMySQLMySQLIngester(Preprocessing,Partitioning,etc)(Preprocessing,Tokenization,Earlybird@yzTheRoadtoaCompleteTweetIndex2013-2015EliminatingtheuseofMySQLasourdatabus.RawTweets(JSON)Tokenization,Partitioning,etc)EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesTwitter’sPartitioned,Replicated,High-performanceMessagingSystem.IngesterIngester(Preprocessing,(Preprocessing,Tokenization,IngesterTokenization,Partitioning,etc)Partitioning,etc)DistributedLog(Twitter’sOpenSourcereplicatedlogservice)Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzCompleteTweetIndexMotivationBeabletosearchforanyTweeteverpublished,notjustTweetfromthelatest7days.(approx.300xscaling)@yzTheRoadtoaCompleteTweetIndexSmallteam:limitednumberofengineersandSREs.Realtimesearchin-memoryarchitecturecannotholdhundredsofbillionsofTweetsinRAM,wejustdonothaveenoughRAM,andevenifwedo,itisnotcosteffective.Scalingisnon-trivial:Realtimesearcharchitecturehasroughlyfixsize(7daysofTweets),buttheCompleteTweetIndexneedstogrowbiggereachday.Ingestionparallelismislowandfixed---parallelismisachievedviapartitioning:20partitionsmeans20parallelingestionpipelines.@yzTheRoadtoaCompleteTweetIndexExistingArchitectureChallengesIndexeveryTweeteverpublished.Modularity:SharedsourcecodeandtestsbetweentheRealtimeandCompleteTweetIndexwherepossible,whichcreatedacleanersysteminlesstime.Scalability:expandsinplacegracefullyasmoreTweetsareadded.Costeffectiveness:UsingthesameRAMtechnologyforthecompleteindexwouldhavebeenprohibitivelyexpensive.Highlyparallelingestion:abilitytofullyrebuildtheindexinreasonableamountoftime.Simpleinterface:wantedasimpleinterfacethathidestheunderlyingpartitionssothatinternalclientscantreattheclusterasasingleendpoint.@yzTheRoadtoaCompleteTweetIndexCompleteTweetIndexDesignGoalsCompleteTweetIndexDesignOverview@yzTheRoadtoaCompleteTweetIndexBatch
本文标题:Twitter从支撑千万到万亿级索引的搜索引擎架构演化
链接地址:https://www.777doc.com/doc-5834501 .html