Twitter从支撑千万到万亿级索引的搜索引擎架构演化

TheRoadtoaCompleteTweetIndexOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzOutline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzMorethan2billionsearchqueriesperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofmillionTweetsareindexedperday.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearch@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchHundredsofbillionsofTweetshavebeensentsincecompanyfoundingin2006.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchOurCompleteTweetIndexisservedbythousandsofinstances,eachwith256GBRAMand2TBSSD.@yzTheRoadtoaCompleteTweetIndexCurrentScaleofTwitterSearchBut…oursearchinfrastructureiscurrentlysupportedbyonlyasmallnumberofengineersandSREs.Wearehiring!Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yz@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyreplicatedMySQLinstancesandMySQLtextmatching.@yzTheRoadtoaCompleteTweetIndex2010RealtimeSearchPoweredbyMySQL.HundredsofTweetspersecond.Afewthousandofqueriespersecond.Basictextsearch:nofancytokenization,nosearchassistance,slowgeosearchetc.Manyincidentsanddowntimes.MySQLmaster/slavedyingwasparticularlyproblematic.@yzTheRoadtoaCompleteTweetIndex2011LaunchedLucene-basedsearchengine:Earlybird*.LuceneAPI,butcustomdatastructuresoptimizedforin-memoryoperationsandRealtimesearch.Novelconcurrentandlockfreememorymodels:concurrentlywritingandsearchinganindexsegment.Containsabout7daysofTweets.*~jimmylin/publications/Busch_etal_ICDE2012.pdfEarlybirdLucene/ElasticSearchOptimizedforin-memorydatastructuresOptimizedforDisksOptimizedforRealtimeindexingandupdatesRelativelyslowRealtimeindexingandupdatesOptimizedforTweetsIndexgeneraldocumentsFacet&TermStatisticsSupportN/AwhenwebuiltEarlybirdHighlyoptimizedforJVMGarbageCollectionGeneratesrelativelymoregarbageThriftQuery/Schema/DocAPIsJSONQuery/Schema/DocAPIs@yzTheRoadtoaCompleteTweetIndexEarlybirdvsLucene/ElasticSearchEarlybirdEarlybirdEarlybird@yzTweetFirehose(JSON)Ingestion(Preprocessing,Analysis,Tokenization,Partitioning,etc)ReplicatedMySQLTheRoadtoaCompleteTweetIndex2011RetiredMySQLtextmatching,butstillutilizeMySQLtopipedataintoEarlybird.EarlybirdIndicesIndicesIndicesIndicesHashPartitioning:TweetID%numberofpartitionsIngestionTokenization,IngestionTokenization,Analysis,ReplicatedReplicatedEarlybird2012EliminatedSinglePointsofFailureviapartitioning,decreasingtheimpactofMySQLmaster/slavefailures.@yzTheRoadtoaCompleteTweetIndexTweetFirehose(JSON)Tokenization,EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesHashPartitioning:TweetID%numberofpartitionsIngestionIngestion(Preprocessing,(Preprocessing,(Preprocessing,Partitioning,etc)(Preprocessing,Partitioning,etc)Tokenization,etc)ReplicatedMySQLMySQLReplicatedMySQLMySQLIngester(Preprocessing,Partitioning,etc)(Preprocessing,Tokenization,Earlybird@yzTheRoadtoaCompleteTweetIndex2013-2015EliminatingtheuseofMySQLasourdatabus.RawTweets(JSON)Tokenization,Partitioning,etc)EarlybirdIndicesIndicesEarlybirdIndicesEarlybirdIndicesTwitter’sPartitioned,Replicated,High-performanceMessagingSystem.IngesterIngester(Preprocessing,(Preprocessing,Tokenization,IngesterTokenization,Partitioning,etc)Partitioning,etc)DistributedLog(Twitter’sOpenSourcereplicatedlogservice)Outline1.CurrentScaleofTwitterSearch2.TheHistoryofTwitterSearchInfra3.CompleteTweetIndex4.SearchEngineApplications5.OutlookTheRoadtoaCompleteTweetIndex@yzCompleteTweetIndexMotivationBeabletosearchforanyTweeteverpublished,notjustTweetfromthelatest7days.(approx.300xscaling)@yzTheRoadtoaCompleteTweetIndexSmallteam:limitednumberofengineersandSREs.Realtimesearchin-memoryarchitecturecannotholdhundredsofbillionsofTweetsinRAM,wejustdonothaveenoughRAM,andevenifwedo,itisnotcosteffective.Scalingisnon-trivial:Realtimesearcharchitecturehasroughlyfixsize(7daysofTweets),buttheCompleteTweetIndexneedstogrowbiggereachday.Ingestionparallelismislowandfixed---parallelismisachievedviapartitioning:20partitionsmeans20parallelingestionpipelines.@yzTheRoadtoaCompleteTweetIndexExistingArchitectureChallengesIndexeveryTweeteverpublished.Modularity:SharedsourcecodeandtestsbetweentheRealtimeandCompleteTweetIndexwherepossible,whichcreatedacleanersysteminlesstime.Scalability:expandsinplacegracefullyasmoreTweetsareadded.Costeffectiveness:UsingthesameRAMtechnologyforthecompleteindexwouldhavebeenprohibitivelyexpensive.Highlyparallelingestion:abilitytofullyrebuildtheindexinreasonableamountoftime.Simpleinterface:wantedasimpleinterfacethathidestheunderlyingpartitionssothatinternalclientscantreattheclusterasasingleendpoint.@yzTheRoadtoaCompleteTweetIndexCompleteTweetIndexDesignGoalsCompleteTweetIndexDesignOverview@yzTheRoadtoaCompleteTweetIndexBatch

Twitter从支撑千万到万亿级索引的搜索引擎架构演化

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

房地产统计指标解释

tomford2011春季女装秀

赤峰建筑工程学校 “千名干部进百村如万户创先争优贴民心比践行”

医药保健品营销蜥蜴团队招商必读（推荐DOC62）

注射头孢后需禁酒三天

职业生活中的法律-劳动法

创鲁班奖策划书

工业产品设计汇报总结

广东省经济和信息化委员会关于印发XXXX年促进战略性新兴产业加快发展

重型汽车行业分析报告0

相关文档

相关搜索

Twitter从支撑千万到万亿级索引的搜索引擎架构演化

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

房地产统计指标解释

tomford2011春季女装秀

赤峰建筑工程学校 “千名干部进百村如万户 创先争优贴民心比践行”

医药保健品营销蜥蜴团队招商必读（推荐DOC62）

注射头孢后需禁酒三天

职业生活中的法律-劳动法

创鲁班奖策划书

工业产品设计汇报总结

广东省经济和信息化委员会关于印发XXXX年促进战略性新兴产业加快发展

重型汽车行业分析报告0

相关文档

相关搜索

赤峰建筑工程学校 “千名干部进百村如万户创先争优贴民心比践行”