您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 质量控制/管理 > Text Mining is about...
TextMining:TechniquesandApplicationsดร.ชชาต หฤไชยะศกด ChoochartHaruechaiyasak,Ph.D.หน'วยปฎ,บ.ต,การว,จ.ยว,ทยาการมน6ษยภาษาHumanLanguageTechnology(HLT)ศBนยCเทคโนโลยHอ,เลJกทรอน,กสCและคอมพ,วเตอรCแห'งชาต,(เนคเทค)NationalElectronicsandComputerTechnologyCenter(NECTEC)Rev.8March2007Overview–ScopeandTasksTechnique–InformationExtraction(IE)Application–TechMining:Applicationoftextminingtoscienceandtechnology(S&T)informationLectureOutlineOverview:ScopeandTasks“Siftingthroughvastcollectionsofunstructuredorsemistructureddatabeyondthereachofdataminingtools,textminingtracksinformationsources,linksisolatedconceptsindistantdocuments,mapsrelationshipsbetweenactivities,andhelpsanswerquestions.”TappingthePowerofTextMiningCommunicationsoftheACM,Sept.2006TextMiningisabout...Humans:Abilitytodistinguishandapplylinguisticpatternstotext–Couldovercomelanguagedifficultiessuchasslangs,spellingvariations,contextualmeaning.Computers:Abilitytoprocesstextinlargevolumesathighspeed–Couldsiftthroughalargecollectionoftextstofindsimplestatisticsandrelationshipamongtermsinaninstantoftime.TextminingrequiresacombinationofbothHuman'slinguisticcapability+computer'sspeedandaccuracyNLPDataMiningHumansVS.ComputersNLPLexical/MorphologicalAnalysisTagging/ChunkingNamedEntitiesRecognition(NER)SyntacticAnalysis(Shallowparsing)WordSenseDisambiguationSemanticAnalysisReferenceResolutionDiscourseAnalysisNLP+DataMiningTasksTextMiningTasksDataMiningClassification(supervisedlearning)Clustering(unsupervisedlearning)AssociationRuleMiningSequentialPatternAnalysisRegressionAnalysisDependencyModelingChangeandDeviationDetectionInformationextraction:–Analyzeunstructuredtextandidentifykeyphrasesandrelationshipswithintext.Topicdetectionandtracking:–Filterandpresentonlydocumentsrelevanttotheuserprofile.Summarization:–Textsummarizationreducesthecontentbyretainingonlyitsmainpointsandoverallmeaning.Categorization:–AutomaticclassifydocumentsintopredefinedcategoriesClustering:–GroupsimilardocumentsbasedontheirsimilarityTextMiningTasksConceptLinkage–Connectrelateddocumentsbyidentifyingtheirsharedconcepts,helpingusersfindinformationtheyperhapswouldn'thavefoundthroughtraditionalsearchmethodsInformationVisualization–Representdocumentsorinformationingraphicalformatsforeasilybrowsing,viewing,orsearching.Questionandanswering(Q&A)–SearchandextractthebestanswertoagivenquestionTextMiningTasks(cont'd)Example:ConceptLinkageBiomedicine:Co-occurrenceoftermsExample:ConceptLinkageBiomedicine:Entities&RelationshipExample:SearchResultClusteringvivisimo.comExample:Question&Answeringask.comExample:InformationVisualizationkartoo.comTechnique:InformationExtractionFirstnoteonthismisunderstanding:InformationRetrievaldoesn’tretrieveinformationYouhaveaninformationneed,butwhatyougetbackisn’tinformationbutdocuments,whichyouhopehavetheinformationInformationextractionisoneapproachtogoingfurtherforaspecialcase:There’ssomerelationyou’reinterestedinYourqueryisforelementsofthatrelationAlimitedformofnaturallanguageunderstandingWhatisInformationExtraction?Identifyspecificpiecesofinformation(data)inaunstructuredorsemi-structuredtextualdocument.Transformunstructuredinformationinacorpusofdocumentsorwebpagesintoastructureddatabase.Appliedtodifferenttypesoftext:NewspaperarticlesWebpagesScientificarticlesNewsgroupmessagesClassifiedadsMedicalnotesInformationExtraction(IE)Jobpostings/resumesSeminarannouncementsCompanyinformationfromthewebContinuingeducationcourseinfofromthewebUniversityinformationfromthewebApartmentrentaladsMolecularbiologyinformationfromMEDLINEApplicationsExtractingCorporateInformationDataautomaticallyextractedfrommarketsoft.comSourcewebpage.Colorhighlightsindicatetypeofinformation.(e.g.,red=name)E.g.,informationneed:WhoistheCEOofMarketSoft?Source:Whizbang!Labs/AndrewMcCallumShoppingCommercialInformationNeedthispriceTitleAbook,NotatoyProductInformationDigitalCameras:ImageCaptureDevice:1.68millionpixel1/2-inchCCDsensorImageCaptureDevice:TotalPixelsApprox.3.34million,EffectivePixelsApprox.3.24millionImagesensor:TotalPixels:Approx.2.11million-pixelImagingsensor:TotalPixels:Approx.2.11million1,688(H)x1,248(V)CCDTotalPixels:Approx.3,340,000(2,140[H]x1,560[V])EffectivePixels:Approx.3,240,000(2,088[H]x1,550[V])RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V])Theseallcameoffthesamemanufacturer’swebsite!!DifficultBecauseofTextualInconsistencyBackground:AdvertisementsareplaintextClassifiedAdvertisements(RealEstate)ADNUM2067206v1/ADNUMDATEMarch02,1998/DATEADTITLEMADDINGTON$89,000/ADTITLEADTEXTOPEN1.00-1.45BRU11/10BERTRAMSTBRNEWTOMARKETBeautifulBR3brmfreestandingBRvilla,closetoshops&busBROwnermovedtoMelbourneBRideallysuit1sthomebuyer,BRinvestor&55andover.BRBrianHazelden0418958996BRRWHITELEEMING93323477/ADTEXTWhatyousearchforinrealestateadvertisements:Towns:youmightthinkeasy,but:Realestateagents:ColdwellBanker,MosmanPhrases:Only45minutesfromParramattaMultiplepropertyadshavedifferenttownsMoney:wantarangenotatextualma
本文标题:Text Mining is about...
链接地址:https://www.777doc.com/doc-5480523 .html