您好,欢迎访问三七文档
当前位置:首页 > 行业资料 > 交通运输 > DA文档分析研究现状与趋势-刘成林
文档分析技术研究现状与趋势刘成林中国科学院自动化研究所模式识别国家重点实验室liucl@nlpr.ia.ac.cn第四届全国文字与计算学术研讨会,2014.10.25-26,北京1Outline•文档与文档分析•文档分析研究问题•领域发展简史•研究现状–主要方法–性能状况•国内现状•趋势与展望2文字识别与文档分析•文字识别(CharacterRecognition)–字符图像转换为符号代码•文档分析(DocumentAnalysis)–从文档图像提取文本信息–包括文本分割、识别、上下文处理、语义信息提取等•文档分析的意义–数据压缩–内容理解/语义提取3•与文本分析(TextAnalysis,NLP)的关系–TextAnalysis:从电子文本开始–DA:从图像开始–电子文档(如PDF):boundary,DA利用其结构信息文档的种类•什么是文档–载有文字符号的纸张、图像或电子文件4DocumentsPhysicaldocumentsSynthesizeddocumentsScanned/camerapaperdocsScenetextimagesOnlinehandwritingSynthesizeddocimagesStructuredelectronicdocsPrintedHandwrittenOnvarioussurfacesTexts,graphicsTexts/graphicsCaptions具体种类与应用•PaperDocuments–Books,journals,newspapers,letter/parcelenvelopes,certificates,notes,forms,businesscards,engineeringdrawings,musicalscores•SceneTexts–Signboards,licenseplate,streetnumbers–Textsonwood,metal,cloth,oraclebones,etc.•OnlineHandwriting–Texts,graphics,signature,mathematics,sketch,gesture•SynthesizedDocumentImages–Webdocimages–Captionsonimage/video•StructuredDocuments–Webpages–Worddoc,PDF,RTF,etc.5文档分析研究问题•FromImagetoSemanticsImageProcessingLayoutAnalysisContentsRecognitionSemantic/Application•Enhancement-Contrast-denoising•Rectification-Illumination-Skew-Perspective•Binarization•Frameremoval•Regionseparation•Zoneclassification•Textlocalization•Textlinesegmentation•Hand/printseparation•Table/formanalysis•Signature/logo/stampextraction•Textrecognition-Charactersegment-Normalization-Featureextraction-Classifierdesign-Sequencemodel-Linguistic/geometriccontexts•Graphics/symbol-Diagrams-Eng.drawings-Musicalscores-Mathematics-Physical/chemicals•Styleauthentication-Fontidentification-Scriptidentification-Writeridentificat-Signatureverificat•Structuralunderstanding-Logicalrelationbtwzones-Reconstruction•Retrieval-Keywordspotting-Content-based-Structure-based•Semanticanalysis-Categorization-Summarization-Translation6文档分析研究问题•FromImagetoSemanticsImageProcessingLayoutAnalysisContentsRecognitionSemantic/Application•Regionseparation•Zoneclassification•Textlocalization•Textlinesegmentation•Hand/printseparation•Table/formanalysis•Signature/logo/stampextraction•Textrecognition-Charactersegment-Normalization-Featureextraction-Classifierdesign-Sequencemodel-Linguistic/geometriccontexts•Graphics/symbol-Diagrams-Eng.drawings-Musicalscores-Mathematics-Physical/chemicals•Styleauthentication-Fontidentification-Scriptidentification-Writeridentificat-Signatureverificat•Structuralunderstanding-Logicalrelationbtwzones-Reconstruction•Retrieval-Keywordspotting-Content-based-Structure-based•Semanticanalysis-Categorization-Summarization-Translation7•Enhancement-Contrast-denoising•Rectification-Illumination-Skew-Perspective•Binarization•Frameremoval文档分析研究问题•FromImagetoSemanticsImageProcessingLayoutAnalysisContentsRecognitionSemantic/Application•Enhancement-Contrast-denoising•Rectification-Illumination-Skew-Perspective•Binarization•Frameremoval•Textrecognition-Charactersegment-Normalization-Featureextraction-Classifierdesign-Sequencemodel-Linguistic/geometriccontexts•Graphics/symbol-Diagrams-Eng.drawings-Musicalscores-Mathematics-Physical/chemicals•Styleauthentication-Fontidentification-Scriptidentification-Writeridentificat-Signatureverificat•Structuralunderstanding-Logicalrelationbtwzones-Reconstruction•Retrieval-Keywordspotting-Content-based-Structure-based•Semanticanalysis-Categorization-Summarization-Translation8•Regionseparation•Zoneclassification•Textlocalization•Textlinesegmentation•Hand/printseparation•Table/formanalysis•Signature/logo/stampextraction文档分析研究问题•FromImagetoSemanticsImageProcessingLayoutAnalysisContentsRecognitionSemantic/Application•Enhancement-Contrast-denoising•Rectification-Illumination-Skew-Perspective•Binarization•Frameremoval•Regionseparation•Zoneclassification•Textlocalization•Textlinesegmentation•Hand/printseparation•Table/formanalysis•Signature/logo/stampextraction9•Structuralunderstanding-Logicalrelationbtwzones-Reconstruction•Retrieval-Keywordspotting-Content-based-Structure-based•Semanticanalysis-Categorization-Summarization-Translation•Textrecognition-Charactersegment-Normalization-Featureextraction-Classifierdesign-Sequencemodel-Linguistic/geometriccontexts•Graphics/symbol-Diagrams-Eng.drawings-Musicalscores-Mathematics-Physical/chemicals•Styleauthentication-Fontidentification-Scriptidentification-Writeridentification-Signatureverification文档分析研究问题•FromImagetoSemanticsImageProcessingLayoutAnalysisContentsRecognitionSemantic/Application•Enhancement-Contrast-denoising•Rectification-Illumination-Skew-Perspective•Binarization•Frameremoval•Regionseparation•Zoneclassification•Textlocalization•Textlinesegmentation•Hand/printseparation•Table/formanalysis•Signature/logo/stampextraction•Textrecognition-Charactersegment-Normalization-Featureextraction-Classifierdesign-Sequencemodel-Linguistic/geometriccontexts•Graphics/symbol-Diagrams-Eng.drawings-Musicalscores-Mathematics-Physical/chemicals•Styleauthentication-Fontidentification-Scriptidentification-Writeridentificat-Si
本文标题:DA文档分析研究现状与趋势-刘成林
链接地址:https://www.777doc.com/doc-5574450 .html