您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 项目/工程管理 > 尚学堂科技_张志宇_lucene_构建一个简单的WEB搜索程序
Lucene_构建一个简单的WEB搜索程序lucene2.3.2tomcat6.0.16je-analysis1.4.0lukeall0.7.1Mysqljdbcdriver3.1.13Tidy04aug2000r7MyEclipse6.0M1_E3.3项目周期3-4天目标Lucene入门全文检索的概念,倒排索引的概念建立索引搜索中文分词的实现Nutch入门串知识点Html,css,javascript,servlet,jsp,mysql,介绍MVC的概念演示借用一些javascript的成熟的框架实现页面的特殊效果。例如:rico学会使用myeclipse熟悉mysql数据库的用法什么时候用lucene数据库大量数据,文本字段内容很多非结构化文档1.安装myeclipse建立工程webproject工程名称lucene如何配置tomcat服务器好处自动部署Windowshowviewservers如何部署webappDeploy按钮,添加tomcat项目Webbrowser窗口最好不用此browserShowviewwebbrowser引入jar包Lucene工程文件夹下,建立lib目录,拷贝如下jar包到lib目录lucene-core-2.2.0.jarTidy.jarlucene-2.2.0\lucene-2.2.0\contrib\analyzerslucene-analyzers-2.2.0.jarje-analysis-1.4.0.jarmysql-connector-java-3.1.13-bin.jar显示linenumberAlt/自动完成快捷键效果出不来.快捷键效果出不来2.为一个文件建立索引(英文)确认已经引入包lucene-core-2.2.0.jarField.Store.YES和Field.Store.NO区别termVector是Lucene1.4.3新增的它提供一种向量机制来进行模糊查询,很少用。DateTools.timeToStringIndexHTML.javaimportjava.io.File;importorg.apache.lucene.analysis.standard.StandardAnalyzer;importorg.apache.lucene.document.Document;importorg.apache.lucene.document.Field;importorg.apache.lucene.index.IndexWriter;publicclassIndexHTML{staticStringindex=D:\\share\\05_Servlet_JSP\\tomcat\\apache-tomcat-5.5.17\\index;staticStringroot=D:\\share\\lucene\\soft\\lucene-2.2.0\\lucene-2.2.0\\docs\\api\\index.html;publicstaticvoidmain(Stringargs[])throwsException{IndexWriterwriter=newIndexWriter(index,newStandardAnalyzer(),true);Documentdoc=newDocument();Filef=newFile(root);doc.add(newField(path,f.getPath(),Field.Store.YES,Field.Index.UN_TOKENIZED));doc.add(newField(content,我们是共产主义接班人,Field.Store.NO,Field.Index.TOKENIZED));writer.addDocument(doc);writer.optimize();writer.close();}}3.如何确认索引已经正确建立?java-jarlukeall-0.7.1.jar4.tomcat配置\WEB-INF\lib\lucene-core-2.2.0.jarje-analysis-1.4.0.jar确保8080端口可用reloadableC:\tomcat\conf\context.xmlContextreloadable=true5.为一个文件建立索引(递归)importjava.io.File;importjava.io.FileNotFoundException;importjava.io.FileReader;importjava.io.IOException;importorg.apache.lucene.analysis.standard.StandardAnalyzer;importorg.apache.lucene.document.Document;importorg.apache.lucene.document.Field;importorg.apache.lucene.index.CorruptIndexException;importorg.apache.lucene.index.IndexWriter;importorg.apache.lucene.store.LockObtainFailedException;publicclassIndexHTML1{staticIndexWriterwriter;publicstaticvoidmain(String[]args)throwsException{Stringroot=D:\\share\\01_J2SE\\soft\\html_zh_CN\\html\\zh_CN\\api\\java\\lang;Stringindex=D:\\share\\tools\\apache-tomcat-6.0.14\\apache-tomcat-6.0.14\\index_cn;writer=newIndexWriter(index,newStandardAnalyzer(),true);Filef=newFile(root);indexDocs(f);writer.optimize();writer.close();}privatestaticvoidindexDocs(Filef)throwsException{if(f.isDirectory()){File[]subs=f.listFiles();for(inti=0;isubs.length;i++){indexDocs(subs[i]);}}else{indexDoc(f);}}privatestaticvoidindexDoc(Filef)throwsException{System.out.println(f.getPath());Documentdoc=newDocument();doc.add(newField(path,f.getPath(),Field.Store.YES,Field.Index.UN_TOKENIZED));doc.add(newField(content,newFileReader(f)));writer.addDocument(doc);}}6.为一个文件建立索引(使用Tidy)确认已经引入包Tidy.jar确认已经引入包je-analysis-1.4.0.jarimportjava.io.File;importjava.io.FileInputStream;importjava.io.IOException;importjava.io.InputStream;importjava.io.InputStreamReader;importjava.io.Reader;importjava.text.DecimalFormat;importjeasy.analysis.MMAnalyzer;importorg.apache.lucene.document.DateTools;importorg.apache.lucene.document.Document;importorg.apache.lucene.document.Field;importorg.apache.lucene.index.IndexWriter;importorg.w3c.dom.Element;importorg.w3c.dom.Node;importorg.w3c.dom.NodeList;importorg.w3c.dom.Text;importorg.w3c.tidy.Tidy;publicclassIndexHTMLTidy{//索引建立到那个目录staticStringindex=C:\\tomcat\\index_cn;//英文内容//staticStringroot=//G:\\lessons\\lucene\\student\\soft\\lucene-2.2.0\\lucene-2.2.0\\docs\\api\\index.html;//中文内容,java.lang下面的内容即可staticStringroot=E:\\app\\develop\\java\\api\\html_zh_CN\\html\\zh_CN\\api\\java\\lang;staticDocumentdoc=null;staticIndexWriterwriter=null;publicstaticvoidmain(String[]args)throwsException{writer=newIndexWriter(index,newMMAnalyzer(),true);Filef=newFile(root);indexDocs(f);writer.addDocument(doc);writer.optimize();writer.close();System.out.println(ok...);}publicstaticvoidindexDocs(Filef)throwsException{if(f.isDirectory()){Stringfile[]=f.list();for(inti=0;ifile.length;i++){indexDocs(newFile(f,file[i]));}}elseif(f.getName().endsWith(.html)){indexDoc(f);}}publicstaticvoidindexDoc(Filef)throwsException{doc=newDocument();System.out.println(f.getPath());doc.add(newField(path,f.getPath(),Field.Store.YES,Field.Index.NO));Stringsize=newDecimalFormat(0000000000).format(f.length());doc.add(newField(size,size,Field.Store.YES,Field.Index.UN_TOKENIZED));doc.add(newField(lastmodified,DateTools.timeToString(f.lastModified(),DateTools.Resolution.DAY),Field.Store.YES,Field.Index.UN_TOKENIZED));Tidytidy=newTidy();tidy.setQuiet(true);tidy.setShowWarnings(false);//乱码//org.w3c.dom.Documentroot=tidy.parseDOM(new//FileInputStream(f),System.out);//解决乱码问题//jav
本文标题:尚学堂科技_张志宇_lucene_构建一个简单的WEB搜索程序
链接地址:https://www.777doc.com/doc-4825648 .html