Lucene的内存索引和磁盘索引
1.我们接触到的数据类型有哪些?
我们知道生活中的数据分为结构化数据(固定长度或有限长度的数据,如数据库,元数据;)和非结构化数据(长度不固定或者格式不固定的)。有时还会接触到半结构化的数据如html,xml等;
2.lucene是基于java的全文检索信息库,全文检索信息库包含两个方面的内容,索引创建和搜索索引。今天我们主要讲的是索引的创建。
首先看看lucene相关的索引的类继承结构。
从上面的类继承结构可以看出lucene为我们提供了多种索引的实现。比较常用是两种内存索引和磁盘索引。
3.磁盘索引
常用的磁盘索引类有SimpleFSDirectory,NIOFSDirectory,MMapDirectory其中后面两个对多线程的支持比较好。
4.内存索引
内存索引顾名思义索引建立在内存中,优点访问查询快,缺点放在内存中当虚拟机退出时索引就消失了。
5.Demo
package HelloLucene;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
//标准分词器对中文支持的不够好
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
// Directory index = new RAMDirectory();
Directory index=new SimpleFSDirectory(new File("/user/lucene"));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "Lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = null;
try {
q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title") );
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
6.索引优化
a.fsIndexWriter.optimize() 对索引进行优化。
b.forceMerge(int maxNumSegments) 强制合并文件,这是一个比较消耗IO的操作。合并小文件,如果此时有线程添加新的文档进来,就不会被合并,除非触发新一次的merge。
c.还有一些其他的方式,内存索引和磁盘索引相结合。将常被索引的东西放在内存中来。