Lucene的内存索引和磁盘索引

1.我们接触到的数据类型有哪些？

我们知道生活中的数据分为结构化数据（固定长度或有限长度的数据，如数据库，元数据；）和非结构化数据（长度不固定或者格式不固定的）。有时还会接触到半结构化的数据如html，xml等；

2.lucene是基于java的全文检索信息库，全文检索信息库包含两个方面的内容，索引创建和搜索索引。今天我们主要讲的是索引的创建。

首先看看lucene相关的索引的类继承结构。

Lucene的内存索引和磁盘索引

从上面的类继承结构可以看出lucene为我们提供了多种索引的实现。比较常用是两种内存索引和磁盘索引。

3.磁盘索引

常用的磁盘索引类有SimpleFSDirectory，NIOFSDirectory，MMapDirectory其中后面两个对多线程的支持比较好。

4.内存索引

内存索引顾名思义索引建立在内存中，优点访问查询快，缺点放在内存中当虚拟机退出时索引就消失了。

5.Demo

package HelloLucene;

import java.io.File;
import java.io.IOException;
import java.text.ParseException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;

public class HelloLucene {

   public static void main(String[] args) throws IOException, ParseException {
       // 0. Specify the analyzer for tokenizing text.
       //    The same analyzer should be used for indexing and searching
       //标准分词器对中文支持的不够好
       StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

       // 1. create the index
//       Directory index = new RAMDirectory();
       Directory index=new SimpleFSDirectory(new File("/user/lucene"));

       IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

       IndexWriter w = new IndexWriter(index, config);
       addDoc(w, "Lucene in Action", "193398817");
       addDoc(w, "Lucene for Dummies", "55320055Z");
       addDoc(w, "Managing Gigabytes", "55063554A");
       addDoc(w, "The Art of Computer Science", "9900333X");
       w.close();

       // 2. query
       String querystr = args.length > 0 ? args[0] : "Lucene";

       // the "title" arg specifies the default field to use
       // when no field is explicitly specified in the query.
       Query q = null;
       try {
           q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
       } catch (org.apache.lucene.queryparser.classic.ParseException e) {
           e.printStackTrace();
       }

       // 3. search
       int hitsPerPage = 10;
       IndexReader reader = DirectoryReader.open(index);
       IndexSearcher searcher = new IndexSearcher(reader);
       TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
       searcher.search(q, collector);
       ScoreDoc[] hits = collector.topDocs().scoreDocs;

       // 4. display results
       System.out.println("Found " + hits.length + " hits.");
       for (int i = 0; i < hits.length; ++i) {
           int docId = hits[i].doc;
           Document d = searcher.doc(docId);
           System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title") );
       }
       // reader can only be closed when there
       // is no need to access the documents any more.
       reader.close();
   }

   private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
       Document doc = new Document();
       doc.add(new TextField("title", title, Field.Store.YES));

       // use a string field for isbn because we don't want it tokenized
       doc.add(new StringField("isbn", isbn, Field.Store.YES));
       w.addDocument(doc);
   }

}

6.索引优化

a.fsIndexWriter.optimize() 对索引进行优化。

b.forceMerge(int maxNumSegments) 强制合并文件，这是一个比较消耗IO的操作。合并小文件，如果此时有线程添加新的文档进来，就不会被合并，除非触发新一次的merge。

c.还有一些其他的方式，内存索引和磁盘索引相结合。将常被索引的东西放在内存中来。

Lucene的内存索引和磁盘索引

相关推荐