Elasticsearch Completion Suggester 实战

Completion Suggester是Elasticsearch SearchAPIs中一种，主要用于输入框的自动补齐功能。

我将通过字或者拼音首字母推荐成语的java程序作为实战演示。

项目地址：https://github.com/tangwanggong/week-project/tree/master/week-1

成语数据来源：https://github.com/pwxcoo/chinese-xinhua

#成语结构
{
    "derivation": "语出《法华经·法师功德品》下至阿鼻地狱。”",
    "example": "但也有少数意志薄弱的……逐步上当，终至堕入～。★《上饶集中营·炼狱杂记》",
    "explanation": "阿鼻梵语的译音，意译为无间”，即痛苦无有间断之意。常用来比喻黑暗的社会和严酷的牢狱。又比喻无法摆脱的极其痛苦的境地。",
    "pinyin": "ā bí dì yù",
    "word": "阿鼻地狱",
    "abbreviation": "abdy"
}

根据需求，我们需要检索的字段为word,abbreviation，以下为创建mapping的语句。

#创建mapping
PUT idiom
{
  "mappings": {
    "doc": {
      "properties": {
        "id": {
          "type": "long",
          "index": false
        },
        "derivation": {
          "type": "keyword",
          "index": false
        },
        "example": {
          "type": "keyword",
          "index": false
        },
        "explanation": {
          "type": "keyword",
          "index": false
        },
        "pinyin": {
          "type": "keyword",
          "index": false
        },
        "word": {
          "type": "completion",
          "analyzer": "simple"      
        },
        "abbreviation": {
          "type": "completion",
          "analyzer": "simple"
        }
      }
    }
  }
}

首先把type设置为completion，创建mapping还支持以下参数：

analyzer 分词器默认为simple
search_analyzer 查询用分词器默认与分词器一致
preserve_separators 分离器默认为true,如果设置为false,则在索引foof 能推荐Foo Fighters
preserve_position_increments 保留位置增量默认为true,如果设置为false并使用stopwords分词器,则在搜索b时能推荐The Bee.
max_input_length 最大输入量默认为50。

接下来就是导入数据到Elasticsearch中，我采用Spring Boot创建的项目，利用Spring Data，很容易的将数据导入；

public void save() {
        try {
            //读取本地数据文件
            String json = FileUtils.readFileToString(new File(filePath), StandardCharsets.UTF_8);
            List<Idiom> list = JSONObject.parseArray(json, Idiom.class);
            log.info("成语库数量:{}",list.size());
            long id = 1;
            for (Idiom idiom : list) {
                idiom.setId(id++);
            }
            //写库
            idiomRepository.saveAll(list);
            log.info("成语库插入完成");
        } catch (IOException e) {
            log.error("写入es问题:{}",e.getMessage());
        }
    }

因为Suggester查询方法在Spring Data 中暂未实现，所以我采用了HttpClient使用REST API进行请求，只做了word的 Completion Suggester简单演示。

#请求
POST ip:port/idiom/doc/_search
{
     "suggest": {
        "my-suggest" : {
            "prefix" : "阿", 
            "completion" : { 
                "field" : "word" 
            }
        }
    }
}

#响应json 响应过长被截取
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "my-suggest" : [
      {
        "text" : "阿",
        "offset" : 0,
        "length" : 1,
        "options" : [
          {
            "text" : "阿世取容",
            "_index" : "idiom",
            "_type" : "doc",
            "_id" : "16",
            "_score" : 1.0,
            "_source" : {
              "id" : 16,
              "word" : "阿世取容",
              "derivation" : "鲁迅《汉文学史纲要》第六篇至叔孙通，则正以曲学容，非重其能定朝仪，知典礼也。”",
              "example" : "叙述西汉儒学，应该看到多数～的章名小儒，也应该看到少数同情人民的正统儒者。★范文澜蔡美彪等《中国通史》第二编第二章第九节",
              "explanation" : "指迎合世俗，取悦于人。",
              "pinyin" : "ē shì qǔ róng",
              "abbreviation" : "esqr"
            }
          }
        ]
      }
    ]
  }
}

拿到json数据我们就能够解析，最终响应给前端展示。最终展示结果如下。 Elasticsearch Completion Suggester 实战

最后我们思考一下为什么Elasticsearch搜索推荐这么快。

原来索引并非通过倒排来完成，而是将分词过的数据编码成FST和索引一起存放。对于一个open状态的索引，FST会被ES整个装载到内存里的，进行前缀查找速度极快。但是FST只能用于前缀查找，这也是Completion Suggester的局限所在。

Elasticsearch Completion Suggester 实战

相关推荐