索引中SOLR：修正分析仪不会产生巨大的术语

问题描述：

我想我的指数由Nutch的抓取运行数据：索引中SOLR：修正分析仪不会产生巨大的术语

bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*

起初它被完全的工作好。我编制了我的数据索引，发送了一些查询并收到了很好的结果。但后来我跑再次爬行，所以它获取更多的网页，而现在，当我运行的Nutch指数命令，我面对

java.io.IOException: Job failed!

这里是我的Hadoop日志：

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

我意识到在提到的页面中必须有一个非常长的期限。因此，在schema.xml（in nutch）和managed-schema（solr）中，我将“id”，“content”和“text”从“strings”改为“text_general”：但它没有解决这个问题。

我不是专家，所以我不知道如何纠正分析器而不要搞砸别的东西。我读过，我可以： 1.使用（在索引分析器中），一个LengthFilterFactory为了过滤那些不符合请求长度范围的标记。 2.use（在索引分析器中）使用TruncateTokenFilterFactory来修复索引标记的最大长度

但是架构中有这么多分析器。我应该更改分析仪的定义吗？如果是的话，因为内容和其他字段的类型是text_general，是不是也会影响它们呢？

任何人都知道我该如何解决这个问题？我真的很感激任何帮助。

顺便说一句，我使用nutch 1.11和solr 6.0.0。

答

假设你使用的是与Nutch的捆绑为您Solr的安装基础架构的schema.xml，基本上你只需要其中的任一过滤器（LengthFilterFactory或TruncateTokenFilterFactory）的添加到text_general字段类型。

从text_generalfieldType（https://github.com/apache/nutch/blob/master/conf/schema.xml#L108-L123）的初始定义出发，您需要添加以下到<analyzer type="index">部分：

... 
<analyzer type="index"> 
    <tokenizer class="solr.StandardTokenizerFactory"/> 
    <!-- remove long tokens --> 
    <filter class="solr.LengthFilterFactory" min="3" max="7"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 
    <filter class="solr.LowerCaseFilterFactory"/> 
</analyzer> 
...

这也可以适用于使用相同的语法分析器query。如果你想使用TruncateTokenFilterFactory过滤器只是交换与添加一行：

<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>

另外，不要忘记每个滤波器的参数调整，以您的需求（min，max为LengthFilterFactory）和prefixLength为TruncateTokenFilterFactory。

回答您的其他问题：是的，这会影响text_general类型的所有字段，但这并不是那么有问题，因为如果您在任何其他字段中找到另一个超长期术语，则会引发相同的错误。如果您仍然希望仅为content字段隔离此更改，则只需创建一个新名称为fieldType的新名称（例如，truncated_text_general，例如，复制&粘贴整个fieldType部分并更改名称属性），然后更改该名称的类型content字段（https://github.com/apache/nutch/blob/master/conf/schema.xml#L339）以匹配您新创建的fieldType。

这就是说，只需选择过滤器的理智值，以避免错过您的索引中的很多条款。

感谢您的回复Jorge。虽然你解释了它的工作方式非常好，正如我在问题主体中提到的那样，但我确实尝试了这一点，但不幸的是它并没有解决我的问题。 –

索引中SOLR：修正分析仪不会产生巨大的术语

相关推荐