Java:Apache Tika:从.doc文件中提取文本时意外的runtimeexception。该文件打开没有的MSWord

问题描述:

任何错误

我已经使用TikaParser从“.DOC”中提取纯文本文件Java:Apache Tika:从.doc文件中提取文本时意外的runtimeexception。该文件打开没有的MSWord

public static void main(String[] args) throws Exception { 
    ContentHandler handler = new ToHTMLContentHandler(); 
    AutoDetectParser parser = new AutoDetectParser(); 
    Metadata metadata = new Metadata(); 
    ParseContext context = new ParseContext(); 

    FileInputStream content = new FileInputStream("file.doc"); 
    parser.parse(content, handler, metadata, context); 
    System.out.println(handler.toString()); 

    String[] metadataNames = metadata.names(); 
    for (String name : metadataNames) { 
     System.out.println(name + " : " + metadata.get(name)); 
    } 

    FileOutputStream outStream = new FileOutputStream("file.doc.txt"); 
    outStream.write(handler.toString().getBytes()); 
    outStream.close(); 
    content.close(); 
} 

这是工作的大多数的文件,但对于一个特定的文件,它是扔以下例外

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from [email protected] 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
at com.goarya.app.resumestorage.migration.TikaParser.main(TikaParser.java:29) 
Caused by: java.lang.IllegalArgumentException: The end (7161) must not be before the start (7162) 
at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:208) 
at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:194) 
at org.apache.poi.hwpf.usermodel.Paragraph.<init>(Paragraph.java:165) 
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:144) 
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:766) 
at org.apache.poi.hwpf.extractor.WordExtractor.getParagraphText(WordExtractor.java:168) 
at org.apache.poi.hwpf.extractor.WordExtractor.getMainTextboxText(WordExtractor.java:145) 
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:183) 
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169) 
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
... 3 more 

在Microsoft Word中打开时的doc文件显示没有错误。

此外,在C#中使用Microsoft.Office.Interop.Word给出纯文本。

如何使用Apache Tika解决此问题?

编辑:添加sample doc对于这种情况

+0

您使用的是什么版本的Apache Tika? – Gagravarr

+0

我使用的是tika版本1.14 –

我使用蒂卡cote1.2罐子,我的程序已经用下面的代码成功运行。

import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.io.IOException; 

import org.apache.tika.exception.TikaException; 
import org.apache.tika.metadata.Metadata; 
import org.apache.tika.parser.AutoDetectParser; 
import org.apache.tika.parser.ParseContext; 
import org.apache.tika.sax.ToHTMLContentHandler; 
import org.xml.sax.SAXException; 


public class Exmple2 { 
    public static void main(final String[] args) throws IOException,TikaException, SAXException { 

     ToHTMLContentHandler handler = new ToHTMLContentHandler(); 
      AutoDetectParser parser = new AutoDetectParser(); 
      Metadata metadata = new Metadata(); 
      ParseContext context = new ParseContext(); 

      FileInputStream content = new FileInputStream("/home/ist/FTRDocuments/taableDis.docx"); 
      parser.parse(content, handler, metadata, context); 
      System.out.println(handler.toString()); 

      String[] metadataNames = metadata.names(); 
      for (String name : metadataNames) { 
       System.out.println(name + " : " + metadata.get(name)); 
      } 

      FileOutputStream outStream = new FileOutputStream("/home/ist/file.doc.txt"); 
      outStream.write(handler.toString().getBytes()); 
      outStream.close(); 
      content.close(); 
    } 


} 

唯一的变化与tika1.2是ToHTMLContentHandler,你正在使用的ContentHandler。

+0

它适用于大多数文档,但与[this]类似的文档除外(https://drive.google.com/file/d/0B_vwGeTfv6vHRGw0Tk0xc2Z0aEU/view?usp=sharing ) –

+0

我以为你一直在面对所有文件的问题。 –