将扫描的PDF转换为文本python

问题描述:

我有一个扫描的pdf文件,我尝试从中提取文本。 我试图用pypdfocr能靠它OCR但我有错误:将扫描的PDF转换为文本python

"could not found ghostscript in the usual place"

搜索,我发现这个解决方案Linking Ghostscript to pypdfocr in Windows Platform后,我试图下载GhostScript的,并把它在环境变量,但它仍然有同样的错误。

如何使用python在我扫描的pdf文件中搜索文本?

谢谢。

编辑:这里是我的代码示例:

import os 
import sys 
import re 
import json 
import shutil 
import glob 
from pypdfocr import pypdfocr_gs 
from pypdfocr import pypdfocr_tesseract 
from PIL import Image 

path = PATH_TO_MY_SCANNED_PDF 
mainL = [] 
kk = {} 


def new_init(self, kk): 
    self.lang = 'heb' 
    self.binary = "tesseract" 
    self.msgs = { 
      'TS_MISSING': """ 
       Could not execute %s 
       Please make sure you have Tesseract installed correctly 
       """ % self.binary, 
      'TS_VERSION':'Tesseract version is too old', 
      'TS_img_MISSING':'Cannot find specified tiff file', 
      'TS_FAILED': 'Tesseract-OCR execution failed!', 
     } 

pypdfocr_tesseract.PyTesseract.__init__ = new_init 

wow = pypdfocr_gs.PyGs(kk) 
tt = pypdfocr_tesseract.PyTesseract(kk) 


def secFile(filename,oldfilename): 
    wow.make_img_from_pdf(filename) 


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg') 
    for file in files: 
     im = Image.open(file) 
     im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff') 
    for file in files: 
     tt.make_hocr_from_pnm(file) 
    pdftxt = ""  
    files = glob.glob("PATH" + '*.html') 
    for file in files: 
     with open(file) as myfile: 
      pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile) 
    findNum(pdftxt,oldfilename) 

    folder ="PATH" 

    for the_file in os.listdir(folder): 
     file_path = os.path.join(folder, the_file) 
     try: 
      if os.path.isfile(file_path): 
       os.unlink(file_path) 
     except Exception, e: 
      print e 

def pdf2ocr(filename): 
    pdffile = filename 
    os.system('pypdfocr -l heb ' + pdffile) 

def ocr2txt(filename): 
    pdffile = filename 


    output1 = pdffile.replace(".pdf","_ocr.txt") 
    output1 = "PATH" + os.path.basename(output1) 

    input1 = pdffile.replace(".pdf","_ocr.pdf") 

    os.system("pdf2txt" -o + output1 + " " + input1) 

    with open(output1) as myfile: 
     pdftxt="".join(line.rstrip() for line in myfile) 
    findNum(pdftxt,filename) 


def findNum(pdftxt,pdffile): 
    l = re.findall(r'\b\d+\b', pdftxt) 


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w') 
    for i in l: 
     output.write(",") 
     output.write(i) 
    output.close()  

def is_ascii(s): 
    return all(ord(c) < 128 for c in s) 

i = 0  
files = glob.glob(path + '\\*.pdf') 
print path 
print files 
for file in files: 
    if file.endswith(".pdf"): 
     if is_ascii(file): 
      print file 
      pdf2ocr(file)  
      ocr2txt(file) 
     else: 
      newname = "PATH" + str(i) + ".pdf" 
      shutil.copyfile(file, newname) 
      print newname 
      secFile(newname,file) 
     i = i + 1 

files = glob.glob(path + '\\' + '*_ocr.pdf')   

for file in files: 
    print file 
    shutil.copyfile(file, "PATH" + os.path.basename(file)) 
    os.remove(file) 
+0

你能提供你的代码示例吗? – Keeper

+0

我在我的问题中编辑了这个 – Michal

看看这个库https://pypi.python.org/pypi/pypdfocr 但PDF文件也有它的图片,也许你可以分析网页内容流一些扫描仪突破将单个扫描页面放到图像中,所以你不会使用ghostscript获取文本。

+0

仍然是同样的错误,我在命令行中写了** pypdfocr filename.pdf **并且错误:**错误:在通常的地方找不到Ghostscript;请使用你的配置文件指定它** – Michal

+0

你使用哪个操作系统? – ghovat

+0

我使用Windows 64位 – Michal

您可以使用OpenCV for python。有很多检测文本的实例。 这里是链接enter link description here

+0

我没有找到如何将它用于pdf文件。 – Michal

+0

将图片打印为图片(PNG或JPEG格式),然后您可以使用OpenCV OCR。 –

+0

我试着看看openCV,但是当我做'import numpy'时,它写入'AttributeError:'模块'对象没有属性'einsum'' – Michal