网页抓取：如果在文档的前20个字符中删除单词？

问题描述：

我从http://www.millercenter.org刮了一堆讲话。我的演讲只是按照我想要的方式进行了剪辑和格式化，除了一小块。每个文档（全部911个）在开头都有'transcript'这个词，我不希望他们在文档中，因为我正在推进一些NLP。我无法删除它们，并且我尝试了replace和remove方法。我甚至尝试通过HTML的一部分，在每个文档的开头说：<h2>Transcript</h2>延长我的find方法。网页抓取：如果在文档的前20个字符中删除单词？

这里的样本什么我看，文件明智：

transcript 
to the senate and house of representatives 
i lay before congress several dispatches from his

和

transcript 
the period for a new election of a citizen to administer the executive government

这里是我的代码：

import urllib2,sys,os 
from bs4 import BeautifulSoup,NavigableString 
from string import punctuation as p 
from multiprocessing import Pool 
import re, nltk 
import requests 
reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','')

就像我说的，那最后的replace方法似乎没有工作。思考？

字符串总是以''transcript''开头吗？ – pelumi

答

不知道你的问题是什么，但是当我用python 3.4和bs4运行它时，它删除了“成绩单”以及一堆标点符号。（我拿出了一堆包括，改变urllib2到urllib.request）

import urllib.request 
import re 
from bs4 import BeautifulSoup 

import re 
from string import punctuation as p 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib.request.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

print(chester_3752)

它可以使我运行Python 2.7有所作为吗？ – blacksite

他们是不同的，因此这是可能的，但奇怪的是，'chester_3752 = chester_3752.replace（ ' - '， ' '）'作品和'chester_3752 = chester_3752.replace（' 成绩单'， ''）'没有。你可能想要尝试的另一件事是在最后一行之后放入另一行，因为似乎很奇怪只有最后一行没有被执行。 – dstudeba

答

我已经试过你的代码，它工作正常，但有一个轻微的调整，我会推荐。而不是使用replace使用startswith，以确保该字符串确实有transcript开始。替换会从整个字符串中删除全部转录本的出现，但是你真正需要的是在转录本位于字符串的开始时删除它。

import urllib2 
import sys 
from string import punctuation as p 
import re 

reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('-',' ') 
print(chester_3752) 

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string 

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want 
    chester_3752 = chester_3752[10:].strip() 
print chester_3752

当我运行该程序时，在if语句中出现错误：'UnicodeEncodeError：'ascii'编解码器无法在位置61344中对字符u'\ xa0'进行编码：序号不在范围内（128）' – blacksite

if if它的抱怨是：'chester_3752 = chester_3752.replace（' - '，''）'而不是从文本中删除'transcript'的人。 – pelumi

我添加了'.encode（'utf-8'）'，并解决了这个问题。但它仍然不能为我删除'成绩单'。我不相信'成绩单'前有任何其他角色，所以这不像我们在这个词之前缺少任何东西。 – blacksite

网页抓取：如果在文档的前20个字符中删除单词？

相关推荐