网页抓取:如果在文档的前20个字符中删除单词?

问题描述:

我从http://www.millercenter.org刮了一堆讲话。我的演讲只是按照我想要的方式进行了剪辑和格式化,除了一小块。每个文档(全部911个)在开头都有'transcript'这个词,我不希望他们在文档中,因为我正在推进一些NLP。我无法删除它们,并且我尝试了replaceremove方法。我甚至尝试通过HTML的一部分,在每个文档的开头说:<h2>Transcript</h2>延长我的find方法。网页抓取:如果在文档的前20个字符中删除单词?

这里的样本什么我看,文件明智:

transcript 
to the senate and house of representatives 
i lay before congress several dispatches from his 

transcript 
the period for a new election of a citizen to administer the executive government 

这里是我的代码:

import urllib2,sys,os 
from bs4 import BeautifulSoup,NavigableString 
from string import punctuation as p 
from multiprocessing import Pool 
import re, nltk 
import requests 
reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

就像我说的,那最后的replace方法似乎没有工作。思考?

+0

字符串总是以''transcript''开头吗? – pelumi

不知道你的问题是什么,但是当我用python 3.4和bs4运行它时,它删除了“成绩单”以及一堆标点符号。 (我拿出了一堆包括,改变urllib2urllib.request

import urllib.request 
import re 
from bs4 import BeautifulSoup 

import re 
from string import punctuation as p 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib.request.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

print(chester_3752) 
+0

它可以使我运行Python 2.7有所作为吗? – blacksite

+0

他们是不同的,因此这是可能的,但奇怪的是,'chester_3752 = chester_3752.replace( ' - ', ' ')'作品和'chester_3752 = chester_3752.replace(' 成绩单', '')'没有。你可能想要尝试的另一件事是在最后一行之后放入另一行,因为似乎很奇怪只有最后一行没有被执行。 – dstudeba

我已经试过你的代码,它工作正常,但有一个轻微的调整,我会推荐。而不是使用replace使用startswith,以确保该字符串确实有transcript开始。替换会从整个字符串中删除全部转录本的出现,但是你真正需要的是在转录本位于字符串的开始时删除它。

import urllib2 
import sys 
from string import punctuation as p 
import re 

reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('-',' ') 
print(chester_3752) 

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string 

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want 
    chester_3752 = chester_3752[10:].strip() 
print chester_3752 
+0

当我运行该程序时,在if语句中出现错误:'UnicodeEncodeError:'ascii'编解码器无法在位置61344中对字符u'\ xa0'进行编码:序号不在范围内(128)' – blacksite

+0

if if它的抱怨是:'chester_3752 = chester_3752.replace(' - ','')'而不是从文本中删除'transcript'的人。 – pelumi

+0

我添加了'.encode('utf-8')',并解决了这个问题。但它仍然不能为我删除'成绩单'。我不相信'成绩单'前有任何其他角色,所以这不像我们在这个词之前缺少任何东西。 – blacksite