网页抓取:如果在文档的前20个字符中删除单词?
我从http://www.millercenter.org刮了一堆讲话。我的演讲只是按照我想要的方式进行了剪辑和格式化,除了一小块。每个文档(全部911个)在开头都有'transcript'这个词,我不希望他们在文档中,因为我正在推进一些NLP。我无法删除它们,并且我尝试了replace
和remove
方法。我甚至尝试通过HTML的一部分,在每个文档的开头说:<h2>Transcript</h2>
延长我的find
方法。网页抓取:如果在文档的前20个字符中删除单词?
这里的样本什么我看,文件明智:
transcript
to the senate and house of representatives
i lay before congress several dispatches from his
和
transcript
the period for a new election of a citizen to administer the executive government
这里是我的代码:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
就像我说的,那最后的replace
方法似乎没有工作。思考?
不知道你的问题是什么,但是当我用python 3.4和bs4运行它时,它删除了“成绩单”以及一堆标点符号。 (我拿出了一堆包括,改变urllib2
到urllib.request
)
import urllib.request
import re
from bs4 import BeautifulSoup
import re
from string import punctuation as p
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib.request.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
print(chester_3752)
我已经试过你的代码,它工作正常,但有一个轻微的调整,我会推荐。而不是使用replace
使用startswith
,以确保该字符串确实有transcript
开始。替换会从整个字符串中删除全部转录本的出现,但是你真正需要的是在转录本位于字符串的开始时删除它。
import urllib2
import sys
from string import punctuation as p
import re
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('-',' ')
print(chester_3752)
# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string
if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want
chester_3752 = chester_3752[10:].strip()
print chester_3752
当我运行该程序时,在if语句中出现错误:'UnicodeEncodeError:'ascii'编解码器无法在位置61344中对字符u'\ xa0'进行编码:序号不在范围内(128)' – blacksite
if if它的抱怨是:'chester_3752 = chester_3752.replace(' - ','')'而不是从文本中删除'transcript'的人。 – pelumi
我添加了'.encode('utf-8')',并解决了这个问题。但它仍然不能为我删除'成绩单'。我不相信'成绩单'前有任何其他角色,所以这不像我们在这个词之前缺少任何东西。 – blacksite
字符串总是以''transcript''开头吗? – pelumi