使用Python刮亚洲语言网站

问题描述：

不太熟悉Python生态系统，或者通常使用网页抓取。所以我试图从中文网站上刮取内容。使用Python刮亚洲语言网站

from bs4 import BeautifulSoup 
import requests 

r = requests.get("https://www.baidu.com/") 
r.encoding = 'utf-8' 

text = r.text 

soup = BeautifulSoup(text.encode('utf-8','ignore'), 'html.parser') 

print soup.prettify()

的问题是，此代码对我的作品，但它并不适用于每个人，我不知道有足够的了解字符编码或Python的生态系统来解决问题。我正在运行Python 2.7.10，但在运行Python 2.7.12的另一台计算机上运行这个相同的代码块导致出现以下错误：“UnicodeEncodeError：'ascii'编解码器无法编码369-377位置的字符：序号不是在范围内（128）”

所以我想我的问题其实是：

是什么原因造成这个错误？我该如何修复这段代码才能使它更具可移植性？

非常感谢您提供任何指导或指引。

如果进来与Unicode数据联系，请你帮个忙，并使用Python3。那里有更多更好的编码处理（字符串默认为unicode，而不是ascii，那里）。 – MKesper

我认为一个绝对是Windows？ –

答

我认为你不需要指定请求的编码。因为r.text已经完成了编码转换工作，r.content是原始数据。

看到文件：

| text 
|  Content of the response, in unicode. 
|  
|  If Response.encoding is None, encoding will be guessed using 
|  ``chardet``. 
|  
|  The encoding of the response content is determined based solely on HTTP 
|  headers, following RFC 2616 to the letter. If you can take advantage of 
|  non-HTTP knowledge to make a better guess at the encoding, you should 
|  set ``r.encoding`` appropriately before accessing this property.

，所以你只需要配置响应的编码，而不是请求编码。

所以代码应该是这样的：

print r.encoding 
r.encoding = "urf8" 
print r.text

使用Python刮亚洲语言网站

相关推荐