如何将unicode文本转换为普通文本

问题描述：

我正在学习Python中的Beautiful Soup。如何将unicode文本转换为普通文本

我想解析一个简单的网页与书籍列表。

E.g

<a href="https://www.nostarch.com/carhacking">The Car Hacker’s Handbook</a>

我用下面的代码。

import requests, bs4 
res = requests.get('http://nostarch.com') 
res.raise_for_status() 
nSoup = bs4.BeautifulSoup(res.text,"html.parser") 
elems = nSoup.select('.product-body a') 

#elems[0] gives 
<a href="https://www.nostarch.com/carhacking">The Car Hacker\u2019s Handbook</a>

而且

#elems[0].getText() gives 
u'The Car Hacker\u2019s Handbook'

但我想这是通过给予适当的文字，

s = elems[0].getText() 
print s 
>>>The Car Hacker’s Handbook

如何修改我的代码，以便给“轿车黑客手册”输出，而不是“你的车黑客手册”？

请帮忙。

你得到的结果没有错。它是一个带有花哨字符的unicode字符串。 – Selcuk

谢谢，@Selcuk。但如何使用该字符串“u'The Car Hacker's Handbook'”并存储在文件/数据库中？它会被妥善保存吗？我的意思是我尝试了'f.write（elems [0] .getText（））'，我得到了UnicodeEncodeError。 –

谢谢，@Selcuk。我知道了。我用'elems [0] .getText（）。encode（'utf-8'）'保存到文件或数据库中。 –

答

您是否尝试过使用编码方法？有关Unicode和Python

elems[0].getText().encode('utf-8')

更多信息可以在https://docs.python.org/2/howto/unicode.html

此外，被发现发现，如果你的字符串是真正的UTF-8编码，您可以使用chardet并运行以下命令：

>>> import chardet 
>>> chardet.detect(elems[0].getText()) 
{'confidence': 0.5, 'encoding': 'utf-8'}

谢谢。我试过'elems [0] .getText（）。encode（'utf-8'）'。有效。 Python终端将其打印为“Car Hacker \ xe2 \ x80 \ x99s Handbook”，但如果写入文件，文件内容中包含“The Car Hacker's Handbook”。 –

很酷。我只是为了正确而编辑答案。 – mschuh

@madhusudan_k欢迎来到SO。如果您认为通过此答案解决了您要查找的内容，请不要忘记单击投票计数下方的箭头接受答案。 – Blaszard

答

你可以试试

import unicodedata 

def normText(unicodeText): 
return unicodedata.normalize('NFKD', unicodeText).encode('ascii','ignore')

这将转换unicodetext为纯文本，您可以写入文件。

它还删除了“撇号”，因此书名变成了“The Car Hackers Handbook”。 – BlackJack

如何将unicode文本转换为普通文本

相关推荐