编码特殊字符

问题描述：

我有一段代码，在Python3效果很好：编码特殊字符

def encode_test(filepath, char_to_int): 
    with open(filepath, "r", encoding= "latin-1") as f: 
     dat = [line.rstrip() for line in f] 
     string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]

然而，当我尝试这样做在Python2.7，我第一次得到了错误

SyntaxError: Non-ASCII character '\xc3' in file languageIdentification.py on line 30, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

然后我意识到我可能需要在代码顶部添加#coding = utf-8。但是，这样做后，我遇到了另一个错误：

UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal 
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat] 
Traceback (most recent call last): 
File "languageIdentification.py", line 190, in <module> 
test_string = encode_test(sys.argv[3], char_to_int) 
File "languageIdentification.py", line 32, in encode_test 
string_to_int = [[char_to_int[char] if char != 'ó' else 
char_to_int['ò'] for char in line] for line in dat] 
KeyError: u'\xf3'

所以有人可以告诉我，我能做些什么来解决Python2.7中的这个问题？

谢谢！

Python 3'str'对象实际上是等价于Python 2'unicode'对象，Python 2'str'对象等同于Python 3'bytes'。只需将* everything *转换为源代码中的unicode对象并使用它们即可。 –

@ juanpa.arrivillaga其实我无法对源文件进行更改。无论如何，我可以直接在该计划中进行操作吗？ – Parker

什么？你的意思是在你的文本文件中？你必须改变你的代码，当'str'类型的性质发生根本性改变时，你不能指望能够在Python 2中重新使用python 3代码 –

答

的问题是，你试图比较的unicode字符串和字节串：

char != 'ó'

凡char是Unicode和'ó'是一个字节串（或只是STR）。

当Python 2具有这样的比较面，它试图转换（或解码）：

byte-string -> unicode

转换设置有默认编码是ASCII在Python 2.
由于字节值'ó'高于127，则会导致错误（UnicodeWarning）。

顺便说一句，对于字面上的字节值是在ASCII范围内，比较将成功。
例子：

print u'ó' == 'ó' # UnicodeWarning: ... 
print u'z' == 'z' # True

所以，在比较你需要你的字节字符串转换为手动UNICODE。
例如，你可以做到这一点与内置unicode()功能：

u = unicode('ó', 'utf-8') # note, that you can specify encoding

，或只与'u' -literal：

u = u'ó'

但要注意：使用该选项的皈依将通过实施您在源文件顶部指定的编码。
因此，您的实际源编码和顶部声明的编码应该匹配。

正如我从SyntaxError看到的消息：在您的消息来源'ó'开始'\xc3' -byte。
因此它应该是“\xc3\xb3'这是UTF-8：

print '\xc3\xb3'.decode('utf-8') # ó

所以，# coding: utf-8 + char != u'ó'应该解决您的问题。

UPD。

当我从UnicodeWarning消息看 - 有第二个麻烦：KeyError

在声明中会出现此错误：

char_to_int[char]

因为u'\xf3'（实际上是u'ó'）不一个有效的密钥。

此unicode来自解码您的文件（与latin-1）。
我想，你的代码char_to_int中根本没有unicode密钥。

所以，尽量用编码这种一键返回到它的字节值：

char_to_int[char.encode('latin-1')]

总结，尽量提供代码的最后一个字符串更改为：

string_to_int = [[char_to_int[char.encode('latin-1')] if char != u'ó' else char_to_int['ò'] for char in line] for line in dat]

谢谢。这工作 – Parker

答

如果您想将字符转换为其整数值，您可以使用ord函数，它也适用于Unicode。

line = u’some Unicode line with ò and ó’ 
string_to_int = [ord(char) if char!=u‘ó’ else ord(u’ò’) for char in line]

相关推荐