python004 -- 字符串处理及编码格式

2.1 字符串

2.1.1 字符串转换

1

2

3

4

5

6

7

8

9

10

>>> a = 123    

>>> b = 1.23

>>> type(a)

<type 'int'>

>>> type(b)

<type 'float'>

>>> type(str(a))

<type 'str'>

>>> type(str(b))

<type 'str'>

说明：先定义个整数和浮点数，再查看类型，用str()函数将对象转成字符串。

这里的用到了type()函数，用于查看对象类型。这个type()在以后学习中很用的，刚开始学习时候，往往因为对象类型不对，导致程序运行报错，这时可以用它来排查问题。

2.1.2 字符串连接

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

# 加号字符将同类型字符连接到一起    

>>> hw = "Hello" + "World!"

>>> print hw

HelloWorld!

# 两个相邻的字符串自动连接一起

>>> hw = "Hello""World!"

>>> print hw

HelloWorld!

# 如果字符串内包括单引号或双引号，要用\转义，否则报错，上一章也讲过。

>>> hw = "Hello \"World!\""

>>> print hw

Hello "World!"

# 不同字符串类型拼接

>>> a = "abc"

>>> b = 1

>>> print a + b

Traceback (most recent call last):

 File "<stdin>", line 1, in <module>

TypeError: cannot concatenate 'str' and 'int' objects

说明：不同字符串类型不允许连接，想要连接可以下面这么做。

方法1：

>>> c = "%s%d" %(a,b)

>>> print c

abc1

方法2：

>>> c = a + str(b)

>>> print c

abc1

2.1.3 格式化输出

操作符号	说明
%s	字符串（str()）
%r	字符串（repr()）
%d	整数
%f	浮点数，可指定小数点后的精度

1) 字符串格式输出三种方法

1

2

3

4

5

6

7

>>> xxoo = "string"    

>>> print "%s" %xxoo

string

>>> print "%r" %xxoo

'string'

>>> print `xxoo`   

'string'

说明：%s采用str()函数显示，%r采用repr()函数显示。repr()和反撇号把字符串转为Python表达式。

2) 保留小数点数

1

2

>>> '%.1f' %(float(100)/1024)    

'0.1'

2.1.4 字符串处理

python004 -- 字符串处理及编码格式

上图是字符串处理的方法，红色框框中大概有一半经常用的，我们就拿一部分常用的来举例说明。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

#!/usr/bin/env python    

# -*- coding: utf-8 -*-

xxoo = "Hello world!"

print "字符串长度: %s" % len(xxoo)

print "首字母大写: %s" % xxoo.capitalize()

print "字符l出现次数: %s" % xxoo.count('l')

print "感叹号是否结尾: %s" % xxoo.endswith('!')

print "w字符是否是开头: %s" % xxoo.startswith('w')

print "w字符索引位置: %s" % xxoo.find('w') # xxoo.index('W')

print "格式化字符串: Hello{0} world!".format(',')

print "是否都是小写: %s" % xxoo.islower()

print "是否都是大写: %s" % xxoo.isupper()

print "所有字母转为小写: %s" % xxoo.lower()

print "所有字母转为大写: %s" % xxoo.upper()

print "感叹号替换为句号: %s" % xxoo.replace('!','.')

print "以空格分隔切分成列表: %s" % xxoo.split(' ')

print "转换为一个列表: %s" % xxoo.splitlines()

print "去除两边空格: %s" % xxoo.strip()

print "大小写互换: %s" % xxoo.swapcase()

print "只要Hello字符串: %s" % xxoo[0:5]

print "去掉倒数第一个字符: %s" % xxoo[0:-1]

# python test.py

字符串长度: 12

首字母大写: Hello world!

字符l出现次数: 3

感叹号是否结尾: True

w字符是否是开头: False

w字符索引位置: 6

格式化字符串: Hello, world!

是否都是小写: False

是否都是大写: False

所有字母转为小写: hello world!

所有字母转为大写: HELLO WORLD!

感叹号替换为句号: Hello world.

以空格分隔切分成列表: ['Hello', 'world!']

转换为一个列表: ['Hello world!']

去除两边空格: Hello world!

大小写互换: hELLO WORLD!

只要Hello字符串: Hello

去掉倒数第一个字符: Hello world

2.1.5 字符串输出颜色

字体颜色

字体背景颜色

显示方式

30：黑

31：红

32：绿

33：黄

34：蓝色

35：紫色

36：深绿

37：白色

40：黑

41：深红

42：绿

43：黄色

44：蓝色

45：紫色

46：深绿

47：白色

0：终端默认设置

1：高亮显示

4：下划线

5：闪烁

7：反白显示

8：隐藏

格式：

\033[1;31;40m # 1是显示方式，可选。31是字体颜色。40m是字体背景颜色。

\033[0m # 恢复终端默认颜色，即取消颜色设置。

博客地址：http://lizhenliang.blog.51cto.com

QQ群：Shell/Python运维开发群 323779636

2.2 编码

2.2.1 常见字符编码类型

ASCII：美国信息交换标准码，是目前计算机中最广泛使用的字符集编码。每个ASCII码以1个字节存储，例如数字字符0的ASCII码是0110000，十进制表示为48。

Unicode：为解决世界上上百种语言带来混合、冲突，各国有各国的标准，显示很容易出现乱码。Unicode就出现了，它把所有语言的字符都统一到一套Unicode编码中，并定义每个语言字符的标准，所以Unicode又称统一码，万国码。大部分编程语言都支持Unicode，Python内部编码也支持Unicode。

GB2312：中国国家标准总局发布处理汉字的标准编码。

GBK：GB2312的扩展，向下兼容GB2312。

UTF-8：针对Unicode的可变长度字符编码，又称万国码。支持中文简体繁体及其它语言（如英文，日文，韩文）。

2.2.3 decode()

decode()函数作用是将其他编码（比如ACSII、Byte String）的字符串解码成Unicode。

2.2.4 encode()

encode()函数作用是将Unicode编码成终端软件能是识别的编码，就能正常显示了，比如UTF-8、GBK。

2.2.5 Python编码处理

1

2

3

4

5

6

7

#!/usr/bin/env python        

c = "中文"

print c

# python test.py

  File "test.py", line 2

SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

说明：在程序里面直接打印中文，会报语法错误，这是因为Python默认编码是ASCII，无法处理其他编码。

如果想打印中文，需要声明编码为utf-8，上面也有写过：

1

2

3

4

5

6

7

8

9

#!/usr/bin/env python        

# -*- coding: utf-8 -*-

c = "中文"

print c

print type(c)

# python test.py

中文

<type 'str'>

可以正常输出中文了，类型是字符串，这个字符串是经过Python unicode编码后字节组成的。

虽然可以正常输入中文，并不意味的就万事大吉了，如果终端编码不是utf-8或其他软件也不确定编码还会出现乱码情况。所以还是要明白Python处理编码逻辑关系，才能更好的应对编码问题。

切换到交互式解释器：

1

2

3

4

5

>>> c = "中文"        

>>> c.encode('utf-8')

Traceback (most recent call last):

 File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

如果直接转成utf-8是不允许的，报错Unicode解码错误，大概意思是说ascii码不能解码字节字符串。

上面讲到encode()函数作用是将Unicode码解码，而现在的c变量并非是Unicode码，而是字节字符串，算是Unicode的一种吧？。

故此，不能使用encode()，而是先使用decode()先解码陈Unicode再用encode()编码成utf-8。

1

2

3

4

5

6

7

8

>>> c.decode('utf-8')        

u'\u4e2d\u6587'       # 4e2d对应unicode值是"中"，6587对应unicdoe值是"文"

>>> type(c.decode('utf-8'))

 <type 'unicode'>

>>> print c.decode('utf-8')    ？

中文

>>> print c.decode('utf-8').encode('utf-8')

中文

如果是Unicode字符串可直接通过encode()函数转码其他编码。

1

2

3

4

5

>>> c = u'中文'    

>>> c.encode('utf-8')

'\xe4\xb8\xad\xe6\x96\x87'

>>> print c.encode('utf-8')

中文

看下字节字符串和unicode字符串区别：

1

2

3

4

5

6

7

8

9

10

>>> c = '中文'        

>>> u = u'中文'

>>> c

'\xe4\xb8\xad\xe6\x96\x87'

>>> u

u'\u4e2d\u6587'

>>> len(c)

6

>>> len(u)

2

字节字符串长度要比unicode长的多，而unicode长度就是字符长度。

总结下：Python处理编码流程大致是这样的，ascii --> decode() --> unicode --> encode() --> 终端能识别的编码，unicode算是一个中间码，有着承上启下的作用。

python004 -- 字符串处理及编码格式

相关推荐