Jupyter notebook给你不一样的阅读体验!
- 输入输出通常分为以下几大类:读取文本文件和其他的更高效的磁盘存储格式、加载数据库中的数据、利用Web API操作网络资源
6.1读写文本格式的数据
- pandas提供了一些用于将表格型数据读取为DataFrame对象的函数。下表进行了总结,其中用的比较多的就是read_csv和read_table

import pandas as pd
df = pd.read_csv('data/ex1.csv')
df
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
- 虽然是.csv文件,但是可以使用read_table来读取,但是需要指定分隔符为“,”
pd.read_table('data/ex1.csv',sep=',')
d:\program filles\python\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'.
"""Entry point for launching an IPython kernel.
|
a,b,c,d,message |
0 |
1,2,3,4,hello |
1 |
5,6,7,8,world |
2 |
9,10,11,12,foo |
- 并不是所有的文件都有标题行,读入文件时可以使用默认的列名,也可以自己定义
pd.read_csv('data/ex2.csv',header=None)
|
0 |
1 |
2 |
3 |
4 |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
pd.read_csv('data/ex2.csv',names=['A','B','C','D','message'])
|
A |
B |
C |
D |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
- 如果想要明确的将该列放到索引4的位置上可以通过index_col指定message
names = ['a','b','c','d','message']
pd.read_csv('data/ex2.csv',names=names, index_col='message')
|
a |
b |
c |
d |
message |
|
|
|
|
hello |
1 |
2 |
3 |
4 |
world |
5 |
6 |
7 |
8 |
foo |
9 |
10 |
11 |
12 |
- 如果想要将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可
parsed = pd.read_csv('data/csv_mindex.csv',index_col=['key1','key2'])
parsed
|
|
value1 |
value2 |
key1 |
key2 |
|
|
one |
a |
1 |
2 |
b |
3 |
4 |
c |
5 |
6 |
d |
7 |
8 |
two |
a |
9 |
10 |
b |
11 |
12 |
c |
13 |
14 |
d |
15 |
16 |
- 还有一些文件含有不需要的数据,可以使用skiprows参数 跳过指定的行
result = pd.read_csv('data/ex4.csv',skiprows=[0,2,3])
result
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
- 缺失数据经常是要么没有(空字符串),要么用某个标记值表示。默认情况下,pandas会用一组经常出现的标记值进行识别,比如NA及NULL:
result = pd.read_csv('data/ex5.csv')
result
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
pd.isnull(result)
|
something |
a |
b |
c |
d |
message |
0 |
False |
False |
False |
False |
False |
True |
1 |
False |
False |
False |
True |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
- 参数na_values可以使用一个列表或者集合的字符串表示缺失值
result = pd.read_csv('data/ex5.csv',na_values=['NULL'])
result
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv('data/ex5.csv',na_values=sentinels)
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
NaN |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
NaN |
- read_csv和read_table 一些常用的参数


逐块读取文本文件
- 在处理很大的文件时,或找出大文件中的参数集以便于后续处理时,你可能只想读取文件的一小部分或逐块对文件进行迭代。
- 在看大文件之前,需要先设置pandas的一些参数
pd.options.display.max_rows = 8
result = pd.read_csv('data/examples/ex6.csv')
result
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
... |
... |
... |
... |
... |
... |
9996 |
-0.479893 |
-0.650419 |
0.745152 |
-0.646038 |
E |
9997 |
0.523331 |
0.787112 |
0.486066 |
1.093156 |
K |
9998 |
-0.362559 |
0.598894 |
-1.843201 |
0.887292 |
G |
9999 |
-0.096376 |
-1.012999 |
-0.657431 |
-0.573315 |
0 |
10000 rows × 5 columns
- 如果之想读取几行,可以通过nrows进行指定即可:
pd.read_csv('data/examples/ex6.csv',nrows=5)
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
4 |
0.354628 |
-0.133116 |
0.283763 |
-0.837063 |
Q |
- 如果要逐块的读取文件,可以指定chunksize(行数)
chunker = pd.read_csv('data/examples/ex6.csv',chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0xed5ed0>
- read_csv所返回的这个TextParser对象使你可以根据chunksize对文件进行逐块迭代。比如说,我们可以迭代处理ex6.csv,将值计数聚合到"key"列中,如下所示:
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(),fill_value=0)
tot = tot.sort_values(ascending=False)
tot[:10]
E 368.0
X 364.0
L 346.0
O 343.0
...
J 337.0
F 335.0
K 334.0
H 330.0
Length: 10, dtype: float64
将数据写出到文本格式
data = pd.read_csv('data/examples/ex5.csv')
data
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
data.to_csv('data/out.csv')
- 缺失值在输出结果中会被表示为空字符串。也可将其表示为别的标记值
import sys
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
- 如果没有设置其他参数,则默认列出行和列的标签。也可以不列出:
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,columns=['a','c','b'])
a,c,b
1,3.0,2
5,,6
9,11.0,10
import numpy as np
dates = pd.date_range('1/1/2000',periods=7)
ts = pd.Series(np.arange(7),index = dates)
ts.to_csv('data/tseries.csv')
pd.read_csv('data/tseries.csv')
|
2000-01-01 |
0 |
0 |
2000-01-02 |
1 |
1 |
2000-01-03 |
2 |
2 |
2000-01-04 |
3 |
3 |
2000-01-05 |
4 |
4 |
2000-01-06 |
5 |
5 |
2000-01-07 |
6 |
处理分隔符格式
- 大部分存储在磁盘上的表格型数据都可以使用pandas.read_csv进行加载。但有时候需要做一些手工处理
import pandas as pd
import csv
f = open("data/examples/ex7.csv")
reader = csv.reader(f)
for line in reader:
print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
- 为了让数据格式符合要求,需要做以下工作。首先读取文件到一个多行的列表中:
with open("data/examples/ex7.csv") as f:
lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
data_dict = {h:v for h,v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
- CSV文件的形式有很多。只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔符、字符串引用约定、行结束符等):
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = '|'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
with open("data/examples/ex7.csv") as f:
reader = csv.reader(f,dialect=my_dialect)
print(reader)
<_csv.reader object at 0x07025DB0>
- 各个CSV语支的参数也可以用关键字的形式提供给csv.reader,而无需定义子类:
with open("data/examples/ex7.csv") as f:
reader = csv.reader(f,delimiter='|')
print(reader)
<_csv.reader object at 0x07025D30>
with open('data/mydata.csv','w') as f:
writer = csv.writer(f,dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('4','5','6'))
writer.writerow(('7','8','9'))
JSON 文件
- JSON(JavaScript Object Notation的简称)已经成为通过HTTP请求在Web浏览器和其他应用程序之间发送数据的标准格式之一。它是一种比表格型文本格式(如CSV)灵活得多的数据格式。
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
- 除其空值null和一些其他的细微差别(如列表末尾不允许存在多余的逗号)之外,
JSON非常接近于有效的Python代码。基本类型有对象(字典)、数组(列表)、字符串、数值、布尔值以及null。
对象中所有的键都必须是字符串。许多Python库都可以读写JSON数据。我将使用json,因为它是构建于Python标准库中的。通过json.loads即可将JSON字符串转换成Python形式:
import json
result = json.loads(obj)
result
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
- json.dumps则将Python对象转换成JSON格式:
asjson = json.dumps(result)
asjson
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
- 将(一个或一组)JSON对象转换为DataFrame最简单方便的方式是:向DataFrame构造器传入一个字典的列表(就是原先的JSON对象),并选取数据字段的子集:
siblings = pd.DataFrame(result['siblings'],columns=['name','age'])
siblings
|
name |
age |
0 |
Scott |
30 |
1 |
Katie |
38 |
- pandas.read_json可以自动将特别格式的JSON数据集转换为Series或DataFrame。默认选项假设JSON数组中的每个对象是表格中的一行:
data = pd.read_json('data/examples/example.json')
data
|
a |
b |
c |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
2 |
7 |
8 |
9 |
- 将数据从pandas输出到JSON,可以使用to_json方法:
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
XML和HTML:Web信息收集
- Python有许多可以读写常见的HTML和XML格式数据的库,包括lxml、Beautiful Soup和html5lib。lxml的速度比较快,但其它的库处理有误的HTML或XML文件更好。pandas有一个内置的功能,read_html,它可以使用lxml和Beautiful Soup自动将HTML文件中的表格解析为DataFrame对象
import pandas as pd
tables = pd.read_html('data/examples/fdic_failed_bank_list.html')
len(tables)
1
failures = tables[0]
failures.head()
|
Bank Name |
City |
ST |
CERT |
Acquiring Institution |
Closing Date |
Updated Date |
0 |
Allied Bank |
Mulberry |
AR |
91 |
Today's Bank |
September 23, 2016 |
November 17, 2016 |
1 |
The Woodbury Banking Company |
Woodbury |
GA |
11297 |
United Bank |
August 19, 2016 |
November 17, 2016 |
2 |
First CornerStone Bank |
King of Prussia |
PA |
35312 |
First-Citizens Bank & Trust Company |
May 6, 2016 |
September 6, 2016 |
3 |
Trust Company Bank |
Memphis |
TN |
9956 |
The Bank of Fayette County |
April 29, 2016 |
September 6, 2016 |
4 |
North Milwaukee State Bank |
Milwaukee |
WI |
20364 |
First-Citizens Bank & Trust Company |
March 11, 2016 |
June 16, 2016 |
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, dtype: int64
利用lxml.objectify解析XML
- XML(Extensible Markup Language)是另一种常见的支持分层、嵌套数据以及元数据的结构化数据格式。介绍了pandas.read_html函数,XML和HTML的结构很相似,但XML更为通用。
from lxml import objectify
path = 'data/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE', 'DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
perf = pd.DataFrame(data)
perf.head()
|
AGENCY_NAME |
CATEGORY |
DESCRIPTION |
FREQUENCY |
INDICATOR_NAME |
INDICATOR_UNIT |
MONTHLY_ACTUAL |
MONTHLY_TARGET |
PERIOD_MONTH |
PERIOD_YEAR |
YTD_ACTUAL |
YTD_TARGET |
0 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei... |
M |
On-Time Performance (West of Hudson) |
% |
96.9 |
95 |
1 |
2008 |
96.9 |
95 |
1 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei... |
M |
On-Time Performance (West of Hudson) |
% |
95 |
95 |
2 |
2008 |
96 |
95 |
2 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei... |
M |
On-Time Performance (West of Hudson) |
% |
96.9 |
95 |
3 |
2008 |
96.3 |
95 |
3 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei... |
M |
On-Time Performance (West of Hudson) |
% |
98.3 |
95 |
4 |
2008 |
96.8 |
95 |
4 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei... |
M |
On-Time Performance (West of Hudson) |
% |
95.8 |
95 |
5 |
2008 |
96.6 |
95 |
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()
root
<Element a at 0xcac8a0>
root.get('href')
'http://www.google.com'
root.text
'Google'