pandas 快速教程
快速教程---Pandas
对象创建
通过传递值列表创建一个系列,让大熊猫创建一个默认的整数索引:
In [1]:
import pandas as pd
import numpy as np
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[1]:
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
通过传递带有日期时间索引和标记列的NUMPY数组创建数据文件:
In [4]:
dates = pd.date_range('20130101', periods=6)
dates
Out[4]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
In [15]:
df = pd.DataFrame(np.arange(24).reshape(6,4), index=dates, columns=list('ABCD'))
df
Out[15]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0 | 1 | 2 | 3 |
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
2013-01-04 | 12 | 13 | 14 | 15 |
2013-01-05 | 16 | 17 | 18 | 19 |
2013-01-06 | 20 | 21 | 22 | 23 |
In [16]:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
Out[16]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
生成的数据文件的列具有不同的dType。
In [17]:
df2.dtypes
Out[17]:
A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
查看数据
下面是如何查看框架的顶部和底部行:
In [18]:
df.head()
Out[18]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0 | 1 | 2 | 3 |
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
2013-01-04 | 12 | 13 | 14 | 15 |
2013-01-05 | 16 | 17 | 18 | 19 |
In [19]:
df.tail(3)
Out[19]:
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 12 | 13 | 14 | 15 |
2013-01-05 | 16 | 17 | 18 | 19 |
2013-01-06 | 20 | 21 | 22 | 23 |
显示索引、列和下面的NUMPY数据:
In [20]:
df.index
Out[20]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
In [21]:
df.columns
Out[21]:
Index(['A', 'B', 'C', 'D'], dtype='object')
In [22]:
df.values
Out[22]:
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]])
描述你的数据的快速统计摘要:
In [23]:
df.describe()
Out[23]:
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 10.000000 | 11.000000 | 12.000000 | 13.000000 |
std | 7.483315 | 7.483315 | 7.483315 | 7.483315 |
min | 0.000000 | 1.000000 | 2.000000 | 3.000000 |
25% | 5.000000 | 6.000000 | 7.000000 | 8.000000 |
50% | 10.000000 | 11.000000 | 12.000000 | 13.000000 |
75% | 15.000000 | 16.000000 | 17.000000 | 18.000000 |
max | 20.000000 | 21.000000 | 22.000000 | 23.000000 |
转置你的数据:
In [24]:
df.T
Out[24]:
2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 | |
---|---|---|---|---|---|---|
A | 0 | 4 | 8 | 12 | 16 | 20 |
B | 1 | 5 | 9 | 13 | 17 | 21 |
C | 2 | 6 | 10 | 14 | 18 | 22 |
D | 3 | 7 | 11 | 15 | 19 | 23 |
轴排序:
In [25]:
df.sort_index(axis=1, ascending=False)
Out[25]:
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | 3 | 2 | 1 | 0 |
2013-01-02 | 7 | 6 | 5 | 4 |
2013-01-03 | 11 | 10 | 9 | 8 |
2013-01-04 | 15 | 14 | 13 | 12 |
2013-01-05 | 19 | 18 | 17 | 16 |
2013-01-06 | 23 | 22 | 21 | 20 |
按值排序:
In [27]:
df.sort_values(by='B',ascending=False)
Out[27]:
A | B | C | D | |
---|---|---|---|---|
2013-01-06 | 20 | 21 | 22 | 23 |
2013-01-05 | 16 | 17 | 18 | 19 |
2013-01-04 | 12 | 13 | 14 | 15 |
2013-01-03 | 8 | 9 | 10 | 11 |
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-01 | 0 | 1 | 2 | 3 |
选择
选择单个列,生成一个序列,相当于df.a:
In [28]:
df['A']
Out[28]:
2013-01-01 0 2013-01-02 4 2013-01-03 8 2013-01-04 12 2013-01-05 16 2013-01-06 20 Freq: D, Name: A, dtype: int32
In [29]:
df[0:3]
Out[29]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0 | 1 | 2 | 3 |
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
In [30]:
df['20130102':'20130104']
Out[30]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
2013-01-04 | 12 | 13 | 14 | 15 |
标签选择
使用标签获取横截面:
In [31]:
df.loc[dates[0]]
Out[31]:
A 0 B 1 C 2 D 3 Name: 2013-01-01 00:00:00, dtype: int32
In [32]:
df.loc[:,['A','B']]
Out[32]:
A | B | |
---|---|---|
2013-01-01 | 0 | 1 |
2013-01-02 | 4 | 5 |
2013-01-03 | 8 | 9 |
2013-01-04 | 12 | 13 |
2013-01-05 | 16 | 17 |
2013-01-06 | 20 | 21 |
In [33]:
df.loc['20130102':'20130104',['A','B']]
Out[33]:
A | B | |
---|---|---|
2013-01-02 | 4 | 5 |
2013-01-03 | 8 | 9 |
2013-01-04 | 12 | 13 |
In [34]:
df.loc['20130102',['A','B']]
Out[34]:
A 4 B 5 Name: 2013-01-02 00:00:00, dtype: int32
In [35]:
df.loc[dates[0],'A']
Out[35]:
0
为了获得对标量的快速访问(等价于先前的方法):
In [36]:
df.at[dates[0],'A']
Out[36]:
0
位置选择
通过传递整数的位置进行选择:
In [37]:
df.iloc[3]
Out[37]:
A 12 B 13 C 14 D 15 Name: 2013-01-04 00:00:00, dtype: int32
In [38]:
df.iloc[3:5,0:2]
Out[38]:
A | B | |
---|---|---|
2013-01-04 | 12 | 13 |
2013-01-05 | 16 | 17 |
In [39]:
df.iloc[[1,2,4],[0,2]]
Out[39]:
A | C | |
---|---|---|
2013-01-02 | 4 | 6 |
2013-01-03 | 8 | 10 |
2013-01-05 | 16 | 18 |
In [40]:
df.iloc[1:3,:]
Out[40]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
In [41]:
df.iloc[:,1:3]
Out[41]:
B | C | |
---|---|---|
2013-01-01 | 1 | 2 |
2013-01-02 | 5 | 6 |
2013-01-03 | 9 | 10 |
2013-01-04 | 13 | 14 |
2013-01-05 | 17 | 18 |
2013-01-06 | 21 | 22 |
In [42]:
df.iloc[1,1]
Out[42]:
5
为了获得对标量的快速访问(等价于先前的方法):
In [43]:
df.iat[1,1]
Out[43]:
5
布尔索引
使用单个列的值来选择数据。
In [44]:
df[df.A > 0]
Out[44]:
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 4 | 5 | 6 | 7 |
2013-01-03 | 8 | 9 | 10 | 11 |
2013-01-04 | 12 | 13 | 14 | 15 |
2013-01-05 | 16 | 17 | 18 | 19 |
2013-01-06 | 20 | 21 | 22 | 23 |
In [45]:
df[df > 0]
Out[45]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | NaN | 1 | 2 | 3 |
2013-01-02 | 4.0 | 5 | 6 | 7 |
2013-01-03 | 8.0 | 9 | 10 | 11 |
2013-01-04 | 12.0 | 13 | 14 | 15 |
2013-01-05 | 16.0 | 17 | 18 | 19 |
2013-01-06 | 20.0 | 21 | 22 | 23 |
使用ISIN()方法进行过滤:
In [47]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
Out[47]:
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0 | 1 | 2 | 3 | one |
2013-01-02 | 4 | 5 | 6 | 7 | one |
2013-01-03 | 8 | 9 | 10 | 11 | two |
2013-01-04 | 12 | 13 | 14 | 15 | three |
2013-01-05 | 16 | 17 | 18 | 19 | four |
2013-01-06 | 20 | 21 | 22 | 23 | three |
In [48]:
df2[df2['E'].isin(['two','four'])]
Out[48]:
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | 8 | 9 | 10 | 11 | two |
2013-01-05 | 16 | 17 | 18 | 19 | four |
设置
设置一个新列会自动根据索引对数据进行对齐。
In [49]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
Out[49]:
2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64
In [50]:
df['F'] = s1
df
Out[50]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 1 | 2 | 3 | NaN |
2013-01-02 | 4 | 5 | 6 | 7 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 11 | 2.0 |
2013-01-04 | 12 | 13 | 14 | 15 | 3.0 |
2013-01-05 | 16 | 17 | 18 | 19 | 4.0 |
2013-01-06 | 20 | 21 | 22 | 23 | 5.0 |
In [51]:
df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
df
Out[51]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 3 | NaN |
2013-01-02 | 4 | 5 | 6 | 7 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 11 | 2.0 |
2013-01-04 | 12 | 13 | 14 | 15 | 3.0 |
2013-01-05 | 16 | 17 | 18 | 19 | 4.0 |
2013-01-06 | 20 | 21 | 22 | 23 | 5.0 |
In [52]:
df.loc[:,'D'] = np.array([5] * len(df))
df
Out[52]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | NaN |
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 5 | 2.0 |
2013-01-04 | 12 | 13 | 14 | 5 | 3.0 |
2013-01-05 | 16 | 17 | 18 | 5 | 4.0 |
2013-01-06 | 20 | 21 | 22 | 5 | 5.0 |
In [53]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[53]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | -2 | -5 | NaN |
2013-01-02 | -4 | -5 | -6 | -5 | -1.0 |
2013-01-03 | -8 | -9 | -10 | -5 | -2.0 |
2013-01-04 | -12 | -13 | -14 | -5 | -3.0 |
2013-01-05 | -16 | -17 | -18 | -5 | -4.0 |
2013-01-06 | -20 | -21 | -22 | -5 | -5.0 |
缺失数据
熊猫主要使用值NP.Na表示缺失的数据。默认情况下不包括在计算中。重新索引允许您在指定的轴上更改/添加/删除索引。这将返回数据的副本。
In [54]:
df
Out[54]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | NaN |
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 5 | 2.0 |
2013-01-04 | 12 | 13 | 14 | 5 | 3.0 |
2013-01-05 | 16 | 17 | 18 | 5 | 4.0 |
2013-01-06 | 20 | 21 | 22 | 5 | 5.0 |
In [55]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[55]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | NaN | 1.0 |
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 5 | 2.0 | NaN |
2013-01-04 | 12 | 13 | 14 | 5 | 3.0 | NaN |
In [56]:
df1.dropna(how='any') #To drop any rows that have missing data.
Out[56]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 | 1.0 |
In [57]:
df1.fillna(value=9) #Filling missing data.
Out[57]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | 9.0 | 1.0 |
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 5 | 2.0 | 9.0 |
2013-01-04 | 12 | 13 | 14 | 5 | 3.0 | 9.0 |
In [58]:
pd.isna(df1) #To get the boolean mask where values are nan.
Out[58]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | False | False | False | False | True | False |
2013-01-02 | False | False | False | False | False | False |
2013-01-03 | False | False | False | False | False | True |
2013-01-04 | False | False | False | False | False | True |
操作
一般来说,操作排除了丢失的数据。执行描述性统计:
In [59]:
df.mean()
Out[59]:
A 10.000000 B 10.833333 C 12.000000 D 5.000000 F 3.000000 dtype: float64
In [60]:
df.mean(1) #Same operation on the other axis:
Out[60]:
2013-01-01 1.75 2013-01-02 4.20 2013-01-03 6.80 2013-01-04 9.40 2013-01-05 12.00 2013-01-06 14.60 Freq: D, dtype: float64
操作具有不同维度和需要对齐的对象。此外,熊猫自动沿指定维度广播。
In [64]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(-2)
s
Out[64]:
2013-01-01 5.0 2013-01-02 NaN 2013-01-03 6.0 2013-01-04 8.0 2013-01-05 NaN 2013-01-06 NaN Freq: D, dtype: float64
应用
将函数应用于数据:
In [65]:
df
Out[65]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | NaN |
2013-01-02 | 4 | 5 | 6 | 5 | 1.0 |
2013-01-03 | 8 | 9 | 10 | 5 | 2.0 |
2013-01-04 | 12 | 13 | 14 | 5 | 3.0 |
2013-01-05 | 16 | 17 | 18 | 5 | 4.0 |
2013-01-06 | 20 | 21 | 22 | 5 | 5.0 |
In [66]:
df.apply(np.cumsum)
Out[66]:
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0 | 0 | 2 | 5 | NaN |
2013-01-02 | 4 | 5 | 8 | 10 | 1.0 |
2013-01-03 | 12 | 14 | 18 | 15 | 3.0 |
2013-01-04 | 24 | 27 | 32 | 20 | 6.0 |
2013-01-05 | 40 | 44 | 50 | 25 | 10.0 |
2013-01-06 | 60 | 65 | 72 | 30 | 15.0 |
In [67]:
df.apply(lambda x: x.max() - x.min())
Out[67]:
A 20.0 B 21.0 C 20.0 D 0.0 F 4.0 dtype: float64
直方图化
In [68]:
s = pd.Series(np.random.randint(0, 7, size=10))
s
Out[68]:
0 2 1 3 2 5 3 0 4 1 5 1 6 0 7 5 8 2 9 6 dtype: int32
In [69]:
s.value_counts()
Out[69]:
5 2 2 2 1 2 0 2 6 1 3 1 dtype: int64
字符串方法
系列在str属性中配备了一组字符串处理方法,使得对数组的每个元素进行操作变得容易,如下面的代码片段所示。注意,STR中的模式匹配通常默认使用正则表达式。
In [70]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
Out[70]:
0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
合并
pandas提供了各种工具,用于在join/merge类型操作的情况下,轻松地将Series、DataFrame和Player对象与各种用于索引和关系代数功能的设置逻辑组合在一起。
将pandas对象与Copter()连接在一起:
In [76]:
df = pd.DataFrame(np.arange(10,50).reshape(10,4))
df
Out[76]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 10 | 11 | 12 | 13 |
1 | 14 | 15 | 16 | 17 |
2 | 18 | 19 | 20 | 21 |
3 | 22 | 23 | 24 | 25 |
4 | 26 | 27 | 28 | 29 |
5 | 30 | 31 | 32 | 33 |
6 | 34 | 35 | 36 | 37 |
7 | 38 | 39 | 40 | 41 |
8 | 42 | 43 | 44 | 45 |
9 | 46 | 47 | 48 | 49 |
In [77]:
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
Out[77]:
[ 0 1 2 3 0 10 11 12 13 1 14 15 16 17 2 18 19 20 21, 0 1 2 3 3 22 23 24 25 4 26 27 28 29 5 30 31 32 33 6 34 35 36 37, 0 1 2 3 7 38 39 40 41 8 42 43 44 45 9 46 47 48 49]
In [79]:
len(pieces),type(pieces[0])
Out[79]:
(3, pandas.core.frame.DataFrame)
In [80]:
pd.concat(pieces)
Out[80]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 10 | 11 | 12 | 13 |
1 | 14 | 15 | 16 | 17 |
2 | 18 | 19 | 20 | 21 |
3 | 22 | 23 | 24 | 25 |
4 | 26 | 27 | 28 | 29 |
5 | 30 | 31 | 32 | 33 |
6 | 34 | 35 | 36 | 37 |
7 | 38 | 39 | 40 | 41 |
8 | 42 | 43 | 44 | 45 |
9 | 46 | 47 | 48 | 49 |
加入
SQL样式合并。
In [81]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left
Out[81]:
key | lval | |
---|---|---|
0 | foo | 1 |
1 | foo | 2 |
In [82]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right
Out[82]:
key | rval | |
---|---|---|
0 | foo | 4 |
1 | foo | 5 |
In [83]:
pd.merge(left, right, on='key')
Out[83]:
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
In [84]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left
Out[84]:
key | lval | |
---|---|---|
0 | foo | 1 |
1 | bar | 2 |
In [85]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right
Out[85]:
key | rval | |
---|---|---|
0 | foo | 4 |
1 | bar | 5 |
In [86]:
pd.merge(left, right, on='key')
Out[86]:
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | bar | 2 | 5 |
追加
将行追加到数据文件中。请参阅追加部分。
In [87]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
Out[87]:
A | B | C | D | |
---|---|---|---|---|
0 | 0.476360 | -0.142900 | 1.306915 | 0.886458 |
1 | 0.685598 | 0.673836 | 0.181896 | 0.446095 |
2 | -0.240541 | -1.670222 | -1.086695 | 0.024977 |
3 | 0.433192 | 1.776994 | -0.226416 | -0.376136 |
4 | 0.955323 | 1.655059 | 0.500006 | -1.382097 |
5 | 0.969208 | 1.964340 | -1.249788 | 1.448280 |
6 | -1.078558 | -0.989534 | -1.044546 | 0.529323 |
7 | 1.851316 | 1.235936 | -0.437019 | 0.546799 |
In [88]:
s = df.iloc[3]
df.append(s, ignore_index=True)
Out[88]:
A | B | C | D | |
---|---|---|---|---|
0 | 0.476360 | -0.142900 | 1.306915 | 0.886458 |
1 | 0.685598 | 0.673836 | 0.181896 | 0.446095 |
2 | -0.240541 | -1.670222 | -1.086695 | 0.024977 |
3 | 0.433192 | 1.776994 | -0.226416 | -0.376136 |
4 | 0.955323 | 1.655059 | 0.500006 | -1.382097 |
5 | 0.969208 | 1.964340 | -1.249788 | 1.448280 |
6 | -1.078558 | -0.989534 | -1.044546 | 0.529323 |
7 | 1.851316 | 1.235936 | -0.437019 | 0.546799 |
8 | 0.433192 | 1.776994 | -0.226416 | -0.376136 |
分组
通过“分组”,我们指的是涉及一个或多个以下步骤的过程:
1、根据一些标准将数据分成组
2、将函数独立地应用于各组
3、将结果结合到数据结构中
In [89]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
Out[89]:
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | -1.026284 | 2.063183 |
1 | bar | one | 0.512460 | -0.363375 |
2 | foo | two | -0.484939 | 0.982710 |
3 | bar | three | 2.073508 | -0.393659 |
4 | foo | two | -0.197030 | 2.044099 |
5 | bar | two | -1.369176 | 2.172274 |
6 | foo | one | 0.445250 | 1.503858 |
7 | foo | three | -1.675851 | 0.207408 |
In [90]:
#Grouping and then applying the sum() function to the resulting groups.
df.groupby('A').sum()
Out[90]:
C | D | |
---|---|---|
A | ||
bar | 1.216791 | 1.415240 |
foo | -2.938853 | 6.801258 |
In [91]:
#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df.groupby(['A','B']).sum()
Out[91]:
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.512460 | -0.363375 |
three | 2.073508 | -0.393659 | |
two | -1.369176 | 2.172274 | |
foo | one | -0.581034 | 3.567041 |
three | -1.675851 | 0.207408 | |
two | -0.681968 | 3.026809 |
重塑
栈
In [92]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
tuples
Out[92]:
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
In [93]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
Out[93]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=['first', 'second'])
In [95]:
df = pd.DataFrame(np.arange(10,26).reshape(8,2), index=index, columns=['A', 'B'])
df
Out[95]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 10 | 11 |
two | 12 | 13 | |
baz | one | 14 | 15 |
two | 16 | 17 | |
foo | one | 18 | 19 |
two | 20 | 21 | |
qux | one | 22 | 23 |
two | 24 | 25 |
In [96]:
df2 = df[:4]
df2
Out[96]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 10 | 11 |
two | 12 | 13 | |
baz | one | 14 | 15 |
two | 16 | 17 |
In [97]:
#The stack() method “compresses” a level in the DataFrame’s columns.
stacked = df2.stack()
stacked
Out[97]:
first second bar one A 10 B 11 two A 12 B 13 baz one A 14 B 15 two A 16 B 17 dtype: int32
对于“堆叠的”DataFrame或Series(具有MultiIndex作为索引),stack()的逆操作是unstack(),在默认情况下,它解锁最后一个级别:
In [98]:
stacked.unstack()
Out[98]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 10 | 11 |
two | 12 | 13 | |
baz | one | 14 | 15 |
two | 16 | 17 |
In [99]:
stacked.unstack(1)
Out[99]:
second | one | two | |
---|---|---|---|
first | |||
bar | A | 10 | 12 |
B | 11 | 13 | |
baz | A | 14 | 16 |
B | 15 | 17 |
In [100]:
stacked.unstack(0)
Out[100]:
first | bar | baz | |
---|---|---|---|
second | |||
one | A | 10 | 14 |
B | 11 | 15 | |
two | A | 12 | 16 |
B | 13 | 17 |
基准表
In [104]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.arange(12),
'E' : np.arange(10,22)})
df
Out[104]:
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | 0 | 10 |
1 | one | B | foo | 1 | 11 |
2 | two | C | foo | 2 | 12 |
3 | three | A | bar | 3 | 13 |
4 | one | B | bar | 4 | 14 |
5 | one | C | bar | 5 | 15 |
6 | two | A | foo | 6 | 16 |
7 | three | B | foo | 7 | 17 |
8 | one | C | foo | 8 | 18 |
9 | one | A | bar | 9 | 19 |
10 | two | B | bar | 10 | 20 |
11 | three | C | bar | 11 | 21 |
In [105]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[105]:
C | bar | foo | |
---|---|---|---|
A | B | ||
one | A | 9.0 | 0.0 |
B | 4.0 | 1.0 | |
C | 5.0 | 8.0 | |
three | A | 3.0 | NaN |
B | NaN | 7.0 | |
C | 11.0 | NaN | |
two | A | NaN | 6.0 |
B | 10.0 | NaN | |
C | NaN | 2.0 |
时间序列
pandas具有简单、强大和高效的功能,用于在频率转换期间执行重采样操作(例如,将第二数据转换为5分钟数据)。这在金融应用中非常普遍,但并不局限于此。
In [107]:
rng = pd.date_range('1/1/2018', periods=10, freq='S')
rng
Out[107]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:00:01', '2018-01-01 00:00:02', '2018-01-01 00:00:03', '2018-01-01 00:00:04', '2018-01-01 00:00:05', '2018-01-01 00:00:06', '2018-01-01 00:00:07', '2018-01-01 00:00:08', '2018-01-01 00:00:09'], dtype='datetime64[ns]', freq='S')
In [108]:
ts = pd.Series(np.random.randint(0, 10, len(rng)), index=rng)
ts
Out[108]:
2018-01-01 00:00:00 2 2018-01-01 00:00:01 9 2018-01-01 00:00:02 4 2018-01-01 00:00:03 1 2018-01-01 00:00:04 5 2018-01-01 00:00:05 3 2018-01-01 00:00:06 7 2018-01-01 00:00:07 2 2018-01-01 00:00:08 5 2018-01-01 00:00:09 2 Freq: S, dtype: int32
In [112]:
ts.resample('5S').sum()
Out[112]:
2018-01-01 00:00:00 21 2018-01-01 00:00:05 19 Freq: 5S, dtype: int32
时间跨度表示之间的转换:
In [114]:
rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[114]:
2018-01-31 -0.488761 2018-02-28 -0.407915 2018-03-31 0.428891 2018-04-30 -0.306055 2018-05-31 -0.805451 Freq: M, dtype: float64
In [115]:
ps = ts.to_period()
ps
Out[115]:
2018-01 -0.488761 2018-02 -0.407915 2018-03 0.428891 2018-04 -0.306055 2018-05 -0.805451 Freq: M, dtype: float64
In [116]:
ps.to_timestamp()
Out[116]:
2018-01-01 -0.488761 2018-02-01 -0.407915 2018-03-01 0.428891 2018-04-01 -0.306055 2018-05-01 -0.805451 Freq: MS, dtype: float64
范畴
pandas可以在数据文件中包含分类数据。
In [117]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
Out[117]:
id | raw_grade | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | b |
3 | 4 | a |
4 | 5 | a |
5 | 6 | e |
In [118]:
#Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[118]:
0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]
In [119]:
#Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
Out[119]:
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
3 | 4 | a | very good |
4 | 5 | a | very good |
5 | 6 | e | very bad |
In [120]:
#Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series by default).
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
Out[120]:
0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
In [121]:
df.sort_values(by="grade")
Out[121]:
id | raw_grade | grade | |
---|---|---|---|
5 | 6 | e | very bad |
1 | 2 | b | good |
2 | 3 | b | good |
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
In [122]:
#Grouping by a categorical column also shows empty categories.
df.groupby("grade").size()
Out[122]:
grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
作图
In [126]:
%matplotlib inline
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0xacef320>
In [127]:
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
Out[127]:
<matplotlib.legend.Legend at 0xada7160>
<matplotlib.figure.Figure at 0x97fc240>
获取数据输入/输出
写入CSV文件。
In [ ]:
df.to_csv('foo.csv')
In [ ]:
#Reading from a csv file.
pd.read_csv('foo.csv')
表格
EXCEL的阅读和写作。
In [ ]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')
In [ ]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])