pandas 快速教程

 

快速教程---Pandas

对象创建

通过传递值列表创建一个系列,让大熊猫创建一个默认的整数索引:

In [1]:

 

import pandas as pd
import numpy as np
s = pd.Series([1,3,5,np.nan,6,8])
s

Out[1]:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递带有日期时间索引和标记列的NUMPY数组创建数据文件:

In [4]:

 

dates = pd.date_range('20130101', periods=6)
dates

Out[4]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:

 

df = pd.DataFrame(np.arange(24).reshape(6,4), index=dates, columns=list('ABCD'))
df

Out[15]:

  A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

In [16]:

 

df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

Out[16]:

  A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

生成的数据文件的列具有不同的dType。

In [17]:

 

df2.dtypes

Out[17]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

下面是如何查看框架的顶部和底部行:

In [18]:

 

df.head()

Out[18]:

  A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19

In [19]:

 

df.tail(3)

Out[19]:

  A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

显示索引、列和下面的NUMPY数据:

In [20]:

 

df.index

Out[20]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [21]:

 

df.columns

Out[21]:

Index(['A', 'B', 'C', 'D'], dtype='object')

In [22]:

 

df.values

Out[22]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

描述你的数据的快速统计摘要:

In [23]:

 

df.describe()

Out[23]:

  A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 10.000000 11.000000 12.000000 13.000000
std 7.483315 7.483315 7.483315 7.483315
min 0.000000 1.000000 2.000000 3.000000
25% 5.000000 6.000000 7.000000 8.000000
50% 10.000000 11.000000 12.000000 13.000000
75% 15.000000 16.000000 17.000000 18.000000
max 20.000000 21.000000 22.000000 23.000000

转置你的数据:

In [24]:

 

df.T

Out[24]:

  2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A 0 4 8 12 16 20
B 1 5 9 13 17 21
C 2 6 10 14 18 22
D 3 7 11 15 19 23

轴排序:

In [25]:

 

df.sort_index(axis=1, ascending=False)

Out[25]:

  D C B A
2013-01-01 3 2 1 0
2013-01-02 7 6 5 4
2013-01-03 11 10 9 8
2013-01-04 15 14 13 12
2013-01-05 19 18 17 16
2013-01-06 23 22 21 20

按值排序:

In [27]:

 

df.sort_values(by='B',ascending=False)

Out[27]:

  A B C D
2013-01-06 20 21 22 23
2013-01-05 16 17 18 19
2013-01-04 12 13 14 15
2013-01-03 8 9 10 11
2013-01-02 4 5 6 7
2013-01-01 0 1 2 3

选择

选择单个列,生成一个序列,相当于df.a:

In [28]:

 

df['A']

Out[28]:

2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32

In [29]:

 

df[0:3]

Out[29]:

  A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11

In [30]:

 

df['20130102':'20130104']

Out[30]:

  A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15

标签选择

使用标签获取横截面:

In [31]:

 

df.loc[dates[0]]

Out[31]:

A    0
B    1
C    2
D    3
Name: 2013-01-01 00:00:00, dtype: int32

In [32]:

 

df.loc[:,['A','B']]

Out[32]:

  A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21

In [33]:

 

df.loc['20130102':'20130104',['A','B']]

Out[33]:

  A B
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13

In [34]:

 

df.loc['20130102',['A','B']]

Out[34]:

A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int32

In [35]:

 

df.loc[dates[0],'A']

Out[35]:

0

为了获得对标量的快速访问(等价于先前的方法):

In [36]:

 

df.at[dates[0],'A']

Out[36]:

0

位置选择

通过传递整数的位置进行选择:

In [37]:

 

df.iloc[3]

Out[37]:

A    12
B    13
C    14
D    15
Name: 2013-01-04 00:00:00, dtype: int32

In [38]:

 

df.iloc[3:5,0:2]

Out[38]:

  A B
2013-01-04 12 13
2013-01-05 16 17

In [39]:

 

df.iloc[[1,2,4],[0,2]]

Out[39]:

  A C
2013-01-02 4 6
2013-01-03 8 10
2013-01-05 16 18

In [40]:

 

df.iloc[1:3,:]

Out[40]:

  A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11

In [41]:

 

df.iloc[:,1:3]

Out[41]:

  B C
2013-01-01 1 2
2013-01-02 5 6
2013-01-03 9 10
2013-01-04 13 14
2013-01-05 17 18
2013-01-06 21 22

In [42]:

 

df.iloc[1,1]

Out[42]:

5

为了获得对标量的快速访问(等价于先前的方法):

In [43]:

 

df.iat[1,1]

Out[43]:

5

布尔索引

使用单个列的值来选择数据。

In [44]:

 

df[df.A > 0]

Out[44]:

  A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

In [45]:

 

df[df > 0]

Out[45]:

  A B C D
2013-01-01 NaN 1 2 3
2013-01-02 4.0 5 6 7
2013-01-03 8.0 9 10 11
2013-01-04 12.0 13 14 15
2013-01-05 16.0 17 18 19
2013-01-06 20.0 21 22 23

使用ISIN()方法进行过滤:

In [47]:

 

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Out[47]:

  A B C D E
2013-01-01 0 1 2 3 one
2013-01-02 4 5 6 7 one
2013-01-03 8 9 10 11 two
2013-01-04 12 13 14 15 three
2013-01-05 16 17 18 19 four
2013-01-06 20 21 22 23 three

In [48]:

 

df2[df2['E'].isin(['two','four'])]

Out[48]:

  A B C D E
2013-01-03 8 9 10 11 two
2013-01-05 16 17 18 19 four

设置

设置一个新列会自动根据索引对数据进行对齐。

In [49]:

 

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

Out[49]:

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [50]:

 

df['F'] = s1
df

Out[50]:

  A B C D F
2013-01-01 0 1 2 3 NaN
2013-01-02 4 5 6 7 1.0
2013-01-03 8 9 10 11 2.0
2013-01-04 12 13 14 15 3.0
2013-01-05 16 17 18 19 4.0
2013-01-06 20 21 22 23 5.0

In [51]:

 

df.at[dates[0],'A'] = 0
df.iat[0,1] = 0
df

Out[51]:

  A B C D F
2013-01-01 0 0 2 3 NaN
2013-01-02 4 5 6 7 1.0
2013-01-03 8 9 10 11 2.0
2013-01-04 12 13 14 15 3.0
2013-01-05 16 17 18 19 4.0
2013-01-06 20 21 22 23 5.0

In [52]:

 

df.loc[:,'D'] = np.array([5] * len(df))
df

Out[52]:

  A B C D F
2013-01-01 0 0 2 5 NaN
2013-01-02 4 5 6 5 1.0
2013-01-03 8 9 10 5 2.0
2013-01-04 12 13 14 5 3.0
2013-01-05 16 17 18 5 4.0
2013-01-06 20 21 22 5 5.0

In [53]:

 

df2 = df.copy()
df2[df2 > 0] = -df2
df2

Out[53]:

  A B C D F
2013-01-01 0 0 -2 -5 NaN
2013-01-02 -4 -5 -6 -5 -1.0
2013-01-03 -8 -9 -10 -5 -2.0
2013-01-04 -12 -13 -14 -5 -3.0
2013-01-05 -16 -17 -18 -5 -4.0
2013-01-06 -20 -21 -22 -5 -5.0

缺失数据

熊猫主要使用值NP.Na表示缺失的数据。默认情况下不包括在计算中。重新索引允许您在指定的轴上更改/添加/删除索引。这将返回数据的副本。

In [54]:

 

df

Out[54]:

  A B C D F
2013-01-01 0 0 2 5 NaN
2013-01-02 4 5 6 5 1.0
2013-01-03 8 9 10 5 2.0
2013-01-04 12 13 14 5 3.0
2013-01-05 16 17 18 5 4.0
2013-01-06 20 21 22 5 5.0

In [55]:

 

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Out[55]:

  A B C D F E
2013-01-01 0 0 2 5 NaN 1.0
2013-01-02 4 5 6 5 1.0 1.0
2013-01-03 8 9 10 5 2.0 NaN
2013-01-04 12 13 14 5 3.0 NaN

In [56]:

 

df1.dropna(how='any')             #To drop any rows that have missing data.

Out[56]:

  A B C D F E
2013-01-02 4 5 6 5 1.0 1.0

In [57]:

 

df1.fillna(value=9)               #Filling missing data.

Out[57]:

  A B C D F E
2013-01-01 0 0 2 5 9.0 1.0
2013-01-02 4 5 6 5 1.0 1.0
2013-01-03 8 9 10 5 2.0 9.0
2013-01-04 12 13 14 5 3.0 9.0

In [58]:

 

pd.isna(df1)                     #To get the boolean mask where values are nan.

Out[58]:

  A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

操作

一般来说,操作排除了丢失的数据。执行描述性统计:

In [59]:

 

df.mean()

Out[59]:

A    10.000000
B    10.833333
C    12.000000
D     5.000000
F     3.000000
dtype: float64

In [60]:

 

df.mean(1)                    #Same operation on the other axis:

Out[60]:

2013-01-01     1.75
2013-01-02     4.20
2013-01-03     6.80
2013-01-04     9.40
2013-01-05    12.00
2013-01-06    14.60
Freq: D, dtype: float64

操作具有不同维度和需要对齐的对象。此外,熊猫自动沿指定维度广播。

In [64]:

 

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(-2)
s

Out[64]:

2013-01-01    5.0
2013-01-02    NaN
2013-01-03    6.0
2013-01-04    8.0
2013-01-05    NaN
2013-01-06    NaN
Freq: D, dtype: float64

应用

将函数应用于数据:

In [65]:

 

df

Out[65]:

  A B C D F
2013-01-01 0 0 2 5 NaN
2013-01-02 4 5 6 5 1.0
2013-01-03 8 9 10 5 2.0
2013-01-04 12 13 14 5 3.0
2013-01-05 16 17 18 5 4.0
2013-01-06 20 21 22 5 5.0

In [66]:

 

df.apply(np.cumsum)

Out[66]:

  A B C D F
2013-01-01 0 0 2 5 NaN
2013-01-02 4 5 8 10 1.0
2013-01-03 12 14 18 15 3.0
2013-01-04 24 27 32 20 6.0
2013-01-05 40 44 50 25 10.0
2013-01-06 60 65 72 30 15.0

In [67]:

 

df.apply(lambda x: x.max() - x.min())

Out[67]:

A    20.0
B    21.0
C    20.0
D     0.0
F     4.0
dtype: float64

直方图化

In [68]:

 

s = pd.Series(np.random.randint(0, 7, size=10))
s

Out[68]:

0    2
1    3
2    5
3    0
4    1
5    1
6    0
7    5
8    2
9    6
dtype: int32

In [69]:

 

s.value_counts()

Out[69]:

5    2
2    2
1    2
0    2
6    1
3    1
dtype: int64

字符串方法

系列在str属性中配备了一组字符串处理方法,使得对数组的每个元素进行操作变得容易,如下面的代码片段所示。注意,STR中的模式匹配通常默认使用正则表达式。

In [70]:

 

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

Out[70]:

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

pandas提供了各种工具,用于在join/merge类型操作的情况下,轻松地将Series、DataFrame和Player对象与各种用于索引和关系代数功能的设置逻辑组合在一起。
将pandas对象与Copter()连接在一起:

In [76]:

 

df = pd.DataFrame(np.arange(10,50).reshape(10,4))
df

Out[76]:

  0 1 2 3
0 10 11 12 13
1 14 15 16 17
2 18 19 20 21
3 22 23 24 25
4 26 27 28 29
5 30 31 32 33
6 34 35 36 37
7 38 39 40 41
8 42 43 44 45
9 46 47 48 49

In [77]:

 

# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

Out[77]:

[    0   1   2   3
 0  10  11  12  13
 1  14  15  16  17
 2  18  19  20  21,     0   1   2   3
 3  22  23  24  25
 4  26  27  28  29
 5  30  31  32  33
 6  34  35  36  37,     0   1   2   3
 7  38  39  40  41
 8  42  43  44  45
 9  46  47  48  49]

In [79]:

 

len(pieces),type(pieces[0])

Out[79]:

(3, pandas.core.frame.DataFrame)

In [80]:

 

pd.concat(pieces)

Out[80]:

  0 1 2 3
0 10 11 12 13
1 14 15 16 17
2 18 19 20 21
3 22 23 24 25
4 26 27 28 29
5 30 31 32 33
6 34 35 36 37
7 38 39 40 41
8 42 43 44 45
9 46 47 48 49

加入

SQL样式合并。

In [81]:

 

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Out[81]:

  key lval
0 foo 1
1 foo 2

In [82]:

 

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Out[82]:

  key rval
0 foo 4
1 foo 5

In [83]:

 

pd.merge(left, right, on='key')

Out[83]:

  key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

In [84]:

 

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left

Out[84]:

  key lval
0 foo 1
1 bar 2

In [85]:

 

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right

Out[85]:

  key rval
0 foo 4
1 bar 5

In [86]:

 

pd.merge(left, right, on='key')

Out[86]:

  key lval rval
0 foo 1 4
1 bar 2 5

追加

将行追加到数据文件中。请参阅追加部分。

In [87]:

 

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Out[87]:

  A B C D
0 0.476360 -0.142900 1.306915 0.886458
1 0.685598 0.673836 0.181896 0.446095
2 -0.240541 -1.670222 -1.086695 0.024977
3 0.433192 1.776994 -0.226416 -0.376136
4 0.955323 1.655059 0.500006 -1.382097
5 0.969208 1.964340 -1.249788 1.448280
6 -1.078558 -0.989534 -1.044546 0.529323
7 1.851316 1.235936 -0.437019 0.546799

In [88]:

 

s = df.iloc[3]
df.append(s, ignore_index=True)

Out[88]:

  A B C D
0 0.476360 -0.142900 1.306915 0.886458
1 0.685598 0.673836 0.181896 0.446095
2 -0.240541 -1.670222 -1.086695 0.024977
3 0.433192 1.776994 -0.226416 -0.376136
4 0.955323 1.655059 0.500006 -1.382097
5 0.969208 1.964340 -1.249788 1.448280
6 -1.078558 -0.989534 -1.044546 0.529323
7 1.851316 1.235936 -0.437019 0.546799
8 0.433192 1.776994 -0.226416 -0.376136

分组

通过“分组”,我们指的是涉及一个或多个以下步骤的过程:
1、根据一些标准将数据分成组
2、将函数独立地应用于各组
3、将结果结合到数据结构中

In [89]:

 

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Out[89]:

  A B C D
0 foo one -1.026284 2.063183
1 bar one 0.512460 -0.363375
2 foo two -0.484939 0.982710
3 bar three 2.073508 -0.393659
4 foo two -0.197030 2.044099
5 bar two -1.369176 2.172274
6 foo one 0.445250 1.503858
7 foo three -1.675851 0.207408

In [90]:

 

#Grouping and then applying the sum() function to the resulting groups.
df.groupby('A').sum()

Out[90]:

  C D
A    
bar 1.216791 1.415240
foo -2.938853 6.801258

In [91]:

 

#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
df.groupby(['A','B']).sum()

Out[91]:

    C D
A B    
bar one 0.512460 -0.363375
three 2.073508 -0.393659
two -1.369176 2.172274
foo one -0.581034 3.567041
three -1.675851 0.207408
two -0.681968 3.026809

重塑

In [92]:

 

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
tuples

Out[92]:

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [93]:

 

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

Out[93]:

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [95]:

 

df = pd.DataFrame(np.arange(10,26).reshape(8,2), index=index, columns=['A', 'B'])
df

Out[95]:

    A B
first second    
bar one 10 11
two 12 13
baz one 14 15
two 16 17
foo one 18 19
two 20 21
qux one 22 23
two 24 25

In [96]:

 

df2 = df[:4]
df2

Out[96]:

    A B
first second    
bar one 10 11
two 12 13
baz one 14 15
two 16 17

In [97]:

 

#The stack() method “compresses” a level in the DataFrame’s columns.
stacked = df2.stack()
stacked

Out[97]:

first  second   
bar    one     A    10
               B    11
       two     A    12
               B    13
baz    one     A    14
               B    15
       two     A    16
               B    17
dtype: int32

对于“堆叠的”DataFrame或Series(具有MultiIndex作为索引),stack()的逆操作是unstack(),在默认情况下,它解锁最后一个级别:

In [98]:

 

stacked.unstack()

Out[98]:

    A B
first second    
bar one 10 11
two 12 13
baz one 14 15
two 16 17

In [99]:

 

stacked.unstack(1)

Out[99]:

  second one two
first      
bar A 10 12
B 11 13
baz A 14 16
B 15 17

In [100]:

 

stacked.unstack(0)

Out[100]:

  first bar baz
second      
one A 10 14
B 11 15
two A 12 16
B 13 17

基准表

In [104]:

 

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.arange(12),
                   'E' : np.arange(10,22)})
df

Out[104]:

  A B C D E
0 one A foo 0 10
1 one B foo 1 11
2 two C foo 2 12
3 three A bar 3 13
4 one B bar 4 14
5 one C bar 5 15
6 two A foo 6 16
7 three B foo 7 17
8 one C foo 8 18
9 one A bar 9 19
10 two B bar 10 20
11 three C bar 11 21

In [105]:

 

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Out[105]:

  C bar foo
A B    
one A 9.0 0.0
B 4.0 1.0
C 5.0 8.0
three A 3.0 NaN
B NaN 7.0
C 11.0 NaN
two A NaN 6.0
B 10.0 NaN
C NaN 2.0

时间序列

pandas具有简单、强大和高效的功能,用于在频率转换期间执行重采样操作(例如,将第二数据转换为5分钟数据)。这在金融应用中非常普遍,但并不局限于此。

In [107]:

 

rng = pd.date_range('1/1/2018', periods=10, freq='S')
rng

Out[107]:

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:00:01',
               '2018-01-01 00:00:02', '2018-01-01 00:00:03',
               '2018-01-01 00:00:04', '2018-01-01 00:00:05',
               '2018-01-01 00:00:06', '2018-01-01 00:00:07',
               '2018-01-01 00:00:08', '2018-01-01 00:00:09'],
              dtype='datetime64[ns]', freq='S')

In [108]:

 

ts = pd.Series(np.random.randint(0, 10, len(rng)), index=rng)
ts

Out[108]:

2018-01-01 00:00:00    2
2018-01-01 00:00:01    9
2018-01-01 00:00:02    4
2018-01-01 00:00:03    1
2018-01-01 00:00:04    5
2018-01-01 00:00:05    3
2018-01-01 00:00:06    7
2018-01-01 00:00:07    2
2018-01-01 00:00:08    5
2018-01-01 00:00:09    2
Freq: S, dtype: int32

In [112]:

 

ts.resample('5S').sum()

Out[112]:

2018-01-01 00:00:00    21
2018-01-01 00:00:05    19
Freq: 5S, dtype: int32

时间跨度表示之间的转换:

In [114]:

 

rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

Out[114]:

2018-01-31   -0.488761
2018-02-28   -0.407915
2018-03-31    0.428891
2018-04-30   -0.306055
2018-05-31   -0.805451
Freq: M, dtype: float64

In [115]:

 

ps = ts.to_period()
ps

Out[115]:

2018-01   -0.488761
2018-02   -0.407915
2018-03    0.428891
2018-04   -0.306055
2018-05   -0.805451
Freq: M, dtype: float64

In [116]:

 

ps.to_timestamp()

Out[116]:

2018-01-01   -0.488761
2018-02-01   -0.407915
2018-03-01    0.428891
2018-04-01   -0.306055
2018-05-01   -0.805451
Freq: MS, dtype: float64

范畴

pandas可以在数据文件中包含分类数据。

In [117]:

 

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df

Out[117]:

  id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

In [118]:

 

#Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

Out[118]:

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

In [119]:

 

#Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).
df["grade"].cat.categories = ["very good", "good", "very bad"]
df

Out[119]:

  id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

In [120]:

 

#Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series by default).
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

Out[120]:

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [121]:

 

df.sort_values(by="grade")

Out[121]:

  id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

In [122]:

 

#Grouping by a categorical column also shows empty categories.
df.groupby("grade").size()

Out[122]:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

作图

In [126]:

 

%matplotlib inline
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

Out[126]:

<matplotlib.axes._subplots.AxesSubplot at 0xacef320>

pandas 快速教程

In [127]:

 

import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

Out[127]:

<matplotlib.legend.Legend at 0xada7160>
<matplotlib.figure.Figure at 0x97fc240>

pandas 快速教程

获取数据输入/输出

写入CSV文件。

In [ ]:

 

df.to_csv('foo.csv')

In [ ]:

 

#Reading from a csv file.
pd.read_csv('foo.csv')

表格

EXCEL的阅读和写作。

In [ ]:

 

df.to_excel('foo.xlsx', sheet_name='Sheet1')

In [ ]:

 

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])