Jupyter nobook模式,不一样的阅读体验
import pandas as pd
import numpy as np
7.1处理缺失值
- 对于数值数据,pandas使用浮点值NaN(Not a Number)表示缺失数据。我们称其为哨兵值,可以方便的检测出来:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
string_data[0] = None
string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
- 表7-1列出了一些关于缺失数据处理的函数

滤除缺失值:使用dropna()
string_data.dropna()
1 artichoke
3 avocado
dtype: object
string_data[string_data.notnull()]
1 artichoke
3 avocado
dtype: object
- 而对于DataFrame对象,事情就有点复杂了。你可能希望丢弃全NA或含有NA的行或列。dropna默认丢弃任何含有缺失值的行:
from numpy import nan as NA
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
data.dropna()
- 如果只想丢弃全为NA的行,可以传入一个参数 how=‘all’
data.dropna(how='all')
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
- 如果想丢弃全为NA的列,则可以再传入一个参数 axis = 1:
data[4] = NA
data
|
0 |
1 |
2 |
4 |
0 |
1.0 |
6.5 |
3.0 |
NaN |
1 |
1.0 |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
NaN |
data.dropna(how='all',axis=1)
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
data
|
0 |
1 |
2 |
4 |
0 |
1.0 |
6.5 |
3.0 |
NaN |
1 |
1.0 |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
NaN |
- 另一个滤除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数据,可以用thresh参数实现此目的:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df
|
0 |
1 |
2 |
0 |
1.736760 |
NaN |
NaN |
1 |
-2.129930 |
NaN |
NaN |
2 |
1.437452 |
NaN |
-1.604753 |
3 |
0.826910 |
NaN |
2.472819 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
df.dropna()
|
0 |
1 |
2 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
df.dropna(thresh=2)
|
0 |
1 |
2 |
2 |
1.437452 |
NaN |
-1.604753 |
3 |
0.826910 |
NaN |
2.472819 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
填充缺失值:fillna()
df.fillna(0)
|
0 |
1 |
2 |
0 |
1.736760 |
0.000000 |
0.000000 |
1 |
-2.129930 |
0.000000 |
0.000000 |
2 |
1.437452 |
0.000000 |
-1.604753 |
3 |
0.826910 |
0.000000 |
2.472819 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
df.fillna({1:0.05, 2:0.01})
|
0 |
1 |
2 |
0 |
1.736760 |
0.050000 |
0.010000 |
1 |
-2.129930 |
0.050000 |
0.010000 |
2 |
1.437452 |
0.050000 |
-1.604753 |
3 |
0.826910 |
0.050000 |
2.472819 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
- 以上的方法都是返回一个新的对象,但是也可以传入一个参数: inplace=True实现就地修改
df.fillna(0,inplace=True)
df
|
0 |
1 |
2 |
0 |
1.736760 |
0.000000 |
0.000000 |
1 |
-2.129930 |
0.000000 |
0.000000 |
2 |
1.437452 |
0.000000 |
-1.604753 |
3 |
0.826910 |
0.000000 |
2.472819 |
4 |
1.298570 |
-0.334513 |
1.318870 |
5 |
-0.640984 |
-1.344805 |
0.497007 |
6 |
-0.878085 |
-1.819722 |
0.341758 |
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA
df
|
0 |
1 |
2 |
0 |
0.773045 |
-1.022993 |
0.657842 |
1 |
0.541367 |
-0.003357 |
-2.031984 |
2 |
-0.041549 |
NaN |
-0.678426 |
3 |
0.586369 |
NaN |
-1.340845 |
4 |
-0.367919 |
NaN |
NaN |
5 |
-0.917897 |
NaN |
NaN |
df.fillna(method='ffill')
|
0 |
1 |
2 |
0 |
0.773045 |
-1.022993 |
0.657842 |
1 |
0.541367 |
-0.003357 |
-2.031984 |
2 |
-0.041549 |
-0.003357 |
-0.678426 |
3 |
0.586369 |
-0.003357 |
-1.340845 |
4 |
-0.367919 |
-0.003357 |
-1.340845 |
5 |
-0.917897 |
-0.003357 |
-1.340845 |
df.fillna(method='ffill',limit=1)
|
0 |
1 |
2 |
0 |
0.773045 |
-1.022993 |
0.657842 |
1 |
0.541367 |
-0.003357 |
-2.031984 |
2 |
-0.041549 |
-0.003357 |
-0.678426 |
3 |
0.586369 |
NaN |
-1.340845 |
4 |
-0.367919 |
NaN |
-1.340845 |
5 |
-0.917897 |
NaN |
NaN |
data = pd.Series([1,NA,3.5,7])
data
0 1.0
1 NaN
2 3.5
3 7.0
dtype: float64
data.fillna(data.mean())
0 1.000000
1 3.833333
2 3.500000
3 7.000000
dtype: float64
- fillna得到常用参数

7.2数据转换
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
two |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
one |
3 |
5 |
two |
4 |
6 |
two |
4 |
- duplicated方法返回一个布尔型Series,表示各行是否是重复行(前面出现过的行):
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
|
k1 |
k2 |
0 |
one |
1 |
1 |
two |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
one |
3 |
5 |
two |
4 |
- 上面两个方法默认会判断全部列,你也可以指定部分列进行重复项判断。假设我们还有一列值,且只希望根据k1列过滤重复项:
data['v'] = range(7)
data
|
k1 |
k2 |
v |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
one |
3 |
4 |
5 |
two |
4 |
5 |
6 |
two |
4 |
6 |
data.drop_duplicates(['k1'])
|
k1 |
k2 |
v |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
- duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last’则保留最后一个:
data.drop_duplicates(['k1','k2'],keep='last')
|
k1 |
k2 |
v |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
one |
3 |
4 |
6 |
two |
4 |
6 |
利用函数或者映射进行数据转换
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
|
food |
ounces |
0 |
bacon |
4.0 |
1 |
pulled pork |
3.0 |
2 |
bacon |
12.0 |
3 |
Pastrami |
6.0 |
4 |
corned beef |
7.5 |
5 |
Bacon |
8.0 |
6 |
pastrami |
3.0 |
7 |
honey ham |
5.0 |
8 |
nova lox |
6.0 |
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
'''Series的map方法可以接受一个函数或含有映射关系的字典型对象,但是有些肉类的首字母大写了,而另一些则没有。
因此,我们还需要使用Series的str.lower方法,将各个值转换为小写:'''
lowercased = data['food'].str.lower()
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
|
food |
ounces |
animal |
0 |
bacon |
4.0 |
pig |
1 |
pulled pork |
3.0 |
pig |
2 |
bacon |
12.0 |
pig |
3 |
Pastrami |
6.0 |
cow |
4 |
corned beef |
7.5 |
cow |
5 |
Bacon |
8.0 |
pig |
6 |
pastrami |
3.0 |
cow |
7 |
honey ham |
5.0 |
pig |
8 |
nova lox |
6.0 |
salmon |
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
替换值
- 利用fillna方法填充缺失数据可以看做值替换的一种特殊情况。map可用于修改对象的数据子集,而replace则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series:
data = pd.Series([1,-999,2.,-999.,-1000.,3.])
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data.replace(-999,np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999,-1000],np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data.replace([-999,-1000],[np.nan,0])
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace({-999:np.nan, -1000:0})
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
重命名索引
- 跟Series中的值一样,轴标签也可以通过函数或映射进行转换,从而得到一个新的不同标签的对象。轴还可以被就地修改,而无需新建一个数据结构
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(12).reshape((3,4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one','two','three','four'])
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
New York |
8 |
9 |
10 |
11 |
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
|
one |
two |
three |
four |
OHIO |
0 |
1 |
2 |
3 |
COLO |
4 |
5 |
6 |
7 |
NEW |
8 |
9 |
10 |
11 |
- 如果想要创建数据集的转换版(而不是修改原始数据),比较实用的方法是rename:
data.rename(index=str.title, columns=str.upper)
|
ONE |
TWO |
THREE |
FOUR |
Ohio |
0 |
1 |
2 |
3 |
Colo |
4 |
5 |
6 |
7 |
New |
8 |
9 |
10 |
11 |
- rename可以结合字典型对象实现对部分轴标签的更新;如果想要就地修改,让参数inplace为True即可
data.rename(index={'OHIO':'INDIANA'},columns={'three':'peekaboo'})
|
one |
two |
peekaboo |
four |
INDIANA |
0 |
1 |
2 |
3 |
COLO |
4 |
5 |
6 |
7 |
NEW |
8 |
9 |
10 |
11 |