特征工程前的数据可视化技巧
以泰坦尼克乘客生还数据为例。
读取数据后:
data.head()
data.describe()
data.info()
观察数据大概情况。
plt.figsize=(16,8)
sns.countplot('Survived',data=data)
plt.title('Survived')
sex:
f,ax=plt.subplots(1,2,figsize=(18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Sex vs Survived')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived or dead')
Pclass:
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Pclass'].value_counts().plot.bar(color='Black',ax=ax[0])
ax[0].set_title('Pclass vs Survived')
ax[0].set_ylabel('Count')
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1],palette='Set2')
ax[1].set_title('Pclass:Survived or dead')
船舱等级和性别对结果的影响:
sns.factorplot('Pclass','Survived',hue='Sex',data=data,palette='Set2')
age(连续值特征对结果的影响):
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot('Pclass','Age',hue='Survived',split=True,data=data,ax=ax[0],palette='Set2')
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot('Sex','Age',hue='Survived',split=True,data=data,ax=ax[1],palette='Set2')
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
年龄分布(缺失值填充后):
f,ax=plt.subplots(1,2,figsize=(18,8))
data[data['Survived']==0].Age.plot.hist(bins=20,ax=ax[0],edgecolor='black',color='red')
ax[0].set_title('Survived=0')
x1=list(range(0,85,5))
data[data['Survived']==1].Age.plot.hist(bins=20,ax=ax[1],edgecolor='black',color='blue')
ax[1].set_title('Survived=1')
x2=list(range(0,85,5))
embarked:
sns.factorplot('Embarked','Survived',data=data)