Kaggle Titanic 读后感

本文是根据Kaggle Titanic的比赛kernel写的总结
[link]https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

1.导入数据（data）

2.查看数据的情况

data.head()
data.dtype()
data.shape()
data.describe()

发现数据中存在字符串（如性别，名字）、缺失等情况

3、数据清洗，将字符串变为数字，补全缺失情况

train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

处理dataframe时，train[‘Has_Cabin’]取键
apply：指所有值应用该函数
（）：都是函数，键不放在里面

dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1

将16到32岁间的划分到年龄1阶段
loc：定位
该语句有定位条件
pyhon中[]装数据，list,dic；（）装函数

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
           square=True, cmap=colormap, linecolor='white', annot=True)

Kaggle Titanic 读后感
生成了各个参数关系的热图
其中，train.astype(float).corr()自动求解了各个参数的相关性（真是超级方便）

4、学习和训练

这里用了多种方法进行学习和比较，为了方便建了个类，实现训练和预测

class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
 # Class to extend XGboost classifer

将各个训练方法的参数建立为字典，用的时候直接调用

rf_params = {
   'n_jobs': -1,
   'n_estimators': 500,
    'warm_start': True, 
    #'max_features': 0.2,
   'max_depth': 6,
   'min_samples_leaf': 2,
   'max_features' : 'sqrt',
   'verbose': 0
}

5、结果

将结果建成字典，方便保存，画图

feature_dataframe = pd.DataFrame( {'features': cols,
     'Random Forest feature importances': rf_features,
     'Extra Trees  feature importances': et_features,
      'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
    })

最后生成了这个三点分布图
Kaggle Titanic 读后感