Stacking

Stacking

Basic Idea

The basic idea behind stacked generalization is to use a pool of base classifiers, then use another classifier to combine their predictions, with the aim of reducing the generalization error.

Algorithms

Standard Stacking

Stacking

Split training set $(l a b e l s, F)$ into $k$ folds $f_{1}, f_{2}, \dots, f_{k}$ , fit learner $L_{i}$ on each $k - 1$ folds and use it to predict the remaining fold to get predictions on $f_{1}, f_{2}, \dots, f_{k}$ , denoted as $V_{i 1}, V_{i 2}, \dots, V_{i k}$ , one fold at a time.
Meanwhile, each time we get a $L_{i}$ , use it to make predictions on the whole test set thus we obtain $k$ sets of predictions $P_{i 1}, P_{i 2}, \dots, P_{i k}$ .
$V_{i} = (V_{i 1}; V_{i 2}; \dots; V_{i k})$
$P_{i} = \frac{1}{k} \sum_{m = 1}^{k} P_{i m}$ or other averaging methods
Repeat above steps from $i = 1$ to $n$ , using different learners which we call unified as Model 1, we have
$V = (V_{1}, V_{2}, \dots, V_{n})$
$P = (P_{1}, P_{2}, \dots, P_{n})$
Consider $(l a b e l s, V)$ and $P$ as new training set and test set respectively, train and make predictions with Model 2 (usually Logistic Regression, Popular non-linear algorithms for stacking are GBM, KNN, NN ,RF**and **ET (extra trees).) to get final results.

In a word, it models

y^{⋆} = \sum_{i = 1}^{n} w_{i} g_{i}

where

w_{i}

is the weight of the i-th learner and

g_{i}

is the corresponding prediction. Some details should be paid attention to:

Averaging works if it is a regression problem or it is a classification problem but model 1 learners’ output are probabilities. Under other circumstances, voting could be better than averaging.
In fact you can also get $P_{i}$ by simply using $L_{i}$ to train on the whole training set and make predictions on the test set , which may consume more computing resources but slightly lower the coding complexity.
The partitions of training set must be the same for $n$ different estimators, especially when you’re on a team work, or it will lead to information leak and therefore over-fitting.

Feature-Weighted Linear Stacking

Replace $w_{i}$ with $\sum v_{k} x_{k}$ , here $x_{k}$ represents the k-th feature of a sample and $v_{k}$ is the corresponding weight.

Further

We can also insert predictions of Model 1 to original features to get expanded features, notice that their dimensions are different, therefore normalization is necessary.

References

Stacked Generalization
http://www.machine-learning.martinsewell.com/ensembles/stacking/Wolpert1992.pdf
Feature-Weighted Linear Stacking
https://arxiv.org/pdf/0911.0460.pdf
Kaggle Ensemble Guide
https://mlwave.com/kaggle-ensembling-guide/
A Kaggler’s Guide to Model Stacking in Practice
http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
详解stacking过程
https://blog.csdn.net/wstcjf/article/details/77989963
Stacking Learning在分类问题中的使用 (with more valuable links)
https://blog.csdn.net/MrLevo520/article/details/78161590
详解 Stacking 的 python 实现
https://www.jianshu.com/p/5905f19c4df6

Stacking

Stacking

Basic Idea

Algorithms

Standard Stacking

Feature-Weighted Linear Stacking

Further

References

相关推荐