Stacking

Stacking

Basic Idea

The basic idea behind stacked generalization is to use a pool of base classifiers, then use another classifier to combine their predictions, with the aim of reducing the generalization error.

Algorithms

Standard Stacking

Stacking

  • Split training set (labels,F) into k folds f1,f2,,fk, fit learner Li on each k1 folds and use it to predict the remaining fold to get predictions on f1,f2,,fk, denoted as Vi1,Vi2,,Vik, one fold at a time.
  • Meanwhile, each time we get a Li, use it to make predictions on the whole test set thus we obtain k sets of predictions Pi1,Pi2,,Pik.
  • Vi=(Vi1;Vi2;;Vik)
    Pi=1km=1kPim or other averaging methods
  • Repeat above steps from i=1 to n, using different learners which we call unified as Model 1, we have
    V=(V1,V2,,Vn)
    P=(P1,P2,,Pn)
  • Consider (labels,V) and P as new training set and test set respectively, train and make predictions with Model 2 (usually Logistic Regression, Popular non-linear algorithms for stacking are GBM, KNN, NN ,RF**and **ET (extra trees).) to get final results.

In a word, it models

y=i=1nwigi

where wi is the weight of the i-th learner and gi is the corresponding prediction. Some details should be paid attention to:

  • Averaging works if it is a regression problem or it is a classification problem but model 1 learners’ output are probabilities. Under other circumstances, voting could be better than averaging.
  • In fact you can also get Pi by simply using Li to train on the whole training set and make predictions on the test set , which may consume more computing resources but slightly lower the coding complexity.
  • The partitions of training set must be the same for n different estimators, especially when you’re on a team work, or it will lead to information leak and therefore over-fitting.

Feature-Weighted Linear Stacking

Replace wi with vkxk, here xk represents the k-th feature of a sample and vk is the corresponding weight.

Further

We can also insert predictions of Model 1 to original features to get expanded features, notice that their dimensions are different, therefore normalization is necessary.

References