Pattern Recognition course 笔记 -Sparse Classifiers

仅供个人笔记使用


Modern pattern recognition

background

  • due to the success of pattern recognition, many people try it on their application
  • traditionally: few informative features, and apply density estimation
  • currently: (around 2015 ) many weakly informative features,apply SVM
  • many features

features

  • modern PR should work in (very) higj dimensional data(the classifier should generalize)
  • modern PR should show what features are important(the classifier should select the features)
  • modern PR should explain the classification(the classifier should show the decision rule)(around 2015)
  • partial solution: feature selection, decision trees, sparse classifiers

feature selection vs. extraction

Pattern Recognition course 笔记 -Sparse Classifiers

dimensionality reduction

• Use of dimensionality reduction:

  1. Fewer parameters: faster, easier to estimate
    This might actually improve results,
    even though you are throwing away information!
  2. Explain which measurements are useful
    and which are not (reduce redundancy)
  3. Visualisation

• Think of dimensionality reduction as
finding a mapping y for every x;
so we need to have a model for what is a good mapping

note:

Cross-validation when short of data

Desision Tree

Top down algorithm

  1. If all training samples in current node are from same class, label the
    node as leaf node from the respective class.
    ⇒ done
  2. Otherwise, score all possible splits and choose the best scored split
  3. Create as many child nodes as split outcomes. Label edges between
    parent and child with outcomes and partition training data according
    to split into child nodes
  4. Repeat procedure recursively for all child nodes
    ⇒ go to 1.
    Pattern Recognition course 笔记 -Sparse Classifiers

which means you initially split these two claster as well as possible, and then focus on each claster, split class among each claster iteratively till perfect.

Pattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse ClassifiersPattern Recognition course 笔记 -Sparse Classifiers

Additional Advantages

  • categorical attributes easily dealt with
  • missing feature values can be dealt with naturally
    • excluded in attribute selection
    • training data with missing value for selected attribute is copied to all (both) subtrees
  • interpretation of derived rule set
    • desirable/required in many domains(forensics, medical, etc.)

Bagging and random forests

  • trees are very unstable( if you modify one or more from dataset, it may changed completely)
  • combine trees that are trained on
    • different training sets(bagging)
    • different feature subsets
  • typicall, trees are combined using majority vote

Sparse classifier

  • sparse classifier? comething is sparse: typically the features
  • linear classifier:
    f(x)=sign(wTx+w0)=sign(w1x1+...+wpxp+w0) f(x)=sign(w^Tx+w_{0})=sign(w_{1}x_{1}+...+w_{p}x_{p}+w_{0})
  • when you force some weights to be zero, the feature is removed
    example:
    f(x)=sign(w2x2+w7x7+w15x15+w0)f(x)=sign(w_{2}x_{2}+w_{7}x_7+w_{15}x_{15}+w_0)

Regularization

• To prevent overfitting to training data
• Reducing model complexity
• Typical example: ridge regression
- linear regression model minw,w0i=1N(yiwTxw0)2\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2
- regularized linear(ridge) regression:minw,w0i=1N(yiwTxw0)2+Cw22\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2+C\Vert w\Vert ^2_2
- Lasso minw,w0i=1N(yiwTxw0)2+Cw1\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2+C\Vert w\Vert _1

  • Norms of a vector
    • w2=w12+w22+...+wp2\Vert w \Vert ^2 = w_1^2+w_2^2+...+w_p^2
    • w1=w1+w2+...+wp|w|_1= |w_1|+|w_2|+...+|w_p|

L1 versus L2 regularization

Pattern Recognition course 笔记 -Sparse Classifiers