Pattern Recognition course 笔记 -Sparse Classifiers

仅供个人笔记使用

Modern pattern recognition

background

due to the success of pattern recognition, many people try it on their application
traditionally: few informative features, and apply density estimation
currently: (around 2015 ) many weakly informative features,apply SVM
many features

features

modern PR should work in (very) higj dimensional data(the classifier should generalize)
modern PR should show what features are important(the classifier should select the features)
modern PR should explain the classification(the classifier should show the decision rule)(around 2015)
partial solution: feature selection, decision trees, sparse classifiers

feature selection vs. extraction

Pattern Recognition course 笔记 -Sparse Classifiers

dimensionality reduction

• Use of dimensionality reduction:

Fewer parameters: faster, easier to estimate
This might actually improve results,
even though you are throwing away information!
Explain which measurements are useful
and which are not (reduce redundancy)
Visualisation

• Think of dimensionality reduction as
finding a mapping y for every x;
so we need to have a model for what is a good mapping

note:

Cross-validation when short of data

Desision Tree

Top down algorithm

If all training samples in current node are from same class, label the
node as leaf node from the respective class.
⇒ done
Otherwise, score all possible splits and choose the best scored split
Create as many child nodes as split outcomes. Label edges between
parent and child with outcomes and partition training data according
to split into child nodes
Repeat procedure recursively for all child nodes
⇒ go to 1.

which means you initially split these two claster as well as possible, and then focus on each claster, split class among each claster iteratively till perfect.

Pattern Recognition course 笔记 -Sparse Classifiers

Additional Advantages

categorical attributes easily dealt with
missing feature values can be dealt with naturally
- excluded in attribute selection
- training data with missing value for selected attribute is copied to all (both) subtrees
interpretation of derived rule set
- desirable/required in many domains(forensics, medical, etc.)

Bagging and random forests

trees are very unstable( if you modify one or more from dataset, it may changed completely)
combine trees that are trained on
- different training sets(bagging)
- different feature subsets
typicall, trees are combined using majority vote

Sparse classifier

sparse classifier? comething is sparse: typically the features
linear classifier:
$f(x)=sign(w^Tx+w_{0})=sign(w_{1}x_{1}+...+w_{p}x_{p}+w_{0})$
when you force some weights to be zero, the feature is removed
example:
$f(x)=sign(w_{2}x_{2}+w_{7}x_7+w_{15}x_{15}+w_0)$

Regularization

• To prevent overfitting to training data
• Reducing model complexity
• Typical example: ridge regression
- linear regression model $\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2$
- regularized linear(ridge) regression: $\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2+C\Vert w\Vert ^2_2$
- Lasso $\min_{w,w_0}\sum_{i=1}^N (y_i-w^Tx-w_0)^2+C\Vert w\Vert _1$

Norms of a vector
- $\Vert w \Vert ^2 = w_1^2+w_2^2+...+w_p^2$
- $|w|_1= |w_1|+|w_2|+...+|w_p|$

L1 versus L2 regularization

Pattern Recognition course 笔记 -Sparse Classifiers

Pattern Recognition course 笔记 -Sparse Classifiers

Modern pattern recognition

background

features

feature selection vs. extraction

dimensionality reduction

note:

Desision Tree

Additional Advantages

Bagging and random forests

Sparse classifier

Regularization

L1 versus L2 regularization

相关推荐