Pattern Recognition course 笔记 -Sparse Classifiers
仅供个人笔记使用
Modern pattern recognition
background
- due to the success of pattern recognition, many people try it on their application
- traditionally: few informative features, and apply density estimation
- currently: (around 2015 ) many weakly informative features,apply SVM
- many features
features
- modern PR should work in (very) higj dimensional data(the classifier should generalize)
- modern PR should show what features are important(the classifier should select the features)
- modern PR should explain the classification(the classifier should show the decision rule)(around 2015)
- partial solution: feature selection, decision trees, sparse classifiers
feature selection vs. extraction
dimensionality reduction
• Use of dimensionality reduction:
- Fewer parameters: faster, easier to estimate
This might actually improve results,
even though you are throwing away information! - Explain which measurements are useful
and which are not (reduce redundancy) - Visualisation
• Think of dimensionality reduction as
finding a mapping y for every x;
so we need to have a model for what is a good mapping
note:
Cross-validation when short of data
Desision Tree
Top down algorithm
- If all training samples in current node are from same class, label the
node as leaf node from the respective class.
⇒ done - Otherwise, score all possible splits and choose the best scored split
- Create as many child nodes as split outcomes. Label edges between
parent and child with outcomes and partition training data according
to split into child nodes - Repeat procedure recursively for all child nodes
⇒ go to 1.
which means you initially split these two claster as well as possible, and then focus on each claster, split class among each claster iteratively till perfect.
Additional Advantages
- categorical attributes easily dealt with
- missing feature values can be dealt with naturally
- excluded in attribute selection
- training data with missing value for selected attribute is copied to all (both) subtrees
- interpretation of derived rule set
- desirable/required in many domains(forensics, medical, etc.)
Bagging and random forests
- trees are very unstable( if you modify one or more from dataset, it may changed completely)
- combine trees that are trained on
- different training sets(bagging)
- different feature subsets
- typicall, trees are combined using majority vote
Sparse classifier
- sparse classifier? comething is sparse: typically the features
- linear classifier:
- when you force some weights to be zero, the feature is removed
example:
Regularization
• To prevent overfitting to training data
• Reducing model complexity
• Typical example: ridge regression
- linear regression model
- regularized linear(ridge) regression:
- Lasso
- Norms of a vector