Paper reading (五十七):MetaNN accurate classification of host phenotypes from metagenomic data using NN
论文题目:MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks
scholar 引用:2
页数:14
发表时间:2019.06
发表刊物:BMC Bioinformatics
作者:Chieh Lo & Radu Marculescu
摘要:
Background
Microbiome profiles in the human body and environment niches have become publicly available due to recent advances in high-throughput sequencing technologies. Indeed, recent studies have already identified different microbiome profiles in healthy and sick individuals for a variety of diseases; this suggests that the microbiome profile can be used as a diagnostic tool in identifying the disease states of an individual. However, the high-dimensional nature of metagenomic data poses a significant challenge to existing machine learning models. Consequently, to enable personalized treatments, an efficient framework that can accurately and robustly differentiate between healthy and sick microbiome profiles is needed.
Results
In this paper, we propose MetaNN (i.e., classification of host phenotypes from Metagenomic data using Neural Networks), a neural network framework which utilizes a new data augmentation technique to mitigate the effects of data over-fitting.
Conclusions
We show that MetaNN outperforms existing state-of-the-art models in terms of classification accuracy for both synthetic and real metagenomic data. These results pave the way towards developing personalized treatments for microbiome related diseases.
正文组织架构:
1. Background
1.1 Review of ML methods
1.1.1 Support vector machines (SVMs)
1.1.2 Regularized logistic regression (LR)
1.1.3 Gradient boosting (GB)
1.1.4 Random forests (RF)
1.1.5 Multinomial naïve bayes (MNB)
2. Methods
2.1 Acquisition and preprocessing of metagenomic data
2.2 Modeling the microbiome profile
2.3 Synthetic data generation
2.4 MetaNN framework
2.4.1 Multilayer perceptron (MLP)
2.4.2 Convolutional neural network (CNN)
2.4.3 Data augmentation
2.4.4 Dropout
3. Results
3.1 Experiments with synthetic data
3.1.1 Experimental setup
3.1.2 Classification performance metrics
3.1.3 Classification performance comparisons
3.2 Experiments on real data
3.2.1 Performance of ML models on real data
3.2.2 Classification of body sites
3.2.3 Classification of subjects
3.2.4 Classification of disease states
3.2.5 Classification performance comparisons
3.2.6 Neural network visualization
4. Discussion
5. Conclusion
正文部分内容摘录:
1. Biological Problem: What biological problems have been solved in this paper?
- classification of host phenotypes from Metagenomic data
2. Main discoveries: What is the main discoveries in this paper?
- MetaNN outperforms existing state-of-the-art models in terms of classification accuracy for both synthetic and real metagenomic data.
- For the synthetic experiments, we have evaluated several combinations of measurement errors to demonstrate the applicability of MetaNN to different conditions.
- For real datasets, our MetaNN has average gains of 7% and 5% in terms of F1-macro and F1-micro scores, respectively.
- three contributions:
- we propose two NN models (i.e., MLP and CNN) for metagenomic data classification based on a new data augmentation method.
- we propose a new simulation method to generate synthetic data that considers several sources of measurement errors; synthetic data we develop can be freely used by the research community to benchmark classification performance of different ML models.
- our proposed MetaNN outperforms other models with significant average gains of 7% and 5% in terms of F1-macro and F1-micro scores, respectively.
3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?
- MetaNN:Neural Networks
- a neural network framework which utilizes a new data augmentation technique to mitigate the effects of data over-fitting.
- NNs usually require a large amount of training instances to obtain a reasonable classification accuracy and prevent over-fitting of training data.
- we are the first to propose NN models that can be used to classify metagenomic data with small (e.g., in the order of hundreds) microbial sample datasets; this is a challenging problem as the low count of samples can cause data over-fitting, hence degradation of the classification accuracy.
- To overcome the problem of data over-fitting, we first consider two different NN models, namely, a multilayer perceptron (MLP) and a convolutional neural network (CNN), with design restrictions on the number of hidden layer and hidden unit. Second, we propose to model the microbiome profiles with a negative binomial (NB) distribution and then sample the fitted NB distribution to generate an augmented dataset of training samples. Additionally, we adopt the dropout technique to randomly drop units along with their connections from NNs during training [9]. Data augmentation and dropout can effectively mitigate data over-fitting as we demonstrate in our experiments and analyses.
4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?
- Challenges of host phenotypes classification stem from the very nature of the high dimensionality of the metagenomic data. For instance, a typical dataset may contain few hundred samples, but thousands of OTUs (i.e., features); this large number of features can greatly challenge the classification accuracy of any method and compound the problem of choosing the important features to focus on.
- the high-dimensional nature and limited number of microbial samples can make such models easily over-fit the training set and thus result in poor classification of new samples.
- To remedy the data over-fitting problem, we have proposed data augmentation and dropout during training.
- NNs are well-suited for automatic feature selection and engineering; this makes NNs better than other ML models for classifying metagenomic data.
- augmenting the metagenomic data and using the dropout technique during training
- MLP is able to better deal with the sparse features since NNs can extract higher level features by utilizing hidden units in hidden layers.
5. Biological Significance: What is the biological significance of these ML methods’ results?
- to assess the performance of different ML models, we propose a new simulation method that can generate synthetic microbial samples based on NB distributions which are commonly used to model the microbial count data
- the classification performance for the MLP classifier gets significantly better compared to all other existing methods for seven (out of eight) real datasets for two performance metrics commonly used to evaluate classification models: Area under the receiver operating characteristics (ROC) curve (AUC), and F1 score of class label predictions
6. Prospect: What are the potential applications of these machine learning methods in biological science?
- MetaNN has shown very promising results and better performance compared to existing ML methods.
7. Mine Question(Optional)
- The RF algorithm possesses a number of appealing properties making it well-suited for classification of metagenomic data:
- (i) it is applicable when there are more predictors (features) than observations;
- (ii) it performs embedded feature selection and it is relatively insensitive to the large number of irrelevant features;
- (iii) it incorporates interactions between predictors:
- (iv) it is based on the theory of ensemble learning that allows the algorithm to learn accurately both simple and complex classification functions;
- (v) it is applicable for both binary and multicategory classification tasks; and
- (vi) according to its inventors, it does not require much fine tuning of hyperparameters and the default parameterization often leads to excellent classification accuracy.
- 难怪看到了那么多RF。。。
- dropout直接设为0.5?