机器学习从蛋白序列预测蛋白分类(一)
一,背景与目标:
随着测序技术的快速发展,GenBank等数据库中存储了大量基因、蛋白序列信息,其中大部分尚无标注,如何充分利用GenBank等数据库现有数据资源,挖掘数据信息,为精准医疗、药物研发等生物大健康领域提供有价值的信息具有重要意义。目前这方面已有BLAST等生物信息技术可用,这里希望尝试机器学习技术在这方面的应用。
二,数据清洗与处理
数据来源于Kaggle竞赛(https://www.kaggle.com/abharg16/predicting-protein-classification)。
1,载入数据
导入需要用到的模块pandas、Numpy及sklearn的相关模块
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
pd.set_option('display.max_columns', None) #输出所有列
读入数据
#1) import datasets
df_seq=pd.read_csv('pdb_data_seq.csv')
df_char=pd.read_csv('pdb_data_no_dups.csv')
print('Datasets have been loaded...')
2,查看数据的基本信息
(1)df.head()默认显示数据集的前5行信息
#display data info
print(df_seq.head())
输出如下:
df_seq数据集包含‘StructureId’,‘chainId’,‘sequence’,‘residueCount’,'macromoleculeType’共5列
print(df_char.head())
输出如下:
df_char包含‘structureId’,‘classification’,‘experimentTechnique’,‘macromoleculeType’,‘residueCount’,
‘resolution’,‘structureMoleculeWeight’,‘crystallizatonMethod’,‘crystallizationTempK’,‘densityMattews’,
‘densityPercentSol’,‘pdbxDetails’,‘phValue’,‘publicationYear’’'共14列信息
从‘macromoleculeType’这一列的数据可以看到,数据集包含非蛋白的序列
(2)用df.info()查看数据类型、数据缺失情况
print(df_seq.info())
输出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467304 entries, 0 to 467303
Data columns (total 5 columns):
structureId 467304 non-null object
chainId 467294 non-null object
sequence 467276 non-null object
residueCount 467304 non-null int64
macromoleculeType 432487 non-null object
dtypes: int64(1), object(4)
memory usage: 17.8+ MB
None
这里可以看到有几点信息:
首先,df_seq数据集有467304条数据,每条数据有5个维度的信息;
第二,只有‘residueCount’一列为数值型数据(int64);
第三,只有‘structureId’和‘residueCount’两列无数据缺失(non-null)
print(df_char.info())
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141401 entries, 0 to 141400
Data columns (total 14 columns):
structureId 141401 non-null object
classification 141399 non-null object
experimentalTechnique 141401 non-null object
macromoleculeType 137636 non-null object
residueCount 141401 non-null int64
resolution 128589 non-null float64
structureMolecularWeight 141401 non-null float64
crystallizationMethod 96242 non-null object
crystallizationTempK 97039 non-null float64
densityMatthews 124724 non-null float64
densityPercentSol 124749 non-null float64
pdbxDetails 118534 non-null object
phValue 105110 non-null float64
publicationYear 117602 non-null float64
dtypes: float64(7), int64(1), object(6)
memory usage: 15.1+ MB
None
df_char数据集有141401条数据,每条数据有14个维度的信息
3,数据提取与合并
由于df_char数据集中包含非蛋白序列,本项目仅考察蛋白序列的分类,因此需要对数据做抽提与合并
(1)从两个文件(df_seq和df_char)抽提出蛋白序列
#filter for only protein macromoleculeType
protein_chair=df_char[df_char.macromoleculeType=='Protein']
protein_seq=df_seq[df_seq.macromoleculeType=='Protein']
print(protein_seq.head())
输出看一下:
structureId chainId sequence \
4 101M A MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
7 102L A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
8 102M A MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
11 103L A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
12 103M A MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
residueCount macromoleculeType
4 154 Protein
7 165 Protein
8 154 Protein
11 167 Protein
12 154 Protein
(2)从两个表格里筛选需要的列进行数据合并
protein_seq文件和protein_char文件的‘structureId’是唯一的,所以可以作为合并的连接键。
因为我们需要sequence和对应的classification信息,而这两个信息分别处于protein_seq和protein_char两个文件中,所以需要将这两列合并到一个表格,属于表格横向合并,用pd.mergy()函数或者pd.join()函数,先从protein_seq和protein_char中提取出对应的sequence和classification
#select only necessary variables to join
protein_char=protein_char[['structureId','classification']]
protein_seq=protein_seq[['structureId','sequence']]
print(protein_seq.head(2))
print(protein_char.head(2))
输出:
structureId sequence
4 101M MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
7 102L MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE..
structureId classification
2 101M OXYGEN TRANSPORT
4 102L HYDROLASE(O-GLYCOSYL)
以‘structureId’为连接键用pd.mergy()合并数据
或者先将‘structureId’设为index,再用join()合并
# Join two datasets on structureId using pd.mergy
model_df=protein_seq.set_index('structureId').join(protein_char.set_index('structureId'))
print(model_df.head(2))
输出:
sequence \
structureId
101M MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
102L MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
classification
structureId
101M OXYGEN TRANSPORT
102L HYDROLASE(O-GLYCOSYL)
print(model_df.info())
输出看一下,合并后的表格共有346325条数据,每条数据2个维度,sequence和classification独有数据缺失
<class 'pandas.core.frame.DataFrame'>
Int64Index: 346325 entries, 0 to 346324
Data columns (total 2 columns):
structureId 346325 non-null object
sequence 346322 non-null object
classification 346324 non-null objec
dtypes: object(3)
memory usage: 10.6+ MB
None
(3)查看合并后数据缺失情况,并处理数据缺失:
#check NA
print(model_df.isnull().sum())
输出:
structureId 0
sequence 3
classification 1
dtype: int64
删除缺失数据所在行
model_df=model_df.dropna()
print(model_df.shape[0]) #查看删除后还剩多少条数据
输出:346321,原来有346325条数据,删除4条数据
(4)统计classification分类信息
counts=model_df['classification'].value_counts()
print(counts)
输出
HYDROLASE 46336
TRANSFERASE 36424
OXIDOREDUCTASE 34321
IMMUNE SYSTEM 15615
...
REPLICATION,OXIDOREDUCTASE 1
Virulence Factor 1
Name: classification, Length: 4468, dtype: int64
选择数量超过1000的数据构建模型
print(counts[(counts > 1000)])
输出的是超过1000的classification类型:
HYDROLASE 46336
TRANSFERASE 36424
...
TRANSLATION 1007
BIOSYNTHETIC PROTEIN 1006
Name: classification, dtype: int64
找出classification中统计的数量超过1000的分类名称
print(counts[(counts > 1000)].index)
输出:
Index(['HYDROLASE', 'TRANSFERASE', 'OXIDOREDUCTASE', 'IMMUNE SYSTEM', 'LYASE',
'HYDROLASE/HYDROLASE INHIBITOR', 'TRANSCRIPTION', 'VIRAL PROTEIN',
'TRANSPORT PROTEIN', 'VIRUS', 'SIGNALING PROTEIN', 'ISOMERASE'
数量超过1000的分类名称存储在types中,且数据类型转换为ndarray(即numpy数组类型)
types=np.asarray(counts[(counts > 1000)].index)
查看types中的分类名称是否在‘classification’所在列,isin()返回对应index的True或False
print(model_df.classification.isin(types))
输出:
structureId
101M False
102L False
...
117E True
117E True
118L False
从model_df中抽取满足条件的数据(数量超1000)
data=model_df[model_df.classification.isin(types)]
print(data.head(2))
输出看一下是什么样的
sequence \
structureId
10GS PPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKAS...
10GS PPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKAS...
classification
structureId
10GS TRANSFERASE/TRANSFERASE INHIBITOR
10GS TRANSFERASE/TRANSFERASE INHIBITOR
至此,最终用于模型构建的数据集已经抽提结束,看一下一共多少数据
print(data.shape[0])
输出:278866
未完待续…
以下内容见机器学习从蛋白序列预测蛋白分类(二)
三,特征提取
四,模型构建
五,模型评估