敏感性、特异性、假阳性、假阴性(sensitivity and specificity)
医学、机器学习等等,在统计结果时时长会用到这两个指标来说明数据的特性。
定义
敏感性:在金标准判断有病(阳性)人群中,检测出阳性的几率。真阳性。(检测出确实有病的能力)
特异性:在金标准判断无病(阴性)人群中,检测出阴性的几率。真阴性。(检测出确实没病的能力)
假阳性率:得到了阳性结果,但这个阳性结果是假的。即在金标准判断无病(阴性)人群中,检测出为阳性的几率。(没病,但却检测结果说有病),为误诊率。
假阴性率:得到了阴性结果,但这个阴性结果是假的。即在金标准判断有病(阳性)人群中,检测出为阴性的几率。(有病,但却检测结果说没病),为漏诊率。
计算方法
Sensitivity and specificity:完整定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
True Positive (真正, TP)被模型预测为正的正样本;可以称作判断为真的正确率 True Negative(真负 , TN)被模型预测为负的负样本 ;可以称作判断为假的正确率 False Positive (假正, FP)被模型预测为正的负样本;可以称作误报率 False Negative(假负 , FN)被模型预测为负的正样本;可以称作漏报率 True Positive Rate(真正率 , TPR)或灵敏度(sensitivity) TPR = TP /(TP + FN) 正样本预测结果数 / 正样本实际数 True Negative Rate(真负率 , TNR)或特指度(specificity) TNR = TN /(TN + FP) 负样本预测结果数 / 负样本实际数 False Positive Rate (假正率, FPR) FPR = FP /(FP + TN) 被预测为正的负样本结果数 /负样本实际数 False Negative Rate(假负率 , FNR) FNR = FN /(TP + FN) 被预测为负的正样本结果数 / 正样本实际数 |
假阳性率=假阳性人数÷金标准阴性人数
即: 假阳性率=b÷(b+d)
金标准 | 金标准 | |||
阳性(+) | 阴性(-) | 合计 | ||
某筛检方法 | 阳性(+) | a | b | a+b |
某筛检方法 | 阴性(-) | c | d | c+d |
合计 | a+c | b+d | N |
公式为:假阳性率=b/(b+d)×100%
(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)
假阴性率=假阴性人数÷金标准阳性人数
即: β=c÷(a+c)
终于要用到这个玩意了,很激动,主要统计假阴假阳性率。
我的任务:
1. 评估Pacbio MHC variation calling 结果(CCS/non-CCS)与Hiseq数据结果的一致性。
2. 分别在不同深度梯度的区域完成以上评估,推断PB MHC做variation calling的最低深度。
这里要将一个位点分为SNP、REF 和 LowQual,然后只去 SNP 和 REF 进行统计。
这是我一下午写出来的统计代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
#!/usr/bin/env python # Author: LI ZHIXIN import sys
import pysam
from collections import OrderedDict
def classify_DP(depth):
if depth > 101 :
return 21
return ((depth - 1 ) / / 5 + 1 )
def parse_rec(rec):
sample = list (rec.samples)[ 0 ]
# filter the Invalid line
if not ( 'GQ' or 'GT' or 'DP' ) in rec.samples[sample].keys() or len (rec.alleles) < = 1 :
# continue
return 1 , "LowQual" , rec.pos
# filter the LowQual
if rec.samples[sample][ 'GQ' ] < 30 :
return rec.samples[sample][ 'DP' ], "LowQual" , rec.pos
# filter the indel
flag = 0
for one in rec.alleles:
if len (one) ! = len (rec.ref):
flag = 1
if flag = = 1 :
return rec.samples[sample][ 'DP' ], "LowQual" , rec.pos
if rec.samples[sample][ 'GT' ] ! = ( 0 , 0 ): # rec.qual > 30
# variation_dict[rec.pos] = ["snp", rec.alleles]
return rec.samples[sample][ 'DP' ], "snp" , rec.pos
elif rec.samples[sample][ 'GT' ] = = ( 0 , 0 ):
# variation_dict[rec.pos] = ["ref", rec.alleles]
return rec.samples[sample][ 'DP' ], "ref" , rec.pos
def read_gvcf(gvcf_file_path):
variation_dict = OrderedDict()
for i in range ( 1 , 22 ):
variation_dict[i] = {}
for j in ( 'LowQual' , 'snp' , 'ref' ):
variation_dict[i][j] = []
# pos_list = []
gvcf_file = pysam.VariantFile(gvcf_file_path)
for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ):
DP, pos_type, pos = parse_rec(rec)
if DP < 1 or DP > 20 :
continue
# DP = classify_DP(DP)
variation_dict[DP][pos_type].append(pos)
# print(pos, DP, pos_type)
gvcf_file.close()
# return variation_dict, pos_list
return variation_dict
def read_hiseq_gvcf(gvcf_file_path):
variation_dict = OrderedDict()
# for i in range(1,22):
# variation_dict[i] = {}
for j in ( 'LowQual' , 'snp' , 'ref' ):
variation_dict[j] = []
# pos_list = []
gvcf_file = pysam.VariantFile(gvcf_file_path)
for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ):
DP, pos_type, pos = parse_rec(rec)
DP = classify_DP(DP)
variation_dict[pos_type].append(pos)
# print(pos, DP, pos_type)
gvcf_file.close()
# return variation_dict, pos_list
return variation_dict
def show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2):
for DP in range ( 1 , 21 ):
Hiseq_snp = set (Hiseq_unified_variation_dict[ 'snp' ])
Hiseq_ref = set (Hiseq_unified_variation_dict[ 'ref' ])
Hiseq_lowqual = set (Hiseq_unified_variation_dict[ 'LowQual' ])
PB_snp = PB_non_CCS_variation_dict[DP][ 'snp' ]
PB_ref = PB_non_CCS_variation_dict[DP][ 'ref' ]
PB_lowqual = PB_non_CCS_variation_dict[DP][ 'LowQual' ]
total = set (PB_snp + PB_ref + PB_lowqual)
Hiseq_snp = total & Hiseq_snp
Hiseq_ref = total & Hiseq_ref
Hiseq_lowqual = total & Hiseq_lowqual
PB_snp = set (PB_snp)
PB_ref = set (PB_ref)
PB_lowqual = set (PB_lowqual)
a = len (Hiseq_snp & PB_snp)
b = len (Hiseq_ref & PB_snp)
c = len (Hiseq_lowqual & PB_snp)
d = len (Hiseq_snp & PB_ref)
e = len (Hiseq_ref & PB_ref)
f = len (Hiseq_lowqual & PB_ref)
g = len (Hiseq_snp & PB_lowqual)
h = len (Hiseq_ref & PB_lowqual)
i = len (Hiseq_lowqual & PB_lowqual)
Low_total = (g + h + i) / (a + b + c + d + e + f + g + h + i)
if (a + b) = = 0 :
PPV = "NA"
else :
PPV = a / (a + b)
PPV = "%.4f" % (PPV)
if (a + d) = = 0 :
TPR = "NA"
else :
TPR = a / (a + d)
TPR = "%.4f" % (TPR)
print ( str (DP) + " :\n" , a,b,c, "\n" ,d,e,f, "\n" ,g,h,i, "\n" , file = outf2, sep = '\t' , end = '\n' )
print (DP, TPR, PPV, "%.4f" % Low_total, file = outf, sep = '\t' , end = '\n' )
with open ( "./depth_stat.txt" , "w" ) as outf:
print ( "Depth" , "TPR" , "PPV" , "Low_total" , file = outf, sep = '\t' , end = '\n' )
outf2 = open ( "raw.txt" , "w" )
Hiseq_unified_variation_dict = read_hiseq_gvcf( "./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz" )
PB_non_CCS_variation_dict = read_gvcf( "./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz" )
show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2)
outf2
|
又碰到一个高级python语法:在双层循环中如何退出外层循环? 我用了一个手动的flag,有其他好方法吗?
如何统计下机数据的覆盖度和深度?当然要比对之后才能统计,而且还要对比对做一些处理。
在计算一个位点是否是SNP、indel、Ref时,不仅要考虑ref、alts、qual、GQ,而且必须要把GT、DP考虑在内,所以说还是比较复杂的。
最后如何分析第二个问题,call variation的最低深度?
统计不同深度下的假阴假阳性率,看在什么深度下其达到饱和。