分类数据集医学数据

介绍 (Introduction)

The purpose of this project is to combine the principles of data science and medicine to develop a model that can predict heart disease. The advantage of such a model is that it is easily interpretable and in sync with medical literature, unlike other machine learning models that yield results that are not interpretable. Adopting such an approach has helped me build a model that by screening just 34% of the population can predict the occurrence of heart disease with 84% accuracy.

该项目的目的是结合数据科学和医学原理，以开发可以预测心脏病的模型。 这种模型的优势在于，它易于解释并与医学文献保持同步，这与产生无法解释结果的其他机器学习模型不同。 采用这种方法帮助我建立了一个模型，通过筛查仅34％的人群就可以以84％的准确度预测心脏病的发生。

According to WHO, heart diseases (broadly known as cardiovascular diseases or CVDs) claim an estimated 17.9 million lives each year, which is about 31% of all deaths worldwide. This ranks CVDs as the number one cause of death globally. [1]

据世界卫生组织(WHO)称，心脏病(被广泛称为心血管疾病)每年估计有1790万人死亡，约占全世界所有死亡人数的31％。这将CVD列为全球第一大死亡原因。 [1]

Now, what if we could build a meaningful model that could predict the likelihood of heart disease in a patient, just based on a few parameters? The word ‘meaningful’ here is very important. We don’t necessarily want a model that will give us the highest accuracy rate, but rather one which incorporates significant features and can be explained from a medical point of view. For this project, I used Google Colab to develop my models.

现在，如果我们可以建立一个有意义的模型，仅根据一些参数来预测患者患心脏病的可能性呢？这里的“有意义”一词非常重要。我们不一定需要一种能够为我们提供最高准确率的模型，而是需要一个具有重要特征并可以从医学角度进行解释的模型。对于这个项目，我使用Google Colab开发了模型。

数据集 (Dataset)

I worked with the ‘Heart Disease Cleveland UCI’ dataset from Kaggle (The dataset has been originally posted under the title ‘Heart Disease Data Set’ by UCI in their ML repository). The Kaggle dataset has the data of 297 patients, 13 features, and 1 binary target variable called ‘condition’( 0 = heart disease absent, 1 = heart disease present). The detailed description of all 14 attributes has been included here.

我使用了Kaggle的“ Heart Disease Cleveland UCI”数据集(该数据集最初由UCI在其ML存储库中以“ Heart Disease Data Set”的标题发布)。 Kaggle数据集包含297位患者的数据，13个特征和1个称为“条件”的二进制目标变量(0 =不存在心脏病，1 =存在心脏病)。这里包括所有14个属性的详细说明。

步骤1：医学文献怎么说？ (Step 1: What does Medical Literature have to say?)

Medical research stresses on 5 factors to be the most influential in predicting heart disease.

医学研究强调了5个因素对预测心脏病的影响最大。

Age- Increasing age adds to the risk of developing heart disease [2]

年龄 -年龄增长会增加患心脏病的风险[2]
Sex- Males are at a higher risk of heart disease than pre-menopausal females. The risk is comparable between males and post-menopausal females. [3]

性别 -男性比绝经前的女性患心脏病的风险更高。男性和绝经后女性的风险相当。 [3]
Serum Cholesterol levels-Increased serum cholesterol levels contribute to the development of heart disease. [4]

血清胆固醇水平 -血清胆固醇水平升高会导致心脏病的发展。 [4]
Blood Pressure- Hypertension or high blood pressure is a huge risk factor for the development of heart disease. [5]

血压 -高血压或高血压是心脏病发展的巨大危险因素。 [5]
Chest pain- Approximately 25–50% of patients with heart disease suffer from silent myocardial ischemia (SMI), which means that they do not feel any chest discomfort. Hence, even an absence of chest pain can indicate the presence of heart disease. [6]

胸痛 -大约25–50％的心脏病患者患有无症状的心肌缺血(SMI)，这意味着他们不会感到任何胸部不适。因此，即使没有胸痛也可能表明存在心脏病。 [6]

Luckily, all 5 factors above are included as variables in our dataset! Let’s take a quick look at how they are distributed.

幸运的是，以上所有5个因素都作为变量包含在我们的数据集中！让我们快速看一下它们的分布方式。

观察结果 (Observations)

Ages of the patients are widely distributed, with an average of about 55 years.
患者的年龄广泛分布，平均约55岁。
Overall, there are more males (class = 1) in this study than females (class = 0). From this plot, it is clear that males are more prone to heart disease than females. There are more males who have heart disease than those who don’t. On the contrary, there are far fewer females who have heart disease than those who don’t.
总体而言，本研究中男性(等级= 1)比女性(等级= 0)多。从该图可以明显看出，男性比女性更容易患心脏病。患有心脏病的男性多于没有心脏病的男性。相反，患有心脏病的女性要少于没有心脏病的女性。
Serum cholesterol levels are widely distributed, with an average of about 250 mg/dl.
血清胆固醇水平广泛分布，平均约为250 mg / dl。
The distribution of resting blood pressure is rather irregular, with an average of about 130 mm Hg.
静息血压的分布相当不规则，平均约为130毫米汞柱。
Among those who suffer from heart disease, most patients are asymptomatic (class = 3). Hence, the data supports medical literature. Chest pain in the forms of typical angina (class = 0), atypical angina (class = 1), and non-anginal pain (class = 2) are mostly reported by patients who do not have heart disease.
在患有心脏病的患者中，大多数患者是无症状的(3级)。因此，数据支持医学文献。患有典型心绞痛(0级)，非典型性心绞痛(1级)和非心绞痛(2级)的胸痛大多数是由没有心脏病的患者报告的。

步骤2：使用所有13个功能运行Logistic回归模型。 (Step 2: Running a Logistic Regression Model with all 13 features.)

My first intuition was to run a Logistic Regression model with all 13 features to check whether the most inclusive model is also the most medically meaningful model. I chose logistic regression since it can be easily explained. The model gave me an overall 86.7% accuracy on the test data and was able to correctly predict the presence of heart disease (predicted class = 1) in 82.9% of patients. Pretty good, right? Well, I was pretty disappointed when I took a look at the coefficients of each variable.

我的第一个直觉是运行具有所有13个功能的Logistic回归模型，以检查最具有包容性的模型是否也是最具有医学意义的模型。我选择逻辑回归是因为它很容易解释。该模型使我在测试数据上的总体准确性为86.7％，并且能够正确预测82.9％的患者中是否存在心脏病(预测的等级= 1)。还不错吧？好吧，当我查看每个变量的系数时，我感到非常失望。

The first coefficient (-0.03121) corresponds to the age variable. Just a few moments ago, we talked about how medical literature says that increasing age adds to the risk of developing heart disease. In that case, shouldn’t age have a positive coefficient? Similarly, the coefficient for ‘fbs’ or fasting blood sugar is negative (-0.41832). According to medical literature, when fbs > 100 mg/dL (in this case, class = 1 if fbs >120 mg/dL and class = 0 if fbs < 120 mg/dL), the risk of heart disease greatly increases [7]. Hence, the sign of the coefficient should ideally be positive.

第一系数(-0.03121)对应于年龄变量。就在片刻之前，我们讨论了医学文献如何说年龄增长会增加患心脏病的风险。在这种情况下，年龄不应该具有正系数吗？同样，“ fbs”或空腹血糖的系数为负(-0.41832)。根据医学文献，当fbs> 100 mg / dL(在这种情况下，如果fbs> 120 mg / dL，分类= 1，而fbs <120 mg / dL，分类= 0)，则患心脏病的风险大大增加[7]。。因此，理想情况下，系数的符号应为正。

From a medical perspective, it would be incorrect to accept a model that doesn’t get the sign of the variables correct, even if it is highly accurate.

从医学的角度来看，即使模型非常准确，也接受不正确的变量符号的模型是不正确的。

步骤3：根据医学文献运行新的Logistic回归模型 (Step 3: Running a New Logistic Regression Model in accordance with Medical Literature)

Now that I had built a model purely based on Machine Learning, I decided to try a model that would be meaningful instead. I wanted to construct a model involving a subset of the 13 features, including only those features which had ended up with the right signs in conformity with medical literature. As a result of this, I ended up with a logistic regression model with the 5 most important features according to medical literature — age, sex, serum cholesterol levels (chol), resting blood pressure (trestbps), and chest pain (cp).

既然我已经建立了一个完全基于机器学习的模型，那么我决定尝试一个有意义的模型。我想构建一个模型，其中包含13个特征的子集，仅包括那些以医学文献为准并以正确符号结尾的特征。结果，根据医学文献，我得出了具有5个最重要特征的逻辑回归模型- 年龄，性别，血清胆固醇水平(chol)，静息血压(trestbps)和胸痛(cp) 。

The model gave me an overall 74.4% accuracy on the test data and was able to correctly predict the presence of heart disease (predicted class = 1) in 75.6% of patients. The coefficients of the features and intercept of the model are shown below.

该模型使我在测试数据上的总体准确性为74.4％，并且能够正确预测75.6％的患者中是否存在心脏病(预测的等级= 1)。特征的系数和模型的截距如下所示。

All the coefficients are positive, just as we expected! Although this model may not be the most accurate, it is meaningful and can be easily explained by any medical practitioner.

正如我们预期的那样，所有系数都是正的！尽管此模型可能不是最准确的，但它是有意义的，并且可以由任何医生容易地解释。

步骤4：如何使新模型更可靠？ (Step 4: How can we make the New Model more Reliable?)

The new logistic regression model with a subset of variables is meaningful but falls short of the prediction power of the 13-feature model (82.9% vs 75.6%).

新的具有变量子集的逻辑回归模型很有意义，但是没有13特征模型的预测能力(82.9％对75.6％)。

To overcome these shortcomings, we must complement it with another ML model that incorporates all 13 features. The combination will do a better job of predicting the presence of heart disease in patients.

为了克服这些缺点，我们必须用包含所有13个功能的另一个ML模型对其进行补充。该组合将更好地预测患者中是否存在心脏病。

I developed 5 ML models with all 13 features and their performance is summarized in the table below.

我开发了具有所有13个功能的5个ML模型，其性能汇总在下表中。

Although Logistic Regression happens to be the most accurate model out of the five, I consciously ignored it based on reasons outlined above (non-intuitive signs of variables). Our best bet is to choose the Random Forest model, which gives an overall 83.33% accuracy on the test data and correctly predicts the presence of heart disease (predicted class = 1) in 78.05% of patients. The Random Forest (RF) feature importance chart is shown below.

尽管Logistic回归恰好是五种模型中最准确的模型，但基于上述原因(变量的非直观迹象)，我有意识地忽略了它。最好的选择是选择“随机森林”模型，该模型在测试数据上的总体准确性为83.33％，并且可以正确预测78.05％的患者存在心脏病(预测等级= 1)。随机森林(RF)功能重要性图表如下所示。

The interesting thing is that the top 4 variables according to RF are maximum heart rate achieved (thalach), chest pain (cp), the number of major vessels colored by fluoroscopy (ca), and heart defects (thal) while important features from our logistic model (age, sex, cholesterol, resting blood pressure) are towards the bottom. This tells us that the Random Forest model should not be trusted in itself, but rather be used in conjugation with our 5-feature logistic regression model for good prediction results.

有趣的是，根据RF而言，最重要的4个变量是达到的最大心率(thalach)，胸痛(cp)，通过透视检查显色的主要血管数量(ca)和心脏缺陷(thal)，而我们的重要特征逻辑模型(年龄，性别，胆固醇，静息血压)接近底部。这告诉我们，随机森林模型本身不应该被信任，而应该与我们的5功能逻辑回归模型结合使用，以获得良好的预测结果。

步骤5：结合两种模式和最终建议 (Step 5: Combining the Two Models and Final Recommendations)

This section talks about the way I combined both the scores to come up with the best prediction. I created a table comparing the model predictions and the actual conditions for all patients truly inflicted with heart disease in our test dataset. The notation that I use for this section has also been described below.

本节讨论如何将两个分数相结合以得出最佳预测。我创建了一张表格，用于比较测试数据集中所有真正患有心脏病的患者的模型预测与实际状况。我在本节中使用的表示法也在下面进行了描述。

Interesting! If we solely relied on our logistic model to correctly predict heart disease (LH), we would get a 31/44 or 70.5% accuracy score. On the other hand, if we relied only on our Random Forest model (RH), we would get a 31/36 or 86.11% accuracy score. The RF score is high because of 100% arising in class LLRH, which is probably a result of the small sample size in this class, i.e., only 5 patients.

有趣！如果我们仅依靠逻辑模型正确预测心脏病(LH)，我们将获得31/44或70.5％的准确性得分。另一方面，如果仅依靠随机森林模型(RH)，我们将获得31/36或86.11％的准确性得分。 RF评分高是因为LLRH类别出现100％的情况，这很可能是该类别的样本量小(即只有5名患者)的结果。

If we rely on a combination of the two models (LHRH), we end up selecting only 31 patients (31/90 = 34% of our test sample) where the occurrence of heart disease is 26/31 or 84%. Hence, we have been able to improve the hit rate from 50% (50–50 chance of being detected with heart disease) to 84%, while ensuring the sanctity of medical literature. This model is efficient and does a good job of prediction.

如果我们依靠这两种模型(LHRH)的组合，最终只能选择31名心脏病发生率为26/31或84％的患者(31/90 =测试样品的34％)。因此，我们能够将命中率从50％(发现心脏病的机会为50–50)提高到84％，同时确保医学文献的神圣性。该模型是有效的，并且可以很好地进行预测。

As for the classes where the logistic model and the RF model give conflicting predictions (LHRL and LLRH), more research is needed.

对于逻辑模型和RF模型给出相互矛盾的预测的类(LHRL和LLRH)，需要进行更多的研究。

结论 (Conclusion)

In the medical field, the most accurate model may not be the most meaningful and vice versa. Models like these are an example of ways in which we can combine principles of data science in the form of machine learning models with medical literature to give us the best result possible. The advantage of this model is that it is easily interpretable and in sync with medical literature. Regarding the accurate prediction of heart disease, it has been able to improve the hit rate from 50% to 84% screening just 34% of the population. This model can be leveraged for telemedicine, particularly for underdeveloped countries where there is not much access to cardiologists. Future endeavors can entail collaboration with cardiologists to work on other medical datasets and check for the sanctity of this model.

在医学领域，最准确的模型可能不是最有意义的，反之亦然。此类模型是将机器学习模型中的数据科学原理与医学文献相结合的方式的一个示例，可以为我们提供最佳结果。这种模型的优点是它易于解释并且与医学文献保持同步。关于心脏病的准确预测，它仅将34％的人群进行筛查，就能将命中率从50％提高到84％。此模型可用于远程医疗，特别是对于心脏病专家访问不多的不发达国家。未来的工作可能需要与心脏病专家合作，以处理其他医学数据集并检查该模型的正确性。

The code for this project can be found here.

该项目的代码可以在这里找到。

翻译自: https://towardsdatascience.com/combining-medicine-and-data-science-to-predict-heart-disease-f2e0ad92485f