- Open Access
- Authors : Hana H. Alalawi , Manal S. Alsuwat
- Paper ID : IJERTV10IS070091
- Volume & Issue : Volume 10, Issue 07 (July 2021)
- Published (First Online): 14-07-2021
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Detection of Cardiovascular Disease using Machine Learning Classification Models
Hana H. Alalawi
College of Computer Science and Information System Umm Al-Qura University
Makkah, Saudi Arabia
Manal S. Alsuwat
College of Computer Science and Information System Umm Al-Qura University
Makkah, Saudi Arabia
AbstractCardiovascular disease (CVD) is an all- encompassing term for situations affecting the heart or blood vessels. This is commonly associated with an accumulation of fatty deposits within the arteries (atherosclerosis) and an increased risk of developing blood clots. Cardiovascular disease is considered one of the largest causes of morbidity and mortality in the world's population. Predicting and diagnosing the disease is a critical challenge in clinical data analysis and health care providers to prevent people from contracting such a disease and conserve lives. Healthcare industries collect massive amounts of data that contain some information related to heart disease diagnosis, which is serviceable in making effective decisions. Furthermore, AI algorithms and deep neural networks can be used to analyze and diagnose heart disease. The project intends to automatically detect cardiovascular disease using two datasets through a deep learning network and a variety of machine learning classification models. The performance evaluated based on the accuracy, precision, recall, and f-score for each of the models. Hence, the Random Forest model achieved the highest performance at 94% accuracy in the heart diseases dataset, while Gradient Boosting model achieved the highest performance at 73% accuracy, 73% Recall, 73% F1-score, and 74% Precision in Cardiovascular Disease Dataset.
Keywords Cardiovascular disease, Artificial Neural Network, Classification algorithms.
-
INTRODUCTION
Cardiovascular disease (CVD) is a family of disorders that includes coronary heart, cerebrovascular disease, peripheral arterial disease, and congenital heart disease. Each four out of
5 CVD cases are died due to heart attacks and strokes. According to the World Health Organization (WHO), [1] over
-
million patients have died from heart diseases around the world, which makes Cardiovascular diseases (CVDs) the number one cause of death globally. Cardiovascular diseases are a group of diseases that causes a failure in a humans heart and blood vessels. It has many symptoms such as Chest pain, fast heart rate, difficulty breathing, and dizziness.
As a fact, CVDs are complex as the human body, and they are caused by different risk factors such as obesity, high blood pressure, and high Cholesterol. Diagnosing an infection with heart disease requires specialized cardiologists with complicated procedures and tests to figure the accurate and efficient treatment. Cardiovascular disease can be prevented by early diagnosis, followed by healthy eating, exercising, and avoiding alcohol consumption. In undeveloped countries, patients suffering from cardiovascular disease are diagnosed with a severe delay at times, or they are transported over long distances unnecessarily, with the increase in the cost of travel and treatment, which is a burden on them.
To successfully diagnose the patients into has a heart disorder or not, AI science has been involved with other sciences to solve real-life problems automatically. Nowadays, AI algorithms are used in disease diagnosis and detection [2]. AI with medicine helped the doctors to make decisions in some complicated cases in addition to predicting the high-risk patients of getting infected by a disease such as heart disease or Cardiovascular disease [3].
Various studies have demonstrated an automatic detection to diagnose the severity of heart diseases using different machine learning techniques such as integrating multiple classification algorithms and Boosting algorithms to build strong automated prediction systems. In the research, seven models used which are Support Vector Machine, K-Nearest neighbor, Logistic Regression, Decision Tree, Naïve Bayes, Random Forest, Artificial Neural Network models and some of ensemble techniques with the cardiovascular disease (CVD) dataset. Diagnosis by means of artificial intelligence is useful in the lack of medical personnel in addition that it is cheap and fast and can help doctors in the diagnosis process more efficiently to save patient's lives in early stages [4], [5], [6],[7]
.
In section II, the related papers to the topic have been reviewed. Section III-A the dataset pre-processing. Then, the research methodology and approaches have been discussed IV where the evaluation metrics mentioned in V. The next section is presenting the results of the experiments VI. The conclusion of the research in section VII.
-
-
RELATED WORK
Mohammad et al [8] have develop a model to predict Cardiovascular disease utilizing several combinations of features. The datasets from the Cleveland database in the UCI machine learning repository. Seven classification models were used which are Decision Tree, Logistic Regression (LR), Support Vector Machine (SVM), Neural Network and Vote, Naive Bayes, k-NN. These models were applied to a set of different combinations of features each time. Results show the proposed model achieves an accuracy of 87.4% in heart disease prediction.
On the other hand, Princy et al. [9] propose a solution in predicting cardiovascular diseases by using the six machine learning classification algorithms which are: Decision tree, K nearest neighbor, Logistic regression, Naive Bayes, Random Forest, and Support vector machine. The models have been applied on the cardiovascular disease dataset [10] .The research proves that the decision tree provides the highest accuracy comparing to the other classifiers. DT model
achieved 78% of accuracy in predicting the disease based on the patient's medical records.
Forecast accuracy still needs to be improved. Consequently, the research of Bashir et al [11] Focuses on a features selection to improve the accuracy through using multiple heart disease datasets in the experimentation. utilizing the Rapid miner tool and the classifiers ware; Decision Tree, Logistic Regression, Naïve Bayes, Logistic Regression SVM, and Random Forest. The results have presented an improvement in the accuracy. In another study, Sabab et al [12] , propose a method by using data mining techniques to optimize the diagnosis of heart diseases in addition to applying feature selection methods to improve the accuracy of the classification models. The authors use SVM, C4.5 DT, and Naive Bayes classifiers to detect cardiovascular disease, and models achieved an accuracy of 87.8%, 86.80%, and 79.9%, respectively.
Zunaidi et al [13] utilize Heart Disease Dataset to obtain the better prediction method for heart disease through 5 classification method which is SMO, Decision Tree, Multilayer Perception, Decision Stump, and Random Forest the result shows Multilayer Perceptron Neural Networks is the most suited for early prediction of heart diseases at the accuracy of 85.03%. Similarly, on the same dataset Khemphila et al [14] use Multi-Layer Perceptron (MLP) to diagnose heart disease and the accuracy was 80.17%.
Kavitha et al [15] .use three machine learning: random forest, decision tree, and hybrid of random forest and decision tree to predict heart disease. The dataset was the Cleveland heart disease dataset. Experimental outcomes show the best model was the hybrid model with an accuracy of 88.7%. Princy et al [16] Classified the risk level of heart disease utilizing data mining methods ike Naïve Bayes, ID3, and Neural Network through choosing different attributes. The authors found that the accuracy depends on the number of features, more accuracy when the features number increased. The result of this research achieved an accuracy of 80.6%. Likewise Gawali et al [17]. detect the risk rate of heart disease utilizing data mining methods: Nave Bayes and K-Means. The finding of this research was accuracy of 84%.
Likewise, Maiga et al [18] demonstrate four ML algorithms for CVD detection based on some factors such as cholesterol level, BMI, and blood pressure. The proposed experiment implements Logistic regression, Random Forest, KNN, and Naïve Bayes on UCI health record dataset. Random forest classifier achieved the most accurate prediction outcomes as follows 73% of accuracy, 65% of sensitivity, and 80% of Specificity. Mohan et al. [19] utilize machine learning techniques for finding significant features towards heart disease of the dataset. Through combining the Linear Method (LM) and Random Forest (RF) characteristics this called hybrid HRFLM. The result was quite accurate at an accuracy of 88.7\%. Haq et al.[20] develop a system based on machine learning to diagnose heart disease through utilizing the heart disease dataset. The research used Relief, mRMR, and LASSO as feature selection algorithms and common machine learning classifiers which are: logistic regression, ANN, SVM, K-NN, DT, and NB. The proposed system achieved the best accuracy
in the logistic regression model at 89% with 10-fold cross- validation.
Furthermore, Arabasadi et al [21] introduce an algorithm for diagnosing and detecting coronary artery diseases using an improved neural network on the Z- Alizadeh Sani dataset. The proposed method uses genetic algorithms to initial the networks weights, which enhanced the performance of the network by 10%. As a result, the model provided 93.58\% accurate results.
On the other hand, Latha et al [22] Introduce a new technique to improve the performance of the classification models by implementing ensembling features to provide precise prediction outcomes. Implementing ensembling techniques improved the accuracy of the week classifiers 7% further. The proposed approach uses bagging, boosting, stacking, and majority voting with MLP, Naïve Bayes, Random Forest, and C4.5 classifiers. The best improvement of the accuracy was with the majority voting ensembling. Where Miao et al [4] .Develop a method to predict coronary diseases in early stages using advanced ensemble techniques. Four different heart datasets used to train and test the performance of the models. The proposed model showed significant improvement with an average of accuracy 85.27%
-
DATASETS
The research intends to use two datasets [23], and [10] for heart disease detection and diagnosis by using 9 classification algorithms.
-
Cardiovascular Disease Dataset
In this research, the cardiovascular disease dataset used to diagnose and detect heart disease and compare the results with previous research. It contains a large number of patients information and their medical records. The dataset collected by Kaggle from three different resources which are Objective where the data collected as facts about cardiovascular diseases, Examination is the results of some medical tests, and Subjective represents the information provided by the patient. The dataset utilized for training, testing, and validation sets. The proposed data is publicly available on the website [10].
The shape of the cardiovascular diseases dataset is (68783, 12), and it is a clean version of the CVD dataset as follows:
Table 1: cardiovascular diseases dataset
Feature
Information
Age
The patient age in years
Gender
binary categorical value 1 for Male, 2 for female.
Height
Representing the height of patients
Weight
Representing the weight of patients
AP_HIGHT
Systolic blood pressure
AP_LOW
Diastolic blood pressure
CHOLESTEROL
The Cholesterol Level in the blood (1: normal, 2: above normal, 3: well above normal)
GLUCOSE
Categorical value of the sugar blood level (1: normal, 2: above normal, 3: well above normal)
SMOKE
Smoking (0: No, 1: Yes)
ALCOHOL
Alcohol intake (0: No, 1: Yes)
PHYSICAL_AC TIVITY
Physical activity type
CARDIO_DISE ASE
Target value measuring the Presence or absence of cardiovascular disease.
Table I2: Class Distribution CVD
Class
Counts
0
34742
1
34041
-
Heart Disease Dataset
The second dataset in the research is the Heart Disease Dataset [23], It is a combination of the most popular datasets for heart disease prediction which are: Cleveland, Hungarian, Switzerland, Long Beach VA, and Statlog (Heart) Data Set. It contains 1190 records, 10 features, and one target. The dataset gathered by Anu Siddhartha in the year 2020. Heart Disease Dataset is a public dataset available on the IEEE Data port website [24].
Table 3II: cardiovascular diseases dataset
Features
Information
Age
The patient age in years
Sex
Gender of patient (1: Male, 0: Female)
Chest Pain Type
Type of chest pain experienced by patient Categorized into 4 classes:
Resting bp s
Level of blood pressure
CHOLESTEROL
Serum Cholesterol in mg/dl
Fasting blood sugar
Level of Sugar level on fasting
Resting ecg
Result of electrocardiogram 0: Normal 1: Abnormality in ST-T wave 2: Left ventricular hypertrophy
Max heart rate
Maximum heart rate
Exercise angina
Angina induced by exercise
Oldpeak
Exercise-induced ST-depression in comparison with the state of rest
ST slope
ST segment measured in terms of slope during peak exercise 0: Normal 1: Upsloping 2: Flat 3: Downsloping
Target
Binary value 1: Patient 0: Normal
Table 4V: Class Distribution CVD
Class
Counts
0
561
1
629
-
RESEARCH METHODOLOGY
This paper intends to employ different classification techniques which are Support Vector Machine (SVM), Decision Tree Classifier (DT), Logistic Regression, K-Nearest Neighbor (KNN), and Naïve Bayes (NB). In addition, implementing a deep network and measuring the impact of various optimization learning methods to detect cardiovascular diseases in order to find the best classifier. The steps to carry out this research are described in Figure 1 below.
Figure 1:Research steps
-
Data Preprocessing
In terms of improving the models' performance, feature selection is one of the main steps where only the most correlated features kept and used to train the model and unnecessary data removed to avoid time-consuming. Based on the data analysis, the dataset split into 90:10 for the training set, and testing set, respectively using cross-validation.
In the Data processing phase, the Body MassIndex has been calculated to check the patient's health condition, where the normal value of the BMI is between 18 to 25 and the abnormal value is over 25 or under 18. The BMI values are calculated by using the formula below. After that, the weight and height features removed.
Weight (KG)
=
Height (cm)
According to the American Heart Association (AHA), [25] The blood pressure measured by the Systolic and Diastolic rate. It is categorized into five classes of severity. The blood pressure calculated for each record and specified the blood pressure level.
-
Classification Algorithms
Classification algorithms are the most widely used techniques in the machine learning field. In general, the models in machine learning are divided into three general approaches based on their learning style, Supervised learning, Unsupervised Learning, and Reinforcement Learning. Classification is a way of learning when the machine learns how to assign the labels to each class of the data. There are several models known as classification algorithms such as logistic regression, Naive Bayes, random forest, decision tree, gradient-boosted tree, and linear perceptron. This research reviews the implementation of seven classification algorithms on the dataset and some ensemble classifiers.
-
Artificial Neural Network ANN
Is a set of algorithms developed to process and operate information by learning as the humans brain. The ANN is built by a set of neurons and weighted connections to recognize the pattern and solve AI problems. Multilayer networks can approximate any function from complex prediction to classification tasks. In this paper, to diagnose Cardiovascular diseases, a neural network used to consist of First, the input layer, which is responsible for entering the primary data, sometimes named the visible layer. The architecture of this neural network contains an input layer equal to the number of features in every dataset. And three hidden layers which is one of which calculations are performed: first hidden 18 neurons with activation=relu. Followed by the second hidden layer including 21 neurons with activation=relu. Next, comes the third hidden layer which contains 8 nodes with activation=relu. Finally, the output layer which responsible for giving the expected final result for the case and contains one node with activation='sigmoid'.The architecture is shown in Figure 2.
Figure 2:Neural Network Architecture
The model parameters settings are 0.001 for learning rate, 150 number of epochs, and to further improve the performance of the network an optimizer has been used which is Adam.
-
Adam
In 2014 Kingma and Ba introduced Adam as a new optimization method to provide a faster convergence and accurate results by improving the Nesterov model. Adam stands for Adaptive moment estimation; it is combining root mean squared prop and momentum. It is working by dividing the learning rate by the value of the exponential average of the squared gradient then the gradient is replaced by the exponentially weighted average [26]. The equation below is representing Adam optimization calculation steps.
= 2 (1 2) 2
=
+
-
-
Support Vector Machine (SVM)
SVM classifier has become very popular after its inception due to its impressive performance, which is comparable to the performance of advanced NNs trained in complex tasks, using high computing cost algorithms. SVMs were originally designed as effective approaches for pattern recognition and classification [27]. Immediately after their introduction, these algorithms were used by researchers for some classification problems and applications, such as speech recognition system [28], computer vision and image processing [29].
The algorithm of SVM is primarily used for classification techniques. SVM generates a hyperplane separating the data into various classes. SVM can resolve either linear or non-linear issues. The main objective of the SVM is to discover a hyperplane in N-dimensional space (N matches the features number) that clearly classifies the datasets or points of data. The accuracy of the outcome is related directly to the hyperplanes that select. It should discover a plane with the maximum distance among data points of both classes. The hyperplane is illustrated graphically as a line that divides one class and another. Data points located come on different sides of the hyperplane are assigned to different classes. The hyperplane
dimension dependent on the feature quantity. If the input is two features, then the hyperplane is a line and be a 2D plane if the features of input are three. SVM is also utilized in medical diagnosis to detect anomalies, air quality management systems, financial analyzes.
-
Decision Tree Classifier (DT)
The decision tree is a classification algorithm that builds a tree-like structure that can be utilized for selecting between different actions. a decision tree can solve regression and classification problems. A decision tree is one of the most popular machine learning algorithms used. Decision trees often simulate human level thinking so it's easy to understand the data and obtain some good interpretations. A decision tree is a tree where each node expresses a feature(attribute) and each link(branch) describes a decision(rule) and each leaf represents a result (categorical or continuous value) \cite{dt}.\\
-
Decision Tree Classifier (DT)
The decision tree is a classification algorithm that builds a tree-like structure that can be utilized for selecting between different actions. a decision tree can solve regression and classification problems. A decision tree is one of the most popular machine learning algorithms used. Decision trees often simulate human level thinking so it's easy to understand the data and obtain some good interpretations. A decision tree is a tree where each node expresses a feature(attribute) and each link(branch) describes a decision(rule) and each leaf represents a result (categorical or continuous value) [30].
-
Logistic Regression
Logistic regression is a widely used machine-learning algorithm to predict the probability of the response variables by given a set of explanatory independent variables, it is a supervised binary classification algorithm [31]. The response variables are coded as binary values as 0 and 1, based on the values of the binary target variables the data will be classified. The logistic regression types are Binary or Binomial, Multinomial, and Ordinal. The values of the target variable in the Binary LR have two possible types 0 and 1. Multinomial Logistic regression is when the response variable has unordered 3 or more possible values to represent the data such as the class types (A, B, C, D). The Ordinal LR is the classification when the response variable has ordered 3 or more values such as students scores (High, Mid, Low). The LR algorithm is sensitive to outliers.
-
K-Nearest neighbor (KNN)
The K- Nearest Neighbor is one of the supervised machine learning classification algorithms. KNN supposes that there are similar items nearby. In other words, similar objects are close to each other. K-NN used for searching the nearest neighbor class to predict the target name or label. The nearest neighbors are a simple algorithm to saves all available cases and classify new cases according to the measure of similarity which is calculated by the distance functions (Manhattan, Euclidean, Minkowski, and Hamming). Nearest Neighbor techniques are categorized within two categories 1- Structure less NN technique category entity data is classified into training data .2- Structure-based NN technique. Structure-based NN techniques are based on structures of data like orthogonal structure tree (OST), axis tree, k-d tree, ball tree. In banking, K-nearest neighbor used to detect the credit card patterns. Additionally,
bothdata and site suspicious patterns suggesting suspect behavior are analyzed by KNN algorithms.
-
Naïve Bayes (NB)
NB is a classification algorithm based on Bayes' rule with the assumption of all the features that predict the target value are independent. Naïve Bayes model can predict the class of a given data based on the probability of the data that belong to a
(FN): a case where the true class of the instance was 1 (True) and the model prediction is 0 (False).
Accuracy accuracy measure described as the average number of correct predictions. However, this is not quite robust for the unbalanced dataset.
given class.
-
Random Forest (RF)
=
Random forest creating a number of decision trees at training and give the output based on the most votes become. This technique used for classification and regression tasks.
-
Voting Classifier
Voting Classifier is an ensemble machine learning model called majority voting ensemble , it combines the forecasts of various models to training and then the final result depends on the majority vote of participating models. This technique can be utilized to improve the model performance. The voting classifier carries two kinds of voting. First, hard voting this kind predicts the class or the output based on the largest sum of votes
Precision- called positive predictive value measures the capability of a model to identify the correct instances for each class. This is a strong matrix for multi-class classification and unbalanced datasets.
=
( + )
Recall This metric measures a models performance in recognizing the true positive out of the total true positive cases.
from models. Second, soft voting predicts the class or the output based on the summation of the largest probabilities from
=
( + )
models [31].
-
Gradient Boosting Classifier
Gradient Boosting is a common machine learning technique using in classification problems. Gradient Boosting creates the model in a stage-wise fashion then generalizes the model through enabling optimization of an arbitrary differentiable loss function each predictor corrects the error of his predecessor. This technique is based on building a decision tree. Initially, a decision tree is built and train it on the data set, then it is evaluated (compute the classification error) based on this evaluation a second decision tree is constructed and in which improvement is made. Building of Tree3 based on Tree1 + Tree2. This process is repeat for specific number of iterations. Consequently, the final model is the weighted sum of the predictions created over past tree models [32].
-
-
EVALUATION METRICS
In the classification problems, the process of selecting the best metrics for evaluating the performance of a particular classifier for a given data set depends on the number of considerations including class balance and expected outcomes. One performance metric may evaluate a classifier from a single perspective while the others fail to measure it, and vice versa. Hence, there is no standardized (unified) metric for defining the generalized performance measurement of the classifier. In this paper, several metrics are chosen to measure how well models perform such as, such as Accuracy, Precision, Recall, and F1 score.
These metrics are drawn from the following four categories:
True Positives (TP): case where the true class of the instance was 1(True) and the model prediction is also 1 (True). False Positives (FP): a case where the true class of the instance was 0 (False) and the model prediction is 1 (True). True Negatives (TN): a case where the true class of the instance was 0 (False) and the model prediction is also 0 (False). False Negatives
F-score named as balanced F-score or F-measure. can be defined as a weighted average of precision and recall.
( )
= 2
( + )
-
RESULTS AND DISCUSSION
Based on experiments of implementing deep learning networks and a variety of classification machine learning algorithms to automatically detect cardiovascular disease using the cardiovascular disease dataset this section explains the results obtained with cross validation (10 KFold) 90 for training and 10 for testing.
The following Table V shows the results obtained from the classifier after data processing and features selection. The Gradient Boosting model showed the highest performance in predicting the disease followed by the ANN at 70% accuracy. The lowest model was DT at 62% accuracy.
Table V: Result of the Classifiers (Selected Features)
Models
Precision
Recall
F1-score
Accuracy
KNN
66%
66%
66%
66%
SVM
66%
64%
64%
65%
DT
63%
62%
62%
62%
ANN
69%
68%
68%
68%
NB
65%
65%
64%
65%
RF
63%
63%
63%
63%
LR
64%
64%
64%
64%
Voting Classifier
67%
67%
67%
67%
Gradient Boosting
70%
70%
70%
70%
The table below VI displays the results obtained from the classifier with all features. The Gradient Boosting model also gave the highest performance in predicting the disease at 73% accuracy, 73% Recall, 73% F1-score, and 74% Precision. The Precision is higher than what was mentioned in the research [9]. The lowest model was DT at 63% accuracy.
Models
Precision
Recall
F1-score
Accuracy
KNN
70%
70%
70%
70%
SVM
73%
71%
71%
72%
DT
63%
63%
63%
63%
ANN
73%
73%
73%
73%
NB
71%
71%
71%
71%
RF
70%
70%
70%
70%
LR
73%
73%
73%
73%
Voting Classifier
73%
73%
73%
73%
Gradient Boosting
74%
73%
73%
73%
Models
Precision
Recall
F1-score
Accuracy
KNN
70%
70%
70%
70%
SVM
73%
71%
71%
72%
DT
63%
63%
63%
63%
ANN
73%
73%
73%
73%
NB
71%
71%
71%/p>
71%
RF
70%
70%
70%
70%
LR
73%
73%
73%
73%
Voting Classifier
73%
73%
73%
73%
Gradient Boosting
74%
73%
73%
73%
Table VI: Result of the Classifiers (All Features)
As it is evident from the two tables V and VI that the results of the models with all features are better than the features selection, and that may be due to selection and merging of some features affected the data representation of the problem correctly.
On the other hand, Table VII illustrates the prediction results of the classifiers with the heart diseases dataset [23]. Random forest classifier achieved the best performance by the accuracy of 94%, followed by Decision tree by 89%. As shown in the table, KNN achieved 71% of all metrics, which the lowest in this dataset.
Models
Precision
Recall
F1-score
Accuracy
KNN
66%
66%
66%
66%
SVM
66%
64%
64%
65%
DT
63%
62%
62%
62%
ANN
69%
68%
68%
68%
NB
65%
65%
64%
65%
RF
63%
63%
63%
63%
LR
64%
64%
64%
64%
Voting Classifier
67%
67%
67%
67%
Gradient Boosting
70%
70%
70%
70%
Models
Precision
Recall
F1-score
Accuracy
KNN
66%
66%
66%
66%
SVM
66%
64%
64%
65%
DT
63%
62%
62%
62%
ANN
69%
68%
68%
68%
NB
65%
65%
64%
65%
RF
63%
63%
63%
63%
LR
64%
64%
64%
64%
Voting Classifier
67%
67%
67%
67%
Gradient Boosting
70%
70%
70%
70%
Table VII: Result of the Classifiers Heat diseases dataset
The comparison of prediction cardiovascular disease using different classification and data mining algorithms are provided in Table VIII. It is evident from the table that KNN in this research achieved a greater result than [9] on the same data set at 70%. In addition, the NB and LR of this research gives accuracy more than [9] at 71% and 73%, respectively on the same dataset. This research implemented three additional algorithms for research [9] which are deep ANN, Voting Classifier, and Gradient Boosting in addition to the heart diseases dataset [23]. The performance of the proposed models with [23] dataset predicts the disease successfully; Random Forest classifier achieved the best results by 94%.
Previous work
KNN
RF
SVM
DT
ANN
NB
LR
VC
GB
Prediction of cardiac disease using supervised machine learning algorithms
66
71
72
73
–
60
72
–
–
Research models with CVD
70
70
72
63
73
71
73
73
73
Research models with heart diseases dataset
71
94
72
89
77
83
85
86
88
Previous work
KNN
RF
SVM
DT
ANN
NB
LR
VC
GB
Prediction of cardiac disease using supervised machine learning algorithms
66
71
72
73
–
60
72
–
–
Research models with CVD
70
70
72
63
73
71
73
73
73
Research models with heart diseases dataset
71
94
72
89
77
83
85
86
88
Table VIII: Result of the Classifiers Heat diseases dataset
-
CONCLUSION
Currently, heart disease is one of the dominant diseases that cause death. In this research, cardiovascular disease and heart diseases datasets were used to diagnose heart disease using
different machine learning algorithms, namely Artificial Neural Network (ANN), Support Vector Machine (SVM), Decision Tree Classifier (DT), Logistic Regression (LR), K-Nearest Neighbor (KNN), Random Forest (RF), Voting Classifier, (VC), Gradient Boosting Classifier (GB), and Naïve Bayes (NB). Gradient Boosting Classifier performed better on the cardiovascular disease dataset with all features while in the heart disease dataset, the Random Forest was the best classifier. The outcome results can be utilized as recommendations to help in enhancing and improving the diagnosis of heart disease. Hence, this could assist doctors in the quick and more efficient decision-making process in cardiovascular disease diagnosis.
REFERENCES
-
W. H. Organization, Cardiovascular Diseases. 2020.
-
B. Ali, L. Gurbeta, and A. Badnjevi, Machine learning techniques for classification of diabetes and cardiovascular diseases, in 2017 6th Mediterranean Conference on Embedded Computing (MECO), 2017, pp. 14.
-
S. Mezzatesta, C. Torino, P. De Meo, G. Fiumara, and A. Vilasi, A machine learning-based approach for predicting the outbreak of cardiovascular diseases in patients on dialysis, Comput. Methods Programs Biomed., vol. 177, pp. 915, 2019.
-
K. H. Miao, J. H. Miao, and G. J. Miao, Diagnosing coronary heart disease using ensemble machine learning, Int J Adv Comput Sci Appl IJACSA, 2016.
-
J. Wang et al., Detecting cardiovascular disease from mammograms with deep learning, IEEE Trans. Med. Imaging, vol. 36, no. 5, pp. 11721181, 2017.
-
A. M. Alaa, T. Bolton, E. Di Angelantonio, J. H. Rudd, and M. van der Schaar, Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants,PloS One, vol. 14, no. 5, p. e0213653, 2019.
-
C. Krittanawong, H. Zhang, Z. Wang, M. Aydar, and T. Kitai, Artificial intelligence in precision cardiovascular medicine, J. Am. Coll. Cardiol., vol. 69, no. 21, pp. 26572664, 2017.
-
M. S. Amin, Y. K. Chiam, and K. D. Varathan, Identification of significant features and data mining techniques in predicting heart disease, Telemat. Inform., vol. 36, pp. 8293, 2019.
-
R. J. P. Princy, S. Parthasarathy, P. S. H. Jose, A. R. Lakshminarayanan, and S. Jeganathan, Prediction of Cardiac Disease using Supervised Machine Learning Algorithms, in 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), 2020, pp. 570575.
-
KAGGEL, Cardiovascular Disease dataset. 2020.
-
S. Bashir, Z. S. Khan, F. H. Khan, A. Anjum, and K. Bashir, Improving Heart Disease Prediction Using Feature Selection Approaches, in 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2019, pp. 619623. doi: 10.1109/IBCAST.2019.8667106.
-
S. A. Sabab, M. A. R. Munshi, A. I. Pritom, and others, Cardiovascular disease prognosis using effective classification and feature selection technique, in 2016 International Conference on Medical Engineering, Health Informatics and Technology (MediTec), 2016, pp. 16.
-
G. Biau et al., Performances Analysis of Heart Disease Dataset using Different Data Mining Classifications, Test, vol. 8, no. 6, pp. 2677 2682.
-
A. Khemphila and V. Boonjing, Heart disease classification using neural network and feature selection, in 2011 21st International Conference on Systems Engineering, 2011, pp. 406409.
-
M. Kavitha, G. Gnaneswar, R. Dinesh, Y. R. Sai, and R. S. Suraj, Heart Disease Prediction using Hybrid machine Learning Model, in 2021 6th International Conference on Inventive Computation Technologies (ICICT), 2021, pp. 13291333.
-
J. Thomas and R. T. Princy, Human heart disease prediction system using data mining techniques, in 2016 international conference on circuit, power and computing technologies (ICCPCT), 2016, pp. 15.
-
M. Gawali, N. Shirwalkar, and A. Kalshetti, Heart disease prediction system using data mining techniques, Int. J. Pure Appl. Math., vol. 120, no. 6, pp. 499506, 2018.
-
J. Maiga, G. G. Hungilo, and others, Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data, in 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), 2019, pp. 4548.
-
S. Mohan, C. Thirumalai, and G. Srivastava, Effective heart disease prediction using hybrid machine learning techniques, IEEE Access, vol. 7, pp. 8154281554, 2019.
-
A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms, Mob. Inf. Syst., vol. 2018, 2018.
-
Z. Arabasadi, R. Alizadehsani, M. Roshanzamir, H. Moosaei, and A.
-
Yarifard, Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm, Comput. Methods Programs Biomed., vol. 141, pp. 1926, 2017.
-
-
C. B. C. Latha and S. C. Jeeva, Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques, Inform. Med. Unlocked, vol. 16, p. 100203, 2019.
-
Heart disease dataset. 2021.
-
M. SIDDHARTHA, Heart Disease Dataset (Comprehensive). IEEE, Nov. 05, 2020. Accessed: Jul. 08, 2021. [Online]. Available: https://ieee-dataport.org/open-access/heart-disease-dataset- comprehensive
-
A. H. Association, High blood pressure. 2021.
-
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ArXiv Prepr. ArXiv14126980, 2014.
-
C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn., vol. 20, no. 3, pp. 273297, 1995.
-
K. Aida-zade, A. Xocayev, and S. Rustamov, Speech recognition using Support Vector Machines, in 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT), 2016, pp. 14.
-
Y. Tian, E. Li, L. Yang, and Z. Liang, An image processing method for green apple lesion detection in natural environment based on GA- BPNN and SVM, in 2018 IEEE International Conference on Mechatronics and Automation (ICMA), 2018, pp. 12101215.
-
Chapter 4: Decision Trees Algorithms | by Madhu Sanjeevi ( Mady
) | Deep Math Machine learning.ai | Medium. https://medium.com/deep-math-machine-learning-ai/chapter-4- decision-trees-algorithms-b93975f7a1f1 (accessed Jul. 08, 2021).
-
J. Brownlee, How to Develop Voting Ensembles With Python, Machine Learning Mastery, Apr. 16, 2020. https://machinelearningmastery.com/voting-ensembles-with-python/ (accessed Jul. 08, 2021).
-
ML – Gradient Boosting, GeeksforGeeks, Aug. 25, 2020. https://www.geeksforgeeks.org/ml-gradient-boosting/ (accessed Jul. 08, 2021).