- Open Access
- Authors : Pabitra Kumar Bhunia, Arijit Debnath, Poulami Mondal, Monalisa D E, Kankana Ganguly, Pranati Rakshit
- Paper ID : IJERTCONV9IS11064
- Volume & Issue : NCETER – 2021 (Volume 09 – Issue 11)
- Published (First Online): 16-07-2021
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Heart Disease Prediction using Machine Learning
Pabitra Kumar Bhunia
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India
Poulami Mondal
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India poulamimondal164@gmail.com
Kankana Ganguly
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India
Arijit Debnath
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India
Monalisa De
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India
Pranati Rakshit
Department of Computer Science and Engineering JIS College of Engineering
Kalyani, Nadia, West Bengal, India
Abstract Heart-disease (HD) is one of the most common diseases nowadays, and for people who provide health care, it is very necessary to work with them to take care of their patients' health and save their life. In this paper, different classifiers were analyzed by performance comparison to classify the Heart Disease dataset to classify it correctly and or to Predict Heart Disease cases with minimal attributes.
Large amounts of data that contain some secret information were collected by the healthcare industries. This data collection is useful for making effective decisions. Some advanced data mining techniques are used to make proper results and making effective decisions on data. In this case, a Heart Disease Prediction System (HDPS) is developed using Logistic Regression, K Nearest Neighbor, Decision Tree, Random Forest Classifier, and Support Vector Machine algorithms to predict the heart disease risk level.
The results reveal that the Random Forest Classifier and Support Vector Machine obtained the highest accuracy of 90.32%, whereas 87.09%, 70.96%, and 83.87% accuracy scores are obtained by logistic regression, KNN classifier, and decision tree respectively.
Keywords Machine learning, Logistic regression, Heart disease, Support vector machine, accuracy
-
INTRODUCTION
Data mining is the process by which we can find usually unknown scriptures, patterns, and ongoing trends in databases and it uses that piece of information to structure prognostic models. Data mining technology combines analysis based on statistics, machine learning algorithm, and database technology management system to generate disclosed patterns and establish relationships from huge databases.
The World Health statistics 2012 highlights the issue that every one in three adult age group showed prone to high blood pressure- a situation that results in half of the deaths from heart issues and strokes. Disease-related to the heart, also known as cardiovascular disease (CVD), discusses various conditions that affect the heart not just the disease. This
juncture proved fatal for one person in every 34 seconds in the United States.
Heart disease of the coronary arteries, cardiomyopathy, and cardiovascular health issues are certain subdivisions where the blood is pumped and its circulation is made throughout the body. Diagnosis is an important task that has to be performed efficiently. This is mainly done under a doctor's guidance. This causes unsatisfactory results & excessive medical costs of treatments provided to patients. So, we conclude that an automated medical diagnosis and prediction system would prove extremely favorable.
-
LITARATURE REVIEW
Numerous studies have been done that have focused on the diagnosis of heart disease. They have applied different data mining techniques for diagnosis & achieved different probabilities for different methods.
This system evaluates those parameters using the data mining classification technique. The datasets are evaluated in python using two main Machine Learning Algorithms: The decision Tree Algorithm and the Naive Bayes Algorithm which shows the best algorithm between these two in terms of the accuracy level of heart disease [1].
Aditi Gavhane et al. predicted heart attack for early diagnosis to reduce the count of deaths. For this problem Machine Learning plays a major role in this paper. This prediction takes people from the danger zone of their life. In this paper, we use the KNN algorithm and Random Forest algorithm to predict the heart attack in advance [2].
Senthil Kumar et al. introduced a prediction model with different combinations of features, and several known classification techniques. It produced an enhanced performance level with an accuracy level of 88.7%
through the prediction model for heart disease with Hybrid Random Forest with Linear Model (HRFM) [3].
Himanshu Sharma et al. stated and proved that machine learning algorithms and deep learning opens new door opportunities for precise prediction of a heart attack. Paper provides a lot of information about state of art methods in Machine learning and deep learning. An analytical comparison has been provided to help new researchers working in this field [4].
M. Nikhil Kumar et al. worked with 8 algorithms including Decision Tree, J48 algorithm, Logistic model tree algorithm, Random Forest algorithm, Naïve Bayes, KNN, Support Vector Machine, Nearest Neighbor to predict heart diseases. The accuracy of the prediction level is high when using more attributes [5].
Amandeep Kaur et al. stated that Data mining is an important stage of the KDD process that can be used for disease management, diagnosis, and prediction in healthcare organizations. This paper discusses reviews on different methods and approaches in data mining that have been used to predict heart disease [6].
Pahulpreet Singh Kohli developed an Enhanced New Dynamic Data Processing (ENDDP) Algorithm to predict the early stages of heart disease. The results prove the performance of the proposed system [7].
-
DATA SET INFORMATION
The name of the dataset is heart.csv. There are 303 instances in this dataset, where the cases are either people having heart disease or they are healthy. Among 303, 165 (54.45%) cases are people with heart disease and 138 (45.54%) are people without heart disease. The number of attributes is 14. There are no missing values in the data set nor any null values.
Features include age, sex, chest-pain type, rest BP, cholesterol, blood sugar level, ECG result, maximum heart rate achieved, exercise-induced angina, ST depression, the slope of peak exercise ST segment, number of major vessels, and defect in heart as of 3-normal, 6-fixed defect and 7- reversible defect. Bar graph (Fig.1) showing the positive and negative cases (1=positive, 0=negative) Scatter plot (Fig.2) showing the positive and negative cases depending on age.
Fig.1 Positive and negative cases
Fig.2 Positive and negative cases depending on age
Fig.3 Data set description
-
METHODOLOGY
-
Data set information
The main objective of this research is to develop a heart disease prediction system. The system can discover and extract hidden knowledge associated with diseases from a historical heart data set heart disease prediction system aims to exploit data mining techniques on medical data set to assist in the prediction of the heart diseases.
-
Training and testing
The training phase extracts the features (independent variables) from the dataset and the testing phase (containing dependent variables) is used to determine how the appropriate model behaves fr prediction. We have divided the dataset into two sections. These are the training and testing phases. We have split the dataset into 90% training and 10% testing phase. And we have taken the random state as 1. For initializing the fixed internal random number generator, we use the random state parameter which will decide the splitting of data into train and test indices. Setting a random state will guarantee a fixed value that the same sequence of random numbers will be generated each time the code is being run. Setting random state, a fixed value will guarantee that the same sequence of random numbers is generated each time we run the code. Then we scaled the data using Standard scattered and fitted the training and testing data using fit. transform.
-
Classsification used
-
Logistic regression
Logistic Regression is an analytical modeling technique. It is used for analyzing a dataset in which there are one or more independent variables that decide a result. Logistic Regression was imported with a random state of 0. And then the training model was fitted. The testing accuracy was 87.09%
-
KNN Classifier
-
K-nearest neighbor algorithm is utilized for grouping and used in pattern recognition. It is widely used in predictive analysis. On the arrival of new data, the K- NN algorithm [8] identifies existing data points that are nearest to it. From sklearn.neighbors, KNeighbors Classifier was imported with n_neighbors = 1. Then the training model was fitted. The testing accuracy was 70.96%
-
Support vector machine
Support Vector Machine or SVM is one of the popular Supervised Learning algorithms in machine learning. The benefits of the SVM algorithm is that it creates the best suitable line or decision boundary that can separate a n-dimensional space into classes so that we can easily verify and put the new added data points in the correct category in the future. From sklearn, svm was imported and we kept the kernel as linear and gamma as auto and C = 2. And the training model was fitted. The testing accuracy was 90.32%.
-
Random forest
Random forest classifier is a powerful supervised classification tool. RF generates a forest of classification trees from a given dataset, rather than a single classification tree. Each of these trees produces a classification for a given set of attributes. From sklearn.ensemble, Random Forest Classifier was imported. The n_estimators is kept at 10 and random state at 0. Then the training model was fitted. The testing accuracy was 90.32%.
-
Decision Tree
The testing accuracy was 90.32%. A Decision tree is a tree shape-like diagram, where the internal nodes represent a test on an attribute, each branch denotes the outcome of the test, each leaf node denotes a class label. Decision Tree was imported where the random state was kept as 0 and then the training model was fitted. The testing accuracy was 83.87%. 6. Results Amongst all classification techniques, testing accuracy was best in the case of the random forest and SVM approach with an accuracy of 90.32%.
-
-
RESULT
Amongst all classification techniques [Table.1], testing accuracy was best in the case of the random forest and SVM approach with an accuracy of 90.32%.
Table. 1 Comparison of performances difference classifier
-
CONCLUSION
This heart disease prediction model with an accuracy of 90.32% will help people especially medical professionals to scale different scenarios. They will have a good understanding of a person's health and they can easily understand age related health risk and thus they can warn a patient beforehand. Patients on the other hand can also consult a doctor beforehand and go through checkup and thus can prevent the occurrence of any heart disease. Thus, this model helps to build trust and develops a sense of security among people.
-
REFERENCE
-
"Prediction of Heart Diseases Using Data Mining and Machine Learning Algorithms and Tools" International Journal of Scientific Research in Computer Science Engineering and Information Technology IJSRCSEIT , M. Nikhil Kumar K. V. S. Koushik and K. Deepak|Year
:2019| Conference Paper | Publisher: IEEE.
-
"Heart Diseases Prediction using Data Mining Techniques: A survey" Amandeep Kaur and Jyoti Arora International Journal of Advanced Research in Computer Science IJARCS |Year :2019| Conference Paper
| Publisher: IEEE.
-
"Application of Machine Learning in Diseases Prediction", Pahulpreet Singh Kohli and Shriya Arora ,4th International Conference on Computing Communication And Automation(ICCCA) . |Year :2018| Conference Paper | Publisher: IEEE.
-
"Classification of Heart Disease Using K-Nearest Neighbor and Genetic Algorithm" M. Akhil B. L. Deekshatulu and P. Chandra ,Procedia Technology. vol. 10 pp. 85-94|Year :2013| Conference Paper | Publisher: IEEE.