A Machine Learning Approach to Diabetes Prediction with Feature Selection

Rashmik Manchiraju; Vandana Bhattacharjee

doi:https://doi.org/10.5281/zenodo.18412814

Volume 11, Issue 07 (July 2022)

A Machine Learning Approach to Diabetes Prediction with Feature Selection

DOI : https://doi.org/10.5281/zenodo.18412814

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 305
Authors : Rashmik Manchiraju , Vandana Bhattacharjee
Paper ID : IJERTV11IS070140
Volume & Issue : Volume 11, Issue 07 (July 2022)
Published (First Online): 27-07-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Machine Learning Approach to Diabetes Prediction with Feature Selection

Rashmik Manchiraju

Birla Institute of Technology, Mesra Ranchi

Vandana Bhattacharjee Birla Institute of Technology, Mesra Ranchi

Abstract:- Diabetes is a constant sickness that happens when the pancreas neglects to create sufficient insulin or when the body's insulin is incapably utilized. Insulin is a chemical that assists with holding glucose levels under tight restraints. Uncontrolled diabetes causes hyperglycemia, or high glucose, which makes disastrous harm to a significant number of the body's frameworks, including the neurons and veins, after some time. Machine learning techniques have been applied for many health care problems with good results. The goal of this paper is to analyse different classification algorithm such as Support Vector Machines, Decision tree, K-Nearest Neighbour, NaÃ¯ve Bayes and Random Forest to identify diabetes at beginning phase.

INTRODUCTION

Diabetes is one of the world's most frequent diseases. If a person leads a stressful life or is obese, and carries additional weight in the belly area of the body, insulin activity is hampered, resulting in diabetes. In [1] it was stated that as indicated by (WHO) World Health Organization around 422 million individuals experiencing diabetes particu-larly from low or inactive pay nations. And this could be increased to 490 billion up to the year of 2030. The causes of diabetes as mentioned in [2] are Genetic factors. It is brought about by somewhere around two freak qualities in the chromosome 6, the chromosome that influences the reaction of the body to different antigens. Viral disease may likewise impact the event of type 1 and type 2 diabetes. Diabetes as stated in [3] is not only affected by various factors like height, weight, hereditary factor and insulin but the major reason considered is sugar concentration among all factors. The early identification is the only remedy to stay away from the complications. Application of machine learning algorithm were applied in different medical data sets including machine Diabetes dataset [4-5]. Machine Learning (ML) has been a magnificent help for making expectation of a specific framework via preparing. ML is tied in with gaining structures from the information which is given. ML lately has been the advancing, solid and supporting apparatus in clinical space. Programmed learning has gotten a more prominent measure of interest in clinical space because of less measure of time for identification and less communication with patient, saving time for patients care. Medical care areas have huge volume data sets [7]. Such data sets might contain organized, semi-organized or unstructured information. Big Data Analytics is the cycle which investigations enormous informational indexes and uncovers stowed away data. These days, there is a developing requirement for Internet of Things (IoT)- based versatile medical services applications that assistance to anticipate sicknesses [8].
METHODS
Table 1. Confusion matrix

Predicted

Actual

True

False

True

True positive

False negative

False

False positive

True negative

The evaluation parameters used in this research work are precision, recall, f-measure and accuracy.

Precision estimates the quantity of positive class forecasts that have a place with the positive class.

Precision (P) = TruePositives / (TruePositives + FalsePositives)

Recall estimates the quantity of positive class expectations made from all certain models in the dataset.

Recall (R) = TruePositives / (TruePositives + FalseNegatives)

F-Measure offers a solitary score that adjusts both the worries of precision and recall in one number.

F-Measure (FM) = (2 * Precision * Recall) / (Precision + Recall)

Though, accuracy is the complete number of right expectations partitioned by the all out number of forecasts made for a dataset.

Accuracy (A) = (TP+TN)/(TP+FP+FN+TN)

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

EXPERIMENTS & RESULTS

Pima Indian Diabetes Database is a recognizable and ordinarily utilized informational collection for the expectation of diabetes. This informational index comprises of 768 rows and 9 columns. The characteristics remembered for the section are glucose, pregnancies, skin thickness, Blood Pressure, BMI, insulin, age, and results. The result variable predicts whether the patient is diabetic positive or diabetic-negative. Pandas capability is used to peruse CSV file where the informational index document is in succeed design [6]. Tables 1 5 present the results of experiments with different classifiers.

KNN

Table 1. Performance of KNN classifier with feature sets

Parameters	Precision	Recall	F-Measure	Accuracy
Glucose, BMI	0.6279	0.5744	0.6	0.7622
Glucose, BMI, Pregnancies, Age	0.6888	0.6595	0.6739	0.8051
Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin	0.6956	0.6808	0.6881	0.8116

SVM

Table 2. Performance of SVM classifier with feature sets

Parameters	Precision	Recall	F-Measure	Accuracy
Glucose, BMI	0.7222	0.5531	0.6265	0.7987
Glucose, BMI, Pregnancies, Age	0.6923	0.5744	0.6279	0.7922
Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin	0.7222	0.5531	0.6265	0.7987

DECISION TREE

Table 3. Performance of Decision Tree classifier with feature sets

Parameters	Precision	Recall	F-Measure	Accuracy
Glucose, BMI	0.4897	0.5106	0.5	0.6883
Glucose, BMI, Pregnancies, Age	0.5454	0.6382	0.5882	0.7272
Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin	0.5918	0.6170	0.6041	0.7532

NAIVE BAYES

Table 4. Performance of NAIVE BAYES classifier with feature sets

Parameters	Precision	Recall	F-Measure	Accuracy
Glucose, BMI	0.6829	0.5957	0.6363	0.7922
Glucose, BMI, Pregnancies, Age	0.625	0.6382	0.6315	0.7727
Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin	0.6458	0.6595	0.6526	0.7857

RANDOM FOREST

Table 5. Performance of Random Forest classifier with feature sets

Parameters	Precision	Recall	F-Measure	Accuracy
Glucose, BMI	0.5510	0.5744	0.5625	0.7272
Glucose, BMI, Pregnancies, Age	0.6530	0.6808	0.6666	0.7922
Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin	0.6862	0.7446	0.7142	0.8181

Figure 1. Bar Graph visualization for KNN classifier performance

Figure 2. Bar Graph visualization for SVM classifier performance

Figure 3. Bar Graph visualization for Decision Tree classifier performance

Figure 4. Bar Graph visualization for NaÃ¯ve Bayes classifier performance

Figure 5. Bar Graph visualization for Random Forest classifier performance

CONCLUSION

Various machine learning classifiers have been applied on the diabetes dataset. Four evalution parameters were taken into consideration- Precision, Recall, F-measure, Accuracy. From Figures 1 5 it can be seen that, Precision was highest when SVM classifier was used with 72.22 % and lowest when Decision Tree was used with 48.97% . Recall and F-measure were highest when Random Forest was used with values of 74.46% and 71.42 %, lowest when Decision Tree was used with values of 51.16 % and 50 % respectively.

The accuracy value was highest for Random forest with 81.81 % slightly higher than the KNN value of 81.16 % and it was lowest for Decision Tree with a value of 68.83 %. For the SVM and NaÃ¯ve Bayes classifier the feature set of {Glucose, BMI} gave the highest values for all parameters, however for all other classifiers the feature set {Glucose, BMI, Pregnancies, Age, Skin Thickness, Insulin}, gave the highest performance. It is important that we select the right classifier with the right feature set in order to get accurate solutions to real life problems.

REFERENCES

[1] Mitushi Soni, Dr. Sunita Varma, 2020, Diabetes Prediction using Machine Learning Techniques, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 09, Issue 09 (September 2020).

[2] Rani, KM. (2020). Diabetes Prediction Using Machine Learning. International Journal of Scientific Research in Computer Science, Engineering and Information Technology. 294-305. 10.32628/CSEIT206463.

[3] Deepti Sisodia, Dilip Singh Sisodia,Prediction of Diabetes using Classification Algorithms,Procedia Computer Science,Volume 132,2018,Pages 1578- 1585,

[4] Saru, S. and Subashree, S., Analysis and Prediction of Diabetes Using Machine Learning (April 2, 2019). International Journal of Emerging Technology and Innovative Engineering, Volume 5, Issue 4, April 2019

[5] Aishwarya R., Gayathri P., Jaisankar N, A Method for Classification Using Machine Learning Technique for Diabetes ,International Journal of Engineering and Technology (IJET), 5 (2013), pp. 2903-2908

[6] Raja Krishnamoorthi, Shubham Joshi, Hatim Z Almarzouki, Piyush Kumar Shukla, Ali Rizwan, C. Kalpana, Basant Tiwari, "A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques", Journal of Healthcare Engineering, vol. 2022, Article ID 1684017, 10 pages, 2022. https://doi.org/10.1155/2022/1684017

[7] Mujumdar, Aishwarya & Vaidehi, V.. (2019). Diabetes Prediction using Machine Learning Algorithms. Procedia Computer Science. 165. 292-299.

10.1016/j.procs.2020.01.047.

[8] Sasmita Padhy, Sachikanta Dash, Sidheswar Routray, Sultan Ahmad, Jabeen Nazeer, Afroj Alam, "IoT-Based Hybrid Ensemble Machine Learning Model for Efficient Diabetes Mellitus Prediction", Computational Intelligence and Neuroscience, vol. 2022, Article ID 2389636, 11 pages, 2022. https://doi.org/10.1155/2022/2389636