Predictive Analysis for Chronic kidney Disease Using Machine Learning Techniques

Bhumika M; Divya Shivakumar; Meghana B; Meghana M; Rajashekar M B

doi:10.17577/IJERTCONV7IS10031

NCRACES - 2019 (Volume 7, Issue 10)

Predictive Analysis for Chronic kidney Disease Using Machine Learning Techniques

DOI : 10.17577/IJERTCONV7IS10031

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 153
Total Downloads : 58
Authors : Bhumika M, Divya Shivakumar, Meghana B, Meghana M, Rajashekar M B
Paper ID : IJERTCONV7IS10031
Volume & Issue : NCRACES – 2019 (Volume 7, Issue 10)
Published (First Online): 20-06-2019
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Predictive Analysis for Chronic kidney Disease Using Machine Learning Techniques

Bhumika M, Divya Shivakumar, Meghana B, Meghana M, Student, Department of Computer Science & Engineering, GSSSIETW, Mysuru

Rajashekar MB

Assistant Professor, Department of Computer Science & Engineering, GSSSIETW, Mysuru

AbstractIn this study, dataset named "Chronic Kidney Disease" obtained from UCI database is used. The dataset consists of 400 individuals information and contains 25 features. With WEKA software, this dataset is classified according to whether it is chronic kidney disease using NaÃ¯ve Bayes (NB), and C4.5 classifiers used in data mining. Accuracy, precision, sensitivity, and F- measure values are used for performance comparisons of the performed classifications. According to the obtained results, more successful results were obtained in NaÃ¯ve bayes algorithm with 99%accuracy.

Keywords chronic kidney disease; weka; performance comparison; naive bayes, c4.5;

INTRODUCTION

extracted dataset in order to develop more effective software for this area.

B. Used Classifiers

The dataset used in this study is deployed for performance comparison using Naive Bayes (NB) and C4.5 classifiers. The The rest of this work is organized as follows. The material and method are presented in section II, the dataset used and the performance criteria used in the classification algorithms are discussed in section III, the experimental results in are provided in section IV, and finally the conclusions and the recommendations are offered in section V.

The amount of data emerging along with the development of information technologies also shows a rapid increase. It is estimated that the data stored in the world database is being doubled every 20 months [1]. With the amount of data that increases each passing day, the processing of this data has become a challenge. Various data mining algorithms have been developed to solve this problem. In the literature, there are many algorithms used in data mining and different field of studies for comparing these algorithms. The health sector is also one of these areas. It is used to find more accurate results in diagnostics and treatments in this sector and to prevent human errors[2]. In our study, the dataset named "Chronic Kidney Disease" obtained from the UCI database [3] of chronic kidney disease, which affects human life negatively and is one of the most common public health problems in the world, has been used[4]. According to Tanrverdi and his colleagues, chronic kidney disease [5] has been described as "the end of the reduction of glomerular filtration value, adjusting the fluid-solute balance of the kidney, and chronic and progressive impairment of metabolic endocrine functions". Chronic kidney disease occurs when there is a major reduction in kidney function[6]. According to the data of the Turkish Nephrology Association, it is 390 per million population prevalence of chronic kidney disease in Turkey.

Compared to other countries, this rate is quite low. The most important cause for this case is the experienced difficulties

in data collection [5]. The purpose of the study is to compare the performances of some classifiers on the

MATERIAL AND METHOD

A. Data Extraction

In this study, the dataset named "Chronic Kidney Disease" extracted from UCI database was used. The dataset consists of a total of 25 attributes that 24 attributes

+ class (11 numeric, 14 nominal). In addition, there are a total of 400 records for individuals, whose ages range from2-90.

Table I contains the attributes descriptions and the value ranges of the dataset used.

ttribute Name	Value Range	Description
age	2, .., 90	age
bp	50, , 180	blood pressure
sg	1.005,1.010,1.015,1.020,1.025	specific gravity
al	0,1,2,3,4,5	albumin
su	0,1,2,3,4,5	sugar
rbc	2.1, , 8	red blood cells
pc	normal,abnormal	pus cell
pcc	present,notpresent	pus cell clumps
ba	present,notpresent	bacteria
bgr	22, , 490	blood glucose random
bu	1.5, , 391	blood urea
sc	0.4, , 76	serum creatinine
sod	4.5, , 163	sodium
pot	2.5, , 47	potassium
hemo	3.1, , 17.8	hemoglobin
pcv	9, , 54	packed cell volume
wc	2200,, 26400	white blood cell count

ttribute Name	Value Range	Description
age	2, .., 90	age
bp	50, , 180	blood pressure
sg	1.005,1.010,1.015,1.020,1.025	specific gravity
al	0,1,2,3,4,5	albumin
su	0,1,2,3,4,5	sugar
rbc	2.1, , 8	red blood cells
pc	normal,abnormal	pus cell
pcc	present,notpresent	pus cell clumps
ba	present,notpresent	bacteria
bgr	22, , 490	blood glucose random
bu	1.5, , 391	blood urea
sc	0.4, , 76	serum creatinine
sod	4.5, , 163	sodium
pot	2.5, , 47	potassium
hemo	3.1, , 17.8	hemoglobin
pcv	9, , 54	packed cell volume
wc	2200,, 26400	white blood cell count

TABLE I. USED DATA SET

rc	2.1,, 8	red blood cell count
htn	yes, no	hypertension
dm	yes, no	diabetes mellitus
cad	yes, no	coronary artery disease

following information is given about these classifiers.

Naive Bayes Classifier: This classifier is based on the Bayes theorem, which is one of the simple and widely used methods, and it can handle any number of properties or classes. Although the model is simple, it performs well with some

problems [7]. This classifier assumes that the data is already classified. When a new data arrives, it calculates the probability that this data belongs to oneof the classes. When calculating these probability values, it is assumed that each feature is independent of the other, and each feature has the same degree of importance. When it is desired to find out which class the externally entered data belongs to, the probability of belonging to that class for each class of the data is calculated by using the formula in equation (1). The

class having the highest probability among these calculated values is regarded as the class to which that data belongsI[2II]..

C4.5 Classifier: Step 1: Scan the dataset (storage servers) Step 2: for each attribute a, calculate the gain [number of occurrences]

Step 3: Let a_best be the attribute of highest gain [highest count]

Step 4: Create a decision node based on a_best retrieval of nodes [patient] where the attribute values matches with a_best. Step 5: recur on the sub-lists [list of patient] and calculate the count of outcomes [Stages] termed as sub nodes. Based on the highest count we classify the new node. SampleExample

Attributes(Features) F1,F2,F3 [m=3]

Subject (outcome) CKD, NOT CKD [p=1/2=0.5] TABLE II. TRAINING DATASET

PERFORMANCE MEASURES USED FOR THE CLASSIFICATION ALGORITHMS

(S|X) = (| )()

()

Fig 1. System Architecture

(1)

There are many classifiers used in data mining. Comparing and analyzing these classifiers are a rather complex process. Because there are various evaluation dimensions and these dimensions must be taken into consideration [14].

Two different classification algorithms (NB and C4.5) are

Where,

P(Si|X): The probability of occurrence of the Si event when the X event occurs,

P(X|Si): The probability of occurrence of the X event when the Si event occurs,and

P(Si), P(X): The prior probability of Si and X events.

The optimal hyperplane, given in Figure 1, can be defined as a linear decision function with the maximum distance between the vectors of the two classes [8].

used in this study. In data mining applications, confusion matrix is frequently used to measure the performance of algorithms classification.
1. Accuracy
  
  The correct classification is the total classification ratio [14]
2. Precision
  
  One of the classifiers role is the ability to determine the positive features of the whole classification. That is, as seen in Equation (9), it is obtained by dividing the positive values classified correctly in all positive values.
  
  Precision = TP/TP+FP
3. Sensitivity
  
  To describe the class labels of the classifiers, it refers to the average per class activity. In other words, positive value of correctly classified to the sum of correctly classified positive and false classified negative values [14].
  
  Sensitivity= TP (10)
  
  TP+FN
4. F-Measure
Precision and sensitivity alone may not be sufficient in performance comparisons. In this case, it is necessary to look at the F-measure. The F-measure is obtained by taking the weighted average of the precision and sensitivity values [14].
EXPERIMENTAL RESULTS

The following tables show the confusion matrixes of the classifiers used. The expression "ckd" in the tables means "chronic kidney disease" and "notckd" means "non-chronic kidney disease".

As it can be seen from Table V, the accuracy rate of the performance measure according to the classifiers was obtained from the NB classifier with the highest 99%. C4.5 with the 0.97. Sensitivity was obtained from NB with a 0.99, and C4.5 with a 0.96. Finally, the F-measure was obtained from NB and C4.5 with the 0.98, and the h 0.97. Judging from the generally obtained results, it can be said that NaÃ¯ve bayes classifier is better with 99% accuracy.
CONCLUSIONS AND RECOMMENDATIONS Currently, kidney disease is a major problem. Because

there are so many people with this disease. Kidney disease is very dangerous if not immediately treated on time, and may be fatal. If the doctors have a good tool that can identify patients who are likely to have kidney disease in advance, they can heal the patients in time. Chronic Kidney Disease has been predicted and diagnosed using data mining classifiers: Naive Bayes and C4.5. In future studies, datasets with more data can be used while the performances of the algorithms are being compared.

Moreover, other classifiers can be also used for the same dataset.

TABLE III. NAIVE BAYES CONFUSION MATRIX

Prediction

ckd

notckd

Actual

ckd

230

20

notckd

0

150

TABLE IV.C4.5 CONFUSION MATRIX

Prediction

ckd

notckd

Actual

ckd

241

9

notckd

0

150

TABLE V.COMPARISON OF PERFORMANCE MEASURES OF THE USED ALGORITHMS

REFERENCES
1. I. H. Witten and E. Frank, Data Mining Practical Machine Learning Tools and Techniques, 2nd ed., San Francisco/ABD,2005.
2. M. Karakoyun and M. Hacbeyoglu, Biyomedikal veri kÃ¼meleri kullanarak makine Ã¶grenmesi snflandrma algoritmalarnn karlatrlmas, 2014 October 9-10 [Akll Sistemlerde Yenilikler ve Uygulamalar (ASYU) Sempozyumu.Izmir/Turkey].
3. G. SÃ¼leymanlar, Kronik BÃ¶brek Hastalg ve Yetmezlig: Tanm, Evreleri ve Epidemiyolojisi, Turkiye Klinikleri J Int Med Sci., vol. III, number: 38, 2007, pp. 1-7.
4. https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_ Disease (Access Date: 2018 February7).
5. M. H. Tanrverdi, A. Karadag and E. ?. Hatipoglu, Kronik BÃ¶brek Yetmezligi, Konuralp Tp Dergisi, vol. 2, number:2, 2010, pp.27-32.
6. D. Soria, J. M. Garibaldi, F. Ambrogi, E. M. Biganzoli and I. O. Ellis, A non-parametric version of the naÃ¯ve Bayes classifier, Knowledge- Based Systems, vol. 24, 2011, pp. 775-784.
7. C. Cortes and V. Vapnik, Support- Vector Networks, Machine Learning, vol. 20, 1995, pp.273-297.
8. R. Panja and N. R. Pal, MS-SVM: Minimally Spanned Support Vector Machine, Appied Soft Computing, vol. 64, 2018, pp.356-365.
9. A. Bahri, V. Sugumaran and S. B. Devasenapati, Misfire Detection in IC Engine using Kstar Algorithm, CoRR, abs/1310.3717,2013.
  
  Used Algorithms
  
  Performance Measures
  
  Accuracy (%)
  
  Precision
  
  Sensitivity
  
  F-
  
  Measure
  
  C4.5
  
  97.75
  
  1.0
  
  0.96
  
  0.97
  
  NB
  
  99.00
  
  0.98
  
  0.99
  
  0.98
  
  Used Algorithms
  
  Performance Measures
  
  Accuracy (%)
  
  Precision
  
  Sensitivity
  
  F-
  
  Measure
  
  C4.5
  
  97.75
  
  1.0
  
  0.96
  
  0.97
  
  NB
  
  99.00
  
  0.98
  
  0.99
  
  0.98
10. S. Piramuthu and R. T. Sikora, Iterative feature construction for improving inductive learning algorithms, Expert Systems, vol. 36, 2009, pp. 3401- 3406.
11. C. K. Madhusudana, H. Kumar and S. Narendranath, Condition monitoring of face milling tool using K-star algorithm and histogram features of vibration signal, Engineering Science and Technology, an International Journa, vol. 19, 2016, pp.1543-1551.
12. T. Kavzoglu and I. Ã‡Ã¶lkesen, Karar AgaÃ§lar Ile Uydu GÃ¶rÃ¼ntÃ¼lerinin Snflandrlmas: Kocaeli Ã–rnegi, Harita Teknolojileri Elektronik Dergisi, vol.2, number:1, 2010, pp. 36-45.
13. N. Mehdiyev, D. Enke, P. Fettke and P. Loos, Evaluating Forecasting Methods by Considering Different Accuracy Measures, Procedia Computer Science, vol. 95, 2016, pp.264-271.
14. http://www.datascience.istanbul/2017/07/02/hata- matrisini- confusion- matrix-yorumlama/ (Access Date: 2018 February7).
15. C. Cokun and A. Baykal, Veri Madenciliginde Snflandrma Algoritmalarnn Bir Ã–rnek Ãœzerinde Karlatrlmas, 2011 February 2- 4 [XIII. Akademik Biliim Konferans.Malatya/Turkey].

		Prediction
		ckd	notckd
Actual	ckd	230	20
Actual	notckd	0	150

		Prediction
		ckd	notckd
Actual	ckd	241	9
Actual	notckd	0	150

Used Algorithms	Performance Measures
Used Algorithms	Accuracy (%)	Precision	Sensitivity	F- Measure
C4.5	97.75	1.0	0.96	0.97
NB	99.00	0.98	0.99	0.98

Used Algorithms	Performance Measures
Used Algorithms	Accuracy (%)	Precision	Sensitivity	F- Measure
C4.5	97.75	1.0	0.96	0.97
NB	99.00	0.98	0.99	0.98

Predictive Analysis for Chronic kidney Disease Using Machine Learning Techniques

Leave a Reply