Measuring Performance of Selected Algorithms used for Classification and Regression when Applied Against a Standard Dataset

Mr.    Vinay S Bharadwaj; Mrs.    Sunitha R S; Mr.    Shashidhara H S

doi:10.17577/IJERTV6IS050219

Volume 06, Issue 05 (May 2017)

Measuring Performance of Selected Algorithms used for Classification and Regression when Applied Against a Standard Dataset

DOI : 10.17577/IJERTV6IS050219

Download Full-Text PDF Cite this Publication

Open Access
[post-views]
Total Downloads : 119
Authors : Mr. Vinay S Bharadwaj, Mrs. Sunitha R S, Mr. Shashidhara H S
Paper ID : IJERTV6IS050219
Volume & Issue : Volume 06, Issue 05 (May 2017)
DOI : http://dx.doi.org/10.17577/IJERTV6IS050219
Published (First Online): 08-05-2017
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Measuring Performance of Selected Algorithms used for Classification and Regression when Applied Against a Standard Dataset

Vinay S Bharadwaj

M.Tech Student, Dept of ISE,

Mrs. Sunitha R S

Assistant Professor, Dept of ISE,

Mr. Shashidhara H S

Associate Professor, Dept of ISE,

MSRIT, Bangalore, India

Abstract—Data mining algorithms are often used to extract useful information from datasets, which give us deeper insights into what we are focussing on to get from the data. This paper aims at measuring the performance of the few selected algorithms namely, Bayesian Generalized Linear Model, Generalized Linear Model, k-Nearest Neighbours and Partial Least Squares. The various performance parameters measured include sensitivity, specificity, root mean squared error (RMSE), R Squared etc. The dataset used would be from the machine learning library present in R studio namely, mlbench library and the dataset being Pima Indians Diabetes.

Keywords—- BGLM, GLM, kNN, PLS, mlbench, caret, e1071, train Control, ROC, Kappa

INTRODUCTION

With hundreds of data mining algorithms available in present times, it is often confusing to decide upon which particular algorithm needs to be used for a particular data model. The focus here is on algorithms used for classification and regression purpose targeted towards supervised learning. Before we begin our investigation of algorithm performance we need to understand the basics of what is classification and regression and why are they used. Classification in machine learning essentially means that to place a new observation in a set of categories already pre- defined based on the training dataset [3]. So in classification we actually group the output variables into different corresponding classes. Regression in machine learning means that we actually predict the output values based on the training data values. By using these algorithms we can extract huge amount of meaningful data by applying it against the dataset. The model used here also emphasizes on the use of regression techniques for training the data. The four algorithms under consideration here are used for both classification and regression purpose.

OVERVIEW OF ALGORITHMS UNDER CONSIDERATION

It is essential to understand in brief about the algorithms under consideration in this paper. Let us look into the basics of these algorithms [4] as to what are they, why are they used and their importance in machine learning etc.

In this paper we have divided the sections into the following: Initially we start off with the overview of the 4 algorithms under consideration, then we discuss the implementation details followed by result analysis and

conclusion regarding which of the 4 algorithm best suits the dataset under investigation. We also include the future scope of the paper which is to experiment with various other algorithms available.

Bayesian Generalized Linear Model (BGLM)

The approach here is that we specify a conditional distribution and the data is additionally supplemented with prior probability distribution. The prior can take the form specified based on data. There might not be posterior probabilities for all the prior probabilities. Hence we only take the conjugate probabilities and compute their posterior probabilities. We only consider the posterior probabilities for this model. This model is basically used to avoid over fitting when considered for application to large datasets. Finally we do model evidence which clearly tells us how well the model has predicted the behaviour. Generally Laplace equations are used for getting the final results.
Generalized Linear Model

The approach here is that the model consists of three components namely, a probability distribution from the exponential family, a linear predictor and a link function. We can have any distribution like gamma, poisson etc. The outcome is basically reliant on the dependent input which also relies on mean of the distribution. The distribution is basically an error distribution model. Here the relationship between a response variable and one or more predictors. The outcome is measured using maximum likelihood or using Bayesian techniques etc.
k-Nearest Neighbours

The approach here is that depending on the weights of its neighbours and the Euclidean distance between the training objects, the classification is done. The choice of input parameter k is dependent on the data. The output result depends heavily on the input clusters and the boundaries between them. This algorithm is used both for classification and regression. In regression we determine the inverse distance to compute the k nearest multivariate neighbours.
Partial Least Squares

The approach here is that we try to project both the predicted output variables and the observable variables into a new space to find the linear regression between the two. More the input variables in the dataset more accurate will

be the predicted values. We try to maximise the covariance between the input values and the predicted values to get better results.

III.IMPLEMENTATION DETAILS

Usage of one of the foremost data mining specific open source language namely, R and the IDE used is R Studio. Mainly 4 libraries in R are used namely, mlbench, caret, e1071 and R Weka. The mlbench library consists of built in datasets and this paper uses Pima Indians Diabetes. The dataset consists of 9 attribues. We use the repeated cv method with 10 fold to train the dataset. The user, system and elapsed time is calculated which is nothing but the execution time for each of the four algorithms. Out of the total 9 attributes present in the dataset, we use some of the attributes to estimate the performance of algorithms one at a time. We resample the obtained results and display the summary of results which give us the performance parameters like sensitivity, specificity, root mean squared error (RMSE), R Squared etc. of the algorithms. Following is a glimpse about the dataset under consideration. [11]

Name of the Database:- Pima Indians Diabetes (from mlbench library)

Number of Attributes Present:- 9 Number of Observations:- 768

Accuracy:

Min.

1st

Qu.

Media

n

Mean

3rd

Qu.

Max.

NA

s

BAYESG

LM

0.701

3

0.740

3

0.777

8

0.773

0

0.801

9

0.844

2

0

GLM

0.688

3

0.740

3

0.779

2

0.777

8

0.818

2

0.896

1

0

KNN

0.636

4

0.691

6

0.738

6

0.737

0

0.778

5

0.844

2

0

PLS

0.688

3

0.727

3

0.753

2

0.760

9

0.791

5

0.844

2

0

Kappa:

Table 1) Accuracy Result Tabulation

Min.

Media n

Mean

3rd Qu.

Max.

NA

s

BAYESG LM

0.302

5

0.400

3

0.479

7

0.471

8

0.548

8

0.639

3

0

GLM

0.278

7

0.404

0

0.491

3

0.486

5

0.574

2

0.755

2

0

KNN

0.187

6

0.307

6

0.404

2

0.396

8

0.500

3

0.639

3

0

PLS

0.265

5

0.373

1

0.427

8

0.440

9

0.524

3

0.645

7

0

Table 2) Kappa Result Tabulation

Attribute Information:

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)

RESULT ANALYSIS

From the derived results from the console output in R studio, we analyze what obtained result set means, and how to interpret the results obtained. For result analysis we use both tabulated results and visual graph plots for better understand ability.

Following are the results obtained by running the program for getting accuracy.

Metric:-Accuracy

Models: BAYESGLM, GLM, KNN, PLS

Number of resamples: 30

We have used 5 different types of graphs for presenting the

results. The 5 graphs used are box and whisker plots, density plots, dot plots, parallel plots and pair wise scatter plots. Following are the graph plots available for visual interpretation of the results for accuracy metric:

Fig 1) Box and Whisker plot for Accuracy metric

Fig 2) Density plot for accuracy metric

Fig 3) Dot plot for Accuracy metric

Fig 4) Parallel plot for Accuracy metric

Fig 5) Pair wise scatter plot for Accuracy metric

Measuring execution time as a performance parameter is also very important when measuring the accuracy metric while training the model. We use the average of User, System and Elapsed CPU cycles time for measuring the average execution time.

	User	System	Elapsed
BAYESGLM	1.13	0.03	1.15
GLM	0.94	0.00	0.94
KNN	1.12	0.02	1.14
PLS	1.11	0.00	1.11

Table 3) Execution time for Accuracy metric measurement

Following are the results obtained by running the program for getting ROC, Sensitivity, Specificity:

ROC:

Min.

1st Qu.

Media n

Mean

3rd Qu.

Max.

NA

s

BAYESG LM

0.675

6

0.798

9

0.831

9

0.831

0

0.863

9

0.950

4

0

GLM

0.749

6

0.802

6

0.824

5

0.832

8

0.866

5

0.908

1

0

KNN

0.665

6

0.734

1

0.780

4

0.777

5

0.826

9

0.876

5

0

PLS

0.690

8

0.782

2

0.807

8

0.810

3

0.833

9

0.914

8

0

Table 4) ROC Result Tabulation

Sensitivity:

	Min .	1st Qu.	Media n	Mean	3rd Qu.	Max .	NA s
BAYESGL M	0.72	0.86 0	0.89	0.883 3	0.9 2	0.94	0
GLM	0.78	0.86 0	0.88	0.881 3	0.9 0	0.98	0
KNN	0.72	0.80 0	0.84	0.838 0	0.8 8	0.98	0
PLS	0.72	0.86 5	0.88	0.880 7	0.9 2	0.98	0

Min.

1st Qu.

Medi an

Mea n

3rd Qu.

Max

.

NA

s

BAYES GLM

0.33

33

0.51

85

0.55

56

0.57

31

0.62

96

0.77

78

0

GLM

0.40

74

0.51

85

0.55

56

0.57

48

0.62

96

0.76

92

0

KNN

0.33

33

0.48

15

0.55

56

0.55

11

0.62

96

0.73

08

0

PLS

0.29

63

0.48

15

0.51

85

0.53

26

0.60

97

0.74

07

0

.Table 6) Specificity Result Tabulation

Visual Graph plots for better result analysis for ROC metric are as follows:

Fig 6) Box and Whisker plot for ROC metric

Fig 7) Density plot for ROC metric

Fig 8) Dot plot for ROC metric

Fig 9) Parallel plot for ROC metric

VI.FUTURE SCOPE

The paper presents only 4 algorithms particularly used for classification and regression purposes. Similarly many other algorithms can be used to measure comparative performance of algorithms when applied against standard datasets from respected data sources. Effort can also be made with regards to other types of machine learning algorithms belonging to categories of cluster and predictive algorithms. Usage of open source tools like WEKA or Rapid Mine in place of R studio can be made use for getting the results rapidly and avoid coding to save time although not much flexibility is available without programming.

Fig 10) Pair wise scatter plot for ROC metric

Measuring the execution time as a performance parameter while using the ROC as a metric during training the model.

	User	System	Elapsed
BAYESGLM	1.28	0.00	1.33
GLM	0.94	0.00	0.97
KNN	1.21	0.00	1.20
PLS	1.12	0.00	1.13

Table 7) Execution time for ROC metric measurement:

CONCLUSION

From examining the result set we come to the following conclusion regarding which algorithm is better out of the 4 chosen algorithms and gives better performance when applied against Pima Indians Diabetes dataset. As we can infer from Table 3 and Table 7 the average execution time are as follows: GLM < PLS < KNN < BAYESGLM in the increasing order. We consider the mean values from the tables for comparison purpose. The mean Accuracy of the algorithms from Table 1are as follows: GLM > BAYESGLM > PLS > KNN in decreasing order. The mean kappa values from Table 2 are as follows: GLM > BAYESGLM > PLS > KNN in decreasing order. We can infer that for both Accuracy and Kappa statistics parameters the order is the same. The mean ROC of the algorithms from Table 4 are as follows: GLM > BAYESGLM > PLS > KNN in decreasing order. The mean Sensitivity of the algorithms from Table 5are as follows: BAYESGLM > GLM > PLS > KNN in decreasing order. The mean Specificity of the algorithms from Table 6 are as follows: BAYESGLM > GLM > KNN > PLS in decreasing order. In this paper we are not going to rank these 4 algorithms, instead we only say which is the best among the 4 algorithms under consideration. Depending on the tabulated results and comparison statistics we conclude that for Pima Indians Diabetes dataset with diabetes as the parameter, the best suited algorithm among the 4 is Generalised Linear Model (GLM) algorithm.

REFERENCES

Barath Narayanan Narayanan, Ouboti Djaneye-Boundjou and Temesguen M. Kebede, Performance Analysis of Machine Learning and Pattern Recognition Algorithms for Malware Classification, 2016
M.H. Hesamian, S. Mashohor, M.I. Saripan,WA Wan Adnan, B.Hesamian, M.M.Hooshyari, Performance of Various Training Algorithms on Scene Illumination Classification, 2015 IEEE Student Conference on Research and Development (SCOReD)
Alisa A. Vorobeva, Examining the Performance of Classification Algorithms for Imbalanced Data Sets in Web Author Identification, proceeding of the 18th conference of fruct association
A.Swarupa Rani and S.Jyothi, Performance analysis of Classification Algorithms under Different Datasets, 2016 International Conference on Computing for Sustainable Global Development (INDIACom)
Neelam Singhal and Mohd. Ashraf, Performance Enhancement of Classification Scheme in Data Mining using Hybrid Algorithm, International Conference on Computing, Communication and Automation (ICCCA2015)
K. Dharmaraajan and M.A. Dorairangaswamy, Analysis of FP-Growth and Apriori Algorithms on Pattern Discovery from Weblog Data, 2016 IEEE International Conference on Advances in Computer Applications (ICACA)
Shubhangi D. Patill, Dr. Ratnadeep R. Deshmukh and D.K. Kirange, Adaptive Apriori Algorithm for Frequent Itemset Mining, Proceedings of the SMART -2016, IEEE Conference ID: 39669, 5th International Conference on System Modeling & Advancement in Research Trends
Da-Qi Ren, Da Zheng, Guowei Huang, Shujie Zhang, Zane Wei, Parallel Set Determination and K-means Clustering for Data Mining on Telecommunication Networks, 2013 IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing
Amirah Mohamed Shahiria, Wahidah Husaina, Nuraini Abdul Rashida, "A Review on Predicting Students Performance using Data Mining Techniques", The Third Information Systems International Conference, 2015
Muhammad Arif, Khubaib Amjad Alam, Mehdi Hussain, "Application of Data Mining Using Artificial Neural Network: Survey", International Journal of Database Theory and Application Vol.8, No.1 (2015), pp.245-270
Mr. Chintan Shah, Dr. Anjali G. Jivani, "Comparison of Data Mining Classification Algorithms for Breast Cancer Prediction", 4th ICCCNT
Brijesh Kumar Bhardwaj, Saurabh Pal, "Data Mining: A prediction for performance improvement using classification", (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4
Da-Qi Ren, Da Zheng, Guowei Huang, Shujie Zhang, Zane Wei, "Parallel Set Determination and K-means Clustering for Data Mining on Telecommunication Networks", 2013 IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing
Gurpreet Singh, Jaskaranjit Kaur, MD. Yusuf Mulge, "Performance Evaluation of Enhanced Hierarchical and Partitioning Based Clustering Algorithm (EPBCA) in Data Mining", 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)

Measuring Performance of Selected Algorithms used for Classification and Regression when Applied Against a Standard Dataset

Leave a Reply