Exploring Data Mining Classification Techniques

Manish Kumar Shrivastava; Praveen Chouksey; Rohit Miri

doi:10.17577/IJERTV2IS60878

Volume 02, Issue 06 (June 2013)

Exploring Data Mining Classification Techniques

DOI : 10.17577/IJERTV2IS60878

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 73
Total Downloads : 559
Authors : Manish Kumar Shrivastava, Praveen Chouksey, Rohit Miri
Paper ID : IJERTV2IS60878
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 24-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Exploring Data Mining Classification Techniques

Manish Kumar Shrivastava

M.Tech. C.S.E. Scholar

Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

Praveen Chouksey

Assistant Professor

Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

Rohit Miri

Assistant Professor

Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

Abstract

Data Mining is an analytical process of discovering interesting patterns from large amount of data. Data mining performs several tasks one of its major task is classification. Classification maps data into predefined groups or classes that is why it is often referred to as supervised learning. This paper discusses few of data mining classification techniques and algorithm. In this research work three different data mining classification techniques known as ANN, SVM, DT are applied to classify data of three different datasets: the Vote dataset , Breast-cancer(w) dataset and KDD dataset (Intrusion detection) obtained from UCI repository site.

Introduction

Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data /pattern

analysis. Data mining is a logical process that is used to search through large amount of data in order to find useful data. The goal of this technique is to find patterns that were previously unknown. Once these patterns are found they can further be used to make certain decisions for development of their businesses [1].

Classification in data mining is a form of data analysis that can be used to extract models to describe important data classes or to predict future data trends (Han & Kamber, 2006). The classification process has two phases; the first phase is learning process, the training data will be analyzed by the classification algorithm. The learned model or classifier shall be represented in the form of classification rules. Next, the second phase is classification process where the test data are used to estimate the accuracy of the classification model or classifier. If the accuracy is considered acceptable, the rules can be applied to the classification of new data.

Classification techniques used in this research work described as below.

Multilayer Perceptron

Multilayer Perceptron (MLP) network models are the popular network architectures used in most of the research applications in medicine,engineering, mathematical modeling, etc.. In MLP, the weighted sum of the inputs and bias term are passed to activation level through a transfer function to produce the output, and the units are arranged in a layered feed-forward topology called Feed Forward Neural Network (FFNN). The schematic representation of FFNN with

n inputs, m hidden units and one output

Unit along with the bias term of the input unit and hidden unit is given in Figure 1. [5]

Figure 1. Feed forward neural network.

Decision Trees (DTs)

A decision tree is a tree where each non-terminal node represents a test or decision on the considered data item. Choice of a certain branch depends upon the outcome of the test. To classify a particular data item, we start at the root node and follow the assertions down until we reach a terminal node (or leaf). A decision is made when a terminal node is approached. Decision trees can also be interpreted as a special form of a rule set, characterized by their hierarchical organization of rules. The J48 decision tree in WEKA is based on the C4.5 decision tree algorithm. The C4.5 algorithm is a part of the multi- way split decision tree. C 4.5 yields a binary split if the selected variable is numerical, but if there are other variables representing the attributes it will result in a categorical split. That is, the node will be split into C nodes where C is the number of categories for that attribute.

Support Vector Machine (SVM)

Support vector machine (SVM) is an algorithm that attempts to find a linear separator (hyper-plane) between the data points of two classes in multidimensional space. SVMs are well suited to dealing with interactions among features and redundant features.
Related Work

Many others have worked on different domain to design and develop classification models using data mining techniques.
1. Soltani Sarvestani et al.[2,3] provided a comparison among the capabilities of various neural networks such as Multilayer Perceptron (MLP), Self Organizing Map(SOM), Radial Basis Function (RBF) and Probabilistic Neural Network(PNN) which are used to classify WBC and NHBCD data. The performance of these neural network structures was investigated for breast cancer diagnosis problem.
  
  Dr. Medhat Mohamed Ahmed Abdelaal et al.[2,4] investigated the capability of the classification SVM with Tree Boost and Tree Forest in analyzing the DDSM dataset for the extraction of the mammographic mass features along with age that discriminates true and false cases.
  
  J. Padmavati[5] performed a comparative study on WBC dataset for breast cancer prediction using RBF and MLP along with logistic regression. Logistic regression was performed using logistic regression in SPSS package and MLP and RBF were constructed using MATLAB. It was observed that neural networks took slightly higher time than logistic regression but the sensitivity and specificity of both neural network models had a better predictive power over logistic regression. When comparing RBF and MLP neural network models, it was found that RBF had good predictive capabilities and also time taken by RBF was less than MLP.
  
  Heba Ezzat Ibrahim et al.[6,7] proposed a multi-Layer intrusion detection. There experimental results showed that the proposed multi-layer model using C5 decision tree achieves higher classification rate accuracy, using feature selection by Gain Ratio, and less false alarm rate than MLP and naÃ¯ve Bayes. Using Gain Ratio enhances the accuracy of U2R and R2L for the three machine learning techniques (C5, MLP and NaÃ¯ve Bayes) significantly
System Implementation

Proposed research work introduces a framework to develop a classifier based on data mining techniques. Another objective is to perform cross validation of different framework designed for different category of data. In this frameworks dataset is given to Pre- processing stage which further classified by selected

classifier. Machine learning tools WEKA are used to analyze the performance of datasets. This approach involve three major steps-

Data Pre-processing:
- Data preparation (load data) e.g.
  
  -Vote data
  - Breast-cancer data set
  - KDD data set (for intrusion detection)
- feature reduction (attribute analysis) if needed
Data Mining: Classify datasets
- Select classifier e.g.
  - MLP
  - SVM
    
    – DT (J48)
Data Post-processing:
- Result Interpretation
  
  System Architecture
  
  FIG [A]: Classification in Data Mining
  
  FIG [B]: External Architecture of Cross Validation

Experimental Methodology

The experimental methodology followed in this research includes data sets and classification technique. The descriptions of thse methodologies are described below.

Data Description

Required data sets for experiment collected from following sources-

– UCI data repository

Vote dataset
KDD dataset
Breast cancer dataset

The data sets used for experimental purpose is downloaded from university of California of Iravin (UCI) repository site (web source http://www.archive.ics.uci.edu/ml/datasets.html). There are three different data sets which belongs to different domains. These datasets are Vote data set which has 435 instances from which 236 belongs to no category while 187 belongs to yes category with 17 features (attribute), Breast Cancer dataset which 699 instances from which 458 benign and 241 malignant with 11 features another data set is KDD data set (for intrusion detection) which has 2519 instances from which 1338 normal and 1181 anomaly with 42 features. The detail of data set is shown in table 1.

Table 1: Datasets Description
Data Set Name	No. of Instances	No. of Class	Name of Classes
Breast-Cancer (Wisconsin)	699	2	Benign, Malignant
Vote	435	2	N(no), Y(yes)
KDD data set	2519	2	Normal Anomaly

Weka as a Data Miner Tool

In this paper we have used WEKA (to find interesting patterns in the selected dataset), a Data Mining tool for classification techniques.. The selected software is able to provide the required data mining functions and methodologies. The suitable data format for WEKA data mining software are MS Excel and ARFF formats respectively. WEKA is developed at the University of Waikato in New Zealand. WEKA stands for the Waikato Environment of Knowledge Analysis. The system is written in Java,

An object-oriented programming language that is widely available for all major computer platforms, and WEKA has been tested under Linux, Windows, and Macintosh operating systems. Java allows us to provide a uniform interface to many different learning algorithms, along with methods for pre and post processing and for evaluating the result of learning schemes on any given dataset. WEKA expects the data

to be fed into be in ARFF format (Attribution Relation File Format). [8]

Classification in WEKA

The basic classification is based on supervised algorithms. Algorithms are applicable for the input data. Classification is done to know exactly how the data is being classified. The Classify Tab is also supported which shows the list of machine learning tools. These tools in general operate on a classification algorithm and run it multiple times to manipulating algorithm parameters or input data weight to increase the accuracy of the classifier. Two learning performance evaluators are included with WEKA. The first simply splits a dataset into training and test data, while the second performs cross validation using folds. Evaluation is usually described by the accuracy. The run information is also displayed, for quick inspection of how well a classifier works.

Learning Algorithms

This paper consists of three different supervised machine learning algorithms derived from the WEKA Data mining tool. Which include:

MLP
SVM

J48 (C4.5)

Model Evaluation [9]

Based on data mining techniques as explained above all the developed models are evaluated in terms of following error measures

Accuracy: Is a percentage of samples that are classified correctly .It is calculated as follows: Accuracy = (TP + TN) / (P + N)…………………. (1)

Sensitivity: Is also known as true positive rate (TPR) which can be calculated as follows:

Sensitivity = TP/ (TP+FN)…………………………. (2)

Specificity: Is also known as true negative rate (TNR). It is calculated as follows:

Specificity = TN/ (TN +FP)……………………….. (3)

Where TP, TN, FP and FN are true positive, true negative, false positive and false negative respectively.

Results

Breast Cancer diagnosis Experiment result:

Table 2 : Confusion Matrix for various predictive models
Predictive Model	Target class	Experiment Result
Predictive Model	Target class	Benign	Malignant
Multilayer Perceptron	Benign	440	18
Multilayer Perceptron	Malignant	15	226
Support Vector Machine	Benign	446	12
Support Vector Machine	Malignant	9	232
Decision Tree	Benign	438	19
Decision Tree	Malignant	17	224

Table 3 : Error measures of various predictive models
Predictive Model	Accuracy	Sensitivity	Specificity
Multilayer Perceptron	95.3	96.7	92.6
Support Vector Machine	97.0	98.0	95.1
Decision Tree	94.6	96.2	92.2

Performance Evaluation of Models applied on Breast-Cancer data set

100

90

80

70

60

50

40

30

20

10

0

100

90

80

70

60

50

40

30

20

10

0

Accuracy

Sensitivity

Specificity

MLP

Sensitivity

Specificity

MLP

SVM

C4.5

Table 4 : Confusion Matrix for various predictive models
Predictive Model	Target class	Experiment Result
Predictive Model	Target class	n	y
Multilayer Perceptron	n	254	13
Multilayer Perceptron	y	10	158
Support Vector Machine	n	257	10
Support Vector Machine	y	7	161
Decision Tree	n	259	8
Decision Tree	y	8	160

Table 4 : Confusion Matrix for various predictive models
Predictive Model	Target class	Experiment Result
Predictive Model	Target class	n	y
Multilayer Perceptron	n	254	13
Multilayer Perceptron	y	10	158
Support Vector Machine	n	257	10
Support Vector Machine	y	7	161
Decision Tree	n	259	8
Decision Tree	y	8	160

Vote Prediction Experiment result:

Support Vector Machine

96.

95.8

98.3

Decision Tree

98.8

98.8

98.7

Performance Evaluation of Models applied on Vote Prediction data set

Performance Evaluation of Models applied on Vote Prediction data set

100

90

80

70

60

50

40

30

20

10

0

Accuracy

Sensitivity Specificity

100

90

80

70

60

50

40

30

20

10

0

Accuracy

Sensitivity Specificity

MLP SVM C4.5

MLP SVM C4.5
] Intrusion Detection (KDD dataset) Experiment result:

Table 6 : Confusion Matrix for various predictive models
Predictive Model	Target class	Experiment Result
Predictive Model	Target class	Normal	Anomaly
Multilayer Perceptron	Normal	1319	19
Multilayer Perceptron	Anomaly	38	1143
Support Vector Machine	Normal	1318	20
Support Vector Machine	Anomaly	58	1123
Decision Tree	Normal	1323	15
Decision Tree	Anomaly	16	1165

Table 7 : Error measures of various predictive models
Predictive Model	Accuracy	Sensitivity	Specificity
Multilayer Perceptron	97.7	97.2	98.4

Performance Evaluation of Models applied on Intrusion Detection data set (KDD data set)

100

90

80

70

Conclusion

Table 5 : Error measures of various predictive models
Predictive Model	Accuracy	Sensitivity	Specificity
Multilayer Perceptron	94.7	96.2	92.4
Support Vector Machine	96.1	97.3	94.2
Decision Tree	96.3	97.0	95.2

Table 5 : Error measures of various predictive models
Predictive Model	Accuracy	Sensitivity	Specificity
Multilayer Perceptron	94.7	96.2	92.4
Support Vector Machine	96.1	97.3	94.2
Decision Tree	96.3	97.0	95.2

Data mining has importance regarding finding the patterns, forecasting, discovery of knowledge etc., in different business domains. Further predictive models are evaluated as discussed in model evaluation section using equations 1,2 and 3 and calculated results are presented in table 3,5 and 7 in terms of accuracy, sensitivity and specificity. From table 3 it is clear that in breast cancer diagnosis SVM performs well as compared to other two techniques Accuracy whereas from table 5 and 7 it is clear that Decision tree technique (J48) performs well in vote and kdd data set. A comparative Bar Chart showing Error Measures of all classifiers.

References

[ 1 ] Bharati M. Ramageri / Indian Journal of Computer Science and Engineering Vol. 1 No. 4 301-305

Shelly Gupta et al./ Indian Journal of Computer Science and Engineering (IJCSE)
Sarvestan Soltani A. , Safavi A. A., Parandeh M.

N. and Salehi M., Predicting Breast Cancer Survivability using data mining techniques, Software Technology and Engineering (ICSTE), 2nd International Conference, 2010, vol.2
Abdelaal Ahmed Mohamed Medhat and Farouq Wael Muhamed, Using data mining for assessing diagnosis of breast cnacer, in Proc. International multiconfrence on computer science and information Technology, 2010, pp. 11-17. ".
Padmavati J., A Comparative study on Breast Cancer Prediction Using RBF and MLP, International Journal of Scientific & Engineering Research, vol. 2, Jan. 2011..
Poonam Gupta / International Journal of Engineering Research & Technology (IJERT) Vol. 2 Issue 5, May – 2013 ISSN: 2278-0181
Heba Ezzat Ibrahim,Sherif M. Badr, Mohamed A. Shaheen, Adaptive Layered Approach using Machine Learning Techniques with Gain Ratio for Intrusion Detection Systems, International Journal of Computer Applications (0975 8887),Volume 56 No.7, October 2012.
T. Balasubramanian/European Journal of Scientific Research ISSN 1450-216X Vol.78 No.3 (2012), pp.384-394 Â© EuroJournals Publishing, Inc. 2012
H.S. Hota/International Journal of Emerging Science and Engineering (IJESE) ISSN: 2319 6378, Volume-1, Issue-3, January 2013.

60

50

40

30

20

10

IJERTV2IS608078

MLP SVM C4.5

Accuracy

Sensitivity Specificity

www.ijert.org

2558

Manish Kumar Shrivastava is Persuing M.Tech (CSE) in the department of CSE from Dr. C.V. Raman University , Bilaspur. He recevied his M.C.A. from Gurughasidas University, Bilaspur Chhattisgarh. His interestes area includes Data Mining and Neural Network.

Praveen Chouksey is Currently Assistent Professor in the department of CSE, Dr. C.V. Raman University , Bilaspur. His interestes area includes Data Mining and Neural Network.

Rohit Miri is Currently Assistant Professor in the department of CSE, and Pursuing Ph.D from Dr. C.V. Raman University, Bilaspur, Chhattisgarh, India. He received his M.Tech(CSE) form GEC Pune and BE(CSE) from GEC, Raipur, His interest area includes Application of Soft Computing.

IJERTV2IS60878

www.ijert.org

2559

Support Vector Machine	96.	95.8	98.3
Decision Tree	98.8	98.8	98.7

Exploring Data Mining Classification Techniques

Leave a Reply