- Open Access
- Total Downloads : 650
- Authors : B. Subashini, Dr. K. Chitra
- Paper ID : IJERTV2IS80393
- Volume & Issue : Volume 02, Issue 08 (August 2013)
- Published (First Online): 16-08-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Enhanced System for Revealing Fraudulence in Credit Card Approval
B. Subashini 1, Dr. K. Chitra 2
1Assistant Professor, Department of Computer Science, V.V.Vanniaperumal College for Women, Virudhunagar. 2Assistant Professor, Department of Computer Science, Government Arts College, Melur, Madurai.
Abstract
In developing countries like India, Bankers face more problems with the fraudsters. Data mining techniques are more useful to build a successful predictive model and visualize the report into meaningful information to the user. This research paper aims to enhance and evaluate the fraudulence in credit card approval process using the classification models based on decision trees (C5.0 & CART), Support Vector Machine (SVM) using SMO, BayesNet and Logistic Regression. Five methods to detect fraud are presented. Automatic credit card approval is the most significant process in the banking sector and financial institutions. This enhanced system prevents the fraud which is going to happen. So this paper proposes a good solution to the credit card approval using the above methods.
Keywords:- Credit card approval, Fraud, Data Mining, Classification, SVM, Logistic Regression
-
Introduction
Credit card fraud falls broadly into two categories: behavioral fraud and application fraud. Application fraud occurs when individuals obtain new credit cards from issuing companies using false personal information and then spend as much as possible in a short span of time [5]. In a move to curtail rising credit card frauds, the Reserve Bank of India has asked banks to bar international usage of debit and credit cards unless customers specifically ask for this feature. Banks have also been asked to enable blocking of cards through a text message request [6]. So nowadays, credit approval is the tremendous problem in the banking sector. Automatic credit approval is the process of granting credits or loans to customers. Prevention is better than cure. Fraud prevention is the proactive mechanism with the goal of disabling the occurrence of fraud.
This paper enhances the credit card approval process in the banking sector by using and comparing the classification methods such as decision trees, Support Vector Machine (SVM) and Logistic Regression. It depends on the performance metrics such as performance and accuracy.
The rest of the paper is organized as follows: Section 2 describes the classification methods which are used to apply in the credit card approval dataset and performance metrics. Section 3 presents experiment setup and results from various classification methods. Section 4 analyzes the performance of the classification methods results. Section 5 concludes this work.
-
Data Mining
Data mining is the analysis step of knowledge discovery in databases. Data mining is a powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behavior of their customers and potential customers. Data mining derives its name from the similarities between searching for valuable information in a large database and mining a mountain for a vein of valuable ore. Specific uses of data mining include: Market segmentation, Customer churn, Fraud detection, Direct Marketing, Interactive marketing, Market basket analysis, Trend analysis [1] [2] [3] [4]. Three steps involved in the data mining process are Exploration, Pattern identification, Deployment. Various algorithms and techniques like Classification, Clustering, Regression, Artificial Intelligence, Neural Networks, Association Rules, Decision Trees, Genetic Algorithm, Nearest Neighbor method etc., are used for knowledge discovery from databases [7]. In this paper, we enhance the credit card approval process to prevent fraud in the banking sector using classification methods.
-
Classification Methods
Classification is perhaps the most familiar and most popular data mining technique. Estimation and prediction may be viewed as types of classification. There are more classification methods such as statistical based, distance based, decision tree based, neural network based, rule based [8]. In this paper, we detect fraud using the classification algorithms C5.0, BayesNet, Classification Via Regression Trees (CART), Support Vector Machine using Sequential Minimal Optimization (SMO), Logistic Regression and we analyze their performance in fraud detection in the banking sector. The performance analysis is done on the basis of the following performance metrics:
-
Classified Instances – The importance performance measure is correctly classified instances and incorrectly classified instances.
-
ROC ROC (Relative operating characteristic) curve shows the relationship between false and true positive.
-
Confusion Matrix The confusion matrix illustrates the accuracy of the solution to a classification problem. The rows represent the actual classification and the
columns the predicted classification. In this matrix Good classification denotes the Legitimate customer and Bad classification denotes the Fraud customer.
-
-
-
Experiments and Results
For the experimental work in this paper, a dataset of credit card applications and approval decisions, Credit Card Approval, from UCI Repository of Machine Learning Databases and Domain Theories, was used. The dataset was used to detect fraudulent customer during their credit card approval process and to induce classification models for assessing credit card applications. The dataset has 20 Attributes (7 numerical, 13 categorical) plus the class label attribute. The dataset is interesting because there is a good mix of attributes: numerical, categorical with meaningful values. There are 1000 instances in this dataset.
The tests were made using the software Weka (Waikato Environment for Knowledge Analysis) which contains lot of classification algorithms. Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand [11].
-
C5.0
C5.0 builds decision trees from a set of training data in the same way as ID3, using the concept of Information entropy. The training data is a set S=S1,S2,.. of already classified samples. Each sample Si consists of a p-dimensional vector (x1,i,x2,i, , xp,i), where the xj represent attributes or features of the sample, as well as the class in which si falls. At each node of the tree, C5.0 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C5.0 algorithm then recurses on the smaller sublists. Gain is computed to estimate the gain produced by a split over an attribute. The gain of information is used to create small decision trees that can identify the answers with a few questions [9].
The following figure.1 illustrates the knowledge flow analysis using cross validation fold 10, and uses J48 algorithm for classification.
Figure 1. Knowledge Flow Analysis using J48 Algorithm
For the application in a decision tree, the algorithm used was J48, which is Java implementation of C5.0. Here, the training algorithm took only 0.02 seconds to classify 1000 instances. The classifier output for C5.0 is follows.
=== Evaluation result === Scheme: J48
Otions: -C 0.25 -M 2
Correctly Classified Instances 724 72.4 %
Incorrectly Classified Instances 276 27.6 %
Kappa statistic 0.2988 Mean absolute error 0.3389 Root mean squared error 0.4306
Relative absolute error 80.5279 % Root relative squared error 93.8224 % Coverage of cases (0.95 level) 99.1 % Mean rel. region size (0.95 level) 98.2 % Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.850
0.570
0.777
0.850
0.812
0.303
0.700
0.794
good
0.430
0.150
0.551
0.430
0.483
0.303
0.700
0.536
bad
Weighted
Avg.
0.724
0.444
0.709
0.724
0.713
0.303
0.700
0.717
=== Confusion Matrix === a b <– classified as
595 105 | a = good
171 129 | b = bad
-
CART
A CART tree is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample. Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.
=== Evaluation result ===
Scheme: ClassificationViaRegression
Correctly Classified Instances 741 74.1 %
Incorrectly Classified Instances 259 25.9 %
Kappa statistic 0.297
Mean absolute error 0.3447 Root mean squared error 0.4186
Relative absolute error 81.8997 % Root relative squared error 91.2062 % Coverage of cases (0.95 level) 99.1 % Mean rel. region size (0.95 level) 94.35 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.907
0.647
0.766
0.907
0.831
0.317
0.752
0.866
good
0.353
0.093
0.620
0.353
0.450
0.317
0.752
0.563
bad
Weighted
Avg.
0.741
0.481
0.722
0.741
0.716
0.317
0.752
0.775
=== Confusion Matrix === a b <– classified as
635 65 | a = good
194 106 | b = bad
-
Support Vector Machine(SVM)
In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables. For degree-d polynomials, the polynomial kernel is defined as K(x,y)= (xT y+c)d where x and y
are vectors in the input space, i.e. vectors of features computed from training or test samples, c >0 is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. In this paper SVM is implemented using Sequential Minimal Optimization (SMO).
=== Evaluation result === Scheme: SMO
Correctly Classified Instances 724 72.4 %
Incorrectly Classified Instances 276 27.6 %
Kappa statistic
0.2032
Mean absolute error
0.276
Root mean squared error
0.5254
Relative absolute error
65.5855 %
Root relative squared error 114.4562 % Coverage of cases (0.95 level) 72.4 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.934
0.767
0.740
0.934
0.826
0.240
0.584
0.737
good
0.233
0.066
0.603
0.233
0.337
0.240
0.584
0.371
bad
Weighted
Avg.
0.724
0.556
0.699
0.724
0.679
0.240
0.584
0.627
=== Confusion Matrix === a b <– classified as
654 46 | a = good
230 70 | b = bad
-
Logistic Regression
Logistic regression or logit regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable based on one or more predictor variables.
Instead of fitting the data to a straight line, logistic regression uses a logistic curve. The formula for a univariate logistic curve is
To perform the logarithmic function can be applied to obtain the logistic function
+
Logistic regression is simple, easy to implement, and provide good performance on a wide variety of problems [10].
=== Evaluation result === Scheme: Logistic
Correctly Classified Instances 731 73.1 %
Incorrectly Classified Instances 269 26.9 %
Kappa statistic 0.2807 Mean absolute error 0.3406 Root mean squared error 0.4178
Relative absolute error 80.9379 % Root relative squared error 91.0334 % Coverage of cases (0.95 level) 99.9 % Mean rel. region size (0.95 level) 97.55 % Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.890
0.640
0.764
0.890 0.822
0.295
0.755
0.870
good
0.360
0.110
0.584
0.360 0.445
0.295
0.755
0.564
bad
Weighted
0.731
0.481 0.710
0.731 0.709
0.295
0.755
0.778
Avg.
=== Confusion Matrix === a b <– classified as
623 77 | a = good
192 108 | b = bad
-
BayesNet
Bayes Nets or Bayesian networks are graphical representation for probabilistic relationships among a set of random variables. Given a finite set X = { X1, , Xn} of discrete random variables where each variable Xi may take values from a finite set, denoted by Val(Xi). A Bayesian network is an annotated directed acyclic graph (DAG) G that encodes a joint probability distribution over X. The nodes of the graph correspond to the random variables X1, , Xn. The links of the graph correspond to the direct influence from one
variable to the other. If there is a directed link from vaiable Xi to variable Xj, variable Xi will be a parent of variable Xj. Each node is annotated with a conditional probability distribution (CPD) that represents p(Xi | Pa(Xi)), where Pa(Xi) denotes the parents of Xi in G. The pair (G, CPD) encodes the joint distribution p(X1, , Xn). A unique joint probability distribution over X from G is factorized as:
=== Evaluation result ===
Scheme: BayesNet
Correctly Classified Instances 724
72.4
%
Incorrectly Classified Instances 276
27.6
%
Kappa statistic 0.2857
Mean absolute error 0.3386
Root mean squared error 0.424
Relative absolute error 80.4581 % Root relative squared error 92.3654 % Coverage of cases (0.95 level) 99.4 % Mean rel. region size (0.95 level) 94.75 % Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.863
0.600
0.770
0.863
0.814
0.293
0.748
0.868
good
0.400
0.137
0.556
0.400
0.465
0.293
0.748
0.535
bad
Weighted
Avg.
0.724
0.461
0.706
0.724
0.709
0.293
0.748
0.768
=== Confusion Matrix ===
a b <– classified as 604 96 | a = good
180 120 | b = bad
-
-
Performance Analysis
The chosen methods to build classifier models are C5.0, CART from decision tree methods, SVMs using SMO algorithm with kernels of polynomial functions, Logistic Regression and BayesNet. All these methods are used to detect fraud in the banking sector using Credit Card Fraud data set. The Table 1 summarizes the classification values and success rate for all five classification methods C5.0, BayesNet, Classification via Regression Trees (CART), Support Vector Machine using Sequential Minimal Optimization (SMO), Logistic Regression and we analyze their performance in fraud detection in the banking sector. Here, Good denotes the Legitimate User and Bad denotes the Fraud User.
C5.0 using J48, SVM using SMO, BayesNet are giving the success rate of 72.4% whereas the Bad Good classification is more in SVM using SMO. Because classifying a Bad customer as Good is worse than classifying a Good customer as Bad. Logistic Regression method is providing 73.1% success rate and CART gets the highest success rate 74.1%. Hence it is shown here that depending upon the success rate CART outperforms the other models whereas considering the Bad Good classification J48 shows better performance. Any financial institution needs to retain its customers. A legitimate customer must not be classified as Fraud or a Fraud must not be classified as Legitimate. Hence It is understood from this work that while making decisions on classifying customers combination of different classification models need to be used to make correct decision about a customer.
Table 1. Performance Analysis of Classification Methods
J48 |
Logistic |
SMO |
BayesNet |
CART |
|
Good-->Good |
595 |
623 |
654 |
604 |
635 |
Bad-->Bad |
129 |
108 |
70 |
120 |
106 |
Good-->Bad |
105 |
77 |
46 |
96 |
65 |
Bad-->Good |
171 |
192 |
230 |
180 |
194 |
Success Rate |
724 |
731 |
724 |
724 |
741 |
Failure Rate |
276 |
269 |
276 |
276 |
259 |
Success % |
72.4 |
73.1 |
72.4 |
72.4 |
74.1 |
Failure % |
27.6 |
26.9 |
27.6 |
27.6 |
25.9 |
700
600
Legitimate User
Legitimate User
500
400
300
200
100
0
Legitimate User Classification
J48 Logistic SMO BayesNet CART
Classification Methods
Figure 2. Legitimate User Classification
Figure 2 shows the classification of legitimate users using C5.0 using J48, Logistic Regression, SVM using SMO, BayesNet and CART. With a given credit card data set, SMO has classified 654 legitimate users as legitimate users. Next higher classification is given by the CART.
140
120
100
Fraud User
Fraud User
80
60
40
20
0
Fraud User Classification
J48 Logistic SMO BayesNet CART
Classification Methods
Figure 3. Fraud User Classification
Figure 3 shows the classification of fraud users using C5.0 using J48, Logistic Regression, SVM using SMO, BayesNet and CART. With a given credit card data set, J48
gives lesser classification of Bad Good (Fraud user has been classified as Legitimate user). Next lesser classification is given by the BayesNet.
750
740
Success Rate
Success Rate
730
720
710
700
Success Rate of Classification Methods
J48 Logistic SMO BayesNet CART
Classification Methods
Figure 4. Success Rate of Classification Methods
Logistic Regression method is providing 73.1% success rate and CART gets the highest success rate 74.1%. Hence it is shown here that depending upon the success rate CART outperforms the other models whereas considering the Bad Good classification J48 shows better performance.
300
280
Failure Rate
Failure Rate
260
240
220
200
Failure Rate of Classification Methods
J48 Logistic SMO BayesNet CART
Classification Methods
Figure 5. Failure Rate of Classification Methods
-
Conclusion
Revealing fraudulence in credit card approval is important for the efficient processing of credit applications. To improve security of the credit card approval systems in an automatic and effective way, building an accurate and efficient credit card approval system is one of the key tasks for the financial institutions. In this paper, five classification methods were used to detect fraud in credit card approval process in the banking sector. This work demonstrates the advantages of applying the data mining techniques including decision trees (C5.0 & CART), SVM using SMO, Logistic Regression and BayesNet to reveal fraudulence in credit card approval process. It reduces the financial institutions risk. Any financial institution needs to retain its customers. A legitimate customer must not be classified as Fraud or a Fraud must not be classified as Legitimate. Hence it is shown here that depending upon the success rate CART outperforms the other models whereas considering the Bad Good classification J48 shows better performance. Hence It is concluded from this work that while making decisions on classifying customers combinaion of different classification models need to be used to make correct decision about a customer.
References
-
K. Chitra, B.Subashini, Customer Retention in Banking Sector using Predictive Data Mining Technique, International Conference on Information Technology, Alzaytoonah University, Amman, Jordan, www.zuj.edu.jo/conferences/icit11/paperlist/Papers/
-
K. Chitra, B.Subashini, Automatic Credit Approval using Classification Method, International Journal of Scientific & Engineering Research (IJSER), Volume 4, Issue 7, July-2013 2027 ISSN 2229-5518.
-
K. Chitra, B.Subashini, Fraud Detection in the Banking Sector, Proceedings of National Level Seminar on Globalization and its Emerging Trends, December 2012.
-
K. Chitra, B.Subashini, An Efficient Algorithm for Fraud Detecting Credit Card Frauds, Proceedings of State Level Seminar on Emerging Trends in Banking Industry, March 2013.
-
Richard J. Bolton and David J. Hand, Unsupervised Profiling Methods for Fraud Detection, Technical Report (Department of Mathematics, Imperial College, London), 2002.
-
Mayur Shetty, RBI moves to check credit card frauds, The Times of India, March 1st 2013.
-
Bharati M. Ramageri, Data Mining Techniques and Applications, Indian Journal of Computer Science and Engineering, Vol. 1 No. 4 301-305.
-
Margaret H. Dunham, Data Mining Introductory and Advanced Topics, Pearson Education, Sixth Impression, 2009.
-
B. C. da Rocha and R. T. de Sousa, Identifying Bank Frauds using Crisp-Dm and Decision Trees, International journal of computer science & information Technology (IJCSIT) Vol.2, No.5, October 2010.
-
Classification: Naive Bayes vs Logistic Regression, John Halloran, University of Hawaii at Manoa EE 645, Fall 2009
-
Weka, Machine Learning Group at the University of Waikato.
-