- Open Access
- Total Downloads : 19
- Authors : Zheng Yu-Wei, Ding Zhi-Jun, Xu Xiao-Feng
- Paper ID : IJERTCONV3IS16147
- Volume & Issue : TITCON – 2015 (Volume 3 – Issue 16)
- Published (First Online): 30-07-2018
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Approach to Filter Transaction Risk Based on Distance Discriminant
Zheng Yu-Wei1,2,
-
Department of Electronics and Information
Engineering,
Tongji University, Shanghai 201804, China
Ding Zhi-Jun1,2,+,
-
The Key Laboratory of Embedded System and
Service Computing,
Ministry of Education, Tongji University, Shanghai 200092, China
Xu Xiao-Feng3
-
Department of Economic Information Engineering, Southwest University of Finance and Economics,
Chengdu Sichuan 611130, China)
Abstract: With the growth of e-commerce payment requirements, electronic payments develop rapidly, but it also faces the problem of transaction risk brought by fraud and other reasons. Therefore it is very necessary to establish a transaction risk control system. There are a lot of fraud detection methods based on data mining models, but because of the imbalance samples problem it will easily lead to misclassification, thus normal user transactions is disturbed and resulting in the loss of transaction and other problems. This paper presents a method to build risk filter model for electronic transaction. This method is used for filtering the normal transaction before the strict model validation, reducing the burden of subsequent validation, raising the success rate of normal transactions and the response speed of the normal transactions. This paper provides a distance discrimination method to judge filter conditions and conduct an experiment on the massive real transaction data. The experimental results show that the method has a good effect of filtering normal transaction.
Key words: Distance discrimination; transaction risk; risk filter; fraud detection; electronic payment security
-
INTRODUCTION
With the online payment demand of e-commerce, electronic payments develop rapidly. Because of the open Internet environment, online payments are facing the transaction risk brought by fraud and security problem. At present, more effective solutions to solve the transaction risk problem in industry are based on rules and data mining models. Rules are highly based on domain knowledge, so its recognition accuracy rate of fraud is very high, but rules are usually hard-coded and the types of fraud which rules can detect are limited[1]. Data mining models are more flexible, models such as C4.5 decision tree, logistic regression have been widely used[2]. However, the amount of transaction logs the e-commerce company produced everyday are very huge, but the fraud transactions are just tiny part of it. Applications of data mining models on transaction risk detection are faced with the problem of serious imbalanced sample [3]. Such problem often leads to poor performance of classification model. In order to reduce the financial losses, online payment companies tend to adopt a conservative strategy,
making the model more stringent which can detect more fraud while lead to a high false positive rate.
Currently the stringent risk authentication of transaction is likely to cause a higher false positive rate and worse user experience, which directly leads to a lot of false positives transactions due to the large amount of transactions. When a transaction is identified as a risk, the system will send confirmation by SMS, e-mail or manual methods like telephone to confirm the transaction, which requires company to pay a large amount of additional communications and labor costs. Furthermore, the additional confirmation make users concern about security of electronic payment, disturb users normal payment business process, and easily lead to failure of users payment operation. Therefore, how to reduce interference to normal users under existing risk control system is a very essential research question.
This paper proposes a transaction filtering method which will verify normal electronic transactions before the strict model verification to solve the drawbacks of the simple model validation. If a transaction is verified as a normal trade, this transaction will be directly passed by the risk control system, otherwise it will be rechecked by strict model. Note that this method is just a measure of normal transaction, the transactions cannot be passed by the filtering method will also contains normal transaction and need to be classified by stringent model. Transaction filtering method and stringent model form the two levels risk control system. Through two levels of risk control the system will achieve a high abnormal transaction detection rate and low interference rate to normal transaction. The architecture diagram of risk control system based on two levels of risk control is shown by figure 1.
Risk Filter
System
Risk Filter
System
Level 1: Transaction Risk Filter System
Directly pass the normal transactions.
Level 1: Transaction Risk Filter System
Directly pass the normal transactions.
Rules Based Validation System
Rules Based Validation System
Model Based Validation System
Model Based Validation System
Level II: Model and Rules Based Validation System
Recheck the transaction which can not pass the risk filter system.
passed
Level II: Model and Rules Based Validation System
Recheck the transaction which can not pass the risk filter system.
passed
Transaction
passed
passed
rejected
rejected
rejected
rejected
Figure 1 architecture diagram of risk control system based on transaction risk filter system
The rest of this paper is organized as follows: Section 2 gives a review about the current research on risk control of online transaction. Section 3 gives the details of transaction risk filtering model and describes the method to judge if a transaction reaches the filter conditions based on distance discrimination. Section 4 gives the experiment result on real transaction data and a short discussion about the result, and Section 5 concludes the study.
-
RELATED WORK
The researches on the transaction risk control currently focus
use distance discrimination between transaction and centroid of the main cluster of abnormal transactions for judging whether the transaction can be filtered. In this section, we will give the details of transaction risk filtering model. The data set we used in this paper is given in 3.1. 3.2 gives the analysis and selection of features, and talks about the feature weighted issue. 3.3 gives an analysis on abnormal transactions by DBSCAN cluster method and 3.4 gives the specific of transaction risk filtering model and details of distance discrimination process.
-
Data sets
The data sets of this paper are online credit card payment transaction logs provided by an electronic payment company in China. The transaction samples are produced between February and March in 2014. Due to the large amount of data, in this paper we randomly pick up about 2000 million transaction logs which contain about 4700 abnormal transaction, so the ratio of imbalanced sample is close to 0.02%. We will use the transaction logs in February as training data and logs in March as validation data in experiments. The original transactions log data items in this study are shown in Table 1.
Table 1 Transaction Log Data Item Details
on fraud detection, and most studies focus on applying the
classification model to classify the normal and abnormal transactions, such as neural networks[4,5], decision tree[5], logistic regression[6], hidden Markov model[7]. Halvaiee has applied the artificial mmune system to detect the fraud transactions[8]. But as Phua C[2] says, the experimental data set of most fraud detection research are not as large as the actual environment, and the imbalanced sample problem is not obvious in these data set while the proportion of fraud transactions is much lower in the real environment. Solving fraud detection with classification model are faced with the extreme imbalanced sample problem in real transactions data. Depending on selection of threshold, the classification model will have a problem of high false positive rate or low recall rate. So just applying classification model cannot properly
Variable Type Definition trans_no char transaction number client_id long user client id
mac char mac address of the transaction platform
ip_addr char ip address of the
transaction platform
trans_date char transaction time
trans_amo doub transaction amount unt le
card_index long hash value of card
number
solve the problem of fraud detection in e-commerce environment.
In this paper, in order to filter normal transaction, we want to build a method to measure the reliability of transactions. There are some studies to establish the trust mechanism for uses and transaction in electronic environment. Liqin Tian[9] builds time window according to the user log and evaluate
card_count ry
shipto_adr r
merchant_ type
long country of card char shipping address char merchant type
the reliability of user in these time window. Lijie Fan[10] proposed a recommendation trust evaluation model based on rating in mobile e-commerce environment. Shao Zhang[11] builds a trust model for both users and merchants to ensure that they can trust each other. But there is still no research working on building an evaluation method to normal transaction.
-
-
TRANSACTION RISK FILTER MODEL
The purpose of the proposed transaction risk filter model is to pass as much as possible normal transactions before stringent model, so the primary problem is how to measure a transaction is normal. From the analysis of the distribution pattern of abnormal transactions by clustering method, we
case_flag int class type of transaction
-
Feature Selection and Feature Weighted
Let transaction class be Y, where Y has only two types which are normal and abnormal. We first need to extract features from transaction logs which is represented by X. Based on research of fraud detection feature on credit card[12], we extract and derive 40 features. We also select feature by applying IV(Information Value). IV is a very effective feature selection method with a wide range of application[13]. To applying IV first need to calculate WOE(Weight of Evidence) of each feature though transaction logs, which is calculated by formula(1). Where xi is specific value of a
feature, in this paper xi Xj, and yk stand for normal
transaction class. WOE only can indicate the importance of specific value of feature to the target class, if we want to measure the importance of the entire feature, the introduction IV calculation will be necessary. IV can be calculated by
that this card was used on this device
X13 distance between shipping address and
0.14
formula(2).
Woe x log f y yk |x xi
i
i
k i
k i
f y y |x x
(1)
billing address Because of the important differences between different features, we need to assign a certain weight to each feature.
Previously, we have discussed the IV can reflect the
IV ( X j ) f y yk |xxi f y yk |xxi Woe xi
xi X j
(2)
importance of feature to discriminate between normal and abnormal transactions, thus we can calculate feature weight
Studies have shown that when IV above 0.1 indicates that the feature has a moderate degree of importance for classification, and if IV greater than 0.3 indicates that the feature has obvious effect for classification. This paper chose
by their IV. Let the IV of feature Xi is vi, then the weight wi of feature Xi is calculated by the proportion of vi in sum of all feature IV. wi is calculated by formula(3).
wi vi / vj
0.1 as the criteria to select feature which will reduce the
X j E
(3)
effects of noise feature. Table 2 is a table of partial screened features with their corresponding definition and IV. So we can construct transaction sample ei E, ei=(x1,x2,…,xn,yk), where xiXi, ykY.
Table 2 features details after feature selection by IV
-
Analysis of Abnormal Transaction Based on DBSCAN cluster method
In this paper, we have analyzed the distribution pattern of abnormal transaction samples by clustering. K-Means is commonly used as a simple and fast convergence clustering algorithm, but the number of cluster K should be specified
Variabl
Definition IV
before clustering and the selection of K will directly affect
e the quality of clustering results. Here, we can not specify
X1 the days from the earliest
time the card had a transaction to now
X2 square root of sum of transaction paid by this
card in 15 days
X3 the number of devices which have been used to
pay with the card
X4 total number of times of transaction paid by this
card in 15 days
X5 distance between billing
address and login IP address
X6 square root of sum of transaction paid by this
card in 30 days
X7 total number of times of transaction paid by this
card in 30 days
X8 total number of times of transaction happened in this hour that are paid by
this card
X9 total number of times that the transaction was
set as this shipping address using this card
X10 total number of times that this card was used to buy this merchant type
X11 distance between shipping address and
login IP address
1.41
1.06
1.02
0.72
0.67
0.54
0.49
0.33
0.29
0.22
0.21
how many cluster the abnormal transaction will gather, so we chose a clustering algorithm based on the spatial density called DBSCAN(Density-based Spatial Clustering of Applications with Noise) proposed by Ester Marthin[15]. DBSCAN algorithm does not need to specify the cluster amount and can detect noise data sample. DBSCAN requires that for each sample in one cluster, with a given radius Eps, the number of other samples in this cluster must bigger than a given threshold MinPts. So we need to specify Eps and MinPts for this clustering analysis.
proportion of abnormal transaction in main cluster
proportion of abnormal transaction in main cluster
We observe the distribution of abnormal transactions by applying DBSCAN clustering analysis on the transactions logs in February. We use both the original features and IV weighted feature to conduct the clustering experiment. We find that most abnormal transactions gather into one main cluster. Figure 2 show that when setting a radius of 86, abnormal transactions of main cluster account for 96.31%. Figure 2 also shows that the weighted feature will make the proportion of abnormal transactions of main cluster faster convergence and easier to exclude outliers.
100%
80%
60%
40%
20%
abnormal transactions in main cluster(weighted)
abnormal transactions in main cluster(unweighted)
100%
80%
60%
40%
20%
abnormal transactions in main cluster(weighted)
abnormal transactions in main cluster(unweighted)
0%
0%
0 200 400 600 800 1000 1200 1400 1600
Radius(Eps)
0 200 400 600 800 1000 1200 1400 1600
Radius(Eps)
X12 total number of times 0.14
Figure 2 Relationship between the proportion of abnormal
transaction in main cluster and radius Eps
The result of clustering experiments shows that the anormal transactions that is not in the main cluster account for a small proportion of the overall abnormal transactions, so here we consider these abnormal transactions as outliers. According
model, we will introduce two evaluation methods for this model which are pass rate and leak rate.
Definition 3: Pass Rate, indicating the proportion of normal transaction of the transaction that filtered by models in total normal transaction.
Pass rate can be calculated by formula(6).
to the statistical definition of small probability events, we will set a 0.05 threshold criteria and use an Eps which can
E1
TP
TP FN
(6)
make more than 95% abnormal transactions being in the main cluster. In this data set, when Eps is set to 86, the abnormal transactions in main cluster account for 96.31%, so we set Eps 86 as our further experiment parameter.
Definition 4: Leak Rate, indicating the proportion of abnormal transaction of the transaction that filtered by models in total abnormal transaction.
Leak rate can be calculated by formula(7).
-
Filter condition based on distance discrimination
The purpose of this paper is to pass normal transactions, so
E2
FP
TN FP
(7)
the next issue is how to measure a transaction is normal. Here we define a normal transaction measure function T(E) as a measure method to normal transactions and define the result of measure function is transaction reliability value. When the T(E) is determined, we can use transaction reliability threshold to check if a transaction can be filtered.
Definition 1: Normal Transaction Measure Function T(E) is a function mapping from feature vectors E to transactions reliability value.
Definition 2: If a transaction E can be filtered, if and only if
T(E) , where is the transaction reliability threshold. Based on the analysis of DBSCAN clustering in 2.3, we get that most of abnormal transactions are in the main cluster. Therefore, we use the distance between the transaction sample and centroid of abnormal transaction as the measure method of the difference between normal and abnormal transaction, i.e. transaction reliability value.
Let the centroid be ET=(X T,X T,…,X T), we can get the
Where TP(True Positive) shows the number of the positive samples which is correctly classified as positive samples, FP(False Positive) shows the number of the negative samples which is incorrectly classified as positive samples, TN(True Negative) shows the number of negative samples which is correctly classified as negative samples and FN(False Negative) shows the number of positive samples which is incorrectly classified as negative samples.
For the risk filtering model we expect higher pass rate and lower leak rate at same time. Obviously, when the threshold is set to 0, the pass rate and leak rate will all be 100%. With the increase of threshold, both two evaluation value will decrease. Therefore, in training phase we should select the threshold based on the balance between pass rate and leak rate. In industry environment, we can get the threshold by determining leak rate depending on the business requirement.
-
EXPERIMENTAL RESULTS AND DISCUSSION
1 2 n
centroid by calculate the center of main cluster after DBSCAN clustering. We use Euclidean distance as the measure function, so the T(E) is calculated by formula(4). In the further experiment, we will get the center of the main cluster by the specific Eps which is selected by the criteria described in 3.3. We will also apply weighted feature to the calculation of T(E), which is shown by formula(5).
2
According to the formula (1) to (5), we calculate the transactions reliability value of transaction sample in February using weighted and unweighted Euclidean distance, and calculate both pass rate and leak rate depending on different transactions reliability threshold. The experimental results are shown in figure 3 and 4, where the abscissa stands for threshold. From figure 3 and 4, we can get that when the
threshold increase, both pass rate and leak rate decrease, but
T (E)
X X T
i i
i i
Xi E
(4)
the decrease speed of leak rate is faster than pass rate, indicating that this risk filtering model is more inclined to
T (E)
w X w X T 2
i i i i
i i i i
Xi E
(5)
excluding normal transactions. From figure 3 and 4, we can also get that when using weighted feature the decrease speed of leak rate is higher than using original feature, which
Application of transaction filter model based on distance
discrimination method is divided into offline training phase and online validation phase. Offline training phase will calculate the feature vectors of all transactions for offline training data and calculate the corresponding transaction reliability value. In the training phase we will select a transaction reliability threshold as the criteria to filter transaction in the online phase. And in online validation phase, we will calculate the transaction reliability value of online transaction using the center calculated in training phase and filter normal transaction based on the threshold.
So the selection of transaction reliability threshold determines the performance of the transaction risk filter model. In order to evaluate the performance of our filtering
means that given a same leak rate, the weighted Euclidean distance can achieve a higher pass rate.
Pass Rate
80%
70%
60%
50%
40%
30%
20%
10%
0%
Leak Rate
Pass Rate
80%
70%
60%
50%
40%
30%
20%
10%
0%
Leak Rate
0 50 100 150 200 250
Transaction reliability threshold(thousand)
0 50 100 150 200 250
Transaction reliability threshold(thousand)
Proportion
Proportion
Figure 3 Pass rate and leak rate of ordinary Euclidean distance in February data set
The selection of transaction reliability threshold is mainly based on the specific requirements of the application scenarios, but we can also get some hints from ROC curves,
i.e. select the highest possible efficiency threshold with a given pass rate or leak rate. According to figure 5, we select the point with most obvious slop change as threshold with leak rate 3.10% and pass rate 37.50%.
In order to verify that threshold will also have an effect on future transactions, we calculate the transaction reliability value of transactions samples in March using weighted Euclidean distance and analyze the pass rate and leak rate on difference threshold. The experimental result is shown in Figure 6. When the threshold is set to , the results show that 39.20% normal transactions are filtered and only 0.68% abnormal transactions are incorrectly filtered.
60%
50%
40%
Proportion
Proportion
30%
20%
10%
0%
Pass Rate Leak Rate
Pass Rate Leak Rate
Pass Rate Leak Rate
40%
30%
20%
10%
0%
40%
30%
20%
10%
0%
60%
50%
60%
50%
0
0
50
50
100
100
150
150
200 250
200 250
Proportion
Proportion
0 50 100 150 200 250
Transaction reliability threshold(thousand)
Transaction reliability threshold(thousand)
Transaction reliability threshold(thousand)
Figure 4 Pass rate and leak rate of weighted Euclidean distance in February data set
1
0.9
0.8
0.7
Pass rate
Pass rate
0.6
0.5
0.4
0.3
Figure 4 Pass rate and leak rate of weighted Euclidean distance in March data set
Experimental results show that transaction risk filtering model based on wighted Euclidean distance discrimination method has a significant effect on filtering the normal transactions. With a given leak rate, the model can achieve a high filtering effect. Using weighted feature based on IV has an improvement on filtering effect comparing to the original feature. Finally, the transaction reliability threshold analyzed in training phase has also a good filtering effect in validation phase, which means that the threshold calculated in offline data will also has an effect on filtering future online data.
0.2
0.1
0
unweighted feature
weighted feature
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Leak rate
5 CONCLUSION
This paper proposes a transaction risk filtering model to filter normal transaction before stringent model validation. Through this model, the e-payment company can reduce the burden and risk of additional strict model validation, reduce
Figure 5 ROC curve of original and weighted Euclidean distance in February data set
This conclusion can be more clearly seen in the ROC curve. We also plot the ROC curve in Figure 5, where the abscissa is leak rate and the ordinate is the corresponding pass rate. Figure 5 shows that when using weighted Euclidean distance, the increase speed of pass rate is very high when the leak rate is below 5%, and finally grow to 39.69%, while at same time the pass rate of Euclidean distance using original feature is only 11.73%. And the ROC area of weighted Euclidean distance is obviously larger than unweighted Euclidean distance. Therefore, the weighted Euclidean distance method has a significant effect on improving the performance of normal transactions filtering.
interference to normal users. The transaction risk filtering model can be integrated into the existing risk control system. This paper presents a method to filter normal transaction based on weighted Euclidean distance discrimination and apply the method and model in real transactions data set. The experimental results show that the model can achieve a great effect on filtering normal transaction while keeping a low leak rate.
The features and model we used in this paper has a good effect on filtering normal transactions, but the lack of theoretical and statistical analysis will reduce the universality of the features and model, so we will work on the analysis on the feature and method in future to increase the universality of them. In addition, the selection of threshold we used now is mostly based on experience and application requirements,
we need to find a more scientific method to choose a suitable threshold. We will also do further research on normal transaction measure function to find a function which can achieve a better effect on normal transactions filtering.
ACKNOWLEDGMENT
This work is supported in part by the National Natural Science Foundation of China under Grant No. 61173014 and 61173016, Hong Kong, Macao and Taiwan Science & Technology Cooperation Program of China under Grant No. 2013DFM10100, Shanghai Science & Technology Research Plan under Grant No. 11JC1412800.
REFERENCES:
-
Sherman E. Fighting web fraud[J]. Newsweek, 2002, 10.
-
Phua C, Lee V, Smith K, et al. A comprehensive survey of data mining-based fraud detection research[J]. arXiv preprint arXiv:1009.6119, 2010.
-
Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.
-
KolaliKhormuji M, Bazrafkan M, Sharifian M, et al. Credit Card Fraud Detection with a Cascade Artificial Neural Network and Imperialist Competitive Algorithm[J]. International Journal of Computer Applications, 2014, 96(25): 1-9.
-
Chen F H, Chi D J, Zhu J Y. Application of Random Forest, Rough Set Theory, Decision Tree and Neural Network to Detect Financial Statement FraudTaking Corporate Governance into
Consideration[M]//Intelligent Computing Theory. Springer International Publishing, 2014: 221-234.
-
Shen A, Tong R, Deng Y. Application of classification models on credit card fraud detection[C]//Service Systems and Service Management, 2007 International Conference on. IEEE, 2007: 1-4.
-
Kumari N, Kannan S, Muthukumaravel A. Credit Card Fraud Detection Using Hidden Markov Model-A Survey[J]. Middle-East Journal of Scientific Research, 2014, 19(6): 821-825.
-
Halvaiee N S, Akbari M K. A novel model for credit card fraud detection using Artificial Immune Systems[J]. Applied Soft Computing, 2014, 24: 40-49.
-
Tian Li-Qin, Lin Chuang. Evaluation mechanism for user behavior trust based on DSW[J]. Journal of Tsinghua University(Science and Technology), 2010(5):763-767.
-
Fan Li-jie, Wang Su-Zhen, Liu Wei, Evaluation method based on human trust mechanism for mobile e-commerce trust[J]. Computer Science. 2012, 39(1):190-192.
-
Zhang S, Lu X, Wang B. A trust evaluation model behaviors based in electricity market[C]//Electric Utility Deregulation and Restructuring and Power Technologies, 2008. DRPT 2008. Third International Conference on. IEEE, 2008: 561-566.
-
Wu Ting. Research on the applications of data mining in credit card fraud detection[D]. Southeast University, 2006.
-
Wang Y, Wong A K C. From association to classification: Inference using weight of evidence[J]. Knowledge and Data Engineering, IEEE Transactions on, 2003, 15(3): 764-767
-
Moez H, Alec Y C, Ray F. Variable selection in the credit cardindustry[A]. NESUG, 2006.61 -65
-
Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Kdd. 1996, 96: 226-231.