Anomaly Extraction Using Improved FP-Growth Algorithm Based on Compound Single Linked List

DOI : 10.17577/IJERTV3IS10256

Download Full-Text PDF Cite this Publication

Text Only Version

Anomaly Extraction Using Improved FP-Growth Algorithm Based on Compound Single Linked List

Pravin Gaikwad1 and Jyoti Kulkarni2

1ME (CN) and 2Assistant Professor

Department of Computer Engineering, Sinhgad College of Engineering, Pune-41

Abstract

Identifying network anomalies is essential in enterprise and provider networks for diagnosing events, like attacks or failures that severely impact performance, security, and Service Level Agreements (SLAs). We introduce an anomaly extraction method which refers to automatically finding, in a large set of flows observed during an anomalous time interval, the flow associated with the anomalous events. It is important for root-cause analysis, attack mitigation, network forensics and anomaly modelling. In this paper, we use meta- data provided by several histogram-based detectors to identify suspicious flows, and then apply association rule like Improved FP-Growth Algorithm based on Single Linked List to find and summarize anomalous flows. Using rich traffic data from a network, we show that our technique effectively finds the flows associated with anomalous events in all studied cases. In addition our algorithm shows a very small number of false positive rates and executes fast than Apriori and FP-Growth algorithm for large data sets.

Keywords: Computer network, data mining, detection algorithm, association rules, Improved FP algorithm.

1. INTRODUCTION

  1. Motivation

    Detecting and identifying network anomalies are becoming an important factor to consider when taking care of network security, uptime and performance of ISPs and large scale networks. An anomaly is defined as a "Deviation or departure from the normal or common order, form, or rule". Anomaly detection techniques are last line of defence when other approach fails to detect security threats or other problem. They have pose number of interesting research problems, involving statistics, modelling, and efficient data structures. Nevertheless, they have not yet gained widespread adaptation, as a number of challenges, like reducing number of false positive rate and calibration, remain to be solved.

    In this paper, we are interested in the problem of identifying the traffic flows associated with an anomaly during a time interval with an alarm. We call finding these flows the anomalous flow extraction or anomaly extraction. Anomaly extraction reflects the goal of gaining more information about an anomaly alarm. Identified anomalous flows can be used for a number of applications, like root-cause analysis of the event causing an anomaly, network forensics, improving anomaly detection accuracy, and anomaly modelling.

  2. Anomaly Extraction

    Fig 1 High level goal of anomaly extraction is to filter and summarize the set of anomalous flows that coincide with the flows caused by a network event such as denial- of-service (DoS) attack.

    In Fig. 1, we present a high level goal of anomaly extraction. In the bottom of the figure, events with network level footprint, like attack or failure, trigger event flow, which after analysis by an anomaly detector, may raise an alarm. Ideally, we would like to extract exactly all triggered event flows. The goal of anomaly extraction process is to find a set of anomalous flows coinciding with event flows.

    An anomaly detection system may provide a meta- data relevant to an alarm that helps to find the set of candidate anomalous flows. For example, anomaly detection systems analyzing histograms may indicate histogram bins than an anomaly affected, e.g., a range of IP addresses and Port numbers. Such meta-data can be used to restrict the candidate anomalous flows to these that have a affected IP address and port numbers.

    To extract an anomalous flow from network traffic, one could build a model describing normal flow characteristic and use the model to identify deviating flows. To build such model is very challenging issue due to wide variability of flows characteristic. Similarly, one could compare flows during an interval with flow from normal or past interval to the new flows or flow with significant

    increase/decrease in their volume to search for changes or to identify new flows that were not previously observed [6] [7]. Such approaches essentially perform anomaly detection at the level of individual flows and could be used to identify anomalous flows.

  3. Contributions

    In this paper, we take an alternative approach to identify anomalous flows that combines and consolidate information from multiple histogram- based detectors and an Improved FP-growth algorithm is applied to speed up the system rate and decrease the false positive rate as compared to Apriori and FP-growth algorithm [2]. Compared to other possible approaches, our method does not rely on past data for normal interval or normal models. Intuitively, each histogram-based detectors provide and additional view of network traffic.

    A detector may provide a set of candidate anomalous flows. The main reason of applying an association rule is that Anomalies typically result in many flows with similar characteristic, e.g., common IP address or port numbers, since they have a common root-cause, like network failure or a scripted denial-of-service (DoS) attack.

    Fig. 2. Overview of our approach to the anomaly extraction problem. The fig illustrate how meta-data for filtering flows is consolidate from n traffic features by taking the union and how suspicious flows are pre- filtered and anomalous flows are summarized in item- sets by association rule mining

  4. Outline

The rest of this paper is structured as follows. Section II describes our technique for extracting anomalous traffic and a new algorithm called Improved FP-growth algorithm to overcome the disadvantages of Apriori and FP-growth algorithm. Finally, Section III concludes our paper.

  1. METHODOLOGY

    An overview of our approach to the anomaly extraction is given in fig.2. An n number of histogram-based anomaly detectors monitors the network traffic and detect anomalies in an online fashion [1]. Upon detecting anomalies, we use a union set of meta-data provided by n histograms- based detectors to prefilter a set of suspicious flows. This pre-filtering process is necessary in order to eliminate a large part of normal flows. Then association rules are applied in order to generate frequent item-sets in the set of suspicious flows [1]. The basic assumption behind applying association rules is that frequent item-sets in the pre-filtered data are often related to anomalous event. The association algorithm like Improved FP- growth algorithm is applied to overcome the disadvantages of Apriori and FP-growth. Improved FP-growth improves the algorithm both in runtime and main memory consumption, which does its work without any complex data structure or prefix tree. The entire anomaly extraction process is automated and can take place both in online and offline fashion.

    1. Flow Pre-filtering

      Pre-filtering process usually removes a large part of the normal traffic. This process is desirable for two reasons. 1] It generates a substantially smaller data dataset that results in faster in the following steps. 2] It improves the accuracy of association rule mining by removing flows that could result in false-positive item-sets. Assume a time interval t

      with an anomaly. Pre-filtering selects all flows that match the union of the meta-data provided by n histogram-based detectors.

      An important detail of our approach is that we keep flows matching any of the meta-data instead of flow matching all the meta-data provided by n detectors. We take union of the flows matching meta-data rather than intersection of the flows matching meta-data. Taking the union is important because identified meta-data can be flow-disjoint, meaning that they appear in different flows, in which case the intersection is empty. The intersection of the flows matching the meta-data would be empty, whereas the union would include anomalous flows. It shows that the union results in fewer false positive that the intersection which may miss entire anomaly.

    2. Frequent Item-Sets Mining

      Association rule plays an important part in the field of data mining, which is used to mine the association rules in a given data set and is put forward by R.Agraawal [4] first. It divides the mining process into two steps: (1) find all frequent itemsets. (2) generate strong association rules from the frequent itemsets. The main purpose of applying association rules is that to find frequent item-sets to extract anomalous flows from large data-sets in time interval. Our assumption for applying frequent item-set mining to anomaly extraction problem is that anomalies typically result in large number of flows with similar characteristic. e.g., IP address, port numbers, since they have a common root cause analysis like network failure or scripted DoS attack.

      The transaction width is defined as the number of items present in a transaction. Each transaction has a width of five since each flow record has five associated feature corresponding to its srcIP, dstIP, srcPort, dstPort, #packets. For example the item i1

      ={srcPort : 80} refers to a source port number

      equal to 80, while item i2 = { dstPort : 80} refers to destination port number 80, a transaction cannot have to items same feature type.

      Improved FP-Growth Algorithm: The standard algorithm for generating frequent item- sets is Apriori and FP-Growth Algorithms [4] [5]. The main disadvantage of Apriori and FP-growth is that number of candidate item-sets and a complex data

      Improved FP-growth algorithm based on compound single linked list. The algorithm introduces the compound single linked list to improve the structure of the FP-tree. The improved FP-growth is mined in one direction, using the header table in the former FP-tree, storing them in a sequence table, ordering the frequent itemsets in descending sequence according to the min_sup, then a compound single linked list is formed. Through traversing each transactions frequent itemsets stored in its single linked list, mining the frequent patterns directly without generating conditional FP tree. Through the comparison between the improved algorithm and the former FP-tree, it shows that the new one improves the algorithm both in runtime and the main memory consumption.

      1: The steps to construct the compound single linked list

      1. The first scan of database is the same as the FP- tree. The scan of the database derives the set of frequent items (1-itemsets) and their support counts (frequencies). The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted L and an item header Table is built.

      2. The second scan of database is different from the FP growth. It is processing the items in each transaction in L order, and then inserting the items in each transaction into the single linked list

        recursively. The items order in each single linked list is according to L order.

        2: The realization of the improved algorithm

        The storing data structure is consist of three parts- I] Header Table

        Item-name, support count, node link II] Frequent Itemsets

        Frequent item, Pointer. III] Single Linked List Item, Count, Pointer.

        3: Methods

        1. The first scan of the database derives the set of frequent items (1-itemsets) and their support counts (frequencies). The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted L and an item header table is built.

        2. Scan database a second time. The items in each transaction are reserved and processed in L order, then stored in the single linked table frequent itemsets.

        (3)// i is the sequence of each transaction, p is the pointer pointing to one of the frequent itemsets in each transaction, n is the total number of the transaction.

        (4) Generating frequent patterns.

        4: An example to examine the improved algorithms feasibility

        For the same environment of the database, the article uses the database as article [17].

        The first step is the same, so we dont mention it repeatedly; we mainly introduce the second step. Its process is shown as Fig 3-6.

        For the first transaction<f,c,a,m,p>, constructing its Compound Single Linked List like fig3, taking f as item header-node, insert<c,a,m> after f, then taking c as item header-node, inserting<a,m,p> after c; When inserting the second transaction,

        because b is after a, before m, then inserting b after a and before m.

        Fig.3 inserting the first transaction

        Fig.4 inserting the second transaction

        Fig.5 inserting the third transaction

    3. Histogram-Based Detectors

      Histogram-based anomaly detectors [8] [9] [10] [11] have been work well for detecting anomalous behaviour and changes in traffic distribution. Our histogram-based detector uses the Kullback-Leibler (KL) distance to detect anomalies. The KL distance has been successfully applied for anomaly detection in previous work [12] [11]. Each histogram-based detector monitors a flow feature distribution, like distribution of IP address and Port numbers. We assume that for n histogram detectors of n traffic features m histogram bins are constructed.

      During a time interval t, an anomaly detector module constructs histogram for number of flows per traffic feature. At the end of interval, it computes for each histogram the KL distance between the distribution of current interval and reference interval distribution. The KL distance of given discrete distribution q to reference distribution p be

      =0

      D (p//q) = log( / ).

      Fig.6 inserting the last transaction

      The generating frequent patterns is as table1. (min_sup=3).

      Table1 generating frequent patterns

      itemid

      Generating frequent pattern

      f

      c a b m p

      (f:4), (fc:3), (fa:3), (fm:3), (fca:3),

      (fcm:3),(fam:3), (fcam:3)

      (c:4), (ca:3), (cm:3), (cp:3), (cam:3) (a:3), (am:3)

      (b:3)

      (m:3)

      (p:3)

    4. Histogram Cloning and Voting

    Histogram cloning is a promising technique applied to Histogram-Based Detection. The motivation behind it is to maintain multiple randomized histograms (of the same feature), hence it will obtain additional views of network traffic. This technique is realized in [13] by creating n histogram-based detectors corresponding to n different traffic features. For each of the n features there are m bins per time interval, and by applying a hash function to each of the clones, one makes sure that each feature value is placed randomly into one of the m bins. This differs from classical binning, which tends to place adjacent feature values (e.g. source ports) next to each other in a histogram.

  2. CONCLUSION

Anomaly extraction takes as input a large set of flows and aims at finding the flows associated with the event that triggered an observed anomaly. It is very useful for finding the root cause of detecting anomalies, which helps in attack mitigation, anomaly modelling and network forensics. Optimizing the scalability and efficiency of frequent item-set mining for dealing with big network data including stream processing is one open problem.

To avoid such problems of big network data, we introduced Improved-FP algorithm, The improved FP-growth ismined in one direction, using the header table in the original FP-tree, and a compound single linked list is constructed. The result shows that the improved algorithm is better than the FP-growth. By comparing these frequent item-set mining algorithms apriori and fp-growth and Improved FP, Improved FP is faster and requires less memory.

REFERENCES

  1. D. Brauckhoff, X. Dimitropoulos, A. Wagner, and K. Salamatian, Anomaly extraction in backbone networks using association rules,in Proc. IEEE ACM TRANSACTION ON NETWORKING, VOL.20. NO 6, DECEMBER 2012.

  2. Ding Zhenguo, Wei Qinqin, Ding Xianhua An Improved FP-growth Algorithm Based on Compound Single Linked List. In Proc. IEEE 2009

  3. W. A. Kosters,W. Pijls, and V. Popova, Complexity analysis of depth first and FP-growth implementations of APRIORI, in Proc. MLDM, 2003, pp. 284292.

  4. R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, in Proc. 20th VLDB, Santiago de Chile, Chile, Sep. 1215, 1994, pp. 487499.

  5. C. Borgelt, An implementation of the FP-growth Algorithm, Otto-von-Guericke-University of Magdeburg.

  6. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, Sketch-based change detection: Methods, evaluation, and applications, in Proc. 3rd ACM SIGCOMM IMC, 2003, pp. 234247.

  7. G. Cormode and S. Muthukrishnan, Whats new: Finding significant differences in network data streams, IEEE/ACM Trans. Netw., vol. 13, no. 6, pp. 12191232, Dec. 2005.

  8. A. Kind, M. P. Stoecklin, and X. Dimitropoulos, Histogram-based traffic anomaly detection, IEEE Trans. Netw. Service Manage., vol. 6, no. 2, pp. 110 121, Jun. 2009.

  9. M. P. Stoecklin, J.-Y. L. Boudec, and A. Kind, A two-layered anomaly detection technique based on multi- modal flow behavior

    models, in Proc. 9th PAM, 2008, Lecture Notes in Computer Science, pp. 212221.

  10. X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan,

    G. Iannaccone, and A. Lakhina, Detection and identification of network anomalies using sketch subspaces, in Proc. 6th ACM SIGCOMM IMC, 2006, pp. 147152.

  11. K. H. Ramah, K. Salamatian, and F. Kamoun, Scan surveillance inInternet networks, in Proc. Netw., 2009, pp. 614625.

  12. Y. Gu, A. McCallum, and D. Towsley, Detecting anomalies in network traffic using maximum entropy estimation, in Proc. 5th ACM SIGCOMM IMC, 2005, pp. 3232.

  13. F. Silveira and C. Diot, URCA: Pulling out anomalies by their root causes, in Proc. IEEE INFOCOM, Mar. 2010, pp. 19.

  14. M. Zaki, Scalable algorithms for association mining, IEEE Trans. Knowl. Data Eng., vol. 12, no. 3, pp. 372390, MayJun. 2000.

  15. A. Wagner and B. Plattner, Entropy based worm and anomaly detection in fast IP networks, in Proc. 14th IEEE WETICE, 2005, pp. 172177.

  16. M. V. Mahoney and P. K. Chan, Learning rules for anomaly detection of hostile network traffic, in Proc. 3rd IEEE ICDM, 2003, pp.601604.

  17. Han,J.,Pei J. ,Yin,Y (1999).Mining Frequent Paterns Without Candidate Generation. Technical Report CMPT99-12, Schoolo f Computing Science, Simon Fraser University.

Leave a Reply