Horizontal partitioning ID3 algorithm A new approach of detecting network anomalies using decision tree

DOI : 10.17577/IJERTV1IS7241

Download Full-Text PDF Cite this Publication

Text Only Version

Horizontal partitioning ID3 algorithm A new approach of detecting network anomalies using decision tree

Sonika Tiwari*, Prof. Roopali Soni**

*(Department of Computer Science, OCT Bhopal

** (Department of Computer Science, OCT Bhopal

ABSTRACT

In this paper we are proposing a new approach of Detecting Network Anomalies using improved ID3 with horizontal portioning based decision tree. Here we first apply different clustering algorithms and after that we apply horizontal partioning decision tree and then check the network anomalies from the decision tree. Here in this paper we find the comparative analysis of different clustering algorithms and existing id3 based decision tree.

  1. INTRODUCTION

    It is important to keep our computer systems secure because economical activities rely on it. Despite the existence of attack prevention mechanisms such as firewalls, most company computer networks are still the victim of attacks. According to the statistics of CERT [1], the number of reported incidents against computer networks has increased from 252 in 1990 to 21756 in 2000 and to 137529 in 2003. This happened because of misconfiguration of firewalls or because malicious activities are generally cleverly designed to circumvent the firewall policies. It is therefore crucial to have another line of defence in order to detect and stop malicious activities. This line of defence is intrusion detection systems (IDS).

    During the last decades, different approaches to intrusion detection have been explored. The two most common approaches are misuse detection and anomaly detection. In misuse detection, attacks are detected by matching the current traffic pattern with the signature of known attacks. Anomaly detection keeps a profile of normal system behavior and interprets any significant deviation from this normal profile as malicious activity. One of the strengths of anomaly detection is the ability to detect new attacks. Anomaly detections most serious weakness is that it generates too many false alarms. Anomaly detection falls into two categories: supervised anomaly detection and unsupervised anomaly detection. In supervised anomaly detection, the instances of the data set used for training the system are labelled either as normal or as specific attack type. The problem with this approach is that labeling the data is time consuming. Unsupervised anomaly detection, on the other hand, operates on unlabeled data. The advantage of using unlabeled data is that the unlabeled data is easy and inexpensive to obtain. The main challenge in

    performing unsupervised anomaly detection is distinguishing the normal data patterns from attack data patterns.

    Recently, clustering has been investigated as one approach to solving this problem. As attack data patterns are assumed to differ from normal data patterns, clustering can be used to distinguish attack data patterns from normal data patterns. Clustering network traffic data is difficult because:

    1. of high data volume

    2. of high data dimension

    3. the distribution of attack and normal classes is skewed

    4. the data is a mixture of categorical and continuous data

    5. of the pre-processing of the data required.

      Network anomaly detection

      As we explained earlier, detectors need models or rules for detecting intrusions. These models can be built off-line on the basis of earlier network traffic data gathered by agents. Once the model has been built, the task of detecting and stopping intrusions can be performed online. One of the weaknesses of this approach is that it is not adaptive. This is because small changes in traffic affect the model globally. Some approaches to anomaly detection perform the model construction and anomaly detection simultaneously on-line. In some of these approaches clustering has been used. One of the advantages of online modeling is that it is less time consuming because it does not require a separate training phase. Furthermore, the model reflects the current nature of network traffic. The problem with this approach is that it can lead to inaccurate models. This happens because this approach fails to detect attacks performed systematically over a long period of time. These types of attacks can only be detected by analyzing network traffic gathered over a long period of time. The clusters obtained by clustering network traffic data off-line can be used for either anomaly detection or misuse detection. For anomaly detection, it is the clusters formed by the normal data that are relevant for model construction. For misuse detection, it is the different attack clusters that are used for model construction.

      Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar amongst them and dissimilar compared to objects of other groups. Representing data by fewer clusters necessarily loses certain fine details, but achieves simplification. It represents many data objects by few clusters, and hence, it models data

      by its clusters.

      Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. Patterns within a valid cluster are more similar to each other than they are to a pattern belonging to a different cluster. It is important to understand the difference between clustering (unsupervised classification) and discriminate analysis (supervised classification). In supervised classification, we are provided with a collection of labelled (reclassified) patterns; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these category labels are data driven; that is, they are obtained solely from the data [2,3,4].

      ID3 Algorithm

      The ID3 algorithm (Inducing Decision Trees) was originally introduced by Quinlan in [11] and is described below in Algorithm 1. Here we briefly recall the steps involved in the algorithm. For a thorough discussion of the algorithm we refer the interested reader to [10].

      Require: R, a set of attributes. Require: C, the class attribute. Require: S, data set of tuples. 1: if R is empty then

      2: Return the leaf having the most frequent value in data set S.

      3: else if all tuples in S have the same class value then 4: Return a leaf with that specific class value.

      5: else

      6: Determine attribute A with the highest information gain in S.

      7: Partition S in m parts S(a1), …, S(am) such that a1, …, am are the different values of A.

      8: Return a tree with root A and m branches labeled a1…am, such that branch i contains ID3(R {A} ,C, S(ai)).

      9: end if

  2. RELATED WORK

    In [5] Network anomaly detection aims at detecting malicious activities in computer network traffic data. In this approach, the normal profile of the network traffic is modeled and any significant deviation from this normal profile is interpreted as malicious. While supervised anomaly detection models the normal traffic behavior on the basis of an attack free data set, unsupervised anomaly detection works on a data set which contains both normal and attack data. Clustering has recently been investigated as one way of approaching the issues of unsupevised anomaly detection.

    Our contribution: The main goal of the paper has been to investigate the efficiency of different classical clustering algorithms in clustering network traffic data for unsupervised anomaly detection. The clusters obtained by clustering the network traffic data set are intended to be used by a security expert for manual labeling. A second goal has been to study some possible ways of combining these algorithms in order to improve their performance.

    In [6] Clustering is a division of data into group s of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar compared to objects of other groups. This paper is intended to study and compare different data clustering algorithms. The algorithms under investigation are: k-means algorithm, hierarchical clustering algorithm, self-organizing maps algorithm, and expectation maximization clustering algorithm. All these algorithms are compared according to the following factors: size of dataset, number of clusters, type of dataset and type of software used.

    Our contribution: The main conclusion that can be concluded is the performance comparison of different clustering algorithm.

    In [7] This consider privacy preserving decision tree induction via ID3 in the case where the training data is horizontally or vertically distributed. Furthermore, we consider the same problem in the case where the data is both horizontally and vertically distributed, a situation we refer to as grid partitioned data. We give an algorithm for privacy preserving ID3 over horizontally partitioned data involving more than two parties. For grid partitioned data, we discuss two different evaluation methods for preserving privacy ID3, namely, first merging horizontally and developing vertically or first merging vertically and next developing horizontally. Next to introducing privacy preserving data mining over grid-partitioned data, the main contribution of this paper is that we show, by means of a complexity analysis that the former evaluation method is the more efficient.

    Our contribution: Here the datasets when partitioned horizontally, vertically and after that the clustering algorithm is applied performs better performance than on the whole datasets.

    In [8] This paper presents a novel host-based combinatorial method based on k-Means clustering and ID3 decision tree learning algorithms for unsupervised classification of anomalous and normal activities in computer network ARP traffic. The k-Means clustering method is first applied to the normal training instances to partition it into k clusters using Euclidean distance similarity. An ID3 decision tree is constructed on each cluster. Anomaly scores from the k- Means clustering algorithm and decisions of the ID3

    decision trees are extracted. A special algorithm is used to combine results of the two algorithms and obtain final anomaly score values. The threshold rule is applied for making decision on the test instance normality or abnormality.

    Our contribution: The proposed method is compared with the individual k-Means and ID3 methods and the other proposed approaches based on markovian chains and stochastic learning automata in terms of the overall classification performance defined over five different performance measures. Results on real evaluation test bed network data sets show that: the proposed method outperforms the individual k- Means and the ID3 compared to the other approaches.

    In [9] Traditionally, research on graph theory focused on studying graphs that are static. However, almost all real networks are dynamic in nature and large in size. Quite recently, research areas for studying the topology, evolution, applications of complex evolving networks and processes occurring in them and governing them attracted attention from researchers. In this work, we review the significant contributions in the literature on complex evolving networks; metrics used from degree distribution to spectral graph analysis, real world applications from biology to social sciences, problem domains from anomaly detection, dynamic graph clustering to community detection.

    Our contribution: Many real world complex systems can be represented as graphs. The entities in these system represent the nodes or vertices and links or edges connect a pair or more of the nodes. We encounter such networks in almost any application domain i.e. computer science, sociology, chemistry, biology, anthropology, psychology, geography, history, engineering.

  3. Proposed ID3 algorithm

      • Define P1, P2, ., Pn Parties.(Horizontally partitioned).

      • Each Party contains R set of attributes A1, A2, ., AR.

      • C the class attributes contains c class values C1, C2,

        ., Cc.

      • For party Pi where i = 1 to n do

      • If R is Empty Then

      • Return a leaf node with class value

      • Else If all transaction in T(Pi) have the same class Then

      • Return a leaf node with the class value

      • Else

      • Calculate Expected Information classify the given sample for each party Pi individually.

      • Calculate Entropy for each attribute (A1, A2, ., AR) of each party Pi.

      • Calculate Information Gain for each attribute (A1, A2,., AR) of each party Pi

      • Calculate Total Information Gain for each attribute of all parties (TotalInformationGain( )).

      • ABestAttribute MaxInformationGain( )

      • Let V1, V2, ., Vm be the value of attributes. ABestAttribute partitioned P1, P2,., Pn parties into m parties

      • P1(V1), P1(V2), ., P1(Vm)

      • P2(V1), P2(V2), ., P2(Vm)

      • . .

      • . .

      • Pn(V1), Pn(V2), ., Pn(Vm)

      • Return the Tree whose Root is labelled ABestAttribute and has m edges labelled V1, V2, ., Vm. Such that for every i the edge Vi goes to the Tree

      • NPPID3(R ABestAttribute, C, (P1(Vi), P2(Vi), ., Pn(Vi)))

      • End.

  4. RESULTANALYSIS

    As shown in Fig 1. is the time needed for the decision of any dataset? It was observerd that the existing id3takes more time as compared our proposed work.

    Where,

    HP is the proposed horizontal partioned based ID3.

    Fig. 1

    Relative absolute error Relative error gives an indication of how good a measurement is relative to the size of the thing being measured.

    relative error = absolute error / value of thing measured

    As shown in Fig 1. is the time needed for the decision of any dataset? It was observed that the existing id3 takes more time as compared our proposed work. Where,

    HP is the proposed horizontal partioned based ID3.

    Relative absolute error can be calculated as: (|p1-a1|+.+|pn-an |) / -a1|+..|-an|)

    As shown in the Table(a) is the comparative analysis of time complexity of the existing id3 based decision tree and the horizontal portioning based decision tree. It was found that our proposed algorithm takes very much less time in making of a tree.

    number_of_instances

    id3 time(ms)

    HP

    time(ms)

    14

    78

    15

    25

    93

    15

    50

    110

    16

    100

    125

    31

    200

    150

    32

    In Fig 3. It was observed that the proposed algorithm has less absolute error than the existing algorithm.

    Table (a)

    Mean squared error The mean squared error (MSE) of an estimator is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. MSE measures the average of he squares of the "errors." The error is the amount by which the value implied by the estimator differs from the quantity to be estimated. The difference occurs because of randomness or because the estimator doesn't account for information that could produce a more accurate estimate.It can be calculated

    as:((p1-a1)2+…+(pn-an )2) / (( -a1)2+…( -an)2)

    with Actual target values: a1 a2 an Predicted target values: p1 p2 pn

    Fig. 2

    In Fig 2. Our proposed work performs less means square error as compared to the existing algorithm.

    Fig. 3

    Table 1.

    Clustering with proposed id3

    Time (ms)

    Mean

    absolute error

    Mean

    absolute error

    K-mean with proposed id3

    47

    0.0714

    14.2105 %

    Hierarchical with proposed id3

    31

    0.0357

    36.5854 %

    EM with proposed

    id3

    43

    0.0238

    5.4119 %

    As shown in the table 1 is the comparative study of different clustering algorithm with our proposed algorithm.

    Clustering with existing id3

    Time (ms)

    Mean absolute

    error

    Mean absolute

    error

    K-mean with existing id3

    65

    0.0914

    20.2105 %

    Hierarchical with existing id3

    50

    0.0557

    45.5854 %

    EM with existing

    id3

    60

    0.0438

    7.4119 %

    Table 2.

    As shown in the Table(b) is the comparative analysis of the mean absolute error of the existing id3 based decision tree and the horizontal portioned base decision tree. Although the difference between the existing and the proposed algorithm is less, but having more absolute error will reduce the efficiency of the algorithm.

    number_of_instances

    ID3 Mean absolute error

    HP Mean absolute error

    14

    0.2857

    NP

    25

    0.24

    0.237

    50

    0.24

    0.24

    100

    0.23

    0.22

    200

    0.235

    0.23

    Table(b)

  5. CONCLUSION

The clustering algorithms are used to divide any datasets into a number of clusters, this time clustering algorithms are combined with ID3 algorithm to detect the network anomaly detection and the performance is compared with the other clustering algorithms. The proposed algorithm implemented here provides a way of classifying and provides better leaning of the network anomalies and normal activities in computer network ARP traffic.

REFERENCES

  1. A comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava.

  2. Keogh E., Chakrabarti K., Pazzani M., and Mehrotra S., Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases, Knowledge and Information Systems, vol. 3, pp. 263-286, 2001.

  3. Lepere R. and Trystram D., A New Clustering Algorithm for Large Communication Delays, in Proceedings of 16th IEEE-ACM Annual International Parallel and Distributed Processing Symposium (IPDPS02), Fort Lauderdale, USA, 2002.

  4. Li C. and Biswas G., Unsupervised Learning with Mixed Numeric and Nominal Data, IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 673-690, 2002.

  5. A comparison of clustering method for unsupervised anomaly detection in network traffic, Koffi Bruno Yao.

  6. Comparisons between Data Clustering Algorithms,Osama Abu Abbas Computer Science Department, Yarmouk University, Jordan

  7. Privacy Preserving ID3 over Horizontally, Vertically and Grid Partitioned Data,Bart Kuijpers, Vanessa Lemmens, Bart Moelans Theoretical Computer Science, Hasselt University & Transnational University Limburg, Belgium.

  8. A Novel Unsupervised Classification Approach for Network Anomaly Detection by K Means Clustering and ID3 Decision Tree Learning Methods,Yasser Yasami, Saadat Pour Mozaffari, Computer Engineering Department Amirkabir University of Technology (AUT) Tehran, Iran.

  9. Dynamic Network Evolution: Models, Clustering, Anomaly Detection,Cemal Cagatay Bilgin and B¨ulent Yener Rensselaer Polytechnic Institute, Troy NY, 12180.,

  10. Wenke Lee and S. J. Stolfo. Data Mining Approaches for Intrusion Detection, 1998.

  11. Stefano Zanero and Sergio M. Savaresi. Unsupervised learning techniques for an intrusion detection system, ACM March 2004.

  12. Martin Ester, Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu. A density-based clustering algorithm for discovering Clusters in Large Spatial databases with noise. Proceedings of 2nd international Conference on Knowledge Discovery and Data Mining, 1996.

  13. A. Wespi, G. Vigna and L.Deri. Recent Advances in Intrusion Detection. 5th International Symposium, Raid 2002 Zurich, Switzerland, October 2002 Proceedings. Springer.

  14. G. Qu, S. Hariri, and M. Yousif, A New Dependency and Correlation Analysis for Features, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005.

  15. J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas, On Combining Classifiers, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.

Leave a Reply