A Review on Online Anomaly Detection Techniques

DOI : 10.17577/IJERTV3IS11103

Download Full-Text PDF Cite this Publication

Text Only Version

A Review on Online Anomaly Detection Techniques

S. S. Raut1, D. S. Randhave2, K. V. Sonkamble3, S. N. Deshmukh4

1,2,3,4 Department of Computer Science and IT, Dr. B. A. M. University, Aurangabad – 431004.

Abstract

Anomaly detection has been a crucial analysis topic in data processing and machine learning. Several real- world applications like intrusion or MasterCard fraud detection need a good and efficient framework to spot deviated data instances. A good anomaly detection methodology must to be able to accurately establish many varieties of anomalies, be robust, need comparatively very little resources, and perform detection in period of time. This paper provides an outline of major technological perspective and appreciation of the basic progress of online anomaly detection and conjointly provides overview technique developed in every stage of online anomaly detection. This paper helps in selecting the technique together with their relative deserve & demerits.

  1. Introduction

    Anomaly (or outlier) detection aims to spot a little group of instances that deviate remarkably from the existing data. A renowned definition of outlier in [1]: An observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism, which gives the general idea of an outlier. The importance of anomaly detection is as a result of the actual fact that anomalies in information translate to vital, and infrequently crucial, unfair data in a very wide range of application domains. Basically, anomaly detection may be found in applications like Office of Homeland Security, MasterCard fraud detection, intrusion and business executive threat detection in cyber-security, fault detection, or malignant diagnosing. However, since exclusively a restricted quantity of labelled data accessible within the higher than real-world applications, a way to confirm anomaly of unseen data (or events) attracts attention from the researchers in data processing and machine learning communities.

    1. Types of Anomaly

      1. Point Anomalies

        If an individual data instance can be considered as anomalous with respect to the rest of data, then the instance is termed a point anomaly. This is the simplest type of anomaly and is the focus of majority of research on anomaly detection.

      2. Contextual Anomalies

        If a data instance is abnormal in a very specific context, but not otherwise, then it's termed a contextual anomaly (also brought up as conditional anomaly [2]). The notion of a context is included by the structure within the data set and should be specified as a neighborhood of the problem formulation.

      3. Collective Anomalies

        If a group of connected data instances is abnormal with respect to the whole data set, it's termed a collective anomaly. The individual knowledge instances in an exceedingly collective anomaly might not be anomalies by themselves; however their occurrence together as a group is abnormal.

        The techniques used for detecting collective anomalies are very different than the point and contextual anomaly detection techniques.

  2. Anomaly Detection Techniques

    1. Unsupervised Anomaly Detection Techniques that operate in unsupervised mode do not need training data, and therefore most generally applicable. The techniques in this class create the implicit assumption that standard instances area unit much more frequent than anomalies within the test data. If this assumption isn't true then such techniques suffer from high warning rate. Many semi-supervised techniques may be custom-made to control in Associate in unsupervised mode by employing a sample of the

      unlabeled data set as test data. Such adaptation assumes that the test data contains only a few anomalies and therefore the model learned throughout training is robust to those few anomalies.

    2. Supervised Anomaly Detection

      Techniques trained in supervised mode assume the availability of training data set that has labeled instances for traditional still as anomaly categories. A typical approach in such cases is to make a prophetical model for traditional vs. anomaly categories. Any unseen information instance is compared against the model to work out that category it belongs to. There are two major problems that arise in supervised anomaly detection. First, the abnormal instances are much fewer compared to the conventional instances within the training data. Problems that arise owing to unbalanced category distributions are addressed within the data processing and machine learning literature[3][4]. Second, getting correct and representative labels, especially for the anomaly category is typically difficult. Numbers of techniques are planned that inject artificial anomalies into a traditional data set to get a labeled training data set [5]. Away from these two issues, the supervised anomaly detection disadvantage is comparable to assembling prophetical models.

    3. Semi- Supervised Anomaly Detection Techniques that operate in a semi-supervised mode, assume that the training data set has labeled instances just for the traditional category. Since they are doing not need labels for the anomaly category, they're additional wide applicable than supervised techniques. As an example, in spacecraft capsule fault detection [6], Associate in anomaly situation would signify an accident, that isn't simple to model. The typical approach utilized in such techniques is to create a model for the category corresponding to traditional behavior, and use the model to spot anomalies within the take a look at test data. A restricted set of anomaly detection techniques exists that assumes availableness of only the anomaly instances for training [7]. Such techniques are not usually used; primarily as a result of it is troublesome to get a training data set that covers each potential abnormal behavior that can occur within the data.

  3. Online Anomaly Detection Techniques

    1. SCAN Statistics

      SCAN statistics is a robust methodology for detecting remarkably high rates of events, conjointly referred to

      telecommunications, medicine, biological science, astronomy, internal control, and dependability [8]. In monitoring and management of communication networks, SCAN statistics will be used to monitor the prevalence of events in time, a degree method, like standing messages, alarms, and faults. It did not have a tendency to look for outliers, however rather uncommon bursts in events. In an online application for events occurring in time, the amount of events that

      have occurred in the scanning window [ , ], where

      is that the current time and is that the scanning window size, are compared with the amount of events expected to possess occurred in that window under traditional conditions. If that variety of events is massive compared to what expected, then an alert of an abnormality will be given. SCAN statistics will be used to compute the distribution of events under traditional conditions (the null hypothesis, 0) to see what is a significantly sizable amount (the vital value) within the scanning window, whereas properly dominant the false positive rate (FPR), that is that the chance of exceptional the critical value for any scanning window of size w within the larger time interval [0; ) under 0. A key advantage of scan statistics is that it permits for computationally easy implementation; therefore, it is possible to observe several processes at once with a tiny low process burden. I the usual treatment of scan statistics the time of events occurring within the interval [0; ) are assumed to be generated by a Poisson method under 0. In a Poisson process with rate over [0; ), the number of events is given by Poisson ( ). The inter-arrival times are iid distributed according to Exponential . Conditional on there being events, the times of the events are distributed uniformly in [0; ). Assuming that is known, the scan statistic is defined as follows. Let

      1, 2, . . . , denote the ordered values of the events occurring in the interval [0; ) and let () be the number of points () in the interval [ ; ] (Extensions of scan statistics exist in discrete time and on circular data, such as time of year, but we do not focus on them here.). The scan statistic is then defined as the maximum number of points to be found in any subinterval of [0; ) of length [9]. That is,

      = max

      =1

      = max || . 1

      A related statistic is , the minimum subinterval of

      [0; ) containing points

      as anomalies. Scanning for bursts of events has several applications in numerous fields such as

      = min :

      0

      = min +1 . (2) To test the null hypothesis, 0, that the background rate

      1

      = 0

      = at the significance level , find

      The distributions of these statistics are related by

      ( ) = ( ). Equivalently, and are inverses: = for . One should also note the edge cases of and : = 1 for >

      , 1 = 0 and 0 = 1 for 1, and = . The key trick in scan statistics is controlling the FPR by accounting for the overlapping multiple comparisons that are a result of the rolling scan window while maintaining more power than simple Bonferroni correction.

      For a Poisson process with mean rate per unit time over the interval [0; ), [10] gives the following approximation for the distribution ( µ, ), where µ = and = (also equal to (

      µ, )). Let (; µ) be the probability of exactly events occurring for a Poisson distribution with mean and (; µ) the cumulative distribution function (CDF) for the Poisson, then

      , 1 2(3/2)2. (3)

      the smallest , which we call , such that

      0, . (6)

      Where µ0 = 0. For an online test with fixed , if at time (the current time) the number of points, , occurring in the time interval of length ending at , [ ; ], is , then the null hypothesis is

      rejected at significance level and an alert may be given indicating that the rate of events has likely increased. An equivalent alternative test is to determine the length of time separating the most recent points, = + 1, where =

      [11]. If then an alert may be given.

    2. Incremental Local Outlier Factor

      The incremental LOF technique provides equivalent detection performance as the iterated static LOF technique (applied when insertion of every data record), whereas requiring considerably less machine time. Additionally, the incremental LOF technique conjointly dynamically updates the profiles of data points. This is often a really necessary property, since

      2

      = ( 1, )2 1 ; 2;

      1 ;

      3; , (4)

      data profiles might amendment over time. Incremental LOF algorithm is computationally efficient, whereas at a similar time successfully detect outliers and changes of distribution behavior in numerous data stream

      3 = ( 1, )3 1 + 2 + 3 4, (5)

      Where,

      1 = 2 ; 1;

      × 1 2;

      3; ,

      2 = 0.5(; )2 1 2 3;

      2 2 4;

      + 2 5; ,

      1

      applications. The Local Outlier Factor algorithm [11] has been successfully applied in many domains for outlier detection in a batch mode [12, 13]. Incremental LOF algorithm is applicable for detection of outliers in data streams and this technique is the first incremental outlier detection algorithm. It provides equivalent detection performance as the static LOF technique, and has ( log ) time complexity, where is that the total range of data points.

      In the designing of incremental LOF algorithm, two things are focused. First, the results of the incremental algorithm should be such as the results of the static

      3

      = 2 ; ( 1; )2,

      =1

      1

      algorithm every time a replacement purpose is inserted into a data set. Thus, there would not be a distinction between applying progressive LOF and also the periodic static LOF once all data records up to

      4 = 2 ; ; × 1 2;

      =2

      3;

      time instant are available. Second, asymptotic time complexity of incremental LOF algorithm should be like the static LOF algorithm. In order to have feasible incremental algorithm, it is essential that, at any time moment , insertion/deletion of the data record leads to restricted (preferably small) variety of updates of

      algorithm parameters. Specifically, the amount of updates per every insertion/deletion should not be captivated with this number of records within the dataset; otherwise, the performance of the incremental LOF algorithm would be (2 ) wherever is that the final size of the dataset.

      Incremental LOF algorithm computes LOF value for every data record inserted into data set and instantly determines whether or not inserted data record is outlier. Additionally, LOF values for existing data records are updated if required. And hence work in two phases that are insertion and deletion.

      1. Insertion

        In the insertion half, the algorithm performs two steps:

        a) insertion of latest record, once it computes reach- dist, lrd and LOF values of a replacement point; b) maintenance, once it updates k-distances, reach-dist, lrd and LOF values for affected existing points. The evaluation of the insertion half is done on the basis of following Theorems1-4[14].

        Theorem 1. The insertion of point affects the

        at points that have point in their

        , i.e., where

        ( ). New of the affected points are updated as follows:

        = , ,

        1 , . (7)

        Fig. 1. Update of k-nearest neighbor distance upon insertion of a new record (k=3). a) New record is not among 3 of record 3

        ( ) does not change; b) new record is

        among 3 of 3

        ( ) decreases. Cyan dashed lines denote updates of reachability distances between point and two old points.

        Theorem 2. Change of ( ) may affect

        ( , ) for points that are k-

        neighbours of .

        Proof.

        Using , = max( , ,

        ()), where (, ) is Euclidean distance from q to p. , > ,

        ( ) ( , ) = ( , ). According to Corollary 1, ( ) cannot increase, hence if ( , ) > ( )( ),

        ( ) ( , ) = ( ) ( , ).

        Proof. In insertion, of an existing point

        changes when a new point enters the nearest neighborhood of , since in this case the

        of changes. If a new point is the new nearest neighbor of , its distance from

        becomes the new ( ). Otherwise, old

        1 neighbor of becomes the new

        nearest neighbor of (see Fig. 1).

        Corollary 1. During insertion, cannot increase, i.e., ( )( )

        ( )( ).

        Theorem 3. lrd value needs to be updated for every record (denoted with in [14] a general framework) for which its changes or for which reachability distance to one of its changes. Hence, after each update of ( , ) we have to update ( ) if is among ( ). Also, lrd is updated for all points whose was updated.

        Proof . Change of of affects the scope of the sum in Eq. (8) computed for all

        of . Change of the reachability distance between and some of its

        affects corresponding term in the denominator of Eq. (8).

        = 1

        () (, )/

        . (8)

        Theorem 4. LOF value needs to be updated for all data records which lrd has been updated (since lrd( ) is a denominator in Eq. (9)) and for those records that have records in their . Hence, the set of data records where LOF needs to be updated (according to (9)) corresponds to union of records and their

        .

        Proof . Similar to the proof of Theorem 3, using (9).

        = . (9)

        1 ()

        ()

      2. Deletion

        The general framework for deleting the block of data records from the dataset S is given in [14]. The deletion of each record from dataset influences the of its .

        increases for each data record that is in reverse of . For such records, ( ) becomes equal to the distance from to its new .

        The reachability distances from ( 1) nearest neighbors to need to be updated. Observe that the reachability distance from the neighbor of to record is already equal to their Euclidean distance

        ( , ) and does not need to be updated (Fig. 2). Analog to insertion, lrd value needs to be updated for

        Fig. 2. Update of k-nearest neighbour distance upon deletion of record ( = 3). a) Prior to deletion, data record is among 3 of record

        ; b) After deletion of , 3 ( ) increases and reachability distances from two nearest neighbors of (denoted by cyan dashed lines) are updated to 3 ( ).

        This technique is basically generated by focusing on two problem statements:

        =

        =

        Problem 1. Given a sequence of datapoints { }+ , It need to determine if is a realisation of probability distribution , or . It is considered that the points { }+ are independent observations from the mixture distribution :

        all points where is updated. In

        addition, lrd value needs to be updated for points

        = 1 , + . (10)

        such that is in of and is in of . Finally, LOF value is updated for all points on which lrd value is updated as well as on their . The correctness of the deletion algorithm can be proven analog to the correctness of the insertion algorithm.

    3. Kernel Estimation Based Anomaly Detection Technique

      Large backbone networks are regularly affected by a range of anomalies. This technique gives an online anomaly detection based on mathematical technique of

      where is the mixing faction. Here , is slowly time- varying, while is time-invariant. And needed to compute an operator-specified probability of false discovery, .

      Problem 2. Need to make a preliminary decision about the underlying distribution of at time , and a final decision at time + here . Thus the preliminary decision is based on the data sequence

      { } and the final decision on { }+ , instead of

      Kernel Density Estimates (KDE) [15]. This technique

      tL

      tL

      sequentially and adaptively learns the definition of

      =

      =

      on { }+ .

      =

      normality in the online applications and assumes no prior knowledge regarding the underlying distributions, and then detects anomalies. The anomaly detection threshold has been mathematically linked to the users specified tolerance level for false alarms.

      As explicit in Problem 2, the aim is to form a preliminary decision about the underlying distribution of Xt at time , and a final decision at time t + . Thus the decision needs to be initially made using {Xi }t , and eventually using {Xi }t+ , as the sample from Pt

      tL

      from the interval {Xi}t+L . The subsequent technique of getting an estimated detection statistic T t is proposed, and T t is subsequently used as the online detection statistic that approximates the probability of false discovery achieved with the algorithm [15] for Problem 1.

      j=1

      At each timestep , the KEAD first evaluates the mean squared error t in representing Xt using a relatively small dictionary of approximately linearly independent elements Dt = { Xj}mt in the feature space defined by

      the kernel function. Error t may be derived to be [16]:

      interest { + }. The error introduced on this count is governed by the window length L.

    4. Online Oversampling Principal Component Analysis

      In this a way is propose known as online oversampling principal Component analysis (osPCA) algorithm [17] to handle anomaly problem, and that aims to detect the presence of outliers from an outsized quantity of data via a online updating technique. Unlike prior principal component analysis (PCA)-based approaches, it does not store the entire data matrix or covariance matrix, and thus this approach is especially of interest in online or large-scale problems. The task is completed by oversampling the target instance and extracting the principal direction of the data, osPCA permits to work

      = min{

      2

      out the anomaly of the target instance consistent with

      1

      1

      the variation of the ensuing dominant eigenvector.

      + ( , )}. (11)

      j=1

      i=1

      where [K t1]i,j = k( Xi, Xj) and [K t1]i,j = k( Xi, Xj) for i, j = 1 mt1. Note that here { Xj}mt 1 represent those selections from {Xi }t1 that have been entered into the dictionary up to time t 1. The optimum sparsification coefficient vector at that minimises t at time is then:

      1

      = 1 . 1 . (12)

      The expression for error is simplified into:

      = 1 ( ) . . (13)

      It also maintain a sliding window of the optimal sparsification coefficient vectors for the past L timesteps. One may then use the dictionary 1 and the matrix of past optimal sparsification coefficient

      vectors to obtain the online detection statistic :

      1

      = 1 . ,

      =1 =1

      It uses the leave one out (LOO) strategy along with PCA. LOO strategy states that adding (or removing) a abnormal data will have an effect on the principal direction a lot of as compared with the adding or removing normal data. Using the above strategy, the principal direction of data set is calculated without considering the target instance of the initial data set. Thus, the outlierness of the targeted instance may be determined by the variation of the ensuing principal directions. A lot of exactly, the distinction between these two eigenvectors can indicate the anomaly of the target instance. By ranking the dissimilarity of all data points, one will establish the outlier instance by a predefined threshold or a preset portion of the data. While it works well for applications with moderate data set size, the variation of principal directions won't be important once the size of data sets is massive.

      However, this LOO anomaly detection procedure with an oversampling strategy will markedly increase the computational load. Fo every target instance, one has to produce a dense covariance matrix and solves the associated PCA drawback. This may refuse the framework for real-world large-scale applications. Though the standard power technique is used to approximate PCA solutions, it needs the storage of the

      1 covariance matrix and cannot be simply extended to

      = 1. 1( ) . (14)

      =1

      is an approximation of in two respects. First, the (sparse) dictionary sample of representative input vectors from the interval { } is used. The error introduced on this count is governed by the

      sparsification parameter . Second, the interval

      { } is used as representative of the interval of

      applications with streaming data or online settings. Therefore, online updating technique is used along with osPCA. This change technique permits to efficiently calculate the approximated dominant eigenvector while not performing eigen analysis or storing the info covariance matrix. Compared to the ability technique or different well-liked anomaly detection algorithms, the desired process prices and memory needs are considerably reduced, and so this technique is very

      preferred in online, streaming data, or large-scale issues.

      osPCA technique duplicates the target instance multiple times, and also the plan is to amplify the result of outlier instead of that of traditional data. Whereas it might not be adequate to perform anomaly detection merely supported the foremost dominant eigenvector and ignore the remaining ones, on-line osPCA methodology aims to expeditiously confirm the anomaly of every target instance while not sacrificing computation and memory efficiency. More specifically, if the target instance is associate outlier, this oversampling scheme permits to hyperbolize its impact on the foremost dominant eigenvector, and so it specialize in extracting and approximating the dominant principal direction in an online fashion, rather than hard multiple eigenvectors rigorously. The detailed formulation of the osPCA is given now, suppose that it oversample the target instance times; the associated PCA can be formulated as follows:

      = , (15)

      Ã

      the computation value of scheming or changes the principal directions in massive scale issues.

  4. Comparison

    The overall paper describes the anomaly detection techniques that are been developed for on-line applications or data streams. In section 3 we have described briefly the various online anomaly detection techniques. The studied techniques are compared with that of the computational complexity and memory requirement. The table 1 shows the basis on which we have compared these different online anomaly detection techniques.

    SCA N

    Stat. [9]

    Incret. LOF [14]

    KEAD [15]

    osPCA [17]

    Computation al

    Complexity

    O(n)

    O(nlogn

    )

    O(np2+n2

    )

    O(p)

    Memory

    Requirement

    O(n)

    O(np)

    O(n3)

    O(p)

    Table 1. Comparisons of SCAN Statistics, Incremental LOF, KEAD and osPCA for Online Anomaly Detection in Terms of Computational Complexity and Memory Requirement.

    where à = ,ñ

    ,

    ( +ñ)× . The mean of

    is , and thus

    ñ

    = 1 + 1

    + ñ

    Ã

    + ñ

    =1

    Note that n and p are the size and dimensionality of

    1

    = +

    . (16)

    data respectively.

    1 +

  5. Conclusion

    osPCA framework, duplicates the target instance times (e.g., 10 percent of the size of the original data set), and we will compute the score of outlierness of that target instance, as defined in (17).

    In this paper studied a number of the anomaly detection technique developed for on-line applications or data streams. Through we've got found that the bulk of the work has been in done offline mode and a little work

    = 1 ,

    (17)

    has been done or goes on for the online application. The technique that has been developed is in preliminary stage and does not work with the large data sets. The

    If this score is above some predetermined threshold, it considers targeted instance as associate degree outlier. With this oversampling strategy, if the target instance may be traditional data, it will observe negligible changes within the principal directions and also the mean of the data. Its value noting that the employment of osPCA not exclusively determines outliers from the present data; it will be applied to anomaly detection issues with streaming data or those with on-line necessities. Clearly, the most important concern is that

    accuracy of this developed technique is a smaller amount compared with offline applications.

    This study additionally describes the most options of many anomaly detection techniques that are presently available in brief manner. The presented information constitutes a crucial purpose to begin for addressing Research & Development within the field of online anomaly detection.

    Countermeasures that are quicker and more effective are required to cope up with the data with higher dimension. We discover that the bulk of surveyed works don't meet these necessities. On the whole, the findings ensure a standard trend within the experimental technology.

  6. References

  1. Hawkins, D.M. 1980. Identification of Outliers. Chapman and Hall Publication.

  2. Song, X., Wu, M., Kermaine, C., and Ranka, 2007. Conditional Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering, 631-645.

  3. Joshi, M. V., Agarwal, R. C., and Kumar, V. 2001. Mining Needle in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction. International Conference on Management of Data. ACM Press 2001, 631-645.

  4. Joshi, M. V., Agarwal, R. C., and Kumar, V. 2002. Predicting rare classes: can boosting make any weak learner strong? International Conference on Knowledge Discovery and Data Mining. ACM 2002, 297306.

  5. Theiler, J. and Cai, D. M. 2003. Resampling approach for anomaly detection in multispectral images. SPIE. vol. 5093 , 230240.

  6. Fujimaki, R., Yairi, T., and Machida, K. 2005. An Approach to Spacecraft Anomaly Detection Problem Using Kernel Feature Space. International Conference on Knowledge Discovery in Data Mining, 401410.

  7. Dasgupta, D. and Nino, F. 2000. A Comparison of Negative and Positive Selection Algorithms in Novel Pattern Detection. IEEE International Conference on Systems, Man, and Cybernetics. vol. 1(Oct 2000), 125130.

  8. Daniel, B. N. and Gregory F. C. 2010. A Multivariate Bayesian SCAN Statistic for Early Event Detection and Characterization. Machine Learning, vol. 79 (June 2010), 261282.

  9. Ryan, T., Zoubin, G. and Steven, B. 2010. Fast Online Anomaly Detection Using Scan Statistics. IEEE International Workshop on Machine Learning for Signal Processing. 385- 390

  10. Joseph, I. N. 1982. Approximations for Distributions of SCAN Statistics. Journal of the American Statistical Association, vol. 77, No. 377 (June1982), 177183.

  11. Breunig, M. M., Kriegel, H. P., Ng, R. T. and Sander, J. 2000. LOF: Identifying Density Based Local Outliers. ACM SIGMOD Conference, Dallas, TX, 93-104.

  12. Lazarevic, L. E., Ozgur, A., Srivastava, J. and Kumar, V. 2003. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. Third SIAM International Conference on Data Mining, San Francisco, CA (May 2003), 25-36.

  13. Lazarevic, A. and Kumar, V. 2005. Feature Bagging for Outlier Detection. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicgo, IL, (Aug 2005), 157-166.

  14. Dragoljub, P., Aleksandar, L. And Longin, J. L. 2007. Incremental Local Outlier Detection for Data Streams. IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 504-515.

Leave a Reply