Detecting Outliers from Dataset by using Distributed Methods

DOI : 10.17577/IJERTV3IS050725

Download Full-Text PDF Cite this Publication

Text Only Version

Detecting Outliers from Dataset by using Distributed Methods

Priyanka R. Agwan

  1. E. Information Technology 2nd, AVCOE Sangamner,

    Suvarna E. Pawar

    Head of department,

    Dept. of Information Technology, AVCOE, Sangamner

    Abstract We introduce a module for distributed technique for detection distance-based outliers in very large massive knowledge sets. Our approach relies on the conception of outlier detection determination set [5], that may be a tiny set of the info set which will be extensively utilized for predicting novel outliers. the strategy exploits parallel computation so as to get large time savings. Indeed, on the far side conserving the correctness of the result, the planned schema exhibits wonderful performances. From the theoretical purpose of read, for common settings, the temporal cost of our algorithmic rule is anticipated to be a minimum of 3 orders of magnitude quicker than the classical nested-loop like approach to find outliers. Experimental results show that the algorithmic rule is economical which its time period scales quite well for associate degree increasing variety of nodes. we tend to discuss conjointly a variant of the fundamental strategy that reduces the quantity of knowledge to be transferred so as to boost each the communication value and therefore the overall runtime. significantly, the determination set computed by our approach in a very distributed setting has the same quality as that made by the corresponding centralized technique.

    Index Terms Anomaly detection, Data Mining, Parallel and Distributed Data Mining.

    1. INTRODUCTION

      OUTLIER detection is that the data processing task whose goal is to isolate the observations that square measure significantly dissimilar from the remaining information [10]. This task has sensible applications in many domains like fraud detection, intrusion detection, information improvement, diagnosis, and many others. unsupervised approaches to outlier detection are ready to discriminate every information as traditional or exceptional when no coaching examples square measure out there. Among the unsupervised approaches, distance-based ways distinguish an object as outlier on the premise of the distances to its nearest neighbors [1], [2], [4], [7], These approaches dissent within the manner the gap live is defined, however generally, given a knowledge set of objects, an object can be related to a weight or score, which is, intuitively, a perform of its k nearest neighbors distances quantifying the unsimilarity of the item from its neighbors. In this work, we tend to follow the definition given in [4]: a top-n distancebased outlier AN exceedingly in a very large information set is an object having weight not smaller than the ordinal largest weight, wherever the load of a

      data set object is computed because the add of the distances from the object to its k nearest neighbors.

    2. MODULES

      As our system detects outlier for an oversized knowledge set in numerous stages at numerous levels we need to supply some quite framework that facilitate for development of the system. Here we have a tendency to area unit about to introduce 2 algorithms during this chapter one is Distributed solving set algorithmic program and another one is Lazy distributed resolution set algorithm[5].

      In this system we are using Intrusion detection domain and using KDD cup 1991[7] standard dataset we are giving to the algorithms. Same dataset is using for the SVM [8] algorithm and result will compaired.

      1. SolvingSet Algorithm

        At every iteration (let America denote by j the generic iteration number), the SolvingSet formula compares all knowledge set objects with a specific tiny set of the knowledge set, called Cj (for candidate objects), and stores their k nearest neighbors with regard to the set C1 [ . . . [ Cj. From these stored neighbors, associate degree bound to actuality weight of every data set object will so be obtained[5]. Moreover, since the candidate objects are compared with all the information set objects, their true weights ar glorious. The objects having weight bound below the ordinal greatest weight associated with a candidate object ar known as nonactive (since these objects cannot belong to the top-n outliers), while the others ar known as active. At the start, C1 contains randomly elite objects from D, while, at every ulterior iteration j, Cj is made by choosing, among the active objects of the information set not already inserted in C1; . . . ; Cj1 during the previous iterations, the objects having the maximum current weight higher bounds. During the computation, if associate degree object becomes nonactive, then it'll not be thought-about any longer for insertion into the set of candidates, as a result of it can not be associate degree outlier. Because the formula processes new objects, a lot of correct weights ar computed and the variety of nonactive objects will increase. The algorithm stops once no a lot of objects need to be examined, i.e., once all the objects not nonetheless elite as candidates are nonactive, and so Cj becomes empty. The finding set is the union of the sets Cj computed throughout every iteration[3].

      2. Distributed Solving Set Algorithm

        The DistributedSolvingSet algorithmic program adopts an equivalent strategy of the SolvingSet algorithmic program. It consists of a main cycle executed by a supervisor node, that iteratively schedules the following 2 tasks: 1) the core computation, which is simultaneously disbursed by all the opposite nodes; and 2) the synchronization of the partial results came back by every node after finishing its job. The computation is driven by the estimate of the outlier weight of every information and of a global bound for the load, below that points area unit guaranteed to be nonoutliers. The on top of estimates area unit iteratively refined by considering or else native and global information.

        Fig.1 The DistributedSilvingSet Algorithm

        It is value to watch that many mining algorithms deal with distributed information set by computing native models that are collective during a general model as a final step within the supervisor node. The DistributedSolvingSet algorithmic program is different, since it computes actuality international model through iterations wherever solely elite international information and every one the native data area unit concerned[5].

        The core computation dead at every node consists in the following steps:

        1. receiving this determination set objects along with this bound for the load of the top ordinal outlier,

        2. examination them with the native objects,

        3. extracting a replacement set of native candidate objects (the objects with the highest weights, consistent with this estimate) beside the list of native nearest neighbors with relation to the determination set and, finally,

        4. deciding the quantity of native active objects, that is the objects having weight not smaller than the current bound.

          Table 1 Variables,Datastructures and finctions.

          The comparison is performed in many distinct cycles, in order to avoid redundant computations. The on top of information area unit used in the synchronization step by the supervisor node to generate a replacement set of world candidates to be utilized in the following iteration, and for every of them actuality list of distances from the closest neighbors, to reckon the new (increased) bound for the load.

          The formula DistributedSolvingSet is shown in Fig. 1. Table 1 summarizes variables, information structures, and functions employed by he formula. The formula receives in input the quantity l of native nodes, the values di representing the sizes of the native information sets Di, a distance operate dist on the objects in D, the quantity k of neighbors to think about for the weight calculation, the quantity n of prime outliers to search out, and whole number m >= k, representing the quantity of objects to be added to the finding set at every iteration. It outputs the solving set DSS and therefore the began containing the top-n outliers in D. At the start of the execution of the algorithm, DSS

          and OUT area unit initialized to the empty set (lines 1-2), whereas the set of candidates C is initialized by picking indiscriminately m objects from the full information set D (lines 3-6; confer with the procedure NodeInit for details). The main cycle (lines 9-22) stops once the set C becomes empty. The points presently happiness to C area unit added to the finding set DSS (line 10). At the start of every iteration, the set of candidates C is shipped to the procedures NodeCompi running at every native node (the instance NodeCompi runs at node atomic number 28, for i = 1, 2, . . . l), at the side of the worth minOUT representing a boundary to the load of the top-n outlier, and the total range act of active objects. every NodeCompi procedure returns:

          • the information structure LLNCi containing the k distances to the closest neighbors within the native information set Di of the candidate objects in C;

          • the updated range acti of active objects within the native data set Di;

          • the information structure LCi containing mi objects coming back from the native information set Di to be accustomed build the set of candidates C for successive iteration. The number mi represents the proportion of the active objects in Di, and is outlined as mi= m acti/act (note that once the structures LCi area unit came to the supervisor node by the native nodes, these information structures no longer embrace the weights related to the objects in this stored; see Table one for details).

            After having collected all the results of the procedures NodeCompi, actuality weight related to the candidate objects within the set C are often computed (lines 14-17). The k nearest neighbors distances within the whole information set of every candidate object letter area unit obtained from the distances hold on in the data structures LNNCi[q] (line 15); actually, the k smallest distances within the union of all LNNCi½q_ sets represent the distances separating letter from its k nearest neighbors within the global information set. Then, the heap OUT, containing the present top-n outliers, is updated (line 16). To conclude the description of the most iteration of the formula, the lower bound minOUT to the load of the ordinal prime outlier is updated (line 18), and therefore the novel set of candidate objects C is built (lines 19-21). we have a tendency to next offer details of the procedures NodeInit and NodeComp[9].

            NodeInit procedure – The procedure NodeIniti (see Fig. 2, lines 1-3) runs at native node atomic number 28. It receives in input AN whole number value mi and returns a at random chosen set Ci of mi information points happiness to the native information set. The variable acti, that is the range of the active information points within the native information set, is set to the native information set size. Finally, each the variable acti and the set Ci area unit hold on within the native node memory.

            Fig.2 The procedures employed in the DistributedSolvingSet algorithm.

            NodeComp procedure – The procedure NodeCompi, shown in Fig. 2, runs at native node metal. 1st of all, the worth acti and the set of native candidates Ci (computed either by NodeIniti or throughout the previous execution of NodeCompi

            ) are retrieved in the native memory (line 4). Then, the objects in Ci are removed from the native knowledge set (line 5) and also the variety acti of native active objects is updated (line 6). Before beginning the comparison of the native objects with this candidate objects, the heap LCi is initialized by means that of the procedure init so as to accommodate mi objects (line 7). Moreover, the plenty LNNCi[p] related to the local candidate objects p ar initialized to the corresponding heaps NNi (lines 8-9). The plenty LNNCi[p], for p not in Ci, are at first empty. Thus, solely the native node that generated the candidate object p is awake to the closest neighbors distances of p with relation to the previous sets of candidates (distances that are actually keep within the heap NNi[p] keep on the native node of the candidate object p). By adopting this strategy, we tend to ar ready to achieve communication savings. The supervisor node will then watch out of choosing verity nearest neighbor distances of p among the distances keep all told the plenty LNNCi[p] (i= 1, . . . ,l). At this time, the weights of the objects in Ci ar computed by examination every object in Ci with every object within the native knowledge set. This

            operation is split into 3 steps (corresponding to a few completely different nested loops) so as to avoid duplicate distance computations (lines 10-33). Thus, the primary double cycle (lines 10-15) compares every object in Ci with all different objects in Ci and updates the associated plenty. The second double cycle (lines 16-20) compares the objects of Ci with the opposite objects of C. Finally, the third double cycle (lines 21-33) compares the objects of Di with the objects of C. In particular, the objects p and letter, with p two Di and letter two C, are compared given that a minimum of one in all the 2 objects may well be associate degree outlier (that is, if the most between their weight higher

            bounds Sum(NNi[p]) and Sum(LNNCi[q]) is bigger than the boundary minOUT). throughout the last double cycle, if the load bound of the native object p isn't smaller than minOUT, then the quantity acti of native active objects is increased by one (note that acti is ready to zero at the start of the third double cycle; see line 21) and also the heap LCi is updated with p (lines 29-32). Finally, Ci is inhabited with the objects in LCi (line 34) and each acti and Ci ar keep in native memory (line 35) so as to be exploited throughout the next decision of NodeCompi.

      3. LazyDistributedSolvingSet Algorithm

      From the analysis accomplished within the preceding section, it follows that the entire quantity TD of information transferred linearly increases with the amount l of used nodes. Though in some situations the linear dependence on l of the quantity of data transferred could have very little impact on the execution time and on the hurrying of the strategy and, also, on the communication channel load, this type of dependence is in general undesirable, since in another situations relative performances might reasonably deteriorate once the amount of nodes will increase[6]. so as to get rid of this dependency, we describe during this section a variant of the essential Distributed Solving Set algorithm antecedently introduced. The variant, named the LazyDistributedSolvingSet rule, employs a more subtle strategy that ends up in the transmission of a reduced range of distances for every node, say kd, therefore exchange the term lk within the expression TD of the data transferred with the smaller one lkd, specified lkd is O(k). This strategy, thus, mitigates the dependency on l of the amount of information transferred, in order that the relative quantity of data transferred is approximated to TD% = qk/a

      Moreover, the primary term in (2), representing the temporal cost touching on the supervisor node, is replaced by LazyDistributedSolvingSet as follows: O(q|D| .(k log k + log n)), thus relieving the temporal value from the direct dependency on the parameter l.

      Fig. 3 The LazyDistributedSolvingSet Algorithm.

      Fig. 3 reports the LazyDistributedSolvingSet rule. This rule differs from the preceding one for the policy adopted to gather the nearest neighbors distances of every candidate object letter of the alphabet computed by every node. With this aim, an progressive procedure is pursued. many iterations are accomplished. At each iteration, the supervisor node collects the extra distances, puts them beside the antecedently received ones, and checks whether or not extra distances are required in order to see actuality weight related to the candidate objects. If it's the case, an extra iterationis performed, otherwise the progressive procedure stops[1].

      Fig.4 The procedures employed in the LazyDistributedSolvingSet algorithm.

      First of all, the procedure NodeCompi has got to be changed (see Fig. 4 and therefore the associated description reported next) so that it receives in input the supplemental parameter k0 < k, representing the amount of smallest nearest neighbors distances to be came back for every candidate object. Thus, the info structures LNNCi came back by the procedure NodeCompi (see line twelve of Fig.

      3) embrace solely the k0=|k/l| + 1 smallest distances to the closest neighbors in the native information set Di of the candidate objects in C. Lines 14-28 implement the progressive strategy higher than depicted. for every candidate object letter of the alphabet, the entry NNC[q] containing its k nearest neighbors distances is updated with the distances hold on within the entries LNNCi[q] sent by the native nodes throughout the last iteration (line 16). Note that during the primary iteration, the entire range of distances sent by the nodes is lk0 > k. If all the distances stored within the entry NNC[q] are smaller than the best distances last(LNNCi[q]) hold on within the entries LNNCi[q], for each i

      = 1, . . . , l, then it's the case that NNC[q] stores the true k nearest neighbors distances of letter of the alphabet within the whole data set. Indeed, since nodes send distances in increasing order, the distances not already sent by the generic node Ni are bigger than last(LNNCi[q]). otherwise, consider the smallest distance representing the simplest worst case distance distmin = mini{last(LNNCi[q])}.

      Then, it's the case that distmin happens within the arrangement NNC[q] in some position, say the jth one. This check is accomplished in line nineteen of the algorithmic rule. In such a case, the first j distances keep in NNC[q] area unit exactly those separating letter from its j initial nearest neighbors, while the remaining k – j distances keep in NNC[q] represent associate

      upper bound to verity distances separating letter from its (j-1)th to kth nearest neighbor. the amount k – j of unknown distances occurring in NNC[q] is then keep in the entry u_NNC[q] of the array u_NNC (see line 20). Moreover, the higher boundNNC[q][k] to the gap from letter to its kth nearest neighbor is keep within the entry cur last[q] of the array cur last (see line 21). The entry nodes[q] of the array node stores the identifiers of the nodes that might offer at least one in all verity nearest neighbor distances still unknown that area unit the nodes metallic element such the best distance last(LNNCi[q]) sent within the last iteration happens in NNC[q] (see line 22). when having processed all the distances received from the native nodes, for every candidate letter and for each node keep within the entry node[q] the master requests associate additional bunch of distances by suggests that of the procedure NodeReq (see lines 25-26). However, if all the entries nodes[q] are empty, verity nearest neighbors of the candidate objects have been collected and therefore the iterations terminate(line28). Fig.3. The

      LazyDistributedSolvingSet algorithmic rule. Fig.4. The procedures used within the LazyDistributedSolvingSet algorithm.

      NodeReq procedure – The procedure NodeReq is shown in Fig. 4, reportage additionally the modifications to the procedure

      NodeCompi. As for the latter procedure, lines 33-34 of Fig.2 (the last 2 lines) area unit replaced by lines 33-36 of Fig. 4, while the remainder of the procedure remains unchanged.

      The procedure NodeCompi kinds the k nearest neighbor distances of the candidate objects (line 33), inserts into the data structure rLNNCi the k – k0 greatest distances in LNNCi (line34) to be probably came back in subsequent calls of the procedure NodeReq, leaves in LNNCi solely the k0 smallest distances there enclosed (line 35), and stores in the native memory the amount of active objects, the candidates coming back from the native node, and therefore the information structures rLNNCi(line36). For each candidate object letter such this node is stored within the associated entry nodes[q] (line 38), the procedure NodeReq copies within the entry LNNCi[q] another bunch of distances by execution the perform get f (line 39). In particular, solely the distances smaller than cur last[q], the upper bound to the kth nearest neighbor distance for letter, are returned. The distances enclosed in LNNCi[q] area unit then removed from rLNNCi[q] by execution the perform get_l (line

      40) and keep within the native memory (line 42). This terminates the outline of the procedure NodeReqi.

    3. CONCLUSION

In this way in this paper implements modules for detecting outliers from Dataset by using Distributed Methods using DistributedSolvingSet and LazyDistributedSolvingSet Algorithm. This approach is efficient because it uses PDM and DDM hence the time gets saving. Here the DistributedSolvingSet algorithm is running on each local node and all the result of local node collected together given to the supervisor node and the synchronization happen on the supervisor node and top 10 outliers calculated.Experimental result will calculate in the next paper.

REFERENCES

  1. Fabrizio Angiulli, Stefano Basta, Stefano Lodi, and Claudio Sartori, Distributed Strategies for Mining Outliers in Large Data Sets".IEEE TRANSACTIONS month 2013.

  2. A. Koufakou and M. Georgiopoulos. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes." Data Min. Knowl. Discov, 2009C.

  3. M. J. Zaki and C.-T. Ho, Large-Scale Parallel Data Mining", volume 1759 of LNCS. Springer, 2000.

  4. M. J. Zaki and C.-T. Ho, Large-Scale Parallel Data Mining", volume 1759 of LNCS. Springer, 2000.

  5. F. Angiulli, S. Basta, and C. Pizzuti, Distance-Based Detection and Prediction of Outliers, IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 145-160, Feb. 2006

  6. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets." In SIGMOD, pages 427438, 2000. 32

  7. A. Asuncion and D. Newman, UCI Machine Learning Repository, 2007.

  8. S.V.M. Vishwanathan, M. Narasimha Murty SSVM: A Simple SVM Algorithm.

  9. E. Hung and D.W. Cheung, Parallel Mining of Outliers in Large Database, Distributed and Parallel Databases, vol. 12, no. 1, pp. 5-26, 2002.

  10. J. Han and M. Kamber, Data Mining, Concepts and Technique. Morgan Kaufmann, 2001.

.

Ms. Agwan Priyanka R. B.E. in Computer Engineering from Pune University, in 2012. She is currently pursuing her M.E. in Information Technology from University of Pune.

Prof. Pawar Suvrna is the Head of Information Technology, Amrutvahnini collage of engineering,Sangamner. She has been doing research in databases, information systems, data mining.

Leave a Reply