An Overview of Efficient Data Mining Techniques

DOI : 10.17577/IJERTV3IS090727

Download Full-Text PDF Cite this Publication

Text Only Version

An Overview of Efficient Data Mining Techniques

Sandeep Dhawan

Director of Technology IT Department OTTE New York

USA

Abstract:- Data mining is the process of discovering associations within huge data set, finding data patterns, anomalies, changes and significant statistical structures in the data. Conventional data analysis techniques involve formulating a hypothesis and then validating it against the dataset. On the other hand data mining techniques automatically detect significant patters in the data and these patterns can be used to formulate algorithms. An important consideration in mining huge data sets is that the result or the pattern identified should be valid, understandable, useful and novel [1]. Not to go without saying that data warehousing and maintaining large databases also principally rely on the efficiency of robust, intelligent and at times novel data mining techniques. Today data mining (techniques) are employed in nearly every sector of corporate industry. From music industry to films maintenance, medicine to sports theres hardly any field of life without an input and integration of these data mining techniques. This paper focuses on presenting an overview of some of the most commonly used data mining techniques along with their applications. Techniques presented in this paper include sequence mining, clustering, classification, K nearest neighbors and association rule mining. Additionally, theres a sample example in each case to help understand the basic working of each technique. Underlying branches, algorithms and process for each of these techniques are also given. Pseudo code for algorithms is also mentioned where required to ensure readers understanding with respective graphs. Paper also gives a brief overview some of the pre and post-processing data mining techniques.

Key Words: Data mining, KNN, Clustering, Classification, Association rule mining, Sequence mining.

  1. DATA MINING PROCESS

    Data mining is a modular process and it is completed in multiple stages. This section throws light upon the stages involved in carrying out a data mining research on a dataset (2, 3).

      1. Explore Data Domain

        Before dwelling into the algorithm designing and data accumulation process, profound understanding of application domain, in which research has to be carried out, must be developed. Having grasp over the application domain helps in accumulating better data sets and deciding what data mining technique should be applied to achieve the expected results.

      2. Collect Data

        All of the data mining algorithms are implemented on some data sets. Therefore a dataset sufficient enough to satisfy

        all the requirements of the algorithm being implemented must be gathered.

      3. Refine and Transform the data

        Datasets contain noise, outliners, missing values and other inconsistencies which need to be removed before data can be further processed for analysis and pattern extraction. A research carried out by Swine MHC (4) studied 173 data records and found that 36 of them were faulty or erroneous. Therefore it is extremely important to refine and transform data. (5)

      4. Feature Selection

        Dataset might consist of thousands of features, however for a particular problem, only a handful of features are required. Therefore, after a refined dataset has been obtain which is not contaminated by inconsistencies and noise, relevant features are selected to apply further processing. Feature selection is done via techniques like principal component analysis [7], Wilcoxson rank sum test [6], entropy analysis and Fisher Criterion [8].

      5. Apply Relevant Algorithm

        After data has been acquired, cleaned and features have been selected, algorithm is selected that will process the data and produce results [1]. Some of the most commonly used algorithms are clustering, association rule mining, decision tree, sequence mining etc. The details of these algorithms have been explained in the later sections.

      6. Observe, Analyze and Evaluate

    The last and final stage of data mining process is to observe, evaluate and find patterns in the results produced by the algorithm. Final conclusion is made on the basis of these evaluations [1].

  2. DATA MINING TECHNIQUES

    There are many data mining techniques currently in use by data scientists and they differ depending upon their efficiency, precision and the type of data upon which they operate [2]. Here however, we will only discuss 5 of them. Association rule mining, Classification, Clustering, KNN and Sequence mining.

  3. ASSOCIATION RULE MINING [11][12]

    Proposed by Agarwal et al in 1993[11], today it is extensively used by data mining fraternity. It assumes data to be categorical, hence not popular option for numerical data analysis. Initially it was used in Market Basket Analysis,[12] an analysis to find out relations between

    customers shopped items. In explanation below, we will use the same example with association rule mining detailed description.

      1. Example Case: Market Basket Analysis.

        1. Data:

          ConsiderItems I: {i1, i2, .,ix}: A set of articles/items in the store i.e. Butter.

          Transaction t: t is a set of items brought by a particular person, and tI.

          Transaction Database T: a set of transactions from stores database, T = {t1, t2, ,tn}.

          Hence a transaction would be something like:

          t1: {bread, butter, milk} t2: {apple, eggs, salt, sugar}

          tn: {biscuit, eggs, milk}

        2. Rules:

          1. A transaction t from transaction database T, contains X, a set of items (item set) in I, if Xt.

          2. An association rule is a simple manifestation of the form: XY, where X, Y I, and X Y =

          3. An item set is a set of items.E.g., X = {milk, butter, cereal} is an item set.

          4. A k-item set is an item set with k items.E.g.,

            {milk, bread, cereal} is a 3-itemset and {eggs, butter} is a 2-itemset.

        3. Measuring Standards:

          Support (sup):The rule exists with support sup in T (the transaction data set) if sup percentage % of transactionscontain XY.

          I.e. sup = Pr(XY).

          Confidence (conf):The rule exists in T (the transaction data set) with conf, if conf percentage % of transactions that contain X also contain Y.

          I.e. conf = Pr(Y | X).

          Hence an association rule is a simple pattern that states whenever X occurs, Y occur with a certain probability.

        4. Measuring Support & Confidence:

    Support count: X.count (denotation of support count) of an item set X, in a dataset T is the number of transactions in T that contains X. Assuming there are n transactions in T, then:

    support ( X Y ).count

    n

    Confidence Count: In a dataset is the number of transactions in T that contains X against total no of items in item set X. i.e.

    confidence ( X Y ).count

    X .count

    Table -1: Example of Support Measure.

    TID

    Items

    Support=Occurrence/Total Support

    1

    Chicken, Milk

    Total Support=5

    2

    Beef, Cheese

    Support[Beef, Cheese]=2/5=40%

    3

    Chicken, Clothes, Cheese, Milk

    Support[Chicken, Milk]=3/5=60%

    4

    Beef, Chicken, Cheese

    Support[Beef, Chicken, Cheese]=1/5=20%

    5

    hicken, Milk

    Table -2: Example of Confidence Measure.

    TID

    Items

    Given X->Y: Confidence= Occurrence[Y]/ Occurrence[X]

    1

    Apple, Butter,

    Cheese

    2

    Apple, Butter, Detergents

    Confidence {Apple->Butter}= 2/3=66%

    3

    Butter, Cheese

    Confidence {Butter->Cheese}=3/4=75%

    4

    Apple, Cheese

    Confidence {Apple, Butter-

    >Cheese}=1/2=50%

    5

    Butter, Cheese, Detergents

    3.2. Summary Association Rule Mining:

    Currently there are many algorithms those implement Association based mining, however Apriori Algorithm (a two-step iterative process) is most largely associated with Association mining [12]. Another popular Algorithm is DLG Algorithm. Important to note here is that space of all association rules is exponential, O (2n), where n is the number of items in I. Additionally, associative mining exploits sparseness of data, and high minimum support (minsup) and high minimum confidence (minconf) values [12].

  4. SEQUENCE MINING[13][18]

    Sequence mining is closely related to associative rule mining [2]. However the principal difference between the two lies in the input dataset. Each row in the input dataset acts as a single data sample/data item for associative mining (as in transactions t1,t2,..tn above), whereas a data sample (aka sequence in sequence mining) is split across multiple consecutive rows in input data, each row representing only one event, based on some predefined criteria. Hence each event acts as a transaction.

    Sequential mining techniques are used for medical treatments, natural disasters, stocks and markets, DNA sequencing and gene structures [13].

      1. Related Terms:

        1. Subsequence vs. Super sequence:

          Generally, a sequence is an ordered list of events, denoted as < e1 e2 el>. Given two sequences A=< a1 a2 an> and B=< b1 b2 bm>. A is called a subsequence of B, denoted as A , if there exist integers 1 j1< j2<<jn

          m such that a1 bj1, a2 bj2,, an b jn.

          In which case, B is a super sequence of A:

          E.g. A=< (ab), d> and B=< (abc), (de)>.

      2. Sequential Pattern Mining(Sample Case):

        Given a set of sequences (Activities) with a support threshold, we can find the complete set of frequent subsequences:

        Sequence Database: Given a dataset, a sequence : < (ef) (ab) (df) c b >, An element may contain a set of items. Items within an element are unordered and we list them alphabetically. Hence <a(bc)dc> is a subsequence of

        <a(abc)(ac)d(cf)>.

        Given support threshold min_sup=2, <(ab)c> is a sequential pattern.

        SID

        Sequence

        10

        <(ad)c(bc)(ae)>

        20

        <a(abc)(ac)d(cf)>

        30

        <eg(af)cbc>

        40

        <(ef)(ab)(df)cb>

        Table -3: Sample Sequence Data.[13]

      3. Common methods of Sequence based mining:

        1. Apriori-based Approaches: GSP, SPADE etc.

        2. Pattern-Growth-based Approaches: FreeSpan, PrefixSpan etc.

          Graph -1: Performance Comparison of Approacheson Data Set C10T8S8I8.[13]

          Graph -2: Performance Comparison of Approacheson Data SetGaelle.-[13]

          Here we will only discuss GSP, given its high performance evident from the chart.

      4. GSP:

        Also called Generalized Sequential Pattern, is an Apriori based data mining algorithm. It works as:

        Pseudo Code:

        For a given sample data, every item in DB is a candidate of length-1, in start:

        for each level (i.e., sequences of length-k) do

        Go through database and collect support count for every candidate sequence

        Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori.

        repeat as long as there are no frequent sequences or no candidate can be found.

        Example:

        Example-1: Sequencing using GSP.[18]

      5. Constraints on Sequencing:

        Sequential data mining faces following constraints:

        1. Item constraint (web log patterns only about e- stores)

        2. Length constraint (patterns having at least 40 cloth items)

        3. Super pattern constraint (super patterns of Cannon digital camera)

        4. Aggregate constraint(patterns such that the avg. price of items is below $100)

  5. CLASSIFICATION[10][14][16]

    Classification based data mining exists as the backbone of machine learning in artificial intelligence. Classification generally consists of assigning a class label to a set of non- labeled cases [16]. The process of assigning a class or a label to unspecified data can be achieved by either of two ways: Supervised Classification and Unsupervised classification [10].

    Supervised Classification: Given labeled data, predict output. Ie, set of possible classes is known/told via sample data.

    Unsupervised Classification: Unlike supervised classification, sample data for unsupervised classification lacks predefined class labels. Hence its users responsibility to explore the data to find some intrinsicstructures (common features for classification) in them. Clustering is most widely used unsupervised classification technique. KNN and neural network are others.

    Diagram-1: Process Flow of Supervised vs. Unsupervised Classification [10]

      1. Supervised Classification

        1. Process:

          1. Given sample data, also called training set, consists of multiple entries, each with multiple characteristics or features.

          2. Each record is given a class label.

          3. The purpose of classification is to analyze the sample data and to develop an accurate understanding or model for each class using the attributes present in the data.

          4. This model is then used to classify/label test data.

            Diagram -2: Linear Overview of steps involved in Supervised Classification.[16]

        2. Example:

          We are given here an excerpt from university database as training data, with attributes including teachers name, rank and years of experience with class label tenure with classes yes or no. Our purpose here is to analyze this data, learn patterns (using rows) and determine class for test data.

          Test Data: IF rank = professor OR years > 6 THEN tenured? Seeing sample data, tenure be = yes.

          Table-4: Sample Label Data of Professors.[16]

      2. Common Techniques of Supervised Classification: Bayesian Classification, Naïve Bayesian Classification, Robust Bayesian Classifier, Decision tree learning etc.

      3. Unsupervised Classification:

    In an unsupervised classification, the objective is to group together sparse, visibly different response patterns into multiple clusters that are statistically separable. Thus, of given 50 data points, a small range of digital numbers (DNs) lets say 5 bands, can form one cluster that is different from another cluster of say 10 bands and so on. Separation or the distance between clusters will depend on the parameters (features) we choose to differentiate [10].

      1. Common Techniques of Unsupervised Classification: Clustering, K-Means Classification, KNN Classification, Decision tree learning etc.

        Here we will discuss K-Means classification only.

      2. K-Means Classification:

        It is a Nonhierarchical unsupervised classification type where each instance is placed in exactly one of K non- overlapping clusters [16].Since only one set of lusters is output, hence its unto user to input desired number of clusters K beforehand.

        1. Algorithm K-Means:

          1. Decide the value for K.

          2. Initialize the k cluster centers (randomly, if necessary), each consisting of multiple entries, with multiple characteristics or features.

          3. With given record/data points. Decide class memberships of the N objects by assigning them to the nearest cluster center.

          4. Re-calculate k cluster centers, assuming that the memberships determined above are correct.

          5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto III.

            Example:In the graph below, raw data of students of Grade IX were given, based on K-3 (where diamond in little blue circle represents k value), three clusters are formed.

            Graph -3: Clustering using K-3[14].

            More on K-Means:

            1. Relatively efficient: O(tkn), where n is no of objects, k is no of clusters and t is no of iterations. Normally k.t<<n.

            2. Terminates often at local optima.

            3. Applicable only when a mean is defined, i.e. not very effective for categorical data.

            4. K also should be defined in start, i.e. no of clusters.

            5. Finds faults with noisy data and outliers, hence other means should be used to minimize noise before using k-means.

            6. Value of K directly affect Objective function, hence optimum value of K must be used.

    Graph -4: Objective function on different values of K[17].

    Where, objective function is directly related to squared error that is prone to occur. (2nd formula represents objective function)

    m = Number of data-points in one cluster K = total number of clusters

  6. CLUSTERING[9][17]

Clustering, an example of unsupervised learning is defined as the grouping of similar data points from raw, unlabeled data based on some common features. This is often achieved by comparing similarity of items in sample data. Clustering at times is considered as largely subjective.

Hard Vs. Soft Clustering:

Clustering is considered as hard if a single data item can belong to any one cluster, however in soft clustering a single data point is allowed to be a part of two clusters simultaneously [9].

Examples:

Hard Clustering: K-Means clustering, K-Means ++

Soft Clustering: Gaussian mixture model

    1. Types of Clustering:

      Partitional Algorithms:Construct various partitions and then evaluate them by some criterion (K-Means, K- Means++)

      Hierarchical Algorithms: Make a hierarchical decomposition of the set of objects using some criterion (Agglomerative, single link).

      Here we will discuss both Agglomerative also called Bottom-Up clustering and Divisive also called Top Down Clustering from Hierarchical clustering:

    2. Agglomerative Clustering:

      Starting with each item in its own cluster, purpose is to find the best pair to merge into a new cluster at each respective level. While repeating until all clusters are fused together.

      Diagram -3: Step Wise merging of data points using Agglomerative Clustering[17].

    3. Divisive Clustering (Top Down Clustering): Starting with all data points in one cluster, the cluster in each step then splits up using a flat clustering algorithm.

      The process is applied recursively until all data points exist in their own singleton cluster.

      Diagram -4: Step Wise splitting of data points using Bottom up Clustering[17].

    4. Gaussian Mixture Model (Soft Clustering)[9]: Gaussian mixture model finds its large scale usage in image processing. The Gaussian mixture architecture measures probability density functions (PDF) for each given class, and then performs relative classification based on Bayes rule:

      P(Ci | X ) P( X | Ci) P(Ci )

      P( X )

      Where P(X | Ci) is the PDF of class j, evaluated at X, P(Cj ) is the prior probability for class j, and P(X) is the overall

      Graph -5: Gaussian mixture model of Class 1[9].

      Defined parameters are tuned using a complex iterative procedure called the estimate-maximize (EM). Its simply an algorithm, which aims at maximizing the likelihood of the training set generated by the estimated PDF.

      L, the likelihood function for each class j can be defined as:

      )

      Ntra in

      PDF, evaluated at X.

      L P( X | C

      Ntrain

      ln(L ) ln(P( X

      | C ))

      Unlike the normal or uni-modal Gaussian architecture,

      j i j

      i0

      j i j

      i0

      which assumes P(X | Cj) to be in the form of a Gaussian, the Gaussian mixture model uses P(X | Cj) as a weighted average of multiple Gaussians.

      Nc

          1. Gaussian Mixture Training Flow Chart

            1. Using K means clustering algorithm, initialize the initial Gaussian means i, i=1,G.

            2. Initialize Vi,,, the covariance matrices, to the

      i.e.

      P( X | Cj) wkGk

      k 1

      distance to the nearest cluster.

            1. Initialize the weights i=1 / G so that all Gaussian

              Where wk is the weight of the k-th Gaussian Gk, and all weights accumulate to sum one. Hence, using this one such PDF model is produced for each class. Where each Gaussian component is defined as:

              are equally likely

            2. Using formula, present each pattern X of the training set and model each of the classes K as a weighted sum of Gaussians.

              G 1 [1/ 2( X M k )T Vk 1 ( X M k )] G

              k (2 )n / 2 | V

              e

              k

              |1/ 2

              p( X | s ) i p( X | Gi )

              i1

              Where Mk is the mean of the Gaussian and Vk is the covariance matrix of the Gaussian.

              Whereas the free parameters of the Gaussian mixture model consist of the means and covariance matrices of the

              Where Gis the number of Gaussians, and is are the new weights, whereas Viis the covariance matrix.

              i

              i

              i

              1 [1/ 2( X )T Vi 1 ( X )]

              Gaussian components and the weights indicating the contribution of each Gaussian to the approximation of P(X

              | Cj).

              Hence using above three formulas with variables: i, Vi, wk

              p( X | Gi )

            3. Compute:

              (2 )d / 2 | V |1/ 2 e

              We use EM (estimate-maximize) algorithm to approximate

              P(G | X ) i p( X | Gi , Ck )

              i p( X | Gi , Ck )

              this variables.

              ip i

              p( X )

              p( X | , C )

              G

              j j k

              j 1

            4. Iteratively update the weights, means and covariance using:

              N

              i

              (t 1) 1

              c

              N

              c

              i p (t) for weights.

              p1

              N

              1 c

              for means. And

              i(t 1)

              N (t)

              i p (t)X p

              c i

              1 Nc

              p1

              T

              Vi (t 1) N (t) i p (t)((X p i (t))(X p i (t)) ) for

              c i p1

              variance.

            5. Recalculateipusing the new weights, new means and covariance. And quit training if

              i p i p (t 1) i p (t) threshold

              Or the number of epochs reaches the specified value.

              Otherwise, continue the iterative process.

            6. Finally, present each input pattern X and compute the confidence for each class j:

      P(C j )P( X |x ,C j )

      Where P(C ) Nci is the prior probability of class Cj,

      Table-5: Sample Customer Data[17].

      Using Euclidian distance, calculating all customers distance from David gives us John, Rachel and Neile as 3 nearest neighbours, at distances of (15.16,No), (15,Yes),(15.74,Yes)

      j N respectively.Hence Response value for David be Yes, based on

      being estimated by counting the number of training patterns and classify pattern X as the class with the highest confidence.

    5. Summary:

      1. Time complexity: O(n2), where n is the number of total objects.

      2. Interpretation of results is (very) subjective.

7. KNN[14][17]

majority.

    1. Pseudo Code for KNN:

      Training Algorithm:

      1. Store all training examples <x, f(x)>

      2. Find best value for K

Classification Algorithm:

Given a query instance xq to be classified,

Let x1, xk denote the k instances from the list of training examples

KNN also called K-Nearest neighbors is a popular uninformed data mining technique. Given training data

Return

k

f (xq ) argmax (c, f (xi ))

i1

(X(1),D(1)), (X(2),D(2)), , (X(N),D(N)), We first define

a distance metric to measure distance between points in inputs space.

Commonly used distance measure is Euclidean

where (a,b)=1 if a=b and where (a,b)=0 otherwise

(C = class)

Distance:

7.1. Working:

n

D(i, j)

k 1

xk

i xk

j2

7.4. Complexity:

O(Nd) for both storage and query time where N is the number of training examples, and d is the dimension of each sample.

Given a test point X. Our purpose is to find the K nearest training inputs to X. Mark these points as: (X(1),D(1)), (X(2), D(2)), , (X(k), D(k)). Classification then of point X is carried out via: Y = most common class in set {D(1), D(2), , D(k)} and X->D.

7.2. Example:

Classify whether a customer will respond to a survey question using a 3-Nearest Neighbor classifier.

7.5. Variations of KNN:

K-Means, K-Means++,Matching Matrix are few widely used variations of KNN.

8. CONCLUSIONS

Today data mining is widely used in diverse areas. There is hardly any walk of life where data mining finds hard to find its application and usage. A number of commercial data mining systems are available today yet there are many challenges in this field. Broadly, we can classify data mining applications based on their usage into: Financial Data Analysis, Retail Industry, Telecommunication Industry, Biological Data Analysis, Other Scientific Applications and Intrusion Detection [15].

However the choice for which data technique to be used depends largely on your type of data, dimensions, data existing state/stage, and the in-house data state.

From the discussion above its obvious that association based techniques provides patterns of associated values of variables and frequencies of their appearance. Similarly, classification provides means to use data for prediction and future responses. Clustering whereas provides grouping of homogeneous objects. Based on certain hypothesis about number of classes to be found; as in KNN. Results are directly understandable [1].However they normally do not work well with very big data sets.

Another important thing is the algorithm one implies in particular mining technique, since its the working engine thats defines running of a car.

REFERENCES

[1]. K.Gibert, M.Sanchez&V.Codian. Choosing the Right Data Mining Technique: Classification of Methods and Intelligent Recommendation, http://www.iemss.org/iemss2010/papers/S23/S.23.03.Choosing% 20the%20Right%20Data%20Mining%20Technique%20%20Clas sification%20of%20Methods%20and%20Intelligent%20Recomm endation%20-%20MIQUEL%20SANCHEZ-MARRE.pdf

[2]. P.Adriaans and D.Zantinge. Data Mining. Addison Wesley Longman,Harlow, England, 1996.

[3]. J.Han and M.Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, San Francisco, CA, 2000.

[4]. C. Schoenbach, P. K.Saunders, and V. Brusic. Data warehousing inmolecular biology. Briengs in Bioinformatics, 1:190198, 2000.

[5]. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA Microarrays. Bioinformatics, 17:520525, 2001.

[6]. R. Sandy. Statistics for Business and Economics.

McGrawHill,1989.

[7]. I. T. Jollie. Principal Component Analysis. Springer Verlag, Berlin, 1986.

[8]. U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued at-tributes for classication learning. In Proceedings of 13th International Joint Conference on Articial Intelligence, pages 10221029, 1993.

[9]. K.Hsien. (June 2005). K Means Clustering, Nearest Cluster and Gaussian Mixture,

http://www.ibms.sinica.edu.tw/~pan/classification/documents/Ga ussian%20Mixture,%20Nearest%20Cluster%20and%20K%20me ans.ppt

[10] Classification,

http://ces.iisc.ernet.in/hpg/envis/Remote/section27.htm

[11]. N.Cerpa.Support vs. Confidence in Association Rule Algorithm.http://www.academia.edu/648890/Support_vs_Confide nce_in_Association_Rule_Algorithms

[12]. CS583, Bing Liu, UIC. Mining Association Rules. http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583- association-rules.ppt

[13]. J.Pei, H.Pinto&J.Han.Prefix span: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.http://dm.kaist.ac.kr/kse525/resources/papers/icde2001pr efixspan.pdf

[14]. J.McCaffrey. DetectingAbnormal DataUsing k-Means Clustering.http://msdn.microsoft.com/en- us/magazine/jj891054.aspx.

[15] Data Mining – Applications & Trends.http://www.tutorialspoint.com/data_mining/dm_applicatio ns_trends.htm.

[16]. A brief summary of classification techniques. http://www.cs.zju.edu.cn/people/xucf/course/DM2007/Classificat ion.ppt

[17].Unsupervised Learning, Data mining and knowledge discovery.http://www.public.iastate.edu/~olafsson/unsupervised_l earning.ppt

[18] Data Mining. http://www.rainasolutions.org/p/data- mining_7438.html

[19].http://ces.iisc.ernet.in/hpg/envis/Remote/section27_files/Unsup- SupClassif.jpeg

Leave a Reply