Improved kNN Algorithm by Optimizing Cross-validation

DOI : 10.17577/IJERTV1IS3186

Download Full-Text PDF Cite this Publication

Text Only Version

Improved kNN Algorithm by Optimizing Cross-validation

Ms. Soniya S. Dadhania Prof. J. S. Dhobi

M.E Computer Science and Engineering Head of Computer Department,GEC,Modasa

Abstract

Nowadays web applications based on short text is increasing rapidly. Moreover, the classification algorithms which are applied to short text data are Support Vector Machines algorithm, k -Nearest Neighbors algorithm and Naive Bayes algorithm. k NN algorithm depends on the distance function and the value of k nearest neighbor. Traditional k NN algorithm can select best value of k using cross- validation but there is unnecessary processing of the dataset for all possible values of k . Proposed k NN algorithm is an optimized form of traditional k NN by reduceing the time and space for evaluating the algorithm. Experiments are performed in developer version of weka 3.7.5.Comparison of proposed k NN algorithm is done with traditional k NN algorithm, Support vector machine and Naïve Bayes algorithm. The proposed algorithm is more promising than the traditional k NN algorithm as time tak en to process and space used for cross-validation in classification are reduced.

  1. Introduction

    The increasingly impo rtant role played by short texts in the modern means of Web commun ication and publishing, such as Twitter messages, blogs, chat massages, book and movie summa ries, foru m, news feeds, and customer revie ws, opens new application avenues for text min ing techniques but it also raises new scientific challenges. Although text classification and clustering are well established techniques, they are not successful in dealing with short and sparse data, because standard text similarity measures require substantial word co-occurrence or shared context.

    Te xt classificat ion is a lea rning task, where pre – defined category labels are assigned to documents based on the like lihood suggested by a training set of labeled documents. Text categorization methods proposed in the literature are difficu lt to compare. Datasets used in the experiments are rarely same in diffe rent studies. Even when they are the same, diffe rent studies usually use different portions of the datasets or they split the datasets as training and test sets differently. Moreover, classifications will be performed using Support Vector Machines, k-Nearest Neighbors and Naive Bayes. For the analysis and

    comparison of different results, precision, recall and F- measure are used.

    KNN is a typical e xa mp le of la zy lea rning. La zy learning simply stores training data at training time and delays its learning until c lassification time. In contrast, eager learning generates an explicit model at training time. k-NN a lgorith m classifies a test document based on it k nearest neighbour. The training e xa mples can be considered as vectors in a multid imensional feature space. The space is partitioned into regions by locations and labels of the training samples. A point in the space is assigned to a class in which most of the train ing points belong to that class within the k nearest training samples. Usually Euc lidean distance or Cosine similarity is used. During the classification phase, the test sample (whose class needs to be identified) is also represented as a vector in the feature space. Distances or simila rit ies fro m the test vector to all train ing vectors are co mputed and k nearest training samples is selected . There are a nu mber o f ways to classify the test vector to a specific c lass. The classical k-NN a lgorith m determines the class with the ma jority voters fro m its k- nearest neighbours [2].

  2. Scope of improve ment in kNN

    Although kNN has been widely used for decades due to its simplicity, effect iveness, and robustness, it can be improved according to the application. Improve ment can be done on follo wing para meters.

    1. Distance Functi on: The distance function for measuring the difference or similarity between two instances is the standard Euclidean distance.

    2. Selection of Value K: The neighborhood size is artific ially assigned as an input parameter.

    3. Calcul ating Class Probability: The c lass probability estimation is based on a simple voting.

  3. Proposed kNN algorithm

    Distance function

    kNN a lgorith m depends on the distance function used for calculat ing the distance between input test object and objects in training set. To measure the distance of data in the kNN, the distance function is important The most commonly used function is the Euclidean distance function (Euclid), wh ich measures two input vectors (one typically being from a stored instance, and the

    other being an input vector to be classified). One weakness of the Euclidean distance function is that if one of the input attributes has a relatively large range, then it can overpower the other attributes. Therefore, distances are often norma lized by divid ing the distance for each attribute by the range (i.e., ma ximu m- minimu m) of that attribute. The cosine similarity is commonly used in te xt c lassification [15].

    In proposed algorithm cosine similarity function applied instead of Euclid ian distance but the results are found simila r both distance functions.

    Cosine Simila rity

    Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product and magnitude as

    For te xt matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of norma lizing document length during comparison. In the case of information retrieva l, the cosine simila rity of two documents will range fro m 0 to 1, since the term frequencies (tf-idf we ights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

    Selection of value k

    In the traditional kNN a lgorith m, the value of k is fixed beforehand. If k is too large, big classes will overwhelm s ma ll ones. On the other hand, if k is too small, the advantage of kNN algorith m, which could ma ke use of many experts, will not be exhibited. To find best value of k suitable to the data set traditional kNN a lgorith m uses cross -validation. The CROSSVA LIDATION specifies settings for performing V-fold c ross-validation to determine the best number of neighbors.

    • V-fold cross validation divides the data into V fo lds. Then, for a fixed k, it applies nearest neighbor analysis to make predict ions on the vth fold (using the other V1 folds as the training sample ) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The va lue achieving the lowest average error is selected as the optima l value for k.

    • If mu ltiple values of k a re tied on the lowest average error, then the smallest k a mong those that are tied is selected.[17]

      Cross-validation process of traditional kNN a lgorith m works effic iently when number of instances is small in

      data set i.e. when size of data set is small but as the size of data set increases it takes larger time to cross – validate for each value of k specified by user in terms of ma x value of k.

      Example A: for better understanding of cross – validations in Tradit ional kNN:

      For some data set DS1: No. of attributes: 21 No. of instances: 12000

      Maximu m value of k: 10 Best value of k: 5

      The cross-validation process will repeated for:

      (Ma x k- 1) * No. of instances = (10-)*12000=9

      *12000 = 108000

      This is large value and takes la rge time for processing.

      To overcome proble m of unnecessary iteration in cross-validation for finding best value of k new algorith m is proposed.

      The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V- fold cross-validation but for determining the best number of neighbors the process of cross -validation is not applied to all choice of v but stop when the best value is found. It is observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. The cross-validation process starts from ma ximu m value of k specified as input up to value of k = 1.

      In proposed algorithm at each v-fold performance of the previous fold is compared if the performance is decreased at v-fold then value of k used in previous fold is selected as best value of k. Newly proposed kNN algorithm will reduce the numb er of iterations for finding out the best value of k. due to decrease in number of iteration time and space needed to find best k is also decreased.

      Applying proposed kNN in Exa mple A we get: The cross-validation process will repeated for:

      (Ma x k-1-best k) * No. of instances = (10-1-5)*12000

      = 4*12000 = 48000

      In proposed kNN algorith m the iteration of the loop for finding best value of k is reduced from 108000 to 48000.

  4. Experime ntal results

    WEKA 3.7.5 is used for performing e xperiments. Proposed algorithm is coded using Java. Datasets are taken fro m http://kavita-ganesan.com/opinosis -opinion- dataset and http://boston.lti.cs.c mu.edu/classes/95- 865/HW/HW 2/

    P=prec ision, R=reca ll, F=F measure.

    Comparing kNN, Naïve Bayes and SVM for short text classification

    Data set 1: Rev iew o f notebook No. of attributes: 7

    No. of instances: 19

    Cross-validation: 10 fo lds

    Ca

    teg ory

    No

    k-NN

    Naïve B ayes

    SVM

    P

    R

    F

    P

    R

    F

    P

    R

    F

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    2

    0.5

    1

    0.66

    0.55

    1

    0.71

    0

    0

    0

    3

    1

    1

    1

    0.83

    1

    0.9

    0.66

    0.8

    0.72

    4

    1

    1

    1

    1

    0.75

    0.85

    1

    0.75

    0.85

    5

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Avg

    0.68

    0.79

    0.72

    0.66

    0.75

    0.68

    0.51

    0.5

    0.50

    Table 1: Co mparison in Data set 1

    Data set 2: Rev iew of Swiss hotel No. of attributes: 6

    No. of instances: 18

    Cross-validation: 10 fo lds

    Ca teg ory

    No

    k-NN

    Naïve B ayes

    SVM

    P

    R

    F

    P

    R

    F

    P

    R

    F

    1

    0.8

    0.8

    0.8

    0.75

    0.6

    0.66

    0.8

    0.8

    0.8

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    3

    0.6

    0.75

    0.66

    0.5

    0.75

    0.6

    0.42

    0.75

    0.54

    4

    1

    0.8

    0.88

    1

    0.8

    0.88

    1

    0.4

    0.57

    Av

    0.85

    0.83

    0.84

    0.82

    0.78

    0.79

    0.88

    0.72

    0.72

    Table 2 Co mparison in Data set 2

    Data set 3: Auto No. of attributes: 21

    No. of instances: 12000

    Classifier: kNN

    Cross-validation: 10 fo lds

    Ca teg ory

    No

    k-NN

    Naïve B ayes

    SVM

    P

    R

    F

    P

    R

    F

    P

    R

    F

    1

    0.99

    0.99

    0.99

    0.99

    0.99

    0.99

    1

    0.99

    0.99

    2

    0.99

    0.99

    0.99

    0.99

    1

    0.99

    0.99

    1

    0.99

    Av

    0.99

    0.99

    0.98

    0.99

    0.99

    0.99

    0.99

    0.99

    0.99

    Table 3 Co mparison in Data set 3

    Data set 4: Ford No. of attributes: 11

    No. of instances: 6000

    Classifier: kNN

    Cross-validation: 10 fo lds

    Ca teg ory

    No

    k-NN

    Naïve B ayes

    SVM

    P

    R

    F

    P

    R

    F

    P

    R

    F

    1

    0.87

    0.87

    0.87

    0.68

    0.96

    0.80

    0.73

    0.94

    0.82

    2

    0.87

    0.87

    0.87

    0.93

    0.56

    0.70

    0.92

    0.65

    0.76

    Av

    0.87

    0.87

    0.87

    0.81

    0.76

    0.75

    0.82

    0.80

    0.79

    Table 4 Co mparison in Data set 4

    Aver age result of Data set -1

    P recision

    Recall

    F- measure

    kNN

    0.688

    0.792

    0.722

    Naïve Bayes

    0.664

    0.75

    0.689

    SVM

    0.514

    0.5

    0.503

    Table 5 Average result of Data set 1

    Aver age result of Data set -2

    P recision

    Recall

    F- measure

    kNN

    0.856

    0.833

    0.84

    Naïve Bayes

    0.819

    0.778

    0.788

    SVM

    0.817

    0.722

    0.724

    Table 6 Average result of Data set 2

    Aver age result of Data set -3

    P recision

    Recall

    F- measure

    kNN

    0.998

    0.998

    0.998

    Naïve Bayes

    0.997

    0.997

    0.997

    SVM

    0.995

    0.995

    0.995

    Table 7 Average result of Data set 3

    Aver age result of Data set -4

    P recision

    Recall

    F- measure

    kNN

    0.878

    0.878

    0.877

    Naïve Bayes

    0.814

    0.764

    0.754

    SVM

    0.829

    0.802

    0.798

    Table 8 Average result of Data set 4

    Examining effect of k value in kNN

    Data set: 1

    Detail of data set: rev iew o f iPod No. of attributes: 68

    No. of instances: 19 Classifier: kNN

    Cross-validation: 10 fo lds

    Figure 1 Cop marison between kNN, NB and SVM

    Selected

    value of k

    Correct ly

    classified instances

    1

    99.81%

    3

    99.79%

    5

    99.7%

    7

    99.67%

    Selected value of k

    Correct ly classified

    instances

    1

    63.16%

    3

    84.21%

    5

    78.94%

    8

    73.68%

    Table 10 Effect of k value in Data set 3 Data set: 3

    No. of attributes: 21

    No. of instances: 6000 Classifier: kNN

    Selected value of k

    Correct ly

    classified instances

    1

    87.33%

    3

    88.78%

    5

    89.23%

    7

    89.41%

    9

    89.3%

    11

    89.09%

    Cross-validation: 10 fo ld

    Table 9 Effect of k value in Data set 1 Data set: 2

    No. of attributes: 21 No. of instances: 12000 Classifier: kNN

    Cross-validation: 10 fo lds

    Table 11 Effect of k value in Data set 3 Data set: 4

    No. of attributes: 21 No. of instances: 1382 Classifier: kNN

    Cross-validation: 10 fo ld

    Selected value of k

    Correct ly classified

    instances

    1

    62.44%

    3

    66.20%

    5

    68.16%

    7

    68.23%

    9

    68.37%

    11

    69.17%

    13

    69.60%

    15

    69.97%

    17

    70.47%

    19

    69.75%

    21

    69.68%

    value of k

    classifi cation

    n in traditio nal

    kNN

    on in propos ed

    kNN

    of iterati on

    Data set – 1

    10

    84.21

    9

    7

    2

    Data

    set – 2

    10

    99.81

    9

    9

    0

    Data

    set – 3

    10

    89.41

    9

    4

    5

    Data

    set – 4

    20

    70.47

    19

    4

    15

    Accuracy Increases

    Best k

    Accuracy Decreases

    Table 12 Effect of k value in Data set 4

    Form the above tables it can be concluded that if value of k is appropriate to the data then increase in efficiency of the kNN will be noticeable. Here in data set -1 value of k=3 is best value of k which gives 84.21% correctly c lassified instances. If va lue of k is less then or greater then best k value, it will affect the performance of kNN.

    Also from observations it is clear that accuracy increases up to the best value of k and then after accuracy starts to decrease.

    Comparison of traditional kNN and proposed kNN

    To find best value of k suitable to the data set traditional kNN a lgorithm uses cross -validation.

    The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V- fold cross-validation but for determining the best number of neighbors the process of cross -validation is not applied to all choice of v but stop when the best value is found. It can be observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. Newly proposed kNN algorithm will reduce the number of iterations for finding out the best value of k. Due to decrease in number of iteration time and space needed to find best k are a lso decreased.

    Data sets used to exa mine the effect of value of k are

    used in comparison of traditional and proposed algorith m.

    Data

    set

    Maxi

    mu m

    % of

    correct

    No. of

    iteratio

    No. of

    iterati

    Reduc

    ed no.

    Table 13 Tradit ional kNN v/s Proposed kNN

    Figure 2 Traditional kNN v/s Proposed kNN

  5. Conclusion

    For short text c lassification kNN, Naïve Bayes and SVM a lgorith ms can be used. Fro m the results in section 4 it is concluded that kNN g ive better accuracy than other two algorith ms. Also when kNN a lgorith m is used with attribute selection its accuracy for classification inc reases. kNN algorith m depends on the distance function and the value of k nearest neighbor, traditional kNN a lgorithm finds best value of k using cross-validation. But time and space used by it is larger due to unnecessary processing for each and every value of k fro m Ma ximu m k to 1. In p roposed kNN a lgorith m the unnecessary processing of cross -validation is reduced due to which time and space used for classification is also reduced. Table 13 shows reduced number of ite ration in p roposed algorith m. Proposed kNN is mo re promising than traditional kNN as larger dataset can be used for classification and time for evaluation and building model fo r large dataset is reduced.

  6. References

  1. Arzucan ¨Ozg¨ur, Levent ¨ Ozg¨ur, and Tunga G¨ung¨or Text Categorization with class-based and corpus-based keyword selection In Proceedings of the 20th international conference on Computer and Information Sciences, Publisher: Springer- Verlag October,2005

  2. XindongWu, Vipin Kumar, J. Ross Quinlan , Joydeep

    Ghosh , Qiang Yang, Hiroshi M otoda , Geoffrey J. M cLachlan, Angus Ng , Bing Liu , Philip S. Yu, Zhi- Hua Zhou, M ichael Steinbach, David J. Hand and Dan Steinberg Top 10 algorithms in data mining Journal Knowledge and Information Systems Volume 14 Issue 1,

    December ,2007

  3. Xuan-Hieu Phan, Le-M inh Nguyen, Susumu Horiguchi Learning to classify short and sparse text & web hidden topics from large-scale data collection In Proceedings of the 17th international conference on World Wide Web ACM New York, NY, USA,2008

  4. Alfonso M arin Comparison of Automatic Classifiers Performances using Word-basd Feature Extraction Techniques in an E-government setting University essay from KTH/Skolan for informations- och kommunikationsteknik (ICT) January,2011

  5. Victoria Bobicev, M arina Sokolova An Effective and Robust M ethod for Short Text Classification In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence volume 3, July 2008

  6. Hitesh Sajnani, Sara Javanmardi, DavidW. M cDonald, Cristina V. Lopes M ulti-Label Classification of short text: A Study on wikipedia Barnstars Paper from AAAI- 11 workshop on Analyzing microtext 2011

  7. Bengel, J, Gauch, S. M ittur, E. and Vijayaraghavan

    R.Chat room topic detection using classification. In proceeding of 2nd Symposium on Intelligence and Security Informatics,2004

  8. Yiming Yang. A study of thresholding strategies for text categorization. In proceeding of 24th International ACM SIGIR Conference 2001

  9. Yang, Y. and Liu, X. A re-examination of text categorization methods In proceedings of the 22nd annual international ACM SIGIR conference on Research and Information Retrieval 1999

  10. Wei Wang, Sujian Li, Chen Wang An Improved KNN Algorithm for Text Categorization In Proceedings of NTCIR-7 Workshop Meeting 2008

  11. Ali-M ustafa Qamar, Eric Gaussier, Jean-Pierre Chevallet, Joo Hwee Lim, Similarity Learning for Nearest Neighbor Classification ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining 2008

  12. Wen-tau Yih and Christopher M eek, Improving Similarity M easures for Short Segments of Text Proceedings of the 22nd nationa l conference on Artificial intelligence – Volume

  13. Jasmine Kalathipparambil Sudhakaran, Ramaswamy Vasantha A M ixed M ethod Approach for Efficient Component Retrieval from a Component Repository

    Journal of Software Engineering and Applications ,2011

  14. Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury An affinity -based new local distance function and similarity measure for kNN algorithm

    Pattern Recognition Letters , Volume 33 Issue 3 Publisher: Elsevier Science Inc ,2012

  15. M uhammed M iah Improved k-NN Algorithm for Text Classification Department of Computer Science and Engineering University of Texas at Arlington, TX,

    USA,2003

  16. Z. Xie, W. Hsu, Z. Liu, & M. Lee, SNNB: A selective neighborhood based Naive Bayes for lazy learning", Proc. 6th Pacific-Asia Conf. on KDD, Taipei, Taiwan, 2002

  17. cross validation subcommand in kNN, http://publib.boulder.ibm.com/infocenter/spssstat/v20r0 m0/index.jsp

Leave a Reply