Use Of Probabilistic Neural Network In The Classification Of Articles Of Ambiguous Authorship

DOI : 10.17577/IJERTV1IS7096

Download Full-Text PDF Cite this Publication

Text Only Version

Use Of Probabilistic Neural Network In The Classification Of Articles Of Ambiguous Authorship

R. Chandrasekaran and G. Manimannan

Department of Statistics, Madras Christian College, Tambaram, Chennai, INDIA 600 059.

Abstract

Assignment of authorship to the writings of ambiguous authorship is a special type of problems in the field of Stylometry. During the pre-independence period, contemporary Tamil scholars of the same period, namely Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), and T. V. Kalyanasundaram (TVK) had written number of articles on Indias Freedom Movement in the magazine called, India. Initially, all the three writers contributed their articles by attributing their names. Later, all the three patriots wrote articles on the same theme for anonymous publications without mentioning their names due to the oppressive attitude of the then British regime.

In this paper, the assignment of articles of ambiguous authorship to Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), or

  1. V. Kalyanasundaram (TVK) is discussed. The application of Artificial Neural Network models has increased considerably in areas of pattern classification and recognition problems, particularly, in the field of Stylometry. The Probabilistic Neural Network has been used to the problem of authorship attribution for articles of ambiguous authorship and to assign them to the contemporary writers of the same period. Two different sets of variables namely, morphology and function words, are made use of for classification purposes. The results of writings of 23 ambiguous authorships supported the views of many scholars.

    Keywords: Stylometry, Authorship attribution, Probabilistic Neural Networks.

    1. Introduction

      Stylometry relates to the problem of authorship attribution. Stylometry is the study of the quantifiable of human language, or the statistical analysis of literary style (Holmes, 1995; Holmes and Forsyth, 1995).

      This involves attempting to formally capture the creative, unconscious elements of language particular to individual writers and speakers. Although researchers have studied writing for centuries, the discipline of stylometry is fairly recent, and while its origins date back to the late 19th century, the field as it is now began with work on the Federalist Papers in 1968 (Mosteller and Wallace, 1968).

      The study of stylometry mainly concerns itself with authorship attribution, although chronological studies on the dating of work within the corpus of an author have also investigated. Writing in a forensic background, Bailey (1979) proposed three rules to define the situation necessary for authorship attribution:

      1. The number of putative authors should constitute a well- defined set.

      2. The lengths of the writings should be sufficient to reflect the linguistic behavior of the author of the disputed text and also those of the candidates.

      3. The texts used for comparison should be commensurate with the disputed writing.

        A computational stylistic study of ambiguous authorship should involve comparisons of the disputed text with works by each of the possible candidate authors using suitable statistical tools on quantifiable features of the text-features which reflect the style of the writing as defined above.

        Artificial Neural Network (ANN) is one of the modern additions to the tools available for the computational Stylometry. These are the computational methods closely based on the concept of biological neuron, the idea being that simple, trained processing elements will result in much more difficult behavior when used in combination.

        Many scholars have successfully demonstrated that this technique of machine learning field can be applied to authorship attribution in the recent years. Merriam and Mathews (1993, 1994) have trained a multi layer perception network to distinguish the works of Shakespeare and Marlowe. Tweedie et al. (1996) have provided a useful review of the applications of ANNs in the area of computational stylometry and have used this machine-learning package for the reanalysis of the Federalist Papers. Kjell (1994) have taken up authorship study using letter-pair frequency features with neural network classification.

        Recently, authorship identification problem is also attempted by the authors of this paper using the Radial Basis Function Network (RBFN) (Chandrasekaran and Manimannan, 2008a) and Generalized Regression Neural Network (GRNN) (Chandrasekaran and Manimannan, 2008b) as the suitable neural network classification tools. The present study attempts to use the Probabilistic Neural Network (PNN) as one of the appropriate neural network classification model.

    2. Artificial Neural Network

      An Artificial Neural Network (ANN) is defined as a mathematical model

      represented by interlinked simple computational elements, called neurons that could compute, learn, remember and optimize the way a human brain works (Bishop, 2003; Heykin, 2001; Wasserman, 1989, 1993). The neurons in ANN are called nodes. The interconnected nodes (neurons) are arranged into several layers namely, the input, intermediate (hidden) and the output layers. Depending on the signals (data) transmitted by various nodes, a set of outputs is computed by the nodes that received the signals from the other nodes.

      Initially the network should be subjected to learning process. The network must learn decision surfaces from a set of training patterns so that these training patterns are classified correctly (Gose, Johnsonbaugh and Jost, 1997). After training, the network must also be able to generalize, that is, correctly classify test patterns it has never seen before. Usually any neural network should be such that it has the ability to learn well and also to generalize well.

      The general procedure is to have the network learn the appropriate weights from a representative set of training data. In all but the simplest cases, however, direct computation of the weights is difficult. Instead, learning starts off with random initial weights and adjusts them in successive iterations, until the required outputs are produced.

      The supervised and the unsupervised learning methodologies are adapted by the ANN. In supervised learning, the objective is to predict one or more output variables from one or more input variables (Bishop, 2003; Ripley, 1996). In the unsupervised learning, there are no target variables. The network trains itself to extract the features from the input variables using which the input variable themselves can be predicted.

      Tweedie et. al. (1996) discussed the application of neural networks in stylometry and their usefulness for a number of reasons:

      1. Neural networks can learn from the data themselves. Implementing a rule-based system in linguistic computing may become complex as the number of distinguishing variables increases and even the most complex rules may still not be good enough to completely characterize the training data. In essence, neural networks are more adaptive.

      2. Neural networks can generalize. This ability is particularly required in the literary field, as only limited data may be available.

      3. Neural networks can capture non- linear interactions between input variables.

      4. Neural networks are capable of fault tolerance. Hence a particular work, which is not in line with the usual writing style of an author, will not affect the network to a considerable extent. Thus neural networks appear to promise much for the field of stylometry. Their application would appear to be worthy of invetigation.

        The pioneering work in the application of neural nets in Stylometry was undertaken by Merriam and Mathews (1993). In their paper, a very small set of function word frequencies is used as input to a multiplayer perceptron (a neural net having a hidden layer) to examine four plays that have been attributed both to Shakespeare and John Fletcher.

        In the present context an attempted is made to use the concepts of Probabilistic Neural Network for the authorship attribution of ambiguous authorship problem.

    3. Probabilistic Neural Network

      Probabilistic Neural Networks are being used for classification problems. The Probabilistic Neural Network is a feed

      forward neural network with an architecture having three layers namely, the input, radial basis function and competitive layers (Powell, 1992; Wasserman, 1989, 1993). When an input is presented, the first layer computes distances from the input vector to the training input vectors and produces a vector whose elements indicate how close the input is to a training input. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. Finally, a compete transfer function on the output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for the other classes. The architecture for this system is shown in Figure 1.

    4. Database

      The present study deals with the literary works of three contemporary Tamil scholars, namely, Mahakavi Bharathi (MB), T. V.Kalyanasundaranar (TVK), and Subramaniya Iyer (SI). In the Pre Independence period, these three scholars have written number of articles on Indias Freedom Movement in the magazine called India. Initially, all the

      Input Layer

      Radial Basis Function Layer

      Competitive Layer

      Figure 1. Probabilistic Neural Network Model

      three scholars have written articles by attributing their names. The oppressive attitude

      of the then British Regime made all the three writers to write articles on the same topic anonymously in the same magazine. For this quantitative attribution study, all attributed articles of these three scholars written on Indias Freedom Movement in the year 1906 are considered. Our study is based on nineteen articles of MB, seven of TVK and six of SI and twenty-three un-attributed articles. Twenty four function words and eighteen morphological variables have been used to quantify each sentence. The lists of sample of variables of this study with their meanings are given in Table 1 and Table 2.

      Table 1. List of Sample of Function Words

      Function Words

      Translatio n

      Um

      Also

      Aakiyaal

      As

      Entraal

      For

      Entru

      For

      Pearil

      On

      Ul

      Inside

      Ai

      Unmarked

      Nodu

      With

      Lall

      With

      Aall

      Unmarked

      Ukku

      To

      Ill

      In

      Table 2. List of Sample of Morphological Variables with Abbreviations

      Abbreviations

      Variables Name

      P_NOUN

      Nouns

      P_INT

      Introductory

      P_INF

      Intensifiers

      P_PRO

      Pronouns

      P_NUME

      Numerals

      P_CASE

      Case Markers

      P_ADVERB

      Adverbs

      A chi-square analysis of the nineteen articles of MB establishes that these articles do not differ from one another in terms of the frequency distribution of occurrence of these

      stylistic features. Similar results were obtained in the case of other two scholars (Manimannan and Bagavandas, 2001). Thus, each article is converted as a raw data matrix and these raw data matrices form the basis for this data description. Hence the nineteen articles of Mahakavi Bharathiar (MB) consist of three hundred and fifty-three sentences, the seven articles of Subramaniya Iyer (SI) consist of three hundred and fifteen sentences and six articles of T. V. Kalyanasundaram (TVK) consist of three hundred and eighty-two sentences. All the twenty-three ambiguous articles consist of three hundred and seventy-five sentences.

    5. Results and Discussion

      Nineteen articles written by Mahakavi Bharathiar (MB), seven articles by Subramaniya Iyer (SI) and six articles by T.

      V. Kalyanasundaram (TVK) have been considered for the analysis represented by the averages of the sampled sentences. The 32 data points representing the means of the corresponding sample writings of three different authors are used. The Data matrix

      (P) consists of 18 morphological variables, normalized, computed from the chosen articles. The Target matrix (T) has three rows, one row for each author, each row consisting of zeroes and ones. This matrix element tij is define as

      tij = 1 if the sample j in the data matrix P corresponds to the author i

      0 otherwise.

      The radial basis function network is created with appropriate parameters and the data matrices. The articles of unknown authorship is also presented in the form of a Test matrix (X) consisting of 18 morphological variables, normalized, in the form of column vectors, each column corresponding to one sample article of unknown authorship. The entire analysis was also performed on the 24 normalised functional variables.

      The Neural Network Toolbox available with the MATLAB software is used for the

      analysis. It is assumed that there are Q input vector/target vector pairs. Each target vector has K elements. One of these elements is 1 and the rest are 0. Thus, each input vector is associated with one of K classes.

      The first-layer input weights, InputWeight are set to the transpose of the matrix formed from the Q training pairs, Pt. When an input is presented, the distance function produces a vector whose elements indicate how close the input is to the vectors of the training set. These elements are multiplied, element by element, by the bias and sent to the radial basis transfer function. An input vector close to a training vector is represented by a number close to 1 in the output vector at. If an input is close to several training vectors of a single class, it is represented by several elements of at that are close to 1.

      The second-layer weights, Layer Weight are set to the matrix T of target vectors. Each vector has a 1 only in the row associated with that particular class of input, and 0's elsewhere. The multiplication Tat sums the elements of at due to each of the K input classes. Finally, the second-layer transfer function, compete, produces a 1 corresponding to the largest element and 0's elsewhere. Thus, the network classifies the input vector into a specific class C because that class has the maximum probability of being correct.

      The radial basis transfer function creates as many neurons as there are input vectors in

      P. Each neurons weighted input is the distance between the input vector and its weight vector. Similarly, each neurons net input is the element-by-element product of its weighted input with its bias. The neurons output is its net input passed through the radial basis transfer function g(k) = exp(-k2), where k = {||weight vector input vector|| bias}. Every bias in the first layer is set to 0.8326/SPREAD, which gives radial basis transfer function that crosses 0.5 at the weighted inputs of +/- SPREAD, where SPREAD lies between 0 and 1. The larger the SPREAD smoother will be the

      approximation. SPREADhas been taken to vary between 0.1 to 1.0.

      All the 23 writings of ambiguous authorship are assigned by the program to Mahakavi Bharathiar (MB) for morphological as well as functional variables. This result supported the claims made by many scholars that these 23 articles could have been written by Mahakavi Bharathiar (MB).

    6. Conclusion

The problem of classification of articles of ambiguous authorship to the articles written by contemporary Tamil scholars, namely Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), and T. V. Kalyanasundaram (TVK), all of them belonging to of the same period, is taken up in the present research. To begin with, all the three writers contributed their articles by attributing their names. The oppressive attitude of the then British regime compelled all the three patriots to write articles on the same theme for anonymous publications without mentioning their names.

Application of Neural Network models has increased considerably in areas of pattern recognition and classification problems in the field of Stylometry over the last decade. The authorship attribution problem is attempted using a Probabilistic Neural network for attributing the 23 articles of unknown authorship to one of the contemporary writers of the same period, first using morphological variables and then functional variables. All the articles of ambiguous authorship are attributed to Mahakavi Bharathiar (MB). This result supported the claims made by many scholars that these 23 articles could have been written by Mahakavi Bharathiar (MB).

References

  1. Bailey, R. W. (1979), The Future of Computational Stylistics, Association for Literary and Linguistic Computing Bulletin, Vol. 7, 4-11. England.

  2. Bishop, C. M (2003), Neural Networks for Pattern Recognition (First Indian Edition), Oxford University Press, New Delhi.

  3. Chandrasekaran, R. and Manimannan, G. (2008a), Neural Network Classification and Authorship Attribution of Articles of Unknown Authorship Using Radial Basis Function, Proceedings of the National Conference on Artificial Intelligence and Neural Networks, Department of Computer Applications, SRM University, Kattankulathur, Tamil Nadu, pp.246-256.

  4. Chandrasekaran, R. and Manimannan, G. (2008b), Use of Generalized Regression Neural Network in Authorship Attribution, Proceedings of the National Conference on Research Areas in Computer Science, Department of Computer Applications, SRM University, Ramapuram, Tamil Nadu.

  5. Kjell, B. (1994), Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers, Literary and Linguistic Computing, Vol.9, 119-124, England.

  6. Gose, E., Johnsonbaugh, R, and Jost, S (1997), Pattern Recognition and Image Analysis, Prentice Hall Inc., New Jersey.

  7. Heykin, S. (2001), Neural Networks : A Comprehensive Foundation (Second Edition), Pearson Education (Singapore), New Delhi.

  8. Holmes, D. I. (1995), The Analysis of Literary Style : A Review, Journal of Royal Statistical Society, Series A, Vol. 148, 328- 334, England.

  9. Holmes D.I. and Forsyth, R.S. (1995), The Federalist Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, England.

  10. Manimannan, G. and Bagavandas, M. (2001), Authorship Attribution : The case of Bharathiar, National Conference on Mathematical and Applied Statistics, Department of Statistics, Nagpur University, Nagpur.

  11. Merriam, T. and Mathews, R. (1993), Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher, Literary and Linguistic Computing, Vol.8, 203-209, England.

  12. Merriam, T. and Mathews, R. (1994), Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe, Literary and Linguistic Computing, Vol.9, 1-6, England.

  13. Mosteller, F. and Wallace, D. L. (1968), Inference and Disputed Authorship: The Federalist Papers, Addision-Wesley, Messachusetts.

  14. Powell, M. J. D. (1992), The Theory of Radial Basis Functions Approximation, in Advances of Numerical Analysis , pp. 105 210 , Clarendon Press, Oxford.

  15. Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.

  1. Tweedie, F. J, Singh, S., and Holmes,

    D. I. (1996), Neural Network Applications in Stylometry. The Federalist Papers, Computers and the Humanities, 39(1), 1-10, 1996.

  2. Wasserman, P.D (1989), Neural Computing : Theory and Practice, Von Nostrand Reinhold, New York.

  3. Wasserman, P.D (1993), Advanced Methods in Neural Computing, Von Nostrand Reinhold, New York.

Leave a Reply