Use Of Probabilistic Neural Network In The Classification Of Articles Of  Ambiguous Authorship

R. Chandrasekaran; G. Manimannan

doi:10.17577/IJERTV1IS7096

Volume 01, Issue 07 (September 2012)

Use Of Probabilistic Neural Network In The Classification Of Articles Of Ambiguous Authorship

DOI : 10.17577/IJERTV1IS7096

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 86
Total Downloads : 627
Authors : R. Chandrasekaran, G. Manimannan
Paper ID : IJERTV1IS7096
Volume & Issue : Volume 01, Issue 07 (September 2012)
Published (First Online): 25-09-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Use Of Probabilistic Neural Network In The Classification Of Articles Of Ambiguous Authorship

R. Chandrasekaran and G. Manimannan

Department of Statistics, Madras Christian College, Tambaram, Chennai, INDIA 600 059.

Abstract

Assignment of authorship to the writings of ambiguous authorship is a special type of problems in the field of Stylometry. During the pre-independence period, contemporary Tamil scholars of the same period, namely Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), and T. V. Kalyanasundaram (TVK) had written number of articles on Indias Freedom Movement in the magazine called, India. Initially, all the three writers contributed their articles by attributing their names. Later, all the three patriots wrote articles on the same theme for anonymous publications without mentioning their names due to the oppressive attitude of the then British regime.

In this paper, the assignment of articles of ambiguous authorship to Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), or

V. Kalyanasundaram (TVK) is discussed. The application of Artificial Neural Network models has increased considerably in areas of pattern classification and recognition problems, particularly, in the field of Stylometry. The Probabilistic Neural Network has been used to the problem of authorship attribution for articles of ambiguous authorship and to assign them to the contemporary writers of the same period. Two different sets of variables namely, morphology and function words, are made use of for classification purposes. The results of writings of 23 ambiguous authorships supported the views of many scholars.

Keywords: Stylometry, Authorship attribution, Probabilistic Neural Networks.

Introduction

Stylometry relates to the problem of authorship attribution. Stylometry is the study of the quantifiable of human language, or the statistical analysis of literary style (Holmes, 1995; Holmes and Forsyth, 1995).

This involves attempting to formally capture the creative, unconscious elements of language particular to individual writers and speakers. Although researchers have studied writing for centuries, the discipline of stylometry is fairly recent, and while its origins date back to the late 19th century, the field as it is now began with work on the Federalist Papers in 1968 (Mosteller and Wallace, 1968).

The study of stylometry mainly concerns itself with authorship attribution, although chronological studies on the dating of work within the corpus of an author have also investigated. Writing in a forensic background, Bailey (1979) proposed three rules to define the situation necessary for authorship attribution:
1. The number of putative authors should constitute a well- defined set.
2. The lengths of the writings should be sufficient to reflect the linguistic behavior of the author of the disputed text and also those of the candidates.
3. The texts used for comparison should be commensurate with the disputed writing.
  
  A computational stylistic study of ambiguous authorship should involve comparisons of the disputed text with works by each of the possible candidate authors using suitable statistical tools on quantifiable features of the text-features which reflect the style of the writing as defined above.
  
  Artificial Neural Network (ANN) is one of the modern additions to the tools available for the computational Stylometry. These are the computational methods closely based on the concept of biological neuron, the idea being that simple, trained processing elements will result in much more difficult behavior when used in combination.
  
  Many scholars have successfully demonstrated that this technique of machine learning field can be applied to authorship attribution in the recent years. Merriam and Mathews (1993, 1994) have trained a multi layer perception network to distinguish the works of Shakespeare and Marlowe. Tweedie et al. (1996) have provided a useful review of the applications of ANNs in the area of computational stylometry and have used this machine-learning package for the reanalysis of the Federalist Papers. Kjell (1994) have taken up authorship study using letter-pair frequency features with neural network classification.
  
  Recently, authorship identification problem is also attempted by the authors of this paper using the Radial Basis Function Network (RBFN) (Chandrasekaran and Manimannan, 2008a) and Generalized Regression Neural Network (GRNN) (Chandrasekaran and Manimannan, 2008b) as the suitable neural network classification tools. The present study attempts to use the Probabilistic Neural Network (PNN) as one of the appropriate neural network classification model.
Artificial Neural Network

An Artificial Neural Network (ANN) is defined as a mathematical model

represented by interlinked simple computational elements, called neurons that could compute, learn, remember and optimize the way a human brain works (Bishop, 2003; Heykin, 2001; Wasserman, 1989, 1993). The neurons in ANN are called nodes. The interconnected nodes (neurons) are arranged into several layers namely, the input, intermediate (hidden) and the output layers. Depending on the signals (data) transmitted by various nodes, a set of outputs is computed by the nodes that received the signals from the other nodes.

Initially the network should be subjected to learning process. The network must learn decision surfaces from a set of training patterns so that these training patterns are classified correctly (Gose, Johnsonbaugh and Jost, 1997). After training, the network must also be able to generalize, that is, correctly classify test patterns it has never seen before. Usually any neural network should be such that it has the ability to learn well and also to generalize well.

The general procedure is to have the network learn the appropriate weights from a representative set of training data. In all but the simplest cases, however, direct computation of the weights is difficult. Instead, learning starts off with random initial weights and adjusts them in successive iterations, until the required outputs are produced.

The supervised and the unsupervised learning methodologies are adapted by the ANN. In supervised learning, the objective is to predict one or more output variables from one or more input variables (Bishop, 2003; Ripley, 1996). In the unsupervised learning, there are no target variables. The network trains itself to extract the features from the input variables using which the input variable themselves can be predicted.

Tweedie et. al. (1996) discussed the application of neural networks in stylometry and their usefulness for a number of reasons:
1. Neural networks can learn from the data themselves. Implementing a rule-based system in linguistic computing may become complex as the number of distinguishing variables increases and even the most complex rules may still not be good enough to completely characterize the training data. In essence, neural networks are more adaptive.
2. Neural networks can generalize. This ability is particularly required in the literary field, as only limited data may be available.
3. Neural networks can capture non- linear interactions between input variables.
4. Neural networks are capable of fault tolerance. Hence a particular work, which is not in line with the usual writing style of an author, will not affect the network to a considerable extent. Thus neural networks appear to promise much for the field of stylometry. Their application would appear to be worthy of invetigation.
  
  The pioneering work in the application of neural nets in Stylometry was undertaken by Merriam and Mathews (1993). In their paper, a very small set of function word frequencies is used as input to a multiplayer perceptron (a neural net having a hidden layer) to examine four plays that have been attributed both to Shakespeare and John Fletcher.
  
  In the present context an attempted is made to use the concepts of Probabilistic Neural Network for the authorship attribution of ambiguous authorship problem.
Probabilistic Neural Network

Probabilistic Neural Networks are being used for classification problems. The Probabilistic Neural Network is a feed

forward neural network with an architecture having three layers namely, the input, radial basis function and competitive layers (Powell, 1992; Wasserman, 1989, 1993). When an input is presented, the first layer computes distances from the input vector to the training input vectors and produces a vector whose elements indicate how close the input is to a training input. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. Finally, a compete transfer function on the output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for the other classes. The architecture for this system is shown in Figure 1.

Database

The present study deals with the literary works of three contemporary Tamil scholars, namely, Mahakavi Bharathi (MB), T. V.Kalyanasundaranar (TVK), and Subramaniya Iyer (SI). In the Pre Independence period, these three scholars have written number of articles on Indias Freedom Movement in the magazine called India. Initially, all the

Input Layer

Radial Basis Function Layer

Competitive Layer

Figure 1. Probabilistic Neural Network Model

three scholars have written articles by attributing their names. The oppressive attitude

of the then British Regime made all the three writers to write articles on the same topic anonymously in the same magazine. For this quantitative attribution study, all attributed articles of these three scholars written on Indias Freedom Movement in the year 1906 are considered. Our study is based on nineteen articles of MB, seven of TVK and six of SI and twenty-three un-attributed articles. Twenty four function words and eighteen morphological variables have been used to quantify each sentence. The lists of sample of variables of this study with their meanings are given in Table 1 and Table 2.

Table 1. List of Sample of Function Words

Function Words	Translatio n
Um	Also
Aakiyaal	As
Entraal	For
Entru	For
Pearil	On
Ul	Inside
Ai	Unmarked
Nodu	With
Lall	With
Aall	Unmarked
Ukku	To
Ill	In

Table 2. List of Sample of Morphological Variables with Abbreviations

Abbreviations	Variables Name
P_NOUN	Nouns
P_INT	Introductory
P_INF	Intensifiers
P_PRO	Pronouns
P_NUME	Numerals
P_CASE	Case Markers
P_ADVERB	Adverbs

A chi-square analysis of the nineteen articles of MB establishes that these articles do not differ from one another in terms of the frequency distribution of occurrence of these

stylistic features. Similar results were obtained in the case of other two scholars (Manimannan and Bagavandas, 2001). Thus, each article is converted as a raw data matrix and these raw data matrices form the basis for this data description. Hence the nineteen articles of Mahakavi Bharathiar (MB) consist of three hundred and fifty-three sentences, the seven articles of Subramaniya Iyer (SI) consist of three hundred and fifteen sentences and six articles of T. V. Kalyanasundaram (TVK) consist of three hundred and eighty-two sentences. All the twenty-three ambiguous articles consist of three hundred and seventy-five sentences.

Results and Discussion

Nineteen articles written by Mahakavi Bharathiar (MB), seven articles by Subramaniya Iyer (SI) and six articles by T.

V. Kalyanasundaram (TVK) have been considered for the analysis represented by the averages of the sampled sentences. The 32 data points representing the means of the corresponding sample writings of three different authors are used. The Data matrix

(P) consists of 18 morphological variables, normalized, computed from the chosen articles. The Target matrix (T) has three rows, one row for each author, each row consisting of zeroes and ones. This matrix element tij is define as

tij = 1 if the sample j in the data matrix P corresponds to the author i

0 otherwise.

The radial basis function network is created with appropriate parameters and the data matrices. The articles of unknown authorship is also presented in the form of a Test matrix (X) consisting of 18 morphological variables, normalized, in the form of column vectors, each column corresponding to one sample article of unknown authorship. The entire analysis was also performed on the 24 normalised functional variables.

The Neural Network Toolbox available with the MATLAB software is used for the

analysis. It is assumed that there are Q input vector/target vector pairs. Each target vector has K elements. One of these elements is 1 and the rest are 0. Thus, each input vector is associated with one of K classes.

The first-layer input weights, InputWeight are set to the transpose of the matrix formed from the Q training pairs, Pt. When an input is presented, the distance function produces a vector whose elements indicate how close the input is to the vectors of the training set. These elements are multiplied, element by element, by the bias and sent to the radial basis transfer function. An input vector close to a training vector is represented by a number close to 1 in the output vector at. If an input is close to several training vectors of a single class, it is represented by several elements of at that are close to 1.

The second-layer weights, Layer Weight are set to the matrix T of target vectors. Each vector has a 1 only in the row associated with that particular class of input, and 0's elsewhere. The multiplication Tat sums the elements of at due to each of the K input classes. Finally, the second-layer transfer function, compete, produces a 1 corresponding to the largest element and 0's elsewhere. Thus, the network classifies the input vector into a specific class C because that class has the maximum probability of being correct.

The radial basis transfer function creates as many neurons as there are input vectors in

P. Each neurons weighted input is the distance between the input vector and its weight vector. Similarly, each neurons net input is the element-by-element product of its weighted input with its bias. The neurons output is its net input passed through the radial basis transfer function g(k) = exp(-k2), where k = {||weight vector input vector|| bias}. Every bias in the first layer is set to 0.8326/SPREAD, which gives radial basis transfer function that crosses 0.5 at the weighted inputs of +/- SPREAD, where SPREAD lies between 0 and 1. The larger the SPREAD smoother will be the

approximation. SPREADhas been taken to vary between 0.1 to 1.0.

All the 23 writings of ambiguous authorship are assigned by the program to Mahakavi Bharathiar (MB) for morphological as well as functional variables. This result supported the claims made by many scholars that these 23 articles could have been written by Mahakavi Bharathiar (MB).
Conclusion

The problem of classification of articles of ambiguous authorship to the articles written by contemporary Tamil scholars, namely Mahakavi Bharathiar (MB), Subramaniya Iyer (SI), and T. V. Kalyanasundaram (TVK), all of them belonging to of the same period, is taken up in the present research. To begin with, all the three writers contributed their articles by attributing their names. The oppressive attitude of the then British regime compelled all the three patriots to write articles on the same theme for anonymous publications without mentioning their names.

Application of Neural Network models has increased considerably in areas of pattern recognition and classification problems in the field of Stylometry over the last decade. The authorship attribution problem is attempted using a Probabilistic Neural network for attributing the 23 articles of unknown authorship to one of the contemporary writers of the same period, first using morphological variables and then functional variables. All the articles of ambiguous authorship are attributed to Mahakavi Bharathiar (MB). This result supported the claims made by many scholars that these 23 articles could have been written by Mahakavi Bharathiar (MB).

References

Bailey, R. W. (1979), The Future of Computational Stylistics, Association for Literary and Linguistic Computing Bulletin, Vol. 7, 4-11. England.
Bishop, C. M (2003), Neural Networks for Pattern Recognition (First Indian Edition), Oxford University Press, New Delhi.
Chandrasekaran, R. and Manimannan, G. (2008a), Neural Network Classification and Authorship Attribution of Articles of Unknown Authorship Using Radial Basis Function, Proceedings of the National Conference on Artificial Intelligence and Neural Networks, Department of Computer Applications, SRM University, Kattankulathur, Tamil Nadu, pp.246-256.
Chandrasekaran, R. and Manimannan, G. (2008b), Use of Generalized Regression Neural Network in Authorship Attribution, Proceedings of the National Conference on Research Areas in Computer Science, Department of Computer Applications, SRM University, Ramapuram, Tamil Nadu.
Kjell, B. (1994), Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers, Literary and Linguistic Computing, Vol.9, 119-124, England.
Gose, E., Johnsonbaugh, R, and Jost, S (1997), Pattern Recognition and Image Analysis, Prentice Hall Inc., New Jersey.
Heykin, S. (2001), Neural Networks : A Comprehensive Foundation (Second Edition), Pearson Education (Singapore), New Delhi.
Holmes, D. I. (1995), The Analysis of Literary Style : A Review, Journal of Royal Statistical Society, Series A, Vol. 148, 328- 334, England.
Holmes D.I. and Forsyth, R.S. (1995), The Federalist Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, England.
Manimannan, G. and Bagavandas, M. (2001), Authorship Attribution : The case of Bharathiar, National Conference on Mathematical and Applied Statistics, Department of Statistics, Nagpur University, Nagpur.
Merriam, T. and Mathews, R. (1993), Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher, Literary and Linguistic Computing, Vol.8, 203-209, England.
Merriam, T. and Mathews, R. (1994), Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe, Literary and Linguistic Computing, Vol.9, 1-6, England.
Mosteller, F. and Wallace, D. L. (1968), Inference and Disputed Authorship: The Federalist Papers, Addision-Wesley, Messachusetts.
Powell, M. J. D. (1992), The Theory of Radial Basis Functions Approximation, in Advances of Numerical Analysis , pp. 105 210 , Clarendon Press, Oxford.
Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.

Tweedie, F. J, Singh, S., and Holmes,

D. I. (1996), Neural Network Applications in Stylometry. The Federalist Papers, Computers and the Humanities, 39(1), 1-10, 1996.
Wasserman, P.D (1989), Neural Computing : Theory and Practice, Von Nostrand Reinhold, New York.
Wasserman, P.D (1993), Advanced Methods in Neural Computing, Von Nostrand Reinhold, New York.

Use Of Probabilistic Neural Network In The Classification Of Articles Of Ambiguous Authorship

Leave a Reply