Author(s): Indhuja K, Indu M, Sreejith C, P. C. Reghu Raj
Published in: International Journal of Engineering Research & Technology
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Volume/Issue: Vol. 3 - Issue 4 (April- 2014)
Text based language identification is the task of automatically recognizing a language from a given text of document. It is difficult to discriminate languages within language families than those across families. In this paper, we investigate the performance of statistical measures to determine the text-based language identification system, with an emphasis on five languages used in India based on Devanagiri script - Hindi, Sanskrit, Marathi, Nepali and Bhojpuri. The proposed system uses n-grams as feature for classification. Language Identification is an important pre-processing step in many tasks of Natural Language Processing (NLP). In a multilingual society like India there is wide scope for automatic language identification since it would be a vital step in bridging the digital divide between the Indian masses and the world.
Number of Citations for this article: Data not Available
7 Paper(s) Found related to your topic:
Publish your Ph.D/Master's Thesis Online