Hybrid approach for spell checking of Tamil language

Jananie, S.Sarveswaran, K.2025-12-092025-12-092014-07-04Proceedings Peradeniya University International Research Sessions (iPURSE) - 2014, University of Peradeniya, P 401978 955 589 180 613914111https://ir.lib.pdn.ac.lk/handle/20.500.14444/7134The spell checkers are specialised application programs that flags words in a document that may be misspelled. Though there are several spell checkers available for languages like English, no fully functional application is available for the Tamil language. The existing systems either find the misspelled words from an existing list of words stored in those systems or 𝘊𝘢𝘯𝘵𝘪 mistakes. Omission of a required letter or inclusion of an inappropriate letter between two adjoined words is called 𝘊𝘢𝘯𝘵𝘪 mistake. Further, several issues have been also identified in these systems. A new approach for Tamil spell checker has been proposed in this research by integrating existing approaches and new approaches such as rule-base, crowd sourcing and suggestions generation using character level 𝘯-𝘨𝘳𝘢𝘮. According to the proposed approach, each word is checked whether it exists in the dictionary using a Levenshtein distance finding algorithm. If it does not exist, then the 𝘯-𝘨𝘳𝘢𝘮 based technique is used to generate possible suggestions for the given word. And required rules are written to get the appropriate suggestions by considering 𝘊𝘢𝘯𝘵𝘪 check as well to identify the appropriate joining letter of two adjoined words. A list of 250,000 unique and error-free words are included in the dictionary. These words have been collected from various sources, including websites. It is very difficult to gather all the words in Tamil language. Therefore, add to dictionary option has been introduced to collect new words from users and add to the existing dictionary after the moderation. To reduce the search space, the dictionary has been divided into different files based on the first letter of the word. Due to the complex nature of Tamil script compared to English, stacks and lists have been used during the processing of words. These rules have been written in such a way that it can be extended further in future. All these processing is being done without Romanising the Tamil text, while in most of the other approaches Tamil language is processed in Romanised form. The proposed system gives better accuracy than the existing systems; 85.77% accuracy was noted when considering the suggestions generation. This result had been calculated by analysing the suggestions generated by the system for the words that are not in the dictionary. Hence the proposed approach, which has dictionary check with 𝘓𝘦𝘷𝘦𝘯𝘴𝘩𝘵𝘦𝘪𝘯 algorithm, suggestions generation with 𝘯-𝘨𝘳𝘢𝘮, 𝘊𝘢𝘯𝘵𝘪 check with a rule-base and crowd sourcing, is a complete solution for Tamil spell checking.enITMathematics and StatisticsTamil languageSpell checkersHybrid approachLevenshtein algorithmHybrid approach for spell checking of Tamil languageArticle