Persian Word Sense Disambiguation Corpus Extraction Based on Web Crawler Method
Abstract
Finding an appropriate dataset for natural language processing applications is one of the main challenges for researches of this field. This issue is more problematic in Non-Latin languages especially Persian language. Access to an appropriate dataset that can be used in development of practical programs in language processing field, helps us to validate the obtained results and provide the feasibility for comparison and precise analysis of the research studies in this field. This paper presents the procedure for extracting a standard dataset in Persian language. This dataset can only be used for research studies in the field of word-sense disambiguation in Persian language. The required documents that include the ambiguous words of interest are collected by a crawling robot; then these words are processed and registered in Persian dataset for ambiguous words. In this research, three prevalent Persian ambiguous word are used for extracting appropriate phrases that included these words. Finally, a framework for creating the proper configuration for application in word-sense disambiguation problems is presented. By using of this method, we have a solution for absence of suitable word sense disambiguation corpus in Persian language.
Keywords
Natural language processing; Word sense disambiguation; Information Extraction; Corpus; Machine learning