Extracting Persian-English Parallel Sentences from Document Level Aligned Comparable Corpus using Bi-Directional Translation
Abstract
Bilingual parallel corpora are very important in various filed of natural language processing (NLP). The quality of a Statistical Machine Translation (SMT) system strongly dependent upon the amount of training data. For low resource language pairs such as Persian-English, there are not enough parallel sentences to build an accurate SMT system. This paper describes a new approach to use the Wikipedia as a comparable corpus to extract Persian-English parallel sentences and eventually improve SMT system performance .This new approach is also applicable to other low resource language pairs. In order to calculate the similarity score between two sentences, a novel bi-directional translation-based information retrieval system is proposed. A length penalty score is introduced to increase the accuracy of extracted corpus. Using extracted parallel sentences, the performance of existing Persian-English SMT is improved drastically.
Keywords
comparable corpus; bi-directional translation; statistical machine translation; Wikipedia; information retrieval