Extracting Protein Names from Biological Literature
Abstract
Name entity recognition is an essential task in extracting biological knowledge. In biological corpus, protein names and other terminologies are mixed in natural language sentences. Sometimes whether an abbreviation is a protein name or not depends on the context. Protein names are often composed of gene names, cell names, or even drug names. Moreover, the number of newly coined protein names is increasing. Even with the assistance of a dictionary, it is still hard to correctly automatically identify all protein names in a biological corpus. We modify a hierarchical model of protein name tokens. On the one hand, we choose rule-base method to improve protein name recognition prediction accuracy rate. On the other hand, we use the N-gram language model to determine the boundary of protein name. Numerous studies mentioned that the hardest part is to identify abbreviations and words beginning with uppercase. In order to enhance the recognition performance, we use a dictionary to strengthen recognition for abbreviations and words beginning with uppercase. Experimental results show that about 10% increase in performance.We use YAPEX corpus and GENIA corpus datasets for experiment. In our study, an F-score can achieve 0.697 on the YAPEX corpus and 0.691 on the GENIA corpus. Finally, strengthening the abbreviation for part recognition, we use the Uniprot dictionary database to recognize, an F-score can achieve 0.797 on the YAPEX corpus and 0.806 on the GENIA corpus.
Keywords
Name Entity Recognition; Protein Name Recognition; N-gram Language Model