Building an efficient curation workflow for the Arabidopsis literature corpus

Table 1

Data elements extracted from literature and controlled vocabularies used in curation

Data elements in literature	Controlled vocabularies	Examples and links to CV (if available)
Gene	TAIR locus identifiers	AT5G46330
Gene function	GO	GO:0009908 flower development http://amigo.geneontology. org/cgi-bin/amigo/browse.cgi
Gene expression pattern	PO	PO:0009046 flower http://www.plantontology.org/
Polymorphism	TAIR-controlled vocabulary	Insertion, substitution
Phenotype	Free text transitioning to plant-specific Phenotype Ontology	Female sterile; altered ovule development; integuments fail to cover nucellus; reduced plant height; reduced length in inflorescence internodes; reduced levels of pollen, smaller leaves; late flowering, slow growth.

Data elements in literature	Controlled vocabularies	Examples and links to CV (if available)
Gene	TAIR locus identifiers	AT5G46330
Gene function	GO	GO:0009908 flower development http://amigo.geneontology. org/cgi-bin/amigo/browse.cgi
Gene expression pattern	PO	PO:0009046 flower http://www.plantontology.org/
Polymorphism	TAIR-controlled vocabulary	Insertion, substitution
Phenotype	Free text transitioning to plant-specific Phenotype Ontology	Female sterile; altered ovule development; integuments fail to cover nucellus; reduced plant height; reduced length in inflorescence internodes; reduced levels of pollen, smaller leaves; late flowering, slow growth.

CV, controlled vocabularies

http://www.arabidopsis.org/servlets/TairObject?accession=locus:2170483

Gene function data presented in the literature are captured using GO vocabularies consisting of molecular function, biological process and cellular component terms. We use PO vocabularies for plant anatomy and plant growth and developmental stages to annotate gene expression. GO evidence codes and references are attached to each annotation to provide provenance, as well as a pointer to the original published result. Examples of GO and PO annotations are shown in Table 2.

Table 2

Examples of GO and PO annotation in TAIR

	Gene	Relationship type	Term	Evidence type	Reference	Annotated by and date
GO annotation	AT5G46330	involved in	defense response to bacterium GO:0042742	inferred from mutant phenotype: analysis of physiological response	Xiang et al. (2007)	The Arabidopsis Information Resource/ 2008-01-15
PO annotation	AT5G46330	expressed in	root tip PO:0000025	inferred from direct assay: localization of GFP/YFP fusion protein	Robatzek et al. (2006)	The Arabidopsis Information Resource/ 2007-04-04

	Gene	Relationship type	Term	Evidence type	Reference	Annotated by and date
GO annotation	AT5G46330	involved in	defense response to bacterium GO:0042742	inferred from mutant phenotype: analysis of physiological response	Xiang et al. (2007)	The Arabidopsis Information Resource/ 2008-01-15
PO annotation	AT5G46330	expressed in	root tip PO:0000025	inferred from direct assay: localization of GFP/YFP fusion protein	Robatzek et al. (2006)	The Arabidopsis Information Resource/ 2007-04-04

Adapted from TAIR locus detail page:

http://www.arabidopsis.org/servlets/TairObject?accession=locus:2170483

Table 2

Examples of GO and PO annotation in TAIR

	Gene	Relationship type	Term	Evidence type	Reference	Annotated by and date
GO annotation	AT5G46330	involved in	defense response to bacterium GO:0042742	inferred from mutant phenotype: analysis of physiological response	Xiang et al. (2007)	The Arabidopsis Information Resource/ 2008-01-15
PO annotation	AT5G46330	expressed in	root tip PO:0000025	inferred from direct assay: localization of GFP/YFP fusion protein	Robatzek et al. (2006)	The Arabidopsis Information Resource/ 2007-04-04

	Gene	Relationship type	Term	Evidence type	Reference	Annotated by and date
GO annotation	AT5G46330	involved in	defense response to bacterium GO:0042742	inferred from mutant phenotype: analysis of physiological response	Xiang et al. (2007)	The Arabidopsis Information Resource/ 2008-01-15
PO annotation	AT5G46330	expressed in	root tip PO:0000025	inferred from direct assay: localization of GFP/YFP fusion protein	Robatzek et al. (2006)	The Arabidopsis Information Resource/ 2007-04-04

Adapted from TAIR locus detail page:

http://www.arabidopsis.org/servlets/TairObject?id=500245330&type=polyallele

We have developed sets of controlled vocabulary terms to capture polymorphism information (e.g. polymorphism type, mutagen) from the literature. At the same time, we allow a curator-created free text description to be attached to a polymorphism. Germplasm refers to a strain with a unique genotype. We have also developed sets of controlled vocabulary terms to capture germplasm information such as species variant (typically a lab strain or natural variant, known as an ecotype in the Arabidopsis community), alleles known to be present in the germplasm, etc. We also allow a free text description to be attached to a germplasm. Phenotypes are currently described using free text. We are working with the community to supplement these phenotype descriptions with ontology-based phenotype annotations using PO, GO, ChEBI (Chemical Entities of Biological Interest) (14, 15) and PATO (Phenotypic Quality Ontology) (16). As in the case of the GO and PO annotations, references are provided for the phenotypes to allow the users to refer back to the original published result. Table 3 shows how polymorphism and phenotype data are represented within TAIR.

Table 3

Representation of polymorphism and phenotype in TAIR

Polymorphism name	Gene ID	Polymorphism type	Polymorphism site	Inheritance	Germplasm	Phenotype	Reference
fls2-17	AT5G46330.1	substitution	exon	recessive	FLS2-17	Mutant seedlings treated with 10-μM flg22 peptide (strong growth inhibitor) display shoot and root growth similar to that of wild-type Ler.	Gomez-Gomez et al., 2000

Polymorphism name	Gene ID	Polymorphism type	Polymorphism site	Inheritance	Germplasm	Phenotype	Reference
fls2-17	AT5G46330.1	substitution	exon	recessive	FLS2-17	Mutant seedlings treated with 10-μM flg22 peptide (strong growth inhibitor) display shoot and root growth similar to that of wild-type Ler.	Gomez-Gomez et al., 2000

Adapted from TAIR polymorphism page:

http://www.arabidopsis.org/servlets/TairObject?id=500245330&type=polyallele

Table 3

Representation of polymorphism and phenotype in TAIR

Polymorphism name	Gene ID	Polymorphism type	Polymorphism site	Inheritance	Germplasm	Phenotype	Reference
fls2-17	AT5G46330.1	substitution	exon	recessive	FLS2-17	Mutant seedlings treated with 10-μM flg22 peptide (strong growth inhibitor) display shoot and root growth similar to that of wild-type Ler.	Gomez-Gomez et al., 2000

Polymorphism name	Gene ID	Polymorphism type	Polymorphism site	Inheritance	Germplasm	Phenotype	Reference
fls2-17	AT5G46330.1	substitution	exon	recessive	FLS2-17	Mutant seedlings treated with 10-μM flg22 peptide (strong growth inhibitor) display shoot and root growth similar to that of wild-type Ler.	Gomez-Gomez et al., 2000

Adapted from TAIR polymorphism page:

Use of text mining in curation

In our current literature curation workflow, we use an Aho–Corasick keyword-searching algorithm to extract gene names and other keywords from article titles and abstracts and create associations between the terms and the articles. A manual verification step follows to confirm that the gene–article link is valid. Manual extraction of gene names and other keywords from literature followed by adding the associations to the database is a tedious process; the use of this algorithm for generating the putative links therefore greatly improves efficiency.

During the curation process, curators frequently must consult additional articles, e.g. to disambiguate a gene symbol or to track down specific information such as the mutation sites in certain alleles. In collaboration with our team, the WormBase team at Caltech has applied the Textpresso text mining tool (17) to the TAIR Arabidopsis literature corpus to produce Textpresso for Arabidopsis (http://www.textpresso.org/arabidopsis/). This tool, housed at Caltech, is available to both TAIR curators and general users and allows users to search over 43 151 abstracts (including some conference abstracts) and 33 955 full-text publications related to Arabidopsis (numbers as of July 2012). Users can search using specific keyword categories including A. thaliana gene names, GO and PO terms or combinations of keywords to narrow their search results. Sentences that contain matching keywords are retrieved together with bibliographic information so that users can quickly confirm the usefulness of a particular article and link directly to the full text, if they have the appropriate subscriptions to the journals in question.

Recently, in collaboration with the same Textpresso team at Wormbase, we have developed a semi-automated curation process to identify articles with cellular component information and create annotations from them. In this approach (18), the entire available Arabidopsis full-text literature corpus is processed by Textpresso, sentences that contain Arabidopsis gene names, protein subcellular localization data, as well as assay-related words are extracted and GO annotations are suggested. A curator then manually validates each suggested annotation. Validated annotations are exported as a flat file consisting of required fields (gene identifiers, GO term identifiers, evidence codes, references, annotation date). A curator then reformats the file before loading it into TAIR. Details of this approach are reported in a separate publication (Van Auken, K. et al., submitted for publication).

Summary and future directions

Our workflow efficiently prioritizes a manageable set of articles for curation by a small and well-trained curation team. Though we currently focus on extracting data from recently published articles about genes that have not previously been characterized, our workflow is flexible enough to adapt to other prioritization schemes. The workflow can be easily reverted to the journal impact factor-based prioritization or adapted to a new priority, e.g. articles about a specific set of genes or a gene family. In the latter case, we would simply retrieve the set of papers associated with that specific gene set and mark them as ‘first’ priority.

Manual literature curation by professional curators produces accurate and consistent annotations. Yet, manual extraction of gene function information from the literature and conversion into ontology-based annotations is a labor-intensive process. On average, a curator spends about 4 min per abstract (Wei, C.H. et al., submitted for publication) for article selection (verification of gene mentions and article priority ranking). Time spent on full-text curation varies widely from 0.5 h to >4 h per article depending on the complexity of the paper. This very time-consuming, detail-oriented, manual process of interpreting the text to extract and convert gene functions into ontology-based annotations has been the major bottleneck in our current workflow. In recent years, TAIR’s curation team has been able to curate only a fraction of newly published Arabidopsis articles (∼30%). There is a strong incentive to develop a more effective text mining system to automate the extraction and conversion of gene function information from the research literature into annotations based on standard, community-developed ontologies of biological concepts. This translates into an urgent need for tools capable of publication retrieval, entity recognition (including gene name, gene function, expression pattern, polymorphism and phenotype) and mapping of free text descriptions of gene function to ontology terms, with the latter representing the most severe bottleneck in our workflow. The ongoing BioCreative workshops are demonstrating increasing sophistication in more straightforward tasks such as entity recognition (19), protein–protein interaction (20) and basic document triage. The BioCreative tasks relating to the user interface (21) and to ontology term mining are just getting underway and early results are very promising. Nevertheless, the holy grail, full automation of the literature curation process, is not visible on the horizon. To improve effectiveness and productivity in the curation process at TAIR in the relatively near future, it will be necessary to develop a fully integrated environment that assists a curator at all levels of the process described here. This vision requires four factors: first, a focus on incremental improvements that will assist the curator (rather than the more distant and possibly unreachable goal of fully automating the process); second, involving the scientific community in the curation process; third, providing tools that are easily integrated into common programing frameworks through web services or programing APIs and fourth, building frameworks and interfaces that use the latest human–computer interface theory and technology to improve community and curator productivity at the same time as introducing machine learning and automation.

Funding

The National Science Foundation (grant DBI-0850219); the National Institutes of Health National Human Genome Research Institute (NHGRI) (grant 5P41HG002273-09 for gene function curation, partial funding). Additional support for gene function curation comes from the TAIR sponsorship program (see http://arabidopsis.org/doc/about/tair_sponsors/413 for a complete list of sponsors). Funding for open access publication is provided by the National Science Foundation.

Conflict of interest. None declared.

Acknowledgements

We would like to thank the BioCreative 2012 Workshop Steering Committee for the opportunity to participate in the workshop.

References

1

Howe

D

,

Costanzo

M

,

Fey

P

, et al.

Big data: The future of biocuration

,

Nature

,

2008

, vol.

455

(pg.

47

-

50

)

2

Swarbreck

D

,

Wilks

C

,

Lamesch

P

, et al.

The Arabidopsis Information Resource (TAIR): Gene structure and function annotation

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D1009

-

D1014

)

3

Lamesch

P

,

Berardini

TZ

,

Li

D

, et al.

The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools

,

Nucleic Acids Res.

,

2012

, vol.

40

(pg.

D1202

-

D1210

)

4

Yoo

D

,

Xu

I

,

Berardini

TZ

, et al.

PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature

,

Curr. Protoc. Bioinformatics

,

2006

Chapter 9, Unit9.7

5

Ort

DR

,

Grennan

AK

.

Plant physiology and TAIR partnership

,

Plant Physiol.

,

2008

, vol.

146

(pg.

1022

-

1023

)

6

Berardini

TZ

,

Li

D

,

Muller

R

,

Chetty

R

, et al.

Assessment of community-submitted ontology annotations from a novel database-journal partnership

,

Database

,

2012

DOI: 10.1093/database/bas030

7

Arighi

CN

,

Lu

Z

,

Krallinger

M

, et al.

Overview of the BioCreative III workshop

,

BMC Bioinformatics

,

2011

, vol.

12

pg.

S1

8

Hirschman

L

,

Burns

GA

,

Krallinger

M

, et al.

Text mining for the biocuration workflow

,

Database

,

2012

DOI: 10.1093/database/bas020

9

Stark

C

,

Breitkreutz

BJ

,

Chatr-Aryamontri

A

, et al.

The BioGRID Interaction Database: 2011 update

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D698

-

D704

)

10

Sayers

EW

,

Barrett

T

,

Benson

DA

.

Database resources of the National Center for Biotechnology Information

,

Nucleic Acids Res.

,

2012

, vol.

40

(pg.

D13

-

D25

)

11

Aho

AV

,

Corasick

MJ

.

Efficient string matching: an aid to bibliographic search

,

Commun. ACM

,

1975

, vol.

18

(pg.

333

-

340

)

Crossref

12

The Gene Ontology Consortium

The Gene Ontology in 2010: Extensions and refinements

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D331

-

D335

)

Crossref

PubMed

13

Jaiswal

P

,

Avraham

S

,

Ilic

K

, et al.

Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages

,

Comp. Funct. Genomics

,

2005

, vol.

6

(pg.

388

-

397

)

14

Degtyarenko

K

,

de Matos

P

,

Ennis

M

, et al.

ChEBI: A database and ontology for chemical entities of biological interest

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D344

-

D350

)

15

de Matos

P

,

Alcántara

R

,

Dekker

A

, et al.

Chemical Entities of Biological Interest: An update

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D249

-

D254

)

16

Gkoutos

GV

,

Mungall

C

,

Dolken

S

, et al.

Entity/quality-based logical definitions for the human skeletal phenome using PATO

,

Conf. Proc. IEEE Eng. Med. Biol. Soc.

,

2009

, vol.

2009

(pg.

7069

-

7072

)

PubMed