BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Li, Jiao; Sun, Yueping; Johnson, Robin J.; Sciaky, Daniela; Wei, Chih-Hsuan; Leaman, Robert; Davis, Allan Peter; Mattingly, Carolyn J.; Wiegers, Thomas C.; Lu, Zhiyong

doi:10.1093/database/baw068

Abstract

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.

Database URL : http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/

Introduction

Relations between chemicals and diseases (Chemical-Disease Relations or CDRs) play critical roles in drug discovery, biocuration, drug safety, etc. ( 1 ). Because of their critical significance, CDRs are being manually curated by resources such as the Comparative Toxicogenomic Database (CTD; http://ctdbase.org ) ( 2 , 3 ). Due to the high cost of manual curation and rapid growth of the biomedical literature, several attempts have been made to assist curation using text-mining systems ( 4 , 5 ) including the automatic extraction of CDRs ( 6 ). These attempts have met with limited success, however, due in part to the lack of a large-scale training corpus. Through BioCreative V in 2015, one of the major formal evaluations for text-mining research ( 7 ), a new challenge was organized to advance the state-of-the-art in extraction of CDRs ( 8 ). The challenge included two subtasks: disease named entity recognition (DNER) task and chemical-induced disease (CID) relation extraction task.

To support both tasks, a text corpus of PubMed abstracts containing annotations of both chemical/diseases and their interactions is desirable. Despite the existence of many biomedical corpora (see ( 9 ) for a brief review) including a few specifically targeting diseases ( 10–12 ) and chemicals ( 13 ), there were none that fulfilled the following content criteria: (i) inclusion of instances of chemical-disease relation annotations that are asserted from both within and across sentence boundaries; (ii) abstracts containing complete chemical, disease and relation annotations; (iii) chemical/disease annotations grounded in concept identifiers via a controlled vocabulary. Thus, we proposed building a new corpus that satisfies these three requirements.

The proposed corpus is related to some previous efforts in corpus annotation for biomedical information extraction research, such as protein–protein interaction ( 14 ) and drug-drug interaction ( 15 ). It is also significantly different from the previously constructed corpora for mining adverse drug reaction/effects in terms of the annotation scope (CID relations), requirements (see above) and size (1500 articles). As shown in Table 1 , the EU-ADR corpus contains a total of 300 PubMed articles with 739 drugs, 812 diseases and 300 drug-disease associated relations at sentence level ( 16 ). The ADE corpus consists of 2972 PubMed articles with sentence-level statements of 5776 adverse effects related to 5063 drugs ( 17 ). The corpus developed by ( 18 ) served for disease and adverse effect named entity recognition tasks rather than relation extraction.

Table 1.

Comparison with the previous chemical disease relation corpora

Corpus	Annotation scope	Size	Entity annotation—Mention	Entity annotation—Concept	Relation annotation
BC5CDR	Abstract	1500	Yes	Yes	Yes
EU-ADR (16)	Sentence	300	Yes	Yes	Yes
ADE (17)	Sentence	2972	Yes	No	Yes
Corpus (18)	Abstract	400	Yes	Yes	No

Corpus	Annotation scope	Size	Entity annotation—Mention	Entity annotation—Concept	Relation annotation
BC5CDR	Abstract	1500	Yes	Yes	Yes
EU-ADR (16)	Sentence	300	Yes	Yes	Yes
ADE (17)	Sentence	2972	Yes	No	Yes
Corpus (18)	Abstract	400	Yes	Yes	No

Open in new tab

Table 1.

Comparison with the previous chemical disease relation corpora

Corpus	Annotation scope	Size	Entity annotation—Mention	Entity annotation—Concept	Relation annotation
BC5CDR	Abstract	1500	Yes	Yes	Yes
EU-ADR (16)	Sentence	300	Yes	Yes	Yes
ADE (17)	Sentence	2972	Yes	No	Yes
Corpus (18)	Abstract	400	Yes	Yes	No

Corpus	Annotation scope	Size	Entity annotation—Mention	Entity annotation—Concept	Relation annotation
BC5CDR	Abstract	1500	Yes	Yes	Yes
EU-ADR (16)	Sentence	300	Yes	Yes	Yes
ADE (17)	Sentence	2972	Yes	No	Yes
Corpus (18)	Abstract	400	Yes	Yes	No

Open in new tab

Methods and materials

Article selection

We selected a total of 1500 articles for the CDR task, split into three subsets: 500 each for the training, development and test sets. The training, development and most (400) of the test set were randomly selected from the CTD-Pfizer corpus, which was generated via a previous collaboration between CTD and Pfizer, and comprises over 150 000 chemical-disease relations from 88 000 articles ( 19 ).

To ensure we have some unseen data for the task participants, the remaining 100 articles of the test set were annotated during the challenge (i.e. not selected from the previous CTD-Pfizer corpus) and their curation was not made public until the BioCreative V challenge was complete. We used the following method to select the 100 articles to ensure they would have a similar distribution of words as the training and development sets. For each of the 1000 articles in the training and development sets, we retrieved the list of related articles using PubMed E-utilities. We removed from consideration any articles that did not meet our selection criteria. Specifically, the target article must be in English, contain an abstract, and be published in 2014 or later. For each new article, we computed an overall score by summing the similarity scores ( 20 , 21 ) between the target article and each article in the training and development sets. We also determined an overall similarity score for each article in the training and development sets with a similarity score calculated using all other articles in the training and development sets. We then selected the final set by sampling with replacement from the similarity distribution of the training and development sets: we randomly selected an article from the training or development sets, obtained its similarity score, and then selected the new article with the closest similarity score. The resulting articles are approximately as well related to the articles in the training and development sets, in terms of similarity scoring, as the articles in the training and development sets are related to one another.

Annotation tasks

We performed manual annotation of all chemicals and diseases mentioned in the 1500 articles. For each entity occurrence, we not only annotated its text span but also assigned a relevant concept identifier from MeSH ( 22 ). As shown in Figure 1 , three diseases mentioned in the abstract were highlighted by our automated tool for potential consideration by the MeSH annotators, along with three occurrences of the same chemical (Lidocaine).

Figure 1.

Open in new tab Download slide

Annotation example shown in our annotation tool, PubTator.

As indicated above, we largely leveraged the previous annotation of chemical-disease relationships from the CTD-Pfizer dataset for 1400 of the 1500 articles with few changes: (i) we removed relations that required entities not found in abstracts; (ii) we removed relations that were not disease specific (e.g. ‘Drug-Related Side Effects and Adverse Reactions’ (D064420)); and (iii) we updated a few CTD relations due to the MeSH vocabulary changes (the CTD-Pfizer project was conducted in years 2011/12, and the MeSH vocabulary has changed since then).

We performed new manual annotation of chemical–disease relations for the remaining 100 articles in the test set. For the BioCreative V challenge task, the CID relations refer to two types of relationships between a chemical and a disease in CTD:

Putative mechanistic relation between a chemical and a disease indicates that the chemical may play a role in the etiology of the disease (e.g. exposure to chemical X causes lung cancer). Figure 1 shows an example of such a CTD curated relationship between Lidocaine and Heart Arrest (disease term for the synonym ‘asystole’ used by the authors in the abstract).
Biomarker relation between a chemical and a disease indicates that the chemical correlates with the disease (e.g. increased abundance in the brain of chemical X correlates with Alzheimer disease).

CTD curators used their standard curation process for CDR curation ( 23 ). Curation was limited to the title and abstract except in cases where reference to the full text was required for clarification; abstracts that required full-text curation were removed from the corpus. In addition to CDR curation, all observed interactions and relationships applicable to CTD were curated for each abstract. CTD triaged and/or curated 143 articles in conjunction with BioCreative V; the final 100 selected for inclusion in the Test Dataset represented abstract-only curation for CDRs.

Annotators

For entity annotation, we recruited four MeSH indexers, all of whom had a medical training background and curation experience. Each article was annotated independently by two annotators (i.e. double-annotation). Differences were resolved by a third and senior annotator (YS). Three CTD annotators curated the relationships between chemicals and diseases.

Annotation guidelines

The task organizers followed the usual practice of biomedical corpus annotation for entity annotation: the MeSH annotators were asked to follow an initial set of guidelines when annotating the first 100 sample articles. Annotation discrepancies and questions were discussed and settled by the senior annotator; the annotation guidelines were revised accordingly. Detailed guidelines are available on the task website. For CID relation annotation, the standard CTD curation protocol was followed ( 23 ).

Annotation tools

Manual annotation of disease and chemical entities was performed using PubTator ( 4 , 5 ) (see Figure 1 ). To accelerate manual annotation ( 24 ), text-mined disease and chemical results were pre-computed using DNorm ( 25 ) and tmChem ( 26 ) and displayed to the annotators. When necessary, the annotators added new annotations, and deleted or edited the automatic annotations based on their judgment. The annotators were permitted to use public resources such as UMLS or Wikipedia to facilitate the annotation process. CTD’s in-house Curation Tool ( 23 ) was used for all relation curation.

Annotation data formats

All annotated data were made available to participants in both PubTator and BioC formats. The PubTator format consists of a straightforward tab-delimited text file. Figure 2 shows the tab-delimited file for the article (PMID: 354896) in training set. The BioC ( 27 ) format is an XML standard recently proposed for biomedical text-mining and data output. For the same article, its BioC format is shown in Figure 3 .

Figure 2.

Open in new tab Download slide

PubTator format annotation (PMID: 354896).

Figure 3.

Open in new tab Download slide

BioC format annotation (PMID: 354896).

Inter-annotator agreement (IAA) analysis

To assess the consistency of the disease and chemical annotation, we measured pairwise agreement of duplicate annotations using the Jaccard score ( 28 ). As shown below, if we defined

A

as the set of mentions of team A ,

B

as the set of mentions of team B, then the Jaccard agreement score,

S_{A, B}

, could be calculated by counting the number of agreements and disagreements. Mentions with the same PMID, start and end point and concept identifier were counted as a case of agreement. For example, if one annotator annotated ‘tardive dystonia’ with concept ID of D004421, another annotated ‘dystonia’ with concept ID of D004421, then that would count as two cases of disagreement and no case of agreement as different mentions were annotated.

S_{A, B} = \frac{| A \cap B |}{| A \cup B |}

IAA for CTD relation curation was previously described in ( 29 ); no further experiments were conducted in this study.

Results and discussion

Corpus overview

The corpus consists of three separate sets of articles with diseases, chemicals and their relations annotated. The training (500 articles) and development (500 articles) sets were released to task participants in advance to support text-mining method development. The test set (500 articles) was used for final system performance evaluation. As shown in Table 2 , the three data sets have similar distributions of chemical mentions, disease mentions and CID relations, which makes the corpus more useful for training models. The table also shows that while there are more chemical mentions than disease mentions in the corpus, there are more disease entities (IDs) than chemical entities (IDs).

Table 2.

The overall corpus statistics

Task dataset	Articles	Disease		Chemical		CID relation
Task dataset	Articles	Mention	ID	Mention	ID	CID relation
Training	500	4182	1965	5203	1467	1038
Development	500	4244	1865	5347	1507	1012
Test	500	4424	1988	5385	1435	1066

Task dataset	Articles	Disease		Chemical		CID relation
Task dataset	Articles	Mention	ID	Mention	ID	CID relation
Training	500	4182	1965	5203	1467	1038
Development	500	4244	1865	5347	1507	1012
Test	500	4424	1988	5385	1435	1066

Open in new tab

Table 2.

The overall corpus statistics

Task dataset	Articles	Disease		Chemical		CID relation
Task dataset	Articles	Mention	ID	Mention	ID	CID relation
Training	500	4182	1965	5203	1467	1038
Development	500	4244	1865	5347	1507	1012
Test	500	4424	1988	5385	1435	1066

Task dataset	Articles	Disease		Chemical		CID relation
Task dataset	Articles	Mention	ID	Mention	ID	CID relation
Training	500	4182	1965	5203	1467	1038
Development	500	4244	1865	5347	1507	1012
Test	500	4424	1988	5385	1435	1066

Open in new tab

Inter-annotator agreement for mention annotation

Table 3 shows the inter-annotator agreement (IAA) scores of three separate subsets for both disease and chemical annotations. The IAA scores over the entire corpus are 87.49% (diseases) and 96.05% (chemicals), which suggests higher agreement on chemical than disease mentions. Additionally, the IAA scores are slightly higher on the test set than the training and test sets.

Table 3.

Inter-annotator agreement (IAA) scores of the three sets

Task dataset	Disease	Chemical
Training	0.8600	0.9523
Development	0.8742	0.9577
Test	0.8875	0.9630
All	0.8749	0.9605

Open in new tab

Table 3.

Inter-annotator agreement (IAA) scores of the three sets

Task dataset	Disease	Chemical
Training	0.8600	0.9523
Development	0.8742	0.9577
Test	0.8875	0.9630
All	0.8749	0.9605

Open in new tab

Analyzing the disagreements, we found that many were caused by missed annotation, boundary disagreement and inconsistent identifier assignment. Figure 4 shows the proportion of disease and chemical annotation disagreements in the training, development and test sets, respectively. The most of disagreements (∼50%) were due to missed annotation, where one annotator failed to identify the disease/chemical mentions recognized by the other. 28% of disease annotation disagreements were related to the boundary issue. For example, in the article (PMID 20466178) entitled ‘Rosaceiform dermatitis associated with topical tacrolimus treatment’, it was difficult to judge whether annotate ‘rosaceiform dermatitis’ as ‘rosacea’ (MeSH ID: D012393) or simply annotate ‘dermatitis’ (MeSH ID: D003872).

Figure 4..

Open in new tab Download slide

Disagreements of disease and chemical annotations.

There were also many cases of disagreement over the concept identifier of diseases, especially for the mentions where the text did not exactly match any MeSH term. In some cases, it was hard to judge whether to assign an unknown concept identifier of ‘-1’ or an ancestor concept identifier. For example, in the article with PMID of 12093990, one annotator selected ‘infection with hemorrhagic fever viruses’ as ‘-1’, while the other selected ‘D006482’ (Hemorrhagic Fevers, Viral). In this case the adjudicating annotator chose the latter term.

Disease mention distribution

On average, the corpus contains 8.57 non-distinct disease mentions per PubMed abstract. Figure 5(a) shows the breakdown of the number of disease mentions per article in the training, development and test sets, respectively. The three data sets have similar disease mention distribution. In addition, we compared the overlap of unique disease mentions in the three data sets as shown in Figure 5(b) . It can be seen that 718 disease mentions out of 1337 in the test set never appear in the training set or the development set. The similar distribution and differential disease mentions in the three sets make the corpus more useful for training models.

Figure 5.

Open in new tab Download slide

Distribution of disease mentions in the corpus.

Chemical mention distribution

Compared with diseases, the corpus contains more non-distinct chemical mentions (10.62) per PubMed abstract on average. The chemical mention distribution in the three sets ( Figure 6(a) ) and the overlap of unique chemical mentions in the three sets ( Figure 6(b) ) demonstrate that, like the disease mentions, the corpus has excellent characteristics to support model training for chemical entity recognition. Here, we used MeSH identifiers to normalize the chemical mentions because CID relations were previously annotated already in MeSH which is designed for literature indexing and has been used in similar annotation projects ( 26 , 30 ). The CTD chemical vocabulary ( http://ctdbase.org/downloads/#allchems ) can facilitate mapping the MeSH identifiers to other chemical resource accessions for further chemical-related studies.

Figure 6.

Open in new tab Download slide

Distribution of chemical mentions in the corpus.

CID relation distribution

For CID relationships, the average number of relations per PubMed abstract is 2.08. About 50% of articles in the corpus have only one CID relation per article, and 86.8% of articles have no more than three relations ( Figure 7(a) ). Unlike the disease and chemical mentions, the overlap of unique CID relations in the three sets is as low as 61 relations; 79.17% (745 out of 941) of the relations in the test set have never appeared in the training or development set ( Figure 7(b) ). This makes relation extraction task more challenging.

Figure 7.

Open in new tab Download slide

Distribution of chemical-induced disease relations in the corpus.

Corpus usage in BioCreative V

The corpus successfully supported the BioCreative V Chemical Disease Relation (CDR) task ( 8 , 31 ). A total of 34 teams worldwide participated in the task: 16 teams participated in the in the DNER task, and 18 teams participated in the CID task. As reported in the BioCreative V, the best system performance F-score was 86.46% for the DNER task ( 32 ) and 57.03% for the CID task ( 33 ). This corpus provides a benchmark set to facilitate further improvement for biomedical text-mining method development, especially as it relates to semantic relationship extraction.

Conclusions

We developed a corpus for both named entity recognition and chemical-disease relations in the literature. A total of 1500 articles have been annotated with automated assistance from PubTator. Jaccard agreement results and corpus statistics verified the reliability of the corpus. Furthermore, our annotated data includes the CDR relations that are asserted across sentence boundaries (i.e. not in the same sentences). We believe this data set will be invaluable for advancing text-mining techniques for relation extraction tasks.

Acknowledgment

We would like to thank Yifan Peng for his help on the manuscript.

Funding

The research was supported by the National Population and Health Scientific Data Sharing Program of China, the China Knowledge Centre for Engineering Sciences and Technology (Medical Centre), the Fundamental Research Funds for the Central Universities (Grant No. 13R0101), the National Institute of Environmental Health Sciences (ES014065 and ES019604). Funding for open access charge: The National Institutes of Health Intramural Research Program.

Conflict of interest . None declared.

References

1

Islamaj Dogan

R.

Murray

G.C.

Neveol

A

. et al. . (

2009

)

Understanding PubMed user search behavior through log analysis

.

Database (Oxford)

,

2009

,

bap018.

2

Davis

A.P.

Murphy

C.G.

Saraceni-Richards

C.A

. et al. . (

2009

)

Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks

.

Nucleic Acids Res

.,

37

,

D786

–

D792

.

3

Davis

A.P.

Grondin

C.J.

Lennon-Hopkins

K

. et al. . (

2015

)

The Comparative Toxicogenomics Database's 10th year anniversary: update 2015

.

Nucleic Acids Res

.,

43

,

D914

–

D920

.

4

Wei

C.H.

Kao

H.Y.

Lu

Z.

(

2013

)

PubTator: a web-based text mining tool for assisting biocuration

.

Nucleic Acids Res

.,

41

,

W518

–

W522

.

5

Wei

C.H.

Harris

B.R.

Li

D

. et al. . (

2012

)

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts

.

Database (Oxford)

,

2012

,

bas041.

6

Wiegers

T.C.

Davis

A.P.

Mattingly

C.J.

(

2014

)

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

.

Database (Oxford)

,

2014

, bau050.

Google Scholar

OpenURL Placeholder Text

WorldCat

7

Huang

C.C.

Lu

Z.

(

2015

)

Community challenges in biomedical text mining over 10 years: success, failure and the future

.

Brief. Bioinf

.,

17

,

132

–

144

.

Google Scholar

Crossref

WorldCat

8

Wei

C.H.

Pan

Y.

Leaman

R

. et al. . (

2015

) Overview of the BioCreative V Chemical Disease Relation (CDR) Task. In: Proceedings of the fifth BioCreative challenge evaluation workshop . pp.

154

–

166

.

9

Neves

M.

(

2014

)

An analysis on the entity annotations in biological corpora

.

F1000Res

,

3

,

96.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

10

Leaman

R.

Miller

C.

Gonzalez

G.

(

2009

) Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In: The 2009 Symposium on Languages in Biology and Medicine , Jeju Island, South Korea, pp.

82

–

89

.

11

Doğan

R.I.

Leaman

R.

Lu

Z.

(

2014

)

NCBI disease corpus: a resource for disease name recognition and concept normalization

.

J. Biomed. Inf

.,

47

,

1

–

10

.

Google Scholar

Crossref

WorldCat

12

Dogan

R.I.

Lu

Z.

(

2012

) An improved corpus of disease mentions in PubMed citations. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012), Montreal, Canada, pp.

91

–

99

.

13

Krallinger

M.

Rabal

O.

Leitner

F

. et al. . (

2015

)

The CHEMDNER corpus of chemicals and drugs and its annotation principles

.

J. Cheminf

.,

7

,

S2.

Google Scholar

Crossref

WorldCat

14

Krallinger

M.

Vazquez

M.

Leitner

F

. et al. . (

2011

)

The Protein–Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

.

BMC Bioinformatics

,

12

,

S3.

15

Herrero-Zazo

M.

Isabel

S.-B.

Martínez

P

. et al. . (

2013

)

The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions

.

J. Biomed. Inf

.,

46

,

914

–

920

.

Google Scholar

Crossref

WorldCat

16

Mulligen

E.M.V.

Fourrier-Reglat

A.

Gurwitz

D

. et al. . (

2012

)

The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships

.

J. Biomed. Inf

.,

45

,

879

–

884

.

Google Scholar

Crossref

WorldCat

17

Gurulingappa

H.

Rajput

A.M.

Roberts

A

. et al. . (

2012

)

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports

.

J. Biomed. Inf

.,

45

,

885

–

892

.

Google Scholar

Crossref

WorldCat

18

Gurulingappa

H.

Klinger

R.

Hofmann-Apitius

M

. et al. . (

2010

) An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature . In:

The 2nd Workshop on Building and evaluating resources for biomedical text mining

.

Valetta, Malta

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

19

Davis

A.P.

Wiegers

T.C.

Roberts

P.M

. et al. . (

2013

)

A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions

.

Database (Oxford)

,

2013

,

bat080.

20

PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);

2005

–. PubMed Help. [Updated 2015 Aug 7]. http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Computation_of_Similar_Articl

21

Lin

J.

Wilbur

W.J.

(

2007

)

PubMed related articles: a probabilistic topic-based model for content similarity

.

BMC Bioinformatics

,

8

,

423.

(2007)

22

Lipscomb

C.E.

(

2000

)

Medical subject headings (MeSH)

.

Bull. Med. Library Assoc

.,

88

,

265.

Google Scholar

OpenURL Placeholder Text

WorldCat

23

Davis

A.P.

Wiegers

T.C.

Rosenstein

M.C

. et al. . (

2011

)

The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database

.

Database (Oxford)

,

2011

,

bar034.

24

Névéol

A.

Doğan

R.I.

Lu

Z.

(

2011

)

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

.

J. Biomed. Inf

.,

44

,

310

–

318

.

Google Scholar

Crossref

WorldCat

25

Leaman

R.

Doğan

R.I.

Lu

Z.

(

2013

)

DNorm: disease name normalization with pairwise learning to rank

.

Bioinformatics

,

29

,

2909

–

2917

.

26

Leaman

R.

Wei

C.H.

Lu

Z.

(

2015

)

tmChem: a high performance approach for chemical named entity recognition and normalization

.

J. Cheminf

.,

7

,

S3.

Google Scholar

Crossref

WorldCat

27

Comeau

D.C.

Doğan

R.I.

Ciccarese

P

. et al. . (

2013

)

BioC: a minimalist approach to interoperability for biomedical text processing

.

Database (Oxford)

,

2013

,

bat064.

28

Levandowsky

M.

Winter

D.

(

1971

)

Distance between sets

.

Nature

,

234

,

34

–

35

.

Google Scholar

Crossref

WorldCat

29

Wiegers

T.C.

Davis

A.P

Cohen

K.B

. et al. . (

2009

)

Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

.

BMC Bioinformatics

,

10

,

326.

30

Huang

M.

Neveol

A.

Lu

Z.

(

2011

)

Recommending MeSH terms for annotating biomedical articles

.

J. Am. Med. Inf. Assoc

.,

18

,

660

–

667

.

Google Scholar

Crossref

WorldCat

31

Li

J.

Sun

Y.

Johnson

R.J.

et al. . (

2015

) Annotating chemicals, diseases and their interactions in biomedical literature. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop . pp.

173

-

182

32

Lee

H.C.

Hsu

Y.Y.

Kao

H.Y.

(

2015

) An enhanced CRF-based system for disease name entity recognition and normalization on BioCreative V DNER Task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop , pp.

226

-

233

.

33

Xu

J.

Wu

Y.

Zhang

Y

. et al. . (

2015

) UTH-CCB@BioCreative V CDR Task: Identifying Chemical-induced Disease Relations in Biomedical Text. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop , pp.

254

-

259

.

Author notes

Citation details: Li,J., Sun,Y., Johnson,R.J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (2016) Vol. 2016: article ID baw068; doi:10.1093/database/baw068

Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.

Download all slides

Month:	Total Views:
December 2016	5
January 2017	5
February 2017	12
March 2017	22
April 2017	10
May 2017	10
June 2017	26
July 2017	10
August 2017	18
September 2017	10
October 2017	5
November 2017	12
December 2017	27
January 2018	24
February 2018	53
March 2018	47
April 2018	29
May 2018	36
June 2018	33
July 2018	28
August 2018	31
September 2018	60
October 2018	31
November 2018	45
December 2018	36
January 2019	52
February 2019	52
March 2019	58
April 2019	79
May 2019	71
June 2019	37
July 2019	53
August 2019	50
September 2019	131
October 2019	139
November 2019	73
December 2019	81
January 2020	64
February 2020	60
March 2020	81
April 2020	46
May 2020	101
June 2020	149
July 2020	149
August 2020	87
September 2020	113
October 2020	104
November 2020	135
December 2020	109
January 2021	102
February 2021	125
March 2021	182
April 2021	139
May 2021	157
June 2021	172
July 2021	145
August 2021	161
September 2021	169
October 2021	194
November 2021	197
December 2021	173
January 2022	214
February 2022	153
March 2022	208
April 2022	213
May 2022	261
June 2022	187
July 2022	172
August 2022	212
September 2022	142
October 2022	238
November 2022	150
December 2022	135
January 2023	159
February 2023	233
March 2023	190
April 2023	238
May 2023	218
June 2023	150
July 2023	189
August 2023	202
September 2023	146
October 2023	242
November 2023	307
December 2023	241
January 2024	243
February 2024	225
March 2024	266
April 2024	145

Article Contents

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Abstract

Introduction

Methods and materials

Article selection

Annotation tasks

Annotators

Annotation guidelines

Annotation tools

Annotation data formats

Inter-annotator agreement (IAA) analysis

Results and discussion

Corpus overview

Inter-annotator agreement for mention annotation

Disease mention distribution

Chemical mention distribution

CID relation distribution

Corpus usage in BioCreative V

Conclusions

Acknowledgment

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Abstract

Introduction

Methods and materials

Article selection

Annotation tasks

Annotators

Annotation guidelines

Annotation tools

Annotation data formats

Inter-annotator agreement (IAA) analysis

Results and discussion

Corpus overview

Inter-annotator agreement for mention annotation

Disease mention distribution

Chemical mention distribution

CID relation distribution

Corpus usage in BioCreative V

Conclusions

Acknowledgment

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only