Finding and sharing: new approaches to registries of databases and services for the biomedical sciences

Smedley, Damian; Schofield, Paul; Chen, Chao-Kung; Aidinis, Vassilis; Ainali, Chrysanthi; Bard, Jonathan; Balling, Rudi; Birney, Ewan; Blake, Andrew; Bongcam-Rudloff, Erik; Brookes, Anthony J.; Cesareni, Gianni; Chandras, Christina; Eppig, Janan; Flicek, Paul; Gkoutos, Georgios; Greenaway, Simon; Gruenberger, Michael; Hériché, Jean-Karim; Lyall, Andrew; Mallon, Ann-Marie; Muddyman, Dawn; Reisinger, Florian; Ringwald, Martin; Rosenthal, Nadia; Schughart, Klaus; Swertz, Morris; Thorisson, Gudmundur A.; Zouberakis, Michael; Hancock, John M.

doi:10.1093/database/baq014

Abstract

The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources.

Database URL:http://www.casimir.org.uk/casimir_ddf

Biologists currently face a daunting challenge when trying to discover which of the multitude of computational and data resources to use in analysing their results and developing their hypotheses. The basic task of identifying appropriate online resources in a research field is non-trivial and typically involves ad hoc Internet trawling, recommendations from colleagues or literature searching. This is then followed by the more complex task of establishing whether the resource is relevant, reliable, well curated, and maintained. If programmatic access is required, discovering whether this exists and how to utilize it is another challenge. As time is short, most researchers often end up using familiar resources, which are not always the best or most relevant, while the developers and funders of under-utilized but valuable resources essentially waste time and money. What is required is a solution that helps to maximize the usefulness of each resource to the overall community. At present, approaches are being developed to construct two types of registry. One type, ‘databases of databases’, deal with describing the contents and other metadata about databases. The other type, web service registries, deal with the explicit description of services available at particular sites (not always databases). We present the two areas separately, but ultimately we expect solutions to arrive that merge the two approaches.

Registries of databases

Comprehensive, top-level registries of biological resources are currently provided by the Nucleic Acids Research Molecular Biology Database Collection,¹ the BioMedCentral Catalog of Databases on the Web (http://databases.biomedcentral.com) and the Bioinformatics.ca Links Directory.² However, they do not collect extensive metadata beyond a brief description of the resource and URL, can only be browsed by the category each registry has assigned or searched by the resource name, and lack much of the detailed information that the community requires. A number of projects [e.g. CASIMIR (Coordination and Sustainability of Mouse Informatics Resources)³ and ENFIN⁴] have identified this problem and are producing ‘database of databases’ (registries) for their field of expertise.⁵

A registry of resources needs to be more than just a list of databases and textual descriptions to be useful to the biological and bioinformatics communities. To achieve its aim of helping scientists find the most relevant resource for their needs, it needs to provide at the very least browsing and searching by the type of data contained in each resource, i.e. the biological scope of the resource. A typical approach, as used by all the registries described above and the MRB (Mouse Resource Browser)⁶ registry developed by a number of the authors of this article, is for a community to define a list of categories (a controlled vocabulary) that covers their scientific domain and then to tag each resource with one or more of these terms. Use of existing and newly developed ontologies for these tags would certainly facilitate future interoperability of the various registries being developed.

While developing MRB, user feedback suggested that it would be helpful if users could go beyond simple categorization of the scope of resources to discover metadata describing database operations and functionality. We therefore set out to capture the utility, accessibility and ease of use of a resource, along with its potential interoperability with other tools and databases. The types of questions that we wanted to be able to answer from this metadata included whether the resource uses automated or manual curation, how often it updates and whether there is a way to track back to different versions, does it provide good technical documentation and user support, does it use recognized standards to record and structure its data, and finally does it go beyond simple web browsing to allow programmatic access and output in standard formats?

Data are always easier to capture and search if a consistent standard is used and we therefore developed a Database Description Framework (DDF; Table 1) as part of the CASIMIR project. Although produced for the MRB, the DDF is generically applicable to any biological database and can be adapted for the requirements of any biological community. For each heading or category, there is a three-tier assessment criterion, a number chosen for simplicity and ease of use. The aim of the DDF is not to make ‘value judgements’ about a resource, but to summarise what it does and what functionalities it supports, with the categories simply reflecting the degree of complexity or sophistication of the database. What is useful or relevant for some databases need not be so for others, and each needs to be assessed in terms of its own remit and user community. The DDF is also intended to be helpful in disseminating and supporting good database practice, in providing backing for resources aspiring to improve the levels of their service, and in giving objective criteria that can be used by external assessors to measure a resource’s progress towards their stated goals.

Table 1.

The CASIMIR Database Description Framework (DDF)

Category	Level 1	Level 2	Level 3
Quality and Consistency	No explicit process for assuring consistency	Process for assuring consistency, automatic curation only	Process for assuring consistency with manual curation
Currency	Closed legacy database or last update more than a year ago	Updates or versions more than once a year	Updates or versions more than once a month
Accessibility	Access via browser only	Access via browser and database reports or database dumps	Access via browser and programmatic access (well defined API, SQL access or web services)
Output formats	HTML or similar to browser only	HTML or similar to browser and sparse standard file formats, e.g. FASTA	HTML or similar to browser and rich standard file formats, e.g. XML, SBML (Systems Biology Markup Language)
Technical documentation	Written text only	Written text and formal structured description, e.g. automatically generated API docs (JavaDoc), DDL (Data Description Language), DTD (Document Type Definition), UML (Unified Modelling Language), etc.	Written text and formal structured description and tutorials or demonstrations on how to use them
Data representation standards	Data coded by local formalism only	Some data coded by a recognised controlled vocabulary, ontology or use of minimal information standards (MIBBI)	General use of both recognised vocabularies or ontologies, and minimal information standards (MIBBI)
Data structure standards	Data structured with local model only	Data structured with formal model, e.g. an XML schema	Use of recognised standard model, e.g. FUGE
User support	User documentation only	User documentation and Email/web form help desk function	User documentation as well as a personal contact help desk function/training
Versioning	No provision	Previous version of database available but no tracking of entities between versions	Previous version of database available and tracking of entities between versions

Category	Level 1	Level 2	Level 3
Quality and Consistency	No explicit process for assuring consistency	Process for assuring consistency, automatic curation only	Process for assuring consistency with manual curation
Currency	Closed legacy database or last update more than a year ago	Updates or versions more than once a year	Updates or versions more than once a month
Accessibility	Access via browser only	Access via browser and database reports or database dumps	Access via browser and programmatic access (well defined API, SQL access or web services)
Output formats	HTML or similar to browser only	HTML or similar to browser and sparse standard file formats, e.g. FASTA	HTML or similar to browser and rich standard file formats, e.g. XML, SBML (Systems Biology Markup Language)
Technical documentation	Written text only	Written text and formal structured description, e.g. automatically generated API docs (JavaDoc), DDL (Data Description Language), DTD (Document Type Definition), UML (Unified Modelling Language), etc.	Written text and formal structured description and tutorials or demonstrations on how to use them
Data representation standards	Data coded by local formalism only	Some data coded by a recognised controlled vocabulary, ontology or use of minimal information standards (MIBBI)	General use of both recognised vocabularies or ontologies, and minimal information standards (MIBBI)
Data structure standards	Data structured with local model only	Data structured with formal model, e.g. an XML schema	Use of recognised standard model, e.g. FUGE
User support	User documentation only	User documentation and Email/web form help desk function	User documentation as well as a personal contact help desk function/training
Versioning	No provision	Previous version of database available but no tracking of entities between versions	Previous version of database available and tracking of entities between versions

Open in new tab

Table 1.

The CASIMIR Database Description Framework (DDF)

Category	Level 1	Level 2	Level 3
Quality and Consistency	No explicit process for assuring consistency	Process for assuring consistency, automatic curation only	Process for assuring consistency with manual curation
Currency	Closed legacy database or last update more than a year ago	Updates or versions more than once a year	Updates or versions more than once a month
Accessibility	Access via browser only	Access via browser and database reports or database dumps	Access via browser and programmatic access (well defined API, SQL access or web services)
Output formats	HTML or similar to browser only	HTML or similar to browser and sparse standard file formats, e.g. FASTA	HTML or similar to browser and rich standard file formats, e.g. XML, SBML (Systems Biology Markup Language)
Technical documentation	Written text only	Written text and formal structured description, e.g. automatically generated API docs (JavaDoc), DDL (Data Description Language), DTD (Document Type Definition), UML (Unified Modelling Language), etc.	Written text and formal structured description and tutorials or demonstrations on how to use them
Data representation standards	Data coded by local formalism only	Some data coded by a recognised controlled vocabulary, ontology or use of minimal information standards (MIBBI)	General use of both recognised vocabularies or ontologies, and minimal information standards (MIBBI)
Data structure standards	Data structured with local model only	Data structured with formal model, e.g. an XML schema	Use of recognised standard model, e.g. FUGE
User support	User documentation only	User documentation and Email/web form help desk function	User documentation as well as a personal contact help desk function/training
Versioning	No provision	Previous version of database available but no tracking of entities between versions	Previous version of database available and tracking of entities between versions

Category	Level 1	Level 2	Level 3
Quality and Consistency	No explicit process for assuring consistency	Process for assuring consistency, automatic curation only	Process for assuring consistency with manual curation
Currency	Closed legacy database or last update more than a year ago	Updates or versions more than once a year	Updates or versions more than once a month
Accessibility	Access via browser only	Access via browser and database reports or database dumps	Access via browser and programmatic access (well defined API, SQL access or web services)
Output formats	HTML or similar to browser only	HTML or similar to browser and sparse standard file formats, e.g. FASTA	HTML or similar to browser and rich standard file formats, e.g. XML, SBML (Systems Biology Markup Language)
Technical documentation	Written text only	Written text and formal structured description, e.g. automatically generated API docs (JavaDoc), DDL (Data Description Language), DTD (Document Type Definition), UML (Unified Modelling Language), etc.	Written text and formal structured description and tutorials or demonstrations on how to use them
Data representation standards	Data coded by local formalism only	Some data coded by a recognised controlled vocabulary, ontology or use of minimal information standards (MIBBI)	General use of both recognised vocabularies or ontologies, and minimal information standards (MIBBI)
Data structure standards	Data structured with local model only	Data structured with formal model, e.g. an XML schema	Use of recognised standard model, e.g. FUGE
User support	User documentation only	User documentation and Email/web form help desk function	User documentation as well as a personal contact help desk function/training
Versioning	No provision	Previous version of database available but no tracking of entities between versions	Previous version of database available and tracking of entities between versions

Open in new tab

caBIG, the NCI Cancer Bioinformatics grid⁷ has produced a similar framework for capturing resource metadata but with a stronger focus on the technical assessment of the resources that wish to participate in the project. As caBIG has a well-defined set of tasks and a user community tied to the specific vision and funding, their categories and levels are less generic than those in the DDF and more focused on assessing whether databases reach a required level of interoperability to interact with the other components of this particular project.

Registries of web services

As well as capturing the scope and database practices of resources, registries need to be explicit about the modes of programmatic access that databases provide (e.g. web services) as these are increasingly used to build database networks and cyberinfrastructure.^8–10 This technical information is often hard to find in publications or even on database web sites, but can radically change the strategy adopted by bioinformaticians needing to access the database—for example, integration into automated or semi-automated work flows using Taverna¹¹ such as that developed by CASIMIR.¹² Unfortunately, traditional web-service description languages such as WSDL do not provide the required detail on the biological context of the inputs and outputs of each service to allow automated data and service integration. Biocatalogue¹³ and its predecessor, the EMBRACE service registry¹⁴, address this lack of semantics by providing sites for the registration, curation, discovery and monitoring of web services for the whole biological community. Curation of information about web services is open to anyone and uses a combination of free text, tags, ontology terms and example values to describe what each service does, the type of web service (REST, SOAP, soaplab) and in particular the input and outputs in terms of what type of biological data and data formats are expected. Biocatalogue clearly addresses a vital requirement of the community and already some 1173 services have been annotated, despite the project only running for just over a year. Having a single, well-designed solution rather than multiple competing efforts is likely to improve further uptake, and we propose that all registries of databases utilize Biocatalogue to annotate the services provided by their resources rather than separately performing this task.

Dissemination issues and solutions

Capturing metadata as described for the DDF or the Biocatalogue project is not easy. Our initial DDF metadata for over 220 resources was captured as part of a detailed MRB questionnaire sent to each resource, and active manual curation had to be used to fill in the gaps in responses. This is expensive and time-consuming and, after the first pass, there is a requirement to keep the captured data up to date, and this is not easily met.

To eliminate the cost of a central curation effort, it would be much better if each resource curated their own metadata and made it accessible to the wider scientific community. As an example of this, we produced a DDF extension to the Drupal content management system (http://drupal.org), which allows curators to log-in and categorize their databases in terms of DDF categories and levels using a simple web form. The resulting metadata is then browsable and searchable either through a web interface or programmatically through RESTful web services. An example deployment is viewable at www.casimir.org.uk/casimir_ddf (Figure 1) and is currently populated with the metadata for the MRB project. We encourage interested readers to visit our site and for maintainers of resources to curate their metadata using it. The Drupal framework is easily extensible to allow curation of other data associated with each resource, so allowing the production of a customisable community registry. The system is expected to be of great value to communities developing registry resources or individual informaticians wanting to establish quickly which features a database provides (the software is freely available under an open-source license). The REST web services allow a central DDF portal to be established offering the collection and sharing of data from individual database registries as well as avoiding redundancy in curation efforts.

Figure 1.

Open in new tab Download slide

The DDF query and annotation tool. This tool allows any user to browse a set of resources that have been annotated using the DDF categories. Searches for resources by DDF category and level are also possible. In addition, resource maintainers can log-in and edit their existing annotations or annotate a new resource using a simple web form. This tool is freely available and easy to install for other communities that wish to create their own registry of resources.

Biocatalogue have used a combination of central and community curation from the outset to capture data on web services and the large number of services already described is testament to such an approach. Again, the provision of easy to use web tools that suggest particular tags and ontology terms to use in the annotation increases the likelihood of achieving a high level of community engagement and annotation quality.

Community curation requires pro-active participation. Communities need to acknowledge; (i) a central site where they can find relevant resources would be useful, and; (ii) the only practical means of achieving this is for each database to self-curate its entry using a clearly articulated and standardized set of benchmarks and tools such as provided by the DDF and Biocatalogue solutions. Individual resources would also benefit from this small amount of curation effort as the central registry will direct users to them, who might not previously have known about their resource. Although the creators and maintainers of a resource are best placed to describe the associated metadata, a self-curation approach can raise data quality issues, but these should be minimized if the annotation tools are well designed i.e. fast and easy to use, with clear descriptions of what is being asked for, and responses presented as a lists of terms rather than free text. However, even with a well-designed annotation tool, registries are still likely to require some central curation for validating submitted data (e.g. the DDF tool allows administrator level access to check new submissions).

In summary, there is now a clear need for registries to be built that address biological categorization of databases and services, annotate any services provided and capture metadata on database best practises. Considerable progress has been made on standardizing the capture of each of these by such approaches as the DDF and Biocatalogue, but the community would benefit from coordination to produce full registries combining all these approaches. However, the value of a standard is dependent on its uptake by the community as can be seen, for example, in the MIBBI family of minimal information standards.¹⁵ Uptake of a standard is, of course, as much a social issue as one of producing the right technologies for the community. Here, support from funding agencies and journals will be vital in establishing the practice of publishing database and services metadata. All curators can enhance the value of their databases by posting a minimal amount of information about their resource on a community site. The task has minimal cost, but will provide considerable value to investigators, database developers, informaticians and funding agencies.

Funding

The work described above was supported by Seventh Framework Programme of the European Commission contracts to CASIMIR (LSHG-CT-2006-037811) and ENFIN (LSHG-CT-2005-518254). Funding for open access charge: CASIMIR LSHG-CT-2006-037811.

Conflict of interest. None declared.

References

1

Cochrane

GR

,

Galperin

MY

.

The 2010 Nucleic Acids Research Database issue and online database collection: a community of data resources

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D1

-

D4

)

2

Brazas

MD

,

Yamada

JT

,

Ouellette

BF

.

Evolution in bioinformatic resources: 2009 update on the Bioinformatics Links Directory

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

W3

-

W5

)

3

Hancock

JM

,

Schofield

PN

,

Chandras

C

, et al.

CASIMIR: Coordination and Sustainability of International Mouse Informatics Resources

,

Proceedings of the 8th IEEE International Conference on Bioinformatics and Bioengineering

,

2008

doi:10.1109/BIBE.2008.4696712

Google Scholar

OpenURL Placeholder Text

WorldCat

4

Reisinger

F

,

Corpas

M

,

Hancock

J

, et al.

Bairoch

A

,

Cohen-Boulakia

S

,

Froidevaux

C

. ,

Data Integration in the Life Sciences: 5th International Workshop, DILS 2008

,

2008

Evry, France, June 25–27, 2008. Springer, Berlin, pp. 132–143

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

5

Babu

PA

,

Udyama

J

,

Kumar

RK

, et al.

DoD2007: 1082 molecular biology databases

,

Bioinformation

,

2007

, vol.

2

(pg.

64

-

67

)

6

Zouberakis

M

,

Chandras

C

,

Swertz

M

, et al.

Mouse Resource Browser—a database of mouse databases

,

Database

,

2010

doi:10.1093/database/baq010

Google Scholar

OpenURL Placeholder Text

WorldCat

7

Saltz

J

,

Oster

S

,

Hastings

S

, et al.

caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid

,

Bioinformatics

,

2006

, vol.

22

(pg.

1910

-

1916

)

8

Stein

L

.

Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges

,

Nat. Rev. Genet.

,

2008

, vol.

9

(pg.

678

-

688

)

9

Foster

I

.

Service-oriented science

,

Science

,

2005

, vol.

308

(pg.

814

-

817

)

10

Hey

T

,

Trefethen

AE

.

Cyberinfrastructure for e-Science

,

Science

,

2005

, vol.

308

(pg.

817

-

21

)

11

Hull

D

,

Wolstencroft

K

,

Stevens

R

, et al.

Taverna: a tool for building and running workflows of services

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

W729

-

W732

)

12

Smedley

D

,

Swertz

MA

,

Wolstencroft

K

, et al.

Solutions for data integration in functional genomics: a critical assessment and case study

,

Brief. Bioinformatics

,

2008

, vol.

9

(pg.

532

-

544

)

13

Bhagat

J

,

Tanoh

F

,

Nzuobontane

E

, et al.

BioCatalogue: a universal catalogue of web services for the life sciences.

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

W689

-

W694

)

14

Pettifer

S

,

Thorne

D

,

McDermott

P

, et al.

An active registry for bioinformatics web services

,

Bioinformatics

,

2009

, vol.

25

(pg.

2090

-

2091

)

15

Taylor

CF

,

Field

D

,

Sansone

SA

, et al.

Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project

,

Nat. Biotechnol.

,

2008

, vol.

26

(pg.

889

-

896

)

Author notes

Reference 13 in an earlier version of this paper was incorrect. The author apologizes for this error.

This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	5
December 2016	4
January 2017	4
February 2017	5
April 2017	2
May 2017	3
June 2017	2
July 2017	2
August 2017	4
September 2017	1
November 2017	2
December 2017	9
January 2018	13
February 2018	8
March 2018	19
April 2018	17
May 2018	7
June 2018	3
July 2018	7
August 2018	13
September 2018	11
October 2018	8
November 2018	17
December 2018	12
January 2019	4
February 2019	15
March 2019	12
April 2019	17
May 2019	14
June 2019	13
July 2019	14
August 2019	9
September 2019	13
October 2019	26
November 2019	17
December 2019	17
January 2020	20
February 2020	23
March 2020	25
April 2020	7
May 2020	8
June 2020	16
July 2020	13
August 2020	47
September 2020	49
October 2020	28
November 2020	10
December 2020	12
January 2021	6
February 2021	12
March 2021	20
April 2021	12
May 2021	14
June 2021	7
July 2021	9
August 2021	8
September 2021	7
October 2021	11
November 2021	11
December 2021	4
January 2022	11
February 2022	5
March 2022	2
April 2022	7
May 2022	18
June 2022	21
July 2022	11
August 2022	12
September 2022	26
October 2022	15
November 2022	3
December 2022	8
January 2023	13
February 2023	15
March 2023	18
April 2023	19
May 2023	86
June 2023	86
July 2023	78
August 2023	29
September 2023	45
October 2023	26
November 2023	16
December 2023	32
January 2024	15
February 2024	32
March 2024	17
April 2024	16

Article Contents

Finding and sharing: new approaches to registries of databases and services for the biomedical sciences

Abstract

Registries of databases

Registries of web services

Dissemination issues and solutions

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Finding and sharing: new approaches to registries of databases and services for the biomedical sciences

Abstract

Registries of databases

Registries of web services

Dissemination issues and solutions

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only