Database (2009) Vol. 2009:bap014;
doi:10.1093/database/bap014
published on
October 12, 2009
© The Author(s) 2009. Published by Oxford University Press.
This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
ORION-VIRCAT: a tool for mapping ICTV and NCBI taxonomies
Willy Valdivia-Granda* and
Francis Larson
Orion Integrated Biosciences Inc., New Rochelle, NY, 10805, USA
*Corresponding author: Tel: 800 283 0169; Fax: 888 299 4171. Email: willy.valdivia{at}orionbiosciences.com
 |
Abstract
|
|---|
Viruses, viroids and prions are the smallest infectious biological
entities that depend on their host for replication. The number
of pathogenic viruses is considerably large and their impact
in human global health is well documented. Currently, the International
Committee on the Taxonomy of Viruses (ICTV) has classified

4379
virus species while the National Center for Biotechnology Information
Viral Genomes Resource (NCBI-VGR) database has mapped 617 705
proteins to eight large taxonomic groups. Despite these efforts,
an automated approach for mapping the ICTV master list and its
officially accepted virus naming to the NCBI-VGRs taxonomical
classification is not available. Due to metagenomic sequencing,
it is likely that the discovery and naming of new viral species
will increase by at least ten fold. Unfortunately, existing
viral databases are not adequately prepared to scale, maintain
and annotate automatically ultra-high throughput sequences and
place this information into specific taxonomic categories. ORION-VIRCAT
is a scalable and interoperable object-relational database designed
to serve as a resource for the integration and verification
of taxonomical classifications generated by the ICTV and NCBI-VGR.
The current release (v1.0) of ORION-VIRCAT is implemented in
PostgreSQL and it has been extended to ORACLE, MySQL and SyBase.
ORION-VIRCAT automatically mapped and joined 617 705 entries
from the NCBI-VGR to the viral naming of the ICTV. This detailed
analysis revealed that 399 095 entries from the NCBI-VGR can
be mapped to the ICTV classification and that one Order, 10
families, 35 genera and 503 species listed in the ICTV disagree
with the the NCBI-VGR classification schema. Nevertheless, we
were eable to correct several discrepancies mapping 234 000
additional entries.
Database URL: http://www.orionbiosciences.com/research/orion-vircat.html
Received April 20, 2009; Revised September 6, 2009; Accepted September 7, 2009
 |
Introduction
|
|---|
Viruses, viroids and prions are the smallest infectious biological
entities that depend on their host for replication. Because
many species represent a significant threat to global health
and can be used as bioweapons; there has been a considerable
effort to gain a better understanding of their host range and
the molecular forces shaping their adaption and pathogenesis.
Periodically the International Committee on the Taxonomy of
Viruses (ICTV) generates a
master list which currently recognizes
about 4379 virus species divided in nine Orders, 98 assigned
Families, 26 unassigned Families, 18 assigned Sub-families,
5 unassigned Sub-families, 459 assigned genera and 57 unassigned
genera.
The National Center for Biotechnology Information Viral Genomes Resource (NCBI-VGR) (1,2) is a database that uses the Baltimore nomenclature (3) to map
1 million protein records to eight large taxonomic groups (excluding unclassified viruses and unclassified bacteriophages) to one Deltavirus species, 96 species of Retro-transcribing viruses, 129 satellites, 601 dsDNA viruses with no RNA stage, 107 species of dsRNA viruses, 353 species of ssDNA viruses, 123 species of ssRNA negative-strand viruses, 580 species of ssRNA positive-strand viruses with no DNA stage, five unclassified archaeal viruses, 35 unclassified phages and nine unclassified viruses.
The ICTVdb is a viral information repository which uses the DELTA system to generate taxonomical reports in HTML format using the ICTV master list (4,5). The ICTVdb uses an eight position decimal code with up to three digit schema similar to that used for enzyme classes to represent order, family, subfamily, genus, species, subspecies, serotype or subtype, and strain or isolate (4,5). This detailed information is linked to approximately 8000 representative sequences from the NCBI database.
In addition to the NCBI-VGR and ICTVdb, several databases covering specific categories of viruses have been implemented. Most are modeled using relational database management systems (RDBMS) and provide standard interfaces like JDBC and ODBC for data and metadata annotation. Some databases support data curation, genome and proteome comparisons (6–8) and have become specialized sources of information for Bunyavirus (9), Flavivirus (10,11). Herpesvirus (12), Coronavirus (12,13), Influenza (14–16), Hepatitis (17–20), HIV (21–23), vaccines (24), ssRNA viruses (25), virulence factors (26), capsid structures (27), siRNA targets (28) and immunogenesis (17,29,30).
Despite the progress, a comprehensive and automated approach for mapping the ICTV master list and its officially accepted virus naming to the NCBI-VGR is not available. This situation does not only limit the development of additional specialized viral databases but makes the cross-validation across them very difficult. As biological databases grow, it is increasingly more difficult to maintain their integrity. In many cases, data entry errors including virus naming and numerical assignment go undetected and errors at the higher levels of taxonomy (e.g. family) are propagated to lower levels (e.g. species) and to external databases. Furthermore, in their current format, available databases cannot scale seamlessly to handle metagenomic sampling. This is particularly relevant because metagenomic datasets will increase the discovery rate and naming of new viral species by at least 10-fold (31).
To address several of the above challenges we report here the implementation of a series of bioinformatic applications and an enterprise database management system to (i) automatically assign each entry of ICTV master list to the NCBI-VGR and determine the level of discrepancy between these two databases. (ii) implement an object-relational genomic catalog storing viral genome information correcting existing discrepacies. Our work empowers virologists to develop specialized databases and it is one of the first steps for the development of a viral ontology.
 |
Methods
|
|---|
Data monitoring, retrieval and integration
This layer of tools is managed by monitor and adapter modules.
The monitor checks periodically the ICTV master list and the
NCBI-VGR taxonomical records. In case of change, the monitor
module triggers a PERL script named
ICTVml_parser.pl which uploads
new taxonomical classification and species naming from the ICTV.
At the same time, a script, the BioPerlDB class named
load_sqdatabase.pl,
retrieves and parses new GeneBank records. Once these processes
are completed, the
NCBI_ICTV_integrator.pl maps the ICTV species
naming to the NCBI-VGR taxonID and Baltimore classification
schemes (
3) (
Figure 1).
Order, family, sub-family, genus and species naming from the NCBI VGR are flagged and are renamed
using the ICTV master list and a
virus_synonym table that maintains
alternative naming of a virus or strain. When synonyms exist,
precedence of the ICTV master list determines the selection
of virus names that should be included within a taxonomical
category (
Figure 2).
Viral genomic catalog
This object-RDBMS stores metadata and virus genomic sequence
information collected by the monitor and adapter modules and
join them by the
NCBI_ICTV_integrator.pl. ORION-VIRCAT genomic
catalog reuses the attributes from BioSQL
seqfeatures, annotation, taxon and ontology tables and it is implemented in postgreSQL.
In addition, we extended the database schema of BioSQL to include
virus morphology description, geographical information, clinical
characteristics, isolation location and year, culture passage
cycle, and controlled vocabularies. To avoid specific vendor
operations we have extended the genomic catalog to ORACLE, mySQL
and DB2 and data formats.
 |
Results
|
|---|
The current release (v1.0) of ORION-VIRCAT automatically mapped
and joined 617 705 entries from the NCBI-VGR to the viral naming
of the ICTV. This detailed analysis revealed that 399 095 entries
from the NCBI-VGR can be mapped to the ICTV classification and
that one Order, 10 families, 35 genera and 503 species listed
in the ICTV disagree with the the NCBI-VGR classification schema
(
Supplementary Data). Our analysis also found four main types
of discrepancies between the ICTV master list and the NCBI-VGR
entries. The first level consisted of minor differences in the
capitalization between the naming conventions or changes in
one letter. For example, the ICTV listed PhiH-like viruses,
while the NCBI-VGR listed phiH-like viruses. In a smilar case,
the ICTV listed
Omicronpapillomavirus while the NCBI-VGR listed
Omikronpapillomavirus. The second level of discrepancies included
15 genera remaining unclassified within a particular family
in the NCBI-VGR. However, recent updates of the ICTV master
list gave these viral groups a genus name. The third level of
discrepancy consisted of species belonging to one of four different
genera that have been reassigned to a new genus. The fourth
level of discrepancy included species listed only in the ICTV
master list and classified within a particular taxonomy according
to morphological observations but without sequence entries available
in the NCBI-VGR.
 |
Discussion
|
|---|
With the advent of genomics several taxonomical classifications
have been proposed and have led to the development of several
specialized viral databases. However, for the most part, these
implementations remain isolated sources of information and lack
interoperability and scalability. Here, we report the implementation
of ORION-VIRCAT as a progressive step towards the standardization
of genomic information about viruses and the development of
a scalable system to store viral information at the metagenomic
scale. The development of this approach has several implications
for the development of viral databases. First, we comprehensively
assessed the level of discrepancy between the official naming
and taxonomical classification generated by the ICTV master
list and the NCBI-VGR. Second, ORION-VIRCAT reconstructed in
an object-relational format a genomic catalog mapping all the
sequences from NCBI-VGR to the officially accepted naming developed
by the ICTV. By using the ICTV we promote the use of officially
accepted taxon names developed by the research community and
the correct mapping to the information of a particular sequence
stored in the NCBI. At the same time, we uncovered genera and
species names that need to be revised and updated. Therefore,
ORION-VIRCAT promotes nomenclatural clarity through explicit
definitions where each taxon has only one accepted name.
By reusing BioPerl and BioSQL, we, in ORION-VIRCAT, adopted widely accepted standards and pseudo-standards that facilitate interoperability with third-party applications. This not only saves considerable time and resources, but allows the implementation of a robust support system for the future development of specialized viral databases. The schema of the genomic catalog is flexible enough to allow addition of new sources of information [e.g. Pathogen Information Markup Languaje (32)]. As a result, ORION-VIRCAT empowers researchers interested in a particular viral taxonomy to download specific sets of information and implement their own databases and extend them with advanced and specific analysis tools. As the ICTV master list generates new names for species, they are added to the table and this way we ensure that every group has the most updated naming convention. Since curators often dedicate much effort to manually annotate group names, we are now developing an annotation tool for data clarification to generate reports to be considered by the ICTV (Table 1).
Towards a viral ontology
In order to be able to exchange the semantics of information
in a database on viruses one first needs to agree on how to
explicitly model a virus ontology architecture. Trough the use
of ontologies it is possible to develop a mechanism for representing
in a formal form the shared descriptions about viruses including
taxonomy nomenclature, phylogenetics, molecular and functional
biology. We propose starting with the development of a conceptual
discussion to define the scope and range of a viral ontology.
We believe that the viral ontology should be divided into four
parts within two core layers. The first
core layer should be
a static ontology describing only essential and passive concepts
about viruses. The
extended layer should describe concepts actively
evolving and related to viral naming, taxonomy, phylogenetics,
genetics, genomics, biology, host–parasite relations,
ecology, morphology and experiments involving viruses. The extended
layer should include as a rule, a minimum set of description
categories in order to define a species. Representations of
the same data by different biologists will likely be different
(even when using the same system). Hence, mechanisms for aligning
different biological schemas or different versions of schemas
should be supported.
Since the extended layer is subject to constant changes as biological knowledge on viruses evolves, it is necessary to implement different numerical identifiers for each of the attributes and their concepts. This will allow building a complex concept of cardinality and inheritance for terms while formalizing and verifying their correctness and properties. These behavior constraints can be viewed as temporal logic assertions expressing the evolution of a particular term. At the same time, the extended layer should inherit the ontology terms related to viruses (e.g. Pathogen Transmission Ontology, Diseases Ontology, Phage Ontology, Vaccine Ontology, etc.) from other biomedical ontologies.
 |
Conclusions
|
|---|
With the advent of genomic and metagenomic scale virus genome
sampling, using conventional taxonomic criteria based on morphological
and developmental properties is considered unpractical. The
bioinformatics strategy presented here lends support for future
collaborative efforts for a comprehensive, large-scale viral
genome analysis system. These systems should allow intelligent
software agents and advanced text-mining algorithms to analyze
information about viruses and present it in new ways that can
not only advance our understanding of viruses, but redefine
their classification.
 |
Supplementary data
|
|---|
Supplementary Data are available at
Database Online.
 |
Funding
|
|---|
The development of ORION-VIRCAT is partially supported by the
Defense Threat Reduction Agency under the contract W81XWH-0720029.
Funding for open access charge: Contract W81XWH-0720029.
Conflict of interest statement. None declared.
 |
Acknowledgements
|
|---|
The authors would like to thank Dr Sofi Ibrahim at the US Army
Institute for Infectious Diseases (USAMRIID) and Dr Carmenza
Spadafora at the Instituto de Investigaciones Científicas
y Servicios de Alta Tecnología (INDICASAT) for the helpful
discussions and suggestions.
 |
References
|
|---|
- Wheeler D.L., Barrett T., Benson D.A., et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. (2008) 36:D13–D21.[Abstract/Free Full Text]
- Bao Y., Federhen S., Leipe D., et al. National center for biotechnology information viral genomes project. J. Virol. (2004) 78:7291–7298.[Free Full Text]
- Baltimore D. Expression of animal virus genomes. Bacteriol. Rev. (1971) 35:235–241.[Free Full Text]
- Buchen-Osmond C. Further progress in ICTVdB, a universal virus database. Arch. Virol. (1997) 142:1734–1739.[Web of Science][Medline]
- Buechen-Osmond C., Dallwitz M. Towards a universal virus database—progress in the ICTVdB. Arch. Virol. (1996) 141:392–399.[CrossRef][Web of Science][Medline]
- Kulkarni-Kale U., Bhosle S., Manjari G.S., et al. VirGen: a comprehensive viral genome resource. Nucleic Acids Res. (2004) 32:D289–D292.[Abstract/Free Full Text]
- Lefkowitz E.J., Upton C., Changayil S.S., et al. Poxvirus Bioinformatics Resource Center: a comprehensive Poxviridae informational and analytical resource. Nucleic Acids Res. (2005) 33:D311–D316.[Abstract/Free Full Text]
- Yan Q. Bioinformatics databases and tools in virology research: an overview. In Silico Biol. (2008) 8:71–85.[Medline]
- Fourment M., Gibbs M.J. The VirusBanker database uses a Java program to allow flexible searching through Bunyaviridae sequences. BMC Bioinformatics (2008) 9:83.[CrossRef][Medline]
- Misra M., Schein C.H. Flavitrack: an annotated database of flavivirus sequences. Bioinformatics (2007) 23:2645–2647.[Abstract/Free Full Text]
- Schreiber M.J., Ong S.H., Holland R.C., et al. DengueInfo: a web portal to dengue information resources. Infect. Genet. Evol. (2007) 7:540–541.[CrossRef][Web of Science][Medline]
- Alba M.M., Lee D., Pearl F.M., et al. VIDA: a virus database system for the organization of animal virus genome open reading frames. Nucleic Acids Res. (2001) 29:133–136.[Abstract/Free Full Text]
- Huang Y., Lau S.K., Woo P.C., et al. CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes. Nucleic Acids Res. (2008) 36:D504–D511.[Abstract/Free Full Text]
- Chang S., Zhang J., Liao X., et al. Influenza Virus Database (IVDB): an integrated information resource and analysis platform for influenza virus research. Nucleic Acids Res. (2007) 35:D376–D380.[Abstract/Free Full Text]
- Lu G., Rowley T., Garten R., et al. FluGenome: a web tool for genotyping influenza A virus. Nucleic Acids Res. (2007) 35:W275–W279.[Abstract/Free Full Text]
- Squires B., Macken C., Garcia-Sastre A., et al. BioHealthBase: informatics support in the elucidation of influenza virus host pathogen interactions and virulence. Nucleic Acids Res. 36:D497–D503.
- Hraber P.T., Leach R.W., Reilly L.P., et al. Los Alamos hepatitis C virus sequence and human immunology databases: an expanding resource for antiviral research. Antivir. Chem. Chemother. (2007) 18:113–123.[Medline]
- Panjaworayan N., Roessner S.K., Firth A.E., et al. HBVRegDB: annotation, comparison, detection and visualization of regulatory elements in hepatitis B virus sequences. Virol. J. (2007) 4:136.[CrossRef][Medline]
- Combet C., Penin F., Geourjon C., et al. HCVDB: hepatitis C virus sequences database. Appl. Bioinformatics (2004) 3:237–240.[CrossRef][Medline]
- Kuiken C., Hraber P., Thurmond J., et al. The hepatitis C sequence database in Los Alamos. Nucleic Acids Res. (2008) 36:D512–D516.[Abstract/Free Full Text]
- Pan C., Kim J., Chen L., et al. The HIV positive selection mutation database. Nucleic Acids Res. (2007) 35:D371–D375.[Abstract/Free Full Text]
- Araujo L.V., Soares M.A., Oliveira S.M., et al. DBCollHIV: a database system for collaborative HIV analysis in Brazil. Genet. Mol. Res. (2006) 5:203–215.[Web of Science][Medline]
- Doherty R.S., De Oliveira T., Seebregts C., et al. BioAfrica's HIV-1 proteomics resource: combining protein data with bioinformatics tools. Retrovirology (2005) 2:18.[CrossRef][Medline]
- Xiang Z., Todd T., Ku K.P., et al. VIOLIN: vaccine investigation and online information network. Nucleic Acids Res. (2008) 36:D923–D928.[Abstract/Free Full Text]
- Snyder E.E., Kampanya N., Lu J., et al. PATRIC: the VBI PathoSystems Resource Integration Center. Nucleic Acids Res. (2007) 35:D401–D406.[Abstract/Free Full Text]
- Zhou C.E., Smith J., Lam M., et al. MvirDB—a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications. Nucleic Acids Res. 35:D391–D394.
- Shepherd C.M., Borelli I.A., Lander G., et al. VIPERdb: a relational database for structural virology. Nucleic Acids Res. (2006) 34:D386–D389.[Abstract/Free Full Text]
- Naito Y., Ui-Tei K., Nishikawa T., et al. siVirus: web-based antiviral siRNA design software for highly divergent viral sequences. Nucleic Acids Res. (2006) 34:W448–W450.[Abstract/Free Full Text]
- Lundegaard C., Lamberth K., Harndahl M., et al. NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Res. (2008) 36:W509–W512.[Abstract/Free Full Text]
- Yusim K., Richardson R., Tao N., et al. Los alamos hepatitis C immunology database. Appl. Bioinformatics (2005) 4:217–225.[CrossRef][Medline]
- Valdivia-Granda W. The next meta-challenge for Bioinformatics. Bioinformation (2008) 2:358–362.[Medline]
- He Y., Vines R.R., Wattam A.R., et al. PIML: the Pathogen Information Markup Language. Bioinformatics (2005) 21:116–121.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?