Abstract

In recent years, thousands of Saccharomyces cerevisiae genomes have been sequenced to varying degrees of completion. The Saccharomyces Genome Database (SGD) has long been the keeper of the original eukaryotic reference genome sequence, which was derived primarily from S. cerevisiae strain S288C. Because new technologies are pushing S. cerevisiae annotation past the limits of any system based exclusively on a single reference sequence, SGD is actively working to expand the original S. cerevisiae systematic reference sequence from a single genome to a multi-genome reference panel. We first commissioned the sequencing of additional genomes and their automated analysis using the AGAPE pipeline. Here we describe our curation strategy to produce manually reviewed high-quality genome annotations in order to elevate 11 of these additional genomes to Reference status.

Database URL : http://www.yeastgenome.org/

Introduction

Recent advances in sequence technology have led to an explosion of available sequence data, and thousands of Saccharomyces cerevisiae genomes have been sequenced to varying degrees of completion in just the past few years ( 1 ). These genomes come from a variety of laboratory and commercial strains, as well as from clinical and environmental isolates. More than 100 of these genomes have been assembled to the level of scaffold or chromosome (see the Genome database at the National Center for Biotechnology Information (NCBI); http://www.ncbi.nlm.nih.gov/genome/genomes/15 ), and have been deposited in the primary sequence databases (GenBank, ENA, DDBJ) that make up the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org ).

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/ ) has long been the keeper of the first eukaryotic genome sequence, which was derived primarily from S. cerevisiae strain S288C and published in 1996 as the output of several years of collaborative work of an international consortium of researchers ( 2 ). This ‘reference genome’ was produced to serve as a single consensus representative S. cerevisiae genome sequence against which all other sequences could be measured. Since that time, SGD has made that original eukaryotic reference sequence and its annotation freely available to researchers around the world, who have thoroughly studied the sequence and put it to use in many great scientific discoveries and breakthroughs ( 3 ).

We are now firmly in the modern era of yeast genomics, in which the rich variety of available genomic sequences has changed the way we study genomes. Comparative genomics is providing clear pictures of the full constituent parts of species’ genomes, which vary not only in nucleotide sequence, but also in gene complements and chromosome architecture. Although much work in yeast genomics has focused on S288C ( 4 ) and its derivatives ( 5 ), a number of different strains are more informative for distinct areas of study [e.g. W303 for aging ( 6 ), SK1 for meiosis and sporulation ( 7 )], and are popularly used in research because their distinct phenotypes.

New technologies have been quickly pushing S. cerevisiae annotation well past the limits of any system based exclusively on a single reference sequence. Therefore, SGD is actively expanding its sequence curation efforts to provide a 12-genome reference panel that allows us to more comprehensively annotate the genetic background studied in experiments for which a strain of known provenance is reported. The 11 additional high-profile yeast genomes were selected because they are widely studied, and have sizable amounts of published experimental and phenotypic data available. When combined with high quality sequence information, these detailed phenotype and functional annotations will allow researchers to fully understand the specific genomic basis of phenotypic variation. Here we present a description of the curation strategy currently in use to expand the original S. cerevisiae systematic reference sequence from a single highly-studied genome to an expertly curated multi-genome reference panel, as shown in Figure 1 .

Figure 1.

Curation strategy currently in use at SGD to expand the original Saccharomyces cerevisiae systematic reference sequence from a single highly-curated genome to an expertly curated multi-genome reference panel.

Materials and Methods

We recently updated the S288C reference sequence ( 8 ), and then commissioned the sequencing of additional genomes and their automated analysis using the Automated Genome Analysis PipelinE (AGAPE) pipeline (9; NCBI BioProject Accession PRJNA260311). AGAPE accepts raw sequence reads as input, generates de novo assembly scaffolds and contigs, then integrates the steps of open reading frame (ORF) annotation and sequence variation calling. The original AGAPE output is available at http://downloads.yeastgenome.org/published_datasets/Song_2015_PMID_25781462/ . We are now applying the following curation strategy to produce manually reviewed high-quality genome annotations in order to elevate 11 of these additional genomes to Reference status. The 11 strains from which these genome sequences were derived have recently been described elsewhere ( 1 , 9 ). These strains were selected because they have a substantial history of use and experimental results, and also because they are the genomes for which we have the most curated phenotype data and for which we aim to curate specific functional information.

The curation strategy

Step 1: boundary differences. We first identified ORF calls for which the predicted translation start and/or stop differed from the S288C reference, then reviewed them manually and corrected as necessary. These annotation changes and ORF revisions were performed in the same manner that we have used in the past to update the S288C systematic reference sequence, and which we have described previously ( 10 ). For the 11 strains, a total of 1009 such differences were identified ( Table 1 ). Boundary differences fell into three main categories: contig artifacts, intron errors and valid variation. Contig artifacts included ORFs running off the end of contigs and/or ambiguous sequence. In these cases, the pipeline called the closest available in-frame start or stop. Errors due to introns were overwhelmingly in genes that code for ribosomal proteins, especially those in which the leading coding portion is ∼20 nucleotides or less (e.g. RPL2A /YFR031C-A and RPL2B /YIL018W). The remainder of the boundary differences appear valid upon viewing the sequence; we know that some start and stop codons differ between strains (e.g. AQY1 /YPR192W as described in Song et al. , ( 15 )). As further experimental data become available, we will refine these annotations appropriately.

Table 1.

Numbers of automated ORF annotations for 11 different Saccharomyces strains for which the predicted translation start and/or stop generated by the AGAPE sequence analysis pipeline ( 9 ) differed from the S288C reference

Automated ORFs calls ORF boundary differences relative to strain S288C
StrainProvenanceAccessionStartStopBothTotal
CEN.PKLab strainJRIV00000000537935193993
D273-10BLab strainJRIY00000000538337184095
FL100Lab strainJRIT00000000536629213484
JK9-3dLab strainJRIZ00000000538540113586
RM11-1aVineyardJRIP00000000532336173083
SEY6210Lab strainJRIW00000000540044232693
Sigma1278bLab strainJRIQ00000000535831202879
SK1Lab strainJRIH00000000535038223292
W303Lab strainJRIU000000005397542433111
X2180-1ALab strainJRIX00000000538737243596
Y55Lab strainJRIF00000000535939263297
Total4202253641009
Automated ORFs calls ORF boundary differences relative to strain S288C
StrainProvenanceAccessionStartStopBothTotal
CEN.PKLab strainJRIV00000000537935193993
D273-10BLab strainJRIY00000000538337184095
FL100Lab strainJRIT00000000536629213484
JK9-3dLab strainJRIZ00000000538540113586
RM11-1aVineyardJRIP00000000532336173083
SEY6210Lab strainJRIW00000000540044232693
Sigma1278bLab strainJRIQ00000000535831202879
SK1Lab strainJRIH00000000535038223292
W303Lab strainJRIU000000005397542433111
X2180-1ALab strainJRIX00000000538737243596
Y55Lab strainJRIF00000000535939263297
Total4202253641009
Table 1.

Numbers of automated ORF annotations for 11 different Saccharomyces strains for which the predicted translation start and/or stop generated by the AGAPE sequence analysis pipeline ( 9 ) differed from the S288C reference

Automated ORFs calls ORF boundary differences relative to strain S288C
StrainProvenanceAccessionStartStopBothTotal
CEN.PKLab strainJRIV00000000537935193993
D273-10BLab strainJRIY00000000538337184095
FL100Lab strainJRIT00000000536629213484
JK9-3dLab strainJRIZ00000000538540113586
RM11-1aVineyardJRIP00000000532336173083
SEY6210Lab strainJRIW00000000540044232693
Sigma1278bLab strainJRIQ00000000535831202879
SK1Lab strainJRIH00000000535038223292
W303Lab strainJRIU000000005397542433111
X2180-1ALab strainJRIX00000000538737243596
Y55Lab strainJRIF00000000535939263297
Total4202253641009
Automated ORFs calls ORF boundary differences relative to strain S288C
StrainProvenanceAccessionStartStopBothTotal
CEN.PKLab strainJRIV00000000537935193993
D273-10BLab strainJRIY00000000538337184095
FL100Lab strainJRIT00000000536629213484
JK9-3dLab strainJRIZ00000000538540113586
RM11-1aVineyardJRIP00000000532336173083
SEY6210Lab strainJRIW00000000540044232693
Sigma1278bLab strainJRIQ00000000535831202879
SK1Lab strainJRIH00000000535038223292
W303Lab strainJRIU000000005397542433111
X2180-1ALab strainJRIX00000000538737243596
Y55Lab strainJRIF00000000535939263297
Total4202253641009

Step 2: multiple calls and paralogs. We examined ORFs that had been called on more than one contig ( Table 2 ). Through manual review we selected the best call for each ORF in this set based on contig size and sequence quality, then discarded the unneeded duplicate, triplicate or quadruplicate annotations. We also looked at calls for ORFs that are members of paralogous pairs, which accounted for many of the duplicates. The S. cerevisiae genome contains over 500 sets of paralogs ( 11–13 ). Often the automated pipeline annotated both paralogous ORFs as the same paralog. These annotation errors are easily identifiable based on up- and downstream neighbors. The majority of ORFs identified erroneously during the automated annotation were those coding for ribosomal proteins (e.g. RPL33A /YPL143W and RPL33B /YOR234C), chaperones (e.g. SSB1 /YDL229W and SSB2 /YNL209W) and transporters (e.g. GEX1 /YCL073C and GEX2 /YKR106W; VBA3 /YCL069W and VBA5 /YKR105C).

Table 2.

Numbers of ORFs in 11 different S. cerevisiae strains that the AGAPE sequence analysis pipeline ( 9 ) called on more than one contig

ORFs called on multiple contigs
StrainTwo contigsThree contigsFour contigsTotal
CEN.PK153422
D273-10B146323
FL100124319
JK9-3d83314
RM11-1a123318
SEY6210193426
Sigma1278b142319
SK1152522
W303361239
X2180-1A152320
Y55252532
Total1853138254
ORFs called on multiple contigs
StrainTwo contigsThree contigsFour contigsTotal
CEN.PK153422
D273-10B146323
FL100124319
JK9-3d83314
RM11-1a123318
SEY6210193426
Sigma1278b142319
SK1152522
W303361239
X2180-1A152320
Y55252532
Total1853138254
Table 2.

Numbers of ORFs in 11 different S. cerevisiae strains that the AGAPE sequence analysis pipeline ( 9 ) called on more than one contig

ORFs called on multiple contigs
StrainTwo contigsThree contigsFour contigsTotal
CEN.PK153422
D273-10B146323
FL100124319
JK9-3d83314
RM11-1a123318
SEY6210193426
Sigma1278b142319
SK1152522
W303361239
X2180-1A152320
Y55252532
Total1853138254
ORFs called on multiple contigs
StrainTwo contigsThree contigsFour contigsTotal
CEN.PK153422
D273-10B146323
FL100124319
JK9-3d83314
RM11-1a123318
SEY6210193426
Sigma1278b142319
SK1152522
W303361239
X2180-1A152320
Y55252532
Total1853138254

Step 3: superfluous contigs. Due to the nature of high-throughput sequencing, a number of redundant contigs were generated for each of the 11 genomes. Because they unnecessarily complicate the genome annotation, they have been removed from the sequence files, reducing the number of contigs by more than half ( Table 3 ). These contigs included those on which no genes were called, most often due to short overall length (e.g. JRIP01000320.1 from strain RM11-1A) or ambiguous sequence (e.g., JRIT01000109.1 in strain FL100).

Table 3.

Numbers of contigs for 11 different S. cerevisiae strains in the original automated output from the AGAPE sequence analysis pipeline ( 9 ) and in the curated contig set after manual review

Contig set
StrainOriginalCurated
CEN.PK389189
D273-10B403203
FL100402174
JK9-3d431197
RM11-1a325169
SEY6210366183
Sigma1278b451206
SK1389214
W303415236
X2180-1A409212
Y55413198
Total43932181
Contig set
StrainOriginalCurated
CEN.PK389189
D273-10B403203
FL100402174
JK9-3d431197
RM11-1a325169
SEY6210366183
Sigma1278b451206
SK1389214
W303415236
X2180-1A409212
Y55413198
Total43932181
Table 3.

Numbers of contigs for 11 different S. cerevisiae strains in the original automated output from the AGAPE sequence analysis pipeline ( 9 ) and in the curated contig set after manual review

Contig set
StrainOriginalCurated
CEN.PK389189
D273-10B403203
FL100402174
JK9-3d431197
RM11-1a325169
SEY6210366183
Sigma1278b451206
SK1389214
W303415236
X2180-1A409212
Y55413198
Total43932181
Contig set
StrainOriginalCurated
CEN.PK389189
D273-10B403203
FL100402174
JK9-3d431197
RM11-1a325169
SEY6210366183
Sigma1278b451206
SK1389214
W303415236
X2180-1A409212
Y55413198
Total43932181

Step 4: RNAs and chromosomal elements. One limitation of the automated AGAPE annotation is that it focuses solely on protein-coding genes. Many other types of genes and chromosomal elements can be identified through BLAST, then verified and refined through manual curation. We have identified all 16 centromeres in all 11 strains, and are currently working to identify the many RNA genes (e.g. snRNAs, snoRNAs, tRNAs) and replication origins. This curation work will be ongoing over the coming months.

Step 5: the undefined. We are currently working to identify the 1699 ORFs that were marked ‘unidentifiable’ through the AGAPE pipeline ( Table 4 ). Some of these are indeed identifiable but need annotation updates to account for missed introns and coding segments (e.g. TDA5 /YLR426W). We expect that the majority of the undefined are actually known ORFs that will ultimately be deleted from the annotation due to ambiguous sequence (e.g. ALD2 /YMR170C) or because they are truncated at the ends of contigs, often in most or all 11 strains (e.g. SSA1 /YAL005C and COS7 /YDL248W). At least some of the undefined represent bona fide novel ORFs (e.g. YER065W-A in strain JK9-3d).

Table 4.

Numbers of ORFs 11 different S. cerevisiae strains that were marked as ‘unidentifiable’ in the original automated output from the AGAPE sequence analysis pipeline ( 9 ). These ORFs are currently undergoing manual review

StrainUndefined ORFs
CEN.PK169
D273-10B254
FL100128
JK9-3d121
RM11-1a344
SEY621069
Sigma1278b106
SK1124
W303158
X2180-1A78
Y55148
Total1699
StrainUndefined ORFs
CEN.PK169
D273-10B254
FL100128
JK9-3d121
RM11-1a344
SEY621069
Sigma1278b106
SK1124
W303158
X2180-1A78
Y55148
Total1699
Table 4.

Numbers of ORFs 11 different S. cerevisiae strains that were marked as ‘unidentifiable’ in the original automated output from the AGAPE sequence analysis pipeline ( 9 ). These ORFs are currently undergoing manual review

StrainUndefined ORFs
CEN.PK169
D273-10B254
FL100128
JK9-3d121
RM11-1a344
SEY621069
Sigma1278b106
SK1124
W303158
X2180-1A78
Y55148
Total1699
StrainUndefined ORFs
CEN.PK169
D273-10B254
FL100128
JK9-3d121
RM11-1a344
SEY621069
Sigma1278b106
SK1124
W303158
X2180-1A78
Y55148
Total1699

Steps 6 and 7: omissions and supercontigs. Curation work in the coming year will focus on annotating essential, conserved protein-coding genes that we expect are present in the other genomes but escaped automated annotation undetected. We will also assess several contig pairs for the possibility of combining them into supercontigs (e.g. contigs JRIU01000255.1 and JRIU01000122.1 from strain W303).

Future Directions

The expansion in SGD from a single reference genome to a multi-genome reference panel furthers yeast genomics research by providing easy access to alternative alleles and sequence variants ( 14 ). The automated annotations for the additional genomes, as produced by the AGAPE pipeline, have already been incorporated into SGD sequence and alignment pages, analysis tools such as BLAST, PatMatch, and the Variant Viewer, and are also available for download ( Table 5 ). The application of manual curation to automated output improves the quality and increases the depth and granularity of genome annotation. Curated sequence files and annotation will be incorporated into SGD and also submitted to NCBI’s GenBank primary sequence repository within the coming year. As further research by the scientific community provides updated information, we will incorporate improved annotations into future genome releases. Consideration will also be given in the future to the possibility of expanding the reference panel to accommodate emerging or underserved areas of study, as part of our continuing efforts to educate students, enable bench researchers and facilitate scientific discovery.

Funding

This work was supported by a U41 grant from the National Human Genome Research Institute at the US National Institutes of Health (HG001315) to the Saccharomyces Genome Database project. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Human Genome Research Institute or the National Institutes of Health.

Conflict of interest . None declared.

References

1

Engel
S.R.
Cherry
J.M.
(
2013
)
The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database
.
Database
,
2013
, doi: 10.1093/ database/bat012

2

Goffeau
A.
Barrell
B.G.
Bussey
H
. et al.  . (
1996
)
Life with 6000 genes
.
Science
,
274
,
546
567
.

3

Botstein
D.
Fink
G.R.
(
2011
)
Yeast: An experimental organism for 21st century biology
.
Genetics
,
189
,
695
704
.

4

Mortimer
R.K.
Johnston
J.R.
(
1986
)
Genealogy of principal strains of the yeast genetic stock center
.
Genetics
,
113
,
35
43
.

5

Winzeler
E.A.
Shoemaker
D.D.
Astromoff
A
. et al.  . (
1999
)
Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis
.
Science
,
285
,
901
906
.

6

Ralser
M.
Kuhl
H.
Ralser
M
. et al.  . (
2012
)
The Saccharomyces cerevisiae W303-K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt
.
Open Biol
.,
2
,
120093

7

Stuart
D.
(
2008
)
The meiotic differentiation program uncouples S-phase from cell size control in Saccharomyces cerevisiae.
Cell Cycle
,
7
,
777
786
.

8

Engel
S.R.
Dietrich
F.S.
Fisk
D.G
. et al.  . (
2014
)
The reference genome sequence of Saccharomyces cerevisiae : then and now
.
G3
,
4
,
389
398
.

9

Song
G.
Dickins
B.J.A.
Demeter
J
. et al.  . (
2015
)
AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae.
PLoS One
,
10
,
e0120671

10

Fisk
D.G.
Ball
C.A.
Dolinski
K
. et al.  . (
2006
)
Saccharomyces cerevisiae S288C genome annotation: a working hypothesis
.
Yeast
,
23
,
857
865
.

11

Byrne
K.P.
Wolfe
K.H.
(
2005
)
The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species
.
Genome Res
.,
15
,
1456
1461
.

12

Koszul
R.
Dujon
B.
Fischer
G.
(
2006
)
Stability of large segmental duplications in the yeast genome
.
Genetics
,
172
,
2211
2222
.

13

Payen
C.
Di Rienzi
S.C.
Ong
G.T
. et al.  . (
2014
)
The dynamics of diverse segmental amplifications in populations of Saccharomyces cerevisiae adapting to strong selection
.
G3
,
4
,
399
409
.

14

Sheppard
T.K.
Hitz
B.C.
Engel
S.R
. et al.  . (
2016
)
The Saccharomyces Genome Database Variant Viewer
.
Nucleic Acids Res
,
44
,
D698
702
.

15

Song
G.
Balakrishnan
R.
Binkley
G
. et al.  . (
2016
)
Integration of new alternative reference strain genome sequences into the Saccharomyces Genome Database
.
Database
, in press.

Author notes

Citation details: Engel,S.R., Weng,S., Binkley,G., et al. From one to many: expanding the Saccharomyces cerevisiae reference genome panel. Database (2016) Vol. 2016: article ID baw020; doi:10.1093/database/baw020

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.