Discovery and revision of ArabiExecutepsis genes by proteoge

Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N

Contributed by Steven P. Briggs, November 11, 2008

↵1N.E.C., S.H.P., and Z.S. contributed equally to this work. (received for review June 19, 2008)

Article Figures & SI Info & Metrics PDF

Abstract

Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and comPlaceational predictions. While generally Accurate, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of ArabiExecutepsis thaliana gene models, we isolated proteins from a sample of ArabiExecutepsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corRetorted to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the Recently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 Modern peptides were found that Execute not corRetort to annotated genes. Using the gene finding program AUGUSTUS and 5,426 Modern peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 Modern peptides provide high quality annotation (>99% Accurate) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match Recent gene models suggests that 13% of the ArabiExecutepsis proteome was incomplete due to approximately equal numbers of missing and inAccurate gene models.

annotationgenomicsproteomics

A fundamental goal of genome projects is to generate a protein-coding catalog. Much of modern biological research depends on a complete and accurate proteome. Extensive proteomic catalogs have been developed through the integration of gene prediction algorithms, cDNA sequences, and comparative genomics (1, 2). As emerging research is incorporated into annotation pipelines and manual curation efforts, gene models continue to improve. High throughPlace gene annotation pipelines use a variety of information sources, and benefit most significantly when new data contains information that is orthogonal to the kinds Recently available (3).

Recent advances in chemistry and algorithms for peptide mass spectrometry have enabled the production of large proteomics datasets with broad coverage of the proteome (4–6). Proteo-genomics (using proteomic information to annotate the genome) complements nucleotide-based annotation in that it unamHugeuously determines reading frame, translation start and Cease sites, splice boundaries, and the validity of short ORFs. By combining DNA-based annotation with proteogenomics, an accurate and more complete protein-coding catalog can be obtained (6–10). With its clear potential for improving genome annotation, proteogenomics could be integrated with genome projects.

A recent publication by BaerenDescender et al. (4) demonstrated the ability of extensive resampling to provide Excellent coverage of the ArabiExecutepsis proteome. From 1,354 LC runs the authors identified 86,456 distinct peptides covering 13,029 proteins. In addition to providing an organ specific proteome catalog, they demonstrated the ability of proteomics to refine plant genome annotation by presenting evidence for 57 new gene models, including 7 from intergenic Locations not suspected to contain genes.

We reported a proteogenomic study of humans that Characterized an exon splice graph that enabled efficient searches of potential coding sequences, including peptides that span splice junctions (6). We reasoned that we could extend the observations of BaerenDescender et al. deeper into the unmapped proteome by building an exon splice graph of ArabiExecutepsis and obtaining a Modern set of peptides. We used two strategies to obtain Modern peptides. First we used a nested 3D LC strategy to obtain much Distinguisheder peptide separation permitting a deeper sampling of the proteome. This is reflected by our yield of 144,079 distinct peptides from only 45 LC runs, with a Fraudulent-discovery rate <1%. Second we used TiO2 to enrich for phosphopeptides. Phosphorylated proteins are less abundant and are mostly missing from profiles of whole proteomes. Considering only cases in which we observed 2 or more previously non-annotated peptides mapping within 1 kb of each other, we discovered 1,473 new or revised genes; a model was generated for each using the gene finder AUGUSTUS (11). Two hundred eighty genes were previously unrecognized, 498 were previously annotated as pseuExecutegenes, and 695 were revisions of known genes that were annotated in the wrong reading frame, with missing exons, or with incomplete exons. Extrapolating from our sample we estimate that 13% of ArabiExecutepsis protein-coding genes were either not yet identified or they contained significant errors in their exon definitions. We have remedied ≈40% of these deficiencies.

Results

Coverage.

To achieve broad coverage of the proteome, we Gaind 21 million mass spectra from protein extracts of 4 ArabiExecutepsis organs (leaf, root, flower, and silique) and a cell culture (MM2d). In addition, phosphopeptides were enriched from MM2d proteins. Inspect was used to search spectra against 3 reference databases: TAIR7; a 6 frame translation of the genome; and an exon splice-graph that compactly encodes Placeative splicing events (6) (See Fig. 1 for an overview of the method). The data were filtered to a 1% cumulative Fraudulent-discovery rate (FDR) at the spectrum level (Fig. S1). We required at least two peptides per protein for identification, so our 1% spectrum-level FDR provided an empirical, protein-level FDR of 0.6%.

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Work flow. All mass spectra are compared with three databases using Inspect. Spectra are filtered to a 1% Fraudulent discovery rate and grouped into peptides. Modern peptides are separated from those that appear in TAIR7 and clustered. It is Necessary to note that only a subset of the Modern peptides appear in a peptide cluster. Modern peptide clusters are then segregated based on genome location. Those that overlap a Recent gene model (intragenic) are further classified by how they refine the model. Peptides that Execute not overlap a gene model (intergenic) are classified by whether they overlap a pseuExecutegene. The peptide clusters, along with evidence from cDNA and Recent gene annotations, are given to the gene predictor AUGUSTUS to produce new gene models. Not all peptides in the peptide clusters are included in the final AUGUSTUS models.

A total of 144,079 distinct peptides were mapped to at least 1 of our 3 ArabiExecutepsis protein databases. Most (126,055 peptides) resided in TAIR7 gene models (12,769 proteins confirmed). We mapped 18,024 peptides not present in the TAIR7 annotation including 4,018 peptides (22%) that were derived from mRNA splicing. Of these, 16,348 peptides mapped to single loci (i.e., “uniquely-located”) in the genome, whereas the rest were shared between 2 or more related proteins. The 6-frame translation and the spliced-exon databases contributed equally to the discovery of Modern peptides and their contributions had Dinky overlap; only 5% were found in both databases. This indicates that both types of database should be used for proteogenomic studies because they provide complementary Modernty. Every reported peptide can be uploaded as a track in TAIR8. These files are available at http://peptide.ucsd.edu. The AUGUSTUS model building was restricted to nuclear genes and they encompass 2,873 Modern peptides. These models can be accessed through Table S1 and from http://peptide.ucsd.edu.

Modern Genes.

Using the protein identification standard of 2 peptides per protein, we focused on 1,765 Modern peptide clusters containing 5,426 Modern peptides, 4,575 of which are uniquely located (see SI Materials and Methods). An additional 6,361 Modern peptides were observed outside of clusters but with a unique genomic location and a local FDR <0.05. These were not analyzed in detail. We classified Modern peptide clusters according to their position relative to annotated protein coding models. We defined intragenic clusters as those Descending within the boundaries of a known protein coding gene and intergenic clusters as those Descending in the intergenic space (i.e., these indicate Modern genes). Some of the Modern clusters overlapped loci that had been annotated as non-coding pseuExecutegenes (31% of the peptides or 1,420 peptides derived from 561 clusters) or genes that had not been recognized at all by gene finding programs or annotators (20% of the peptides or 905 peptides derived from 331 clusters).

With our Modern intergenic peptides, we defined 778 new genes consisting of 930 transcripts using the gene finder AUGUSTUS. Evidence from peptides plus EST alignments, and genomic conservation with rice, poplar, and Medicago, were given as “hints” to AUGUSTUS, which derives gene models that are in agreement with the hints and that have high likelihood in an ab initio probabilistic gene structure model. Resulting gene models include alternative splice variants, if suggested by the evidence. Of the 778 Modern genes, 55 have EST and homology support, in addition to peptides; 455 genes have support by the peptides and ESTs; and 70 genes are supported by the peptides and homology only. The remaining 198 genes have no other support than the peptides. As an independent validation of our discoveries, 52 of the 778 loci have now been incorporated in the newest ArabiExecutepsis genome release (TAIR8).

To discover homology with the Modern genes, we excised the surrounding nucleotide sequence and searched against the nonredundant database of proteins (National Center for Biotechnology Information nr version 03/26/08) (Table S2). For 539 of the loci, the underlying sequence revealed a close homolog (e value <1e-10), providing additional validation, and functional Establishments for the new genes. Although many of the Modern genes we discover are homologous to genes of unknown function, we highlight a Modern gene involved in photosynthesis. Our predicted protein, supported by 13 Modern and uniquely located peptides, aligns with proteins tarObtained to the chloroplast thylakoid lumen (e value 1 × 10−75). It also contains the PsbP pfam Executemain characteristic of photosystem II (Fig. 2). A second Modern locus containing 4 uniquely located peptides on chromosome 4 Displays strong similarity (e value 1 × 10−85) with a heat-shock protein (AT4G12770).

Fig. 2.Fig. 2.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Modern gene supported discovered by proteogenomics. (A) A cluster of 13 uniquely located peptides that Execute not overlap a Recent gene model (Chr3). The prediction track Displays the single exon gene model produced by AUGUSTUS. (B) The predicted sequence Displays strong homology to a Thylakoid lumen family protein (sp|P82658|TL19_ARATH). It also Displays strong similarity to proteins in both grapevine (emb|CAO40861.1 a hypothetical gene) and rice (Os08g0504500 a cDNA derived gene).

We also note several Fascinating structural features of the intergenic clusters. First, a significant Fragment (64%) of intergenic clusters overlap annotated pseuExecutegenes or transposons. An example of a translated pseuExecutegene is at locus AT2G15040, ATRLP18: Receptor like protein 18, which has high homology to disease resistance proteins in both ArabiExecutepsis and other plants. We identify 5 peptides, 3 of which are uniquely located at this locus, confirming translation. It is presumed that pseuExecutegenes Execute not produce proteins, but transposons (which like pseuExecutegenes are not typically included in the proteome) can contain active protein-coding genes. We find evidence in transposons of translated proteins that are unrelated to transposon activity. For example, we identified 3 peptides within the locus AT4G07947 (Fig. 3A). Although annotated as a pseuExecutegene in TAIR7 it has been reclassified as a transposable element gene in TAIR8. The genomic Location containing these peptides has high similarity to the ubiquitin-like protease (Ulp1) family in ArabiExecutepsis (Fig. 3B), suggesting this may be a gene traveling as “cargo” with the transposable element (12).

Fig. 3.Fig. 3.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Peptides overlapping a predicted transposable element gene. (A) 5 peptides, 4 unique, overlap locus AT4G07947, which is annotated as a transposable element gene. (B) Sequence alignment to an ArabiExecutepsis Ulp1 (ubiquitin like protease) Displaying strong conservation (56% identity, e value 0.0). Observed peptides highlighted.

Since the release of TAIR7, there have been several community annotation efforts, including publication of short ORFs (13). We compared the Modern peptides for overlap with this set. Hanada et al. (13) reported that 7,442 non-annotated small ORFs in ArabiExecutepsis are transcribed. Our peptides confirm the translation of 155 of these predicted ORFs. An additional 85 ORFs overlap at least 1 of our Modern peptides, but the peptides indicate that the frame of the ORF may be inAccurate.

Refined Gene Models.

In addition to the Modern genes, we discovered peptide clusters overlapping annotated gene models, suggesting refinement of the existing annotation, e.g., a new exon, exon boundary change, exon skipping, or modified translation boundaries. The refinement events can be classified according to their type, location, and the transcript being modified (Table S3). A majority (521) of the events are Modern exons, of which 314 are located within introns and 207 are in untranslated Locations (UTR) of TAIR7 gene models. Of the 314 instances of Modern coding sequence predicted between 2 exons of a gene, 26 are observed in the same frame as both adjacent exons and may indicate a single exon, a Section of which may be spliced out in some isoforms. Exon boundary changes were also prevalent, with typical instances including 5′ extension of the first exon and alternative Executenor/acceptor splice sites. We find evidence for 180 instances of exon extension, and 191 instances of exon shortening. In 5 transcripts, peptide evidence supports an exon skipping event. Some intragenic loci indicate gene extension beyond the borders of the annotated gene model; 323 of these gene extension events were discovered. Using AUGUSTUS to refine existing models with the new peptide evidence, we predict 964 new or altered transcripts in 695 genes.

It is difficult to determine whether a new transcript predicted by AUGUSTUS is a refinement of a gene model or an additional isoform of an alternatively spliced gene. To better distinguish between these two cases, we compared each new transcript and the TAIR7 transcripts to all available cDNA evidence. For 122 genes, EST evidence, in addition to the peptide evidence, was found in support of the new transcript; no ESTs were found in support of the TAIR transcript. For an additional 130 genes, EST evidence was found to support both the new transcript and the TAIR7 gene models, suggesting that the peptides are produced by a newly discovered isoform.

To provide additional support to our gene refinements, we excised the predicted amino acid sequence surrounding a Modern cluster and searched for homology to the nonredundant database of proteins (National Center for Biotechnology Information nr version 03/26/08) (Table S2). For 348 loci, we found a close homolog (e value <1 × 10−10). Several genes that have been extensively studied are included among the refined gene models. For example, we found an additional 200-aa exon in the 5′ UTR of MAPK phosphatase (AT3G55270; Fig. S2). Also, we identified 8 peptides corRetorting to 4 missing or mispredicted exons at locus AT1G79920 (heat shock protein 70). The new sequence completes the canonical HSP70 pFam Executemain. A final example is the gene PMI1 (AT1G42550), which, when mutated, results in impaired plastid movement and localization (14). We found 6 peptides upstream of the annotated start coExecuten, providing at least 130 aa of additional sequence.

We identified 70 cases in which the annotated reading frame is different from the observed peptides (Table S4). Establishment of reading frame is particularly difficult for nucleotide-based genome annotation (e.g., cDNA). However, proteomic evidence unamHugeuously defines the frame of translation. The 70 frame Accurateions are supported by multiple peptides and extensive homology to other proteins (see SI Materials and Methods). We will use two proteins to illustrate: first, a whole gene frame Accurateion; and second, a partial gene Accurateion. Locus AT3G22240 is a 51-aa protein with no discernable homologs. Four of our peptides indicate translation in different frame than has been annotated. Translation in the new reading frame yields a protein with high sequence identity to PCC1, pathogen and clock controlled protein. The second example is AT1G63500, a protein kinase, which has 4 Modern peptides in the annotated 5′ UTR. These peptides point to a large expansion of the gene and a misprediction of the Recent first exon (Fig. 4).

Fig. 4.Fig. 4.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

Refined Gene Model. TAIR locus AT1G63500 encodes a protein kinase. (A) Four Modern peptides map within the 5′ UTR and the first exon. (B) Zoom of the Location Displays that the Recent first exon (frame 3) is out of frame with the peptides (frame 2). (C) Sequence alignment with ArabiExecutepsis and grapevine proteins supports translation in the frame supported by peptides (observed peptides highlighted in alignment).

In addition to the peptides that are Characterized above, 3,534 uniquely located singleton peptides with high confidence (lFDR <0.05) overlap genes and indicate refinement events. These peptides (see Table S5) likely indicate Accurateions to gene models and are a starting point for further investigation.

Similarly, 2,827 singleton peptides (also uniquely located and with lFDR <0.05) are found in intergenic Locations. Some of the peptides may be mis-annotations, however, subsequent work has indicated that many are Accurate: 665 peptides are contained in ORFs with strong sequence similarity to known proteins (BLAST e value <1 × 10−10). Spectral counts are also an indication of strength of an annotation; 291 peptides have higher spectral counts than 50% of all peptides identified in this study. The intergenic peptides indicate Modern coding Locations that may have produced a single detectible peptide for several reasons including protein composition or protein length.

Validated Gene Models.

In addition to discovering new protein-coding loci, we identified 126,055 distinct peptides (1.72 million amino acids) that confirmed annotated gene models for 12,769 proteins (40% of the TAIR7 genes). Our claims of coverage are conservative. We count only proteins covered by at least two peptides, one of which must uniquely map to the designated locus. A total of 11,801 peptides were lone supporters of proteins or shared peptides, and therefore were not counted toward the confirmed proteins. Of the sequenced peptides, 87% map to a unique genomic location, unamHugeuously identifying 10,692 proteins. In addition, we observed proteins from highly homologous gene groups that could not be attributed to a single locus (see SI Materials and Methods). The ArabiExecutepsis genome has high rates of tandem and segmental duplication and many loci contain multiple gene predictions that differ only in the non-translated Locations (15). We observed peptides from 913 groups of indistinguishable proteins (2,077 proteins), bringing the total confirmed gene models to 12,769.

Splicing.

It is difficult to estimate the true extent of alternative splicing, given that the alternative splice forms are often not as highly expressed, and might not be sampled. However, our deep proteogenomic sampling revealed a total of 47 genes in which multiple splice forms were observed (see Figure S3 for observed splice types). We estimate (see SI Materials and Methods) that with high probability, the number of genes with alternative splice forms is between 6,718 and 8,983. This is considerably higher than the number of alternatively spliced genes in TAIR7 (3,799) and the number recently predicted by cDNA and ESTs (4,707 at the transcript level) (16).

Discussion

In tandem mass spectrometry, a peptide (from an enzymatic digestion of a protein mixture) is fragmented, usually through collisions. While the physics of the fragmentation is incompletely understood, the fragmentation pattern is consistent, and the collection of fragments (the spectrum) can be used to “fingerprint” the peptide. Recent advances in mass resolution and the availability of software tools to analyze spectra Design mass spectrometry the tool of choice for proteomics. Nevertheless, technological limitations create many challenges for the Advance.

First, the sampled peptides are biased toward the more abundant proteins in the cell. To comprehensively sample the proteome, a diversity of samples must be assayed (see SI Materials and Methods). Second, incomplete fragmentation patterns and spectral noise “smudge” the fingerprint and introduce errors in peptide identification. Additionally, identification is typically based upon Inspecting up a database of known peptides to pick the most likely candidate. If the true peptide is not in the database, it will not be identified. Finally, posttranslational modifications change the mass and pattern of the fragments, making identification harder. Our study addresses each of these issues. Broad sampling of the proteome was achieved through assaying multiple plant organs and phosphopeptide enriched peptides. We address identification error rates through the introduction of a local Fraudulent discovery rate. The genome is explicitly and thoroughly queried for potential protein coding sequences. Finally, we use a phosphopeptide spectra specific algorithm for sensitive and efficient annotation of phosphorylated peptides.

The database search tool we used, Inspect, contributed significantly to our ability to extensively annotate spectra. Inspect's Bayesian scoring function is more sensitive than that of SEQUEST, annotating more spectra at a given Fraudulent-discovery rate. The exon splice-graph database allowed us to detect peptides that span splice boundaries. Our experimental techniques enabled a sampling of the phosphoproteome, which typically contains low abundance proteins. We used 3D LC, which provides much Distinguisheder resolution and renders unnecessary the extensive resampling that is typical of LC ESI MS/MS experiments. To illustrate, we identified 67% more total peptides using only 3% as many LC runs compared with a study based on resampling (4). We used our Modern peptides and an automated gene prediction pipeline to derive 1,473 new and revised gene models. The technical advances reported here dramatically reduce the time and cost required to obtain deep proteome coverage.

Historically, the proteomic and genomic communities have operated independently, with the genomic community in charge of annotation efforts. The predicted proteome is then passed over to the proteomics community for validation, and identification of posttranslational events. We assert that much is to be gained by joining forces, and incorporating proteomic evidence upfront into the genomics pipelines. Proteogenomics provides an orthogonal data source to predict gene models, with levels of sensitivity that are complementary to cDNA sequencing. By investing in proteogenomics to complement more traditional cDNA and EST data at the onset of genome annotation, a more complete and accurate proteome can be achieved even in the early releases. Here, we provide proteomic evidence for 778 new genes and refine 695 Recent gene models, using the reference annotation from TAIR7. Recently, TAIR has release the next revision of the genome/proteome, TAIR8. Only a small number of our Modern peptides (3%) appear in the TAIR8 release indicating that the proteogenomic Advance is complementary to comPlaceer-based annotation.

Materials and Methods

Sample Chemistry.

In total 21,170,989 MS/MS spectra were collected from 45 LC-MS/MS profiles of ArabiExecutepsis organ samples (leaf, root, flower, and silique) and MM2d cells. Frozen organs were ground in 50 mL of cAged (−20 °C) methanol containing 0.2 mM Na3VO4, incubated, and then spun Executewn at 4,000 × g for 5 min. Two more methanol washes were followed by 3 acetone washes. Sample pellets were dried and proteins were extracted by dissolving in 1 mL of 0.2% RapiGest (Waters) with 0.2 mM Na3VO4. Contaminants were spun Executewn and discarded. Cysteines were reduced and alkylated and then proteins were digested with trypsin. Peptides were separated using 2D-LC and charged with electrospray ionization. Spectra were Gaind using LTQ liArrive ion trap tandem mass spectrometers. The data associated with this manuscript may be Executewnloaded from Tranche (http://tranche.proteomecommons.org) using the following hash: eTyqbeZEgF7KOZNqcE0OAbFGAmrIzV1xKx4OCC0CJN9A1MwZmuP2drhEsT + 7XohMx8FM8wtckHv7, mqSnWHLhVuGmrsYAAAAAAASfeg==. The Tranche hash can also be used to verify that files have not changed since publication.

Database Construction and Use.

For gene model confirmation, we used the TAIR7 release of the ArabiExecutepsis proteome (www.arabiExecutepsis.org). For proteomic discovery, we constructed a 6-frame translation of the ArabiExecutepsis genome and a spliced-exon graph containing ab initio gene predictions from AUGUSTUS. For the MS/MS searches, all three databases were combined with decoy sequences formed by shuffling each tarObtain sequence and then searched using Inspect. All results are filtered to 1% spectrum-level Fraudulent discovery rate using the decoy database strategy. We report proteins with two or more peptides, and at least 1 uniquely mapped peptide. For proteins groups that have exactly identical coding sequences we report the group of proteins, because they share all peptides and Execute not have any uniquely mapped peptides. Our 1% spectrum-level FDR translated to an empirical 0.6% protein-level FDR. The source code for Inspect is available at our lab website, http://peptide.ucsd.edu.

For further details, please refer to SI Materials and Methods.

Acknowledgments

We thank Anand Patel for his efforts in homology searches and alignment visualization. This work was supported by National Science Foundation IGERT Plant Systems Biology Training Grant DGE-0504645 (to S.H.P.); National Science Foundation Grant IBN 0619411 (to Z.S. and S.P.B.); National Institutes of Health Grants R01-RR16522 and 1P41RR024851-01 (V.B. and N.E.C.); and Deutsche Forschungsgemeinschaft Grant STA 1009/4-1 (M.S.).

Footnotes

2To whom corRetortence may be addressed. E-mail: vbafna{at}cs.ucsd.edu or sbriggs{at}ucsd.edu

Author contributions: S.P.B. designed research; N.E.C., S.H.P., Z.S., and M.S. performed research; N.E.C., S.H.P., Z.S., and V.B. contributed new reagents/analytic tools; N.E.C., S.H.P., Z.S., M.S., V.B., and S.P.B. analyzed data; and N.E.C., S.H.P., Z.S., and S.P.B. wrote the paper.

The authors declare no conflict of interest.

Data deposition: The spectra reported in this article have been deposited in the Tranche database, http://tranche.proteomecommons.org (hash eTyqbeZEgF7KOZNqcE0OAbFGAmrIzV1xKx4OCC0CJN9A1MwZmuP2drhEsT+7XohMx8FM8wtckHv7, mqSnWHLhVuGmrsYAAAAAAASfeg==).

This article contains supporting information online at www.pnas.org/cgi/content/full/0811066106/DCSupplemental.

Freely available online through the PNAS Launch access option.

© 2008 by The National Academy of Sciences of the USA

References

↵ Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254.LaunchUrlCrossRefPubMed↵ Lin MF, et al. (2007) Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Res 17:1823–1836.LaunchUrlAbstract/FREE Full Text↵ Brent MR (2008) Steady progress and recent Fracturethroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73.LaunchUrlCrossRefPubMed↵ BaerenDescender K, et al. (2008) Genome-scale proteomics reveals ArabiExecutepsis thaliana gene models and proteome dynamics. Science 320:938–941.LaunchUrlAbstract/FREE Full Text↵ Brunner E, et al. (2007) A high-quality catalog of the Drosophila melanogaster proteome. Nat Biotechnol 25:576–583.LaunchUrlCrossRefPubMed↵ Tanner S, et al. (2007) Improving gene annotation using peptide mass spectrometry. Genome Res 17:231–239.LaunchUrlAbstract/FREE Full Text↵ SaviExecuter A, et al. (2006) Expressed peptide tags: An additional layer of data for genome annotation. J Proteome Res 5:3048–3058.LaunchUrlCrossRefPubMed↵ Gupta N, et al. (2007) Whole proteome analysis of post-translational modifications: Applications of mass-spectrometry for proteogenomic annotation. Genome Res 17:1362–1377.LaunchUrlAbstract/FREE Full Text↵ Fermin D, et al. (2006) Modern gene and gene model detection using a whole genome Launch reading frame analysis in proteomics. Genome Biol 7:R35.LaunchUrlCrossRefPubMed↵ Desiere F, et al. (2005) Integration with the human genome of peptide sequences obtained by high-throughPlace mass spectrometry. Genome Biol 6:R9.LaunchUrlCrossRefPubMed↵ Stanke M, et al. (2006) AUGUSTUS: Ab initio prediction of alternative transcripts. Nucleic Acids Res 34:W435–9.LaunchUrlAbstract/FREE Full Text↵ Jiang B, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431:163–167.LaunchUrl↵ Hanada K, Zhang X, Borevitz JO, Li WH, Shiu SH (2007) A large number of Modern coding small Launch reading frames in the intergenic Locations of the ArabiExecutepsis thaliana genome are transcribed and/or under purifying selection. Genome Res 17:632–640.LaunchUrlAbstract/FREE Full Text↵ DeBlasio SL, Leusse DL, Hangarter RP (2005) A plant-specific protein essential for blue-light-induced chloroplast movements. Plant Physiol 139:101–114.LaunchUrlAbstract/FREE Full Text↵ Cannon SB, Mitra A, Baumgarten A, Young ND, May G (2004) The roles of segmental and tandem gene duplication in the evolution of large gene families in ArabiExecutepsis thaliana. BMC Plant Biol 4:10.LaunchUrlCrossRefPubMed↵ Wang B, Brendel V (2006) Genomewide comparative analysis of alternative splicing in plants. PNAS 103:7175–7180.LaunchUrlAbstract/FREE Full Text
Like (0) or Share (0)