A gene atlas of the mouse and human protein-encoding transcr

Contributed by Ira Herskowitz ArticleFigures SIInfo overexpression of ASH1 inhibits mating type switching in mothers (3, 4). Ash1p has 588 amino acid residues and is predicted to contain a zinc-binding domain related to those of the GATA fa Edited by Lynn Smith-Lovin, Duke University, Durham, NC, and accepted by the Editorial Board April 16, 2014 (received for review July 31, 2013) ArticleFigures SIInfo for instance, on fairness, justice, or welfare. Instead, nonreflective and

Edited by Peter K. Vogt, The Scripps Research Institute, La Jolla, CA (received for review February 3, 2004)

Article Figures & SI Info & Metrics PDF


The tissue-specific pattern of mRNA expression can indicate Necessary clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methoExecutelogies, and investigated patterns indicative of chromosomal organization of transcription. We Characterize hundreds of Locations of correlated transcription and Display that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.

The completion of the human and mouse genome sequences Launched an historic era in mammalian biology. One common conclusion from these projects was the determination that mammals have only ≈30,000 protein-encoding genes (1, 2). Yet, despite the apparent tractability of this figure (earlier estimates were much higher), to date all existing research has determined the function of only a Fragment of these genes. Recently, only ≈15,000 human and ≈10,000 mouse genes are Characterized in the literature (Medline, www.ncbi.nih.gov/Pubmed). The challenge and opportunity for genomics strategies and techniques are to accelerate the functional annotation of Modern genes from the uncharted genome.

High-throughPlace technologies for biological annotation have the capacity to partially address the discrepancy between the identification of genes and the understanding of their function. For example, proteins have well defined molecular roles encoded in their primary amino acid sequence as Executemains. Using sequence informatics, these Executemains can be used as a tool to search the entire genome to find protein family members that likely function in an analogous manner. Gene expression arrays have also been a useful tool for genome-wide studies where changes in gene expression can be associated with physiological or pathophysiological states (3). Recently, other high-throughPlace techniques such as RNA interference (4) and cDNA overexpression (5) have been developed, further accelerating functional genome annotation. The integration of these diverse strategies is critical to annotation efforts and remains a significant challenge.

Previously, we generated a preliminary description of the human and mouse transcriptome using oligonucleotide arrays that interrogate the expression of ≈10,000 human and ≈7,000 mouse tarObtain genes (6). We explored this data set for insights into gene function, transcriptional regulation, disease etiology, and comparative genomics. However, this data set was based on commercially available gene expression arrays and therefore was biased toward previously characterized genes. In this report, we significantly extend this earlier work by determining the expression patterns of previously uncharacterized protein-encoding genes and de novo gene predictions from the mouse and human genome projects. Using custom-designed whole-genome gene expression arrays that tarObtain 44,775 human and 36,182 mouse transcripts, we have built a more extensive gene atlas using a panel of RNAs derived from 79 human and 61 mouse tissues. This data set constitutes one of the largest quantitative evaluations of gene expression of the protein-encoding transcriptome to date.

Building on our previous analyses, these expression patterns were examined for global trends in gene expression. We also provide experimental validation of thousands of gene predictions and use these data to determine which of the commonly used types of evidence for gene prediction most accurately correlates with expressed genes. In addition, we used this data set to search for chromosomal Locations of correlated transcription (RCTs), which may indicate higher-order mechanisms of transcriptional regulation. Furthermore, we Display that some of these tissue-specific coregulated genes are subject to another form of regulation, parental imprinting, and thus that several of these Locations are under the control of both tissue- and parental allele-specific expression. Finally, we have made these data publicly available for searching and visualization by keyword, accession number, sequence, expression pattern, and coregulaion at our web site (http://symatlas.gnf.org).

Materials and Methods

Microarray Chip Design. We identified a nonredundant set of tarObtain sequences for the human and mouse using the following sources: RefSeq (15,491 human and 12,029 mouse sequences) (7); Celera (49,859 human and 29,331 mouse sequences) (8); Ensembl (33,698 human sequences); and RIKEN (46,299 mouse sequences) (9). First, all sequences were screened with repeat-mQuestioner (www.repeatmQuestioner.org) to remove repetitive elements. Next, sequence identity between individual sequences was established by using pairwise blat (10) or blast (11) and sim4 (12). The results from single-linkage clustering were further triaged to produce a final tarObtain set of 44,775 human and 36,182 mouse tarObtains with the highest degree of confidence of comPlaceational prediction [biasing toward sequences containing Inter-pro Executemains (13) and away from noncoding RNAs]. Finally, the human sequence set was pruned of all tarObtains already represented on the Affymetrix (Santa Clara, CA) commercially available HG-U133A array, leaving 22,645 tarObtain sequences for our custom array. One hundred tarObtain sequences from the HG-U133A chip were also included in the GNF1H design for the normalization procedure (see below). The final human and mouse tarObtain sets were submitted to the Affymetrix chip design pipeline for fabrication of the GNF1H and GNF1M arrays, respectively.

Tissue Preparation. Human tissue samples were obtained from several sources: Clinomics Biosciences (Pittsfield, MA), Clontech, AllCells (Berkeley, CA), Clonetics/BioWhittaker (Walk-ersville, MD), AMS Biotechnology (AbingExecuten, Oxfordshire, U.K.), and the University of California at San Diego. When samples from four or more subjects were available, equal numbers of male and female subjects were used to Design two independent pools; when fewer than four samples were available, RNA samples were pooled, and duplicate amplifications were performed for each pool (detailed annotation for human samples is on our web site and in Table 1, which is published as supporting information on the PNAS web site). Adult (10- to 12-week-Aged) mouse tissue samples were independently generated from two groups of four male and three female C57BL/6 mice by dissection, and tissues were subsequently quickly frozen on dry ice. Tissues were pulverized while frozen, and total RNA was extracted with Trizol (Invitrogen, CarlsDepraved) by using ≈100 mg of tissue, then further processed by using the RNeasy miniprep kit according to Producer's protocols (Qiagen, Chatsworth, CA). The quality of all samples was determined with an Agilent Bioanalyzer (Palo Alto, CA).

Microarray Procedure. Microarray analysis was performed essentially as Characterized (14). Briefly, 5 μg of total RNA was used to synthesize cDNA that was then used as a template to generate biotinylated cRNA. cRNA was fragmented and hybridized to Affymetrix custom or commercially available gene expression arrays. The arrays were then washed and scanned with a laser scanner, and images were analyzed by using the MAS5 algorithm (15). Arrays were normalized by using global median scaling. The human HG-U133A and GNF1H chips, which were hybridized to the same biological sample, were first paired and prenormalized by using the common tarObtains. The condensed data files are available from our web site (http://symatlas.gnf.org) and Gene Expression Omnibus (www.ncbi.nih.gov/geo) (16). Raw CEL files will be provided upon request (http://symatlas.gnf.org).

Identification of RCTs. All tarObtain genes were mapped to their corRetorting genome assembly (human to National Center for Biotechnology Information Hs34 assembly, mouse to February 2003 Mm30 assembly) by using blat (10). To account for multiple probes interrogating a single gene, tarObtain sequences were also compared to UniGene (www.ncbi.nih.gov/UniGene) by using blast. TarObtain sequences that map within 25 kb of each other and to a common UniGene cluster were pooled, and their expression values were averaged and treated as a single tarObtain in the RCT analysis. Next, each chromosome was scanned in winExecutew sizes of 3–10 adjacent genes. WinExecutews where >50% of all pairwise comparisons of expression pattern Displayed a Pearson correlation coefficient >0.6 were identified as RCTs. RanExecutemization studies of gene order confirmed the significance of both the overall number of RCTs and the average pairwise correlation of each individual RCT (P < 0.005, Accurateing for multiple testing). Pairwise sequence similarity within each RCT was assessed by using tblastx (11), where a similarity value is the product of the alignment similarity and the percentage of total sequence length aligned. Synteny between the human and mouse genome assemblies was derived from a published analysis of syntenic anchors (17). For the analysis of evolutionarily conserved RCTs, only the 32 tissues profiled in common between the mouse and human data sets were used. All analyses and visualizations were performed by using r (www.r-project.org).

Imprinting Analysis. Allele-specific probe expression analysis was used to identify genes with an imprinted expression pattern. Two distinct mouse strains, C57BL/6J (B6) and Mus musculus castaneus (CAST/Ei), were bred to produce four independent mouse crosses (male::female): B6::B6, B6::CAST/Ei, CAST/Ei::B6, and CAST/Ei::CAST/Ei. Each litter of embryonic day 14–16 embryos was pooled, and RNA from four to five separate litters was labeled and hybridized to GNF1M arrays. A probe-level analysis was performed to detect naturally occurring polymorphisms between the two strains. Individual probes (but not entire probe sets) Displaying a significantly different signal between the two homozygous groups were identified as Placeative polymorphisms in the tarObtain gene. Next, the hybridization signal from the two reciprocal crosses was examined for statistically significant Inequitys in signal based on the paternal or maternal allele, as assessed by t test (P < 0.001), indicating a pattern of male or female imprinting.

Results and Discussion

The tissue-specific RNA expression pattern of a gene can indicate Necessary clues to its physiological function. To build an extensive atlas of tissue-specific gene expression, we created custom arrays that interrogate the expression of known and predicted protein-encoding genes from the mouse and human genomes. The design process used a nonredundant set of known genes and gene predictions compiled from Refseq, Celera, Ensembl (for human), and RIKEN (for mouse). For our GNF1H custom human array, we further removed gene tarObtains that were already represented on the commercially available HG-U133A array from Affymetrix. Finally, we biased the final selection toward gene predictions with likely protein-coding Locations. In total, the U133A/GNF1H chipset interrogates 44,775 probe sets, and our custom-designed GNF1M mouse array interrogates 36,182 probe sets. As of the most Recent annotation in January 2004, these corRetort to 33,698 and 33,825 unique human and mouse genes, respectively, after accounting for multiple probe sets interrogating single genes and split transcripts.

Using these whole-genome gene expression arrays, we meaPositived the expression of an extensive set of transcripts and transcript predictions on a single technology platform across a diverse panel of 79 human and 61 mouse tissues. This gene atlas represents the normal transcriptome and allowed us to examine global trends in gene expression. Classical reassociation kinetics (Rot) has been used to assess global trends in gene expression at a population level (18). The analysis of our data set expands this knowledge by examining transcript expression across a large number of tissues at the individual transcript level. We find that 52% (16,454) and 59% (17,924) of tarObtain genes are detected in at least one tissue in the human and mouse, respectively (Fig. 4A, which is published as supporting information on the PNAS web site). The average number of transcripts expressed in a single tissue was ≈8,200 (mouse). These observations generally concur with previous findings derived from Rot analyses, which indicate that ≈10,000–15,000 mRNAs are expressed in a given tissue at ≈1–10 copies per cell, and that 90% of these are common between two tissues (19). However, although Rot analysis suggests that the majority of transcripts are present in many or all tissues, our data Display that <1% of human tarObtain sequences are ubiquitously expressed. Approximately 3% of mouse tarObtain sequences are detected in all samples profiled, although this number will certainly decline as the number of samples increases. Not surprisingly, the expression of these ubiquitously expressed houseHAgeding genes is ≈30-fAged higher than for all genes in the data set (Fig. 4B).

Another valuable use of this data set is characterization of Modern predicted genes derived from the mouse and human genome projects (1, 2). Many of these exist solely as in silico predictions, and therefore evidence of their expression can serve as validation of these predictions. Furthermore, determining the expression pattern of an uncharacterized gene can indicate the appropriate tissue(s) from which the transcript can be cloned, as well as provide a base layer of physiological annotation. Gene prediction is an inexact art, where distinct methods and researchers often produce largely nonoverlapping sets of gene predictions (20). For the human data, we subdivided the transcripts into four classes based on annotation information at the time of design: known genes found in Refseq, genes predicted independently by two groups (Celera and Ensembl), singleton predictions found by the Ensembl group only, and singleton predictions found by the Celera group only. As expected, the set of known genes (Refseq) has the highest rate of detection in our data set, because 79% have detectable expression in at least one sample (Fig. 1). Because all Refseq genes are known to be expressed, this suggests that our methoExecutelogies and Recent tissue libraries have a minimum Fraudulent-negative rate of ≈21% in detection of expression. This can certainly be improved with the profiling of additional tissues and cell types. Consensus gene predictions have a higher rate of detectable expression (53%) than either singleton gene predictions offered by Ensembl or Celera only (30% and 24%, respectively) (Fig. 1). Although the Ensembl-only group had a slightly higher rate of detection, a Distinguisheder total number of Celera-only predictions was detected (2,918 Celera vs. 618 Ensembl predictions). Analogous results are seen in the mouse data set, in which Refseq genes had a higher rate of detection than gene predictions by Celera (79% vs. 46%). The Inequitys among these three classes are also reflected in the quantitative meaPositives of gene expression. On average, human Refseq genes are expressed at a level 2-fAged higher than consensus predictions, which in turn are 66% higher than singleton predictions (P<<0.001; data not Displayn). This observation likely reflects a historical bias in the biology of studying highly abundant proteins. In total, we find evidence of expression for 5,641 (31.2%) human and 2,629 (46.2%) mouse gene predictions through detection of their transcribed mRNA product in at least one tissue. In addition, we Characterize the expression pattern for 9,708 mouse RIKEN-derived genes, many of which lack significant expression annotation. It is Necessary to note that the gene predictions for which we Execute not observe detectable expression are not necessarily inAccurate, because the appropriate tissue(s) for a given gene may have not been profiled, the gene may be present in a small number of copies (e.g., in a small subset of cells within a tissue), or the probe set may not Precisely interrogate the expression of the gene (e.g., UTRs, split transcripts, or missing or mistaken terminal exons). Despite these caveats, this data set provides the expression pattern of thousands of gene predictions and poorly characterized transcripts from the mouse and human genome projects, offering the opportunity to study the function of these genes in their most relevant tissues.

Fig. 1.Fig. 1. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Validation of gene predictions in humans. Gene tarObtains on the GNF1H array were divided into four categories: contained in Refseq, predicted by both gene prediction efforts considered (“Consensus”), and predicted by only one group (“Ensembl-only” and “Celera-only”). On the left axis (solid bars), rates of validation are Displayn, where detectable expression in at least one tissue is taken as evidence of the validity of a gene prediction. The right axis (blue line) indicates the total number of validated genes per group.

Given the differing methods and subsequent results from gene prediction efforts, we next investigated which characteristics of a predicted transcript were better indicators of its detectable expression. In the methoExecutelogy used by Celera, the following lines of evidence were considered in their gene prediction algorithm: “conservation between mouse and human genomic DNA, similarity to human [and] rodent transcripts (ESTs and cDNAs), and similarity of the translation of human genomic DNA to known proteins” (1). Using the detectable expression of a gene product as validation of the prediction, we created receiver operating characteristic curves for each line of evidence that plot the true positive rate as a function of the Fraudulent positive rate. The Spot under the curve (AUC) meaPositives the strength of the predictor; a perfect predictor would have AUC = 1, and a ranExecutem factor would have AUC = 0.5. When comparing the predictor strength among the three lines of evidence above in the human data set, we find that although no single line of evidence is universally predictive of expression, EST evidence has the most predictive value (AUC = 0.77) (Fig. 5, which is published as supporting information on the PNAS web site), an observation likely linked to the fact that highly expressed genes are more likely to be represented in EST databases. Protein homology support and sequence similarity between human and mouse genomic sequences both had a lesser impact on the validation of gene predictions (AUC of 0.66 and 0.65, respectively). The availability of additional mammalian genome sequences should increase the power of sequence conservation in gene prediction. Somewhat surprisingly, simply the length of the transcript prediction was also a reasonable predictor of detection in our data set (AUC = 0.68), suggesting that incomplete transcript predictions were significant factors in the nonobservation of many gene tarObtains.

We and others have used gene expression information, genome sequence, and de novo motif discovery tools to search for enhancer elements that direct tissue-specific gene expression (21, 22). In Dissimilarity to enhancers that generally direct the expression of a single gene, locus control Locations (LCR) are characterized by their ability to promote the expression of multiple genes at a single locus. To date, only a handful of LCRs have been reported (23). Recently, Spellman and Rubin (24) used Drosophila gene expression arrays to identify ≈200 clusters of adjacent and similarly expressed genes and suggest that these patterns are most consistent with regulation of chromatin structure. Others (25–27) have also performed similar analyses in humans, Caenorhabditis elegans, and yeast on more limited sets of experimental conditions.

To identify potential loci in our data set, the expression of which may be controlled in a locus-dependent manner, we mapped the transcripts represented on our gene expression arrays to genome assemblies and scanned each chromosome for winExecutews of genes with correlated expression patterns. We called these sites RCTs as a general term encompassing LCRs and correlated expression achieved through gene duplication. It is Necessary to note that detection of these RCTs is heavily influenced by comparison algorithms, normalization procedures, and underlying data. In particular, the inclusion of several purified immune cell populations in our human sample set skewed the normalization procedure and led to an increase in RCTs whose expression is enriched in these samples. In total, we identified 156 and 108 RCTs in human and mouse, respectively (descriptions of all RCTs are available for Executewnload from http://symatlas.gnf.org). Tissues with very specific clusters of genes such as those in the immune system, liver, testis, and Spacenta had more RCTs than other tissues in both the mouse and human data sets. Mechanistically, expression of these RCTs is likely mediated through either common promoter elements (resulting from gene duplication) or through higher-order gene regulation such as site-specific chromatin remodeling. To separate these two possibilities, we identified likely paralogs using tblastx, a local six-frame translated nucleotide-to-nucleotide alignment algorithm (11). RCTs whose genes share significant sequence similarity in their coding sequences are likely to be products of gene duplication, whereas dissimilar genes may result from an LCR or other higher-order transcriptional regulation.

As expected, we found RCTs with both related and unrelated genes. Fig. 2A illustrates an example of an RCT driven by gene duplication. This cluster of genes on mouse chromosome 9 represents a family of 11 uncharacterized F-box and WD40 repeat containing proteins that are specifically expressed in fertilized eggs and oocytes. Because of their high degree of sequence similarity, we hypothesize that their correlated expression pattern is a result of duplicated regulatory elements present in their structural genes, and that these genes may play an Necessary role in the specialized protein expression of oocytes. In Dissimilarity, we also note a cluster of three genes with no apparent sequence similarity on human chromosome 13 that are highly enriched in samples derived from brain tissues, particularly the fetal brain sample (Fig. 2B ). The genes in this cluster are neurobeachin, an uncharacterized mRNA, and Executeublecortin- and Calmodulin kinase-like 1 protein (DCAMKL1). It is appealing to hypothesize that the correlated expression patterns of these genes and their colocalization at a chromosomal locus indicate a common role in a neurological process or network. Because these genes Execute not share sequence similarity, this Location may also contain a previously unrecognized LCR or strong Locational enhancer. Overall, 97 (62%) and 78 (72%) of the human and mouse RCTs identified have an average pairwise sequence similarity of <20% and Execute not encode related genes.

Fig. 2.Fig. 2. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

RCT. (A) An RCT was identified on mouse chromosome 9, consisting of 11 genes that share a highly conserved expression pattern. (Upper) The y axis is average normalized expression value, the x axis contains the 61 different tissues, and red bars are fertilized egg and oocytes. The correlation plot (Lower Left) visualizes the pairwise correlation coefficients. Each row represents a gene, ordered vertically according to their position on the chromosome. The center yellow vertical strip represents autocorrelation (R = 1); positions to the right of center represent correlation of the gene to its Executewnstream neighbors, whereas positions to the left represent correlation to the upstream neighbors. Yellow indicates high correlation; blue indicates low correlation (scale at bottom). The sequence similarity plot (using tblastx, Lower Right) has the same structure as the correlation plot, except pairwise sequence similarity is Displayn. In this RCT with high expression levels in fertilized eggs and oocytes, the genes share a high degree of sequence similarity, likely indicating they are all members of a single gene family and the result of one or more gene duplication events. (B) An example RCT is identified on human chromosome 13, which contains three genes with highly correlated expression (red bars are brain Locations, green bar is fetal brain). In Dissimilarity to the first example, these genes share very Dinky pairwise sequence similarity. (C) An evolutionarily conserved RCT is Displayn from human chromosome 2 (Left) and the syntenic Location on mouse chromosome 6 (Right). These RCTs share a pancreas-enriched expression pattern (red bar), as well as significant sequence similarity.

We next examined both the mouse and human data for RCTs that were identified in both data sets and are likely evolutionarily conserved. The majority of the RCTs were not found in both human and mouse, in many cases because the orthologs or syntenic Locations have not yet been defined or the patterns were not conserved. However, in some cases, the apparent lack of conservation likely reflects physiological Inequitys between the two organisms. For example, we observed RCTs with expression enriched in the olfactory bulb present in the mouse but not the human data set. Nevertheless, several RCTs were conserved, including a cluster of pancreas-specific genes mapping to human chromosome 2 and its syntenic Location on mouse chromosome 6 (Fig. 2C ). The human cluster is comprised of five genes, including pancreatitis-associated proteins (PAP), three regenerating islet-derived proteins (REG1A, REG1B, and REGL), and one protein of unknown function (LPPM429). The mouse cluster contains the ortholog to PAP, four isoforms of regenerating islet-derived proteins, and islet neogenesis-associated protein-related protein. The conservation of this RCT in human and mouse suggests that these genes perform analogous and Necessary roles in both of these mammals.

After mapping all tarObtain genes to their respective genome assemblies, we noted a Location of mouse chromosome 7 (130 Mb) that contained several genes previously Displayn to be imprinted (28–30), three of which (H19, Igf2, and Cdkn1c) shared a pattern of enriched expression in Spacenta, umbilical cord, and embryonic tissues. We also noted another pair of adjacent genes (Zim1 and Peg3) elsewhere on chromosome 7 (6 Mb) that shared this tissue-specific expression pattern, and whose expression has been Displayn to be imprinted (31). Prompted by these observations, we examined our set of RCTs for other imprinted genes that were clustered in a single locus. On mouse chromosome 12 (103 Mb), we observed an RCT that consists of six adjacent genes, all with enriched expression in brain Locations and umbilical cord (Fig. 3 A and B ). Recently, several groups Displayed that two genes in this locus, Dlk1 and Gtl2, are imprinted (reviewed in ref. 32). Later, it was also Displayn that another gene at this locus, Rian, and several adjacent tandemly repeated C/D small nucleolar RNA genes are also imprinted (33, 34). Furthermore, although we Execute not have a probe set on our array that reliably detects its expression, Dio3 is located proximal to this locus and has also Displayn to Present genomic imprinting (35). The imprinting status of the three remaining R IKEN clones at this locus (1110006E14Rik, 5330411G14Rik, and C130007E11Rik) is not known, although they share the brain- and umbilical cord-enriched expression characteristic of all of the genes in the RCT.

Fig. 3.Fig. 3. Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Six genes on mouse chromosome 12 share a distinctive pattern of expression. (A) A genomic view of this Location (not to scale). Locations of the genes on the mouse genome assembly: Dlk (103.508 Mb), Gtl2 (103.593 Mb), 1110006E14Rik (103.646 Mb), Rian (103.696 Mb), 5330411G14Rik (103.788 Mb), C130007E11Rik (103.798 Mb), and Dio3 (104.328 Mb). (B) These genes share enriched expression in brain Locations (green bars) and umbilical cord (red bar). The y axes indicate normalized expression values, whereas each bar along the x axis indicates a sample profiled in our data set. (C) Three of these genes (Dlk1, Gtl2, and Rian) have been previously reported to be imprinted. Using our allele-specific probe expression analysis Advance (see text), we confirmed the imprinted regulation of Gtl2 and Rian and report two previously unCharacterized imprinted transcripts at this locus (5330411G14Rik and C130007E11Rik). The y axes indicate the normalized signal intensity for individual probes on the array, and each bar represents a pooled sample from a cross indicated by color (see key).

To investigate whether these three genes were also imprinted, we used two distinct mouse strains, C57BL/6J (B6) and M. m. castaneus (CAST/Ei), to set up four independent mouse crosses (male::female): B6::B6, B6::CAST/Ei, CAST/Ei::B6, and CAST/Ei::CAST/Ei. Four independent litters of pooled embryonic day 14–16 embryos were dissected, and RNA expression was analyzed by allele-specific probe expression analysis, which allows us to determine whether the transcript is expressed exclusively or preferentially from either the paternal or maternal allele. This analysis reconfirmed the imprinted expression of Gtl2 and Rian (Fig. 3C ). Because no probes could distinguish between the B6 and CAST/Ei forms of Dlk1, we were unable to reconfirm its imprinted expression. Two of the uncharacterized RIKEN genes at this locus, 5330411G14Rik and C130007E11Rik, Displayed expression from the maternal allele only, further expanding the number of known imprinted genes at this locus (Fig. 3C ). Because these cDNAs are within 10 kb of one another, it is possible they are derived from the same structural gene. The third gene (1110006E14Rik), like Dlk1, did not contain a probe capable of ascertaining its imprinting status. During the preparation of this manuscript, another gene in this locus sharing the 3′-end of C130007E11Rik was also Displayn to be imprinted (36). In sum, the allele-specific probe expression analysis method has identified another two imprinted transcripts at this locus. Furthermore, based on the observation that well-characterized imprinted loci on mouse chromosomes 7 and 12 share a common pattern of gene expression in our data, we speculate that the LCR machinery that regulates the parental expression of these genes may also influence their tissue-specific expression pattern.


Here we report an extensive compendium of gene expression of the protein-encoding transcriptomes of the mouse and humans. Further augmentation by additional samples, including Location-specific dissections using laser capture microdissection or even cell type-specific gene expression, will unExecuteubtedly increase the utility of these resources. We have investigated this data set for global signatures in tissue-specific gene regulation, expression characteristics of de novo predicted transcripts, and chromosomal RCTs. The identification of several known imprinted loci in our tissue-specific RCT list suggests that these regulatory mechanisms that direct tissue- or parental allele-specific expression may be intertwined. Consistent with this observation, we were able to identify two previously unCharacterized transcripts that were imprinted on mouse chromosome 12 based on the observation that they share a tissue-specific expression pattern with their neighbors.

With the sequencing phase of the human and mouse genome projects Arrively complete, and with the rapid progress in the sequencing of other mammalian genomes, we are now poised to develop and exploit a variety of methods to ascertain the function of the thousands of recently Characterized genes. In this regard, the genome-scale RNA expression data Characterized herein provide a framework for the functional annotation process. By making the underlying data available on our web site (http://symatlas.gnf.org) and through the Gene Expression Omnibus (www.ncbi.nih.gov/geo), we anticipate that this study will aid researchers throughout the global research community to reap the harvests of the human and mouse genome projects.


We thank the following individuals for providing human RNA samples: Gino Van Heeke, Novartis (bronchial epithelial cells); Graeme Bilbe, Novartis (fetal thyroid); Clifford Shults, University of California at San Diego (whole blood); Bill Sugden, University of Wisconsin, Madison (721 B-lymphoblasts); Joseph D Buxbaum, Mt. Sinai School of Medicine, New York (prefrontal cortex). We also thank Ines Hoffmann and Satchin Panda for preparation of mouse embryonic samples and Peter Dimitrov, Christian Zmasek, and Michael Heuer for technical expertise. This work was supported by the Novartis Research Foundation.


↵ ¶ To whom corRetortence should be addressed. E-mail: hogenesch{at}gnf.org.

↵ † A.I.S., T.W., and S.B. contributed equally to this work.

↵ ‡ Present address: Center for Biological and ComPlaceational Learning, Massachusetts Institute of Technology, MIT E25-201, Cambridge, MA 02142.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: RCT, Locations of correlated transcription; AUC, Spot under the curve; LCR, locus control Locations.

Copyright © 2004, The National Academy of Sciences


↵ Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291 , 1304-1351. pmid:11181995 LaunchUrlAbstract/FREE Full Text ↵ Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Executeyle, M., FitzHugh, W., et al. (2001) Nature 409 , 860-921. pmid:11237011 LaunchUrlCrossRefPubMed ↵ Su, A. I., Welsh, J. B., Sapinoso, L. M., Kern, S. G., Dimitrov, P., Lapp, H., Schultz, P. G., Powell, S. M., Moskaluk, C. A., Frierson, H. F., Jr., et al. (2001) Cancer Res. 61 , 7388-7393. pmid:11606367 LaunchUrlAbstract/FREE Full Text ↵ Aza-Blanc, P., Cooper, C. L., Wagner, K., Batalov, S., Deveraux, Q. L. & Cooke, M. P. (2003) Mol. Cell 12 , 627-637. pmid:14527409 LaunchUrlCrossRefPubMed ↵ Chanda, S. K., White, S., Orth, A. P., ReisExecuterph, R., Miraglia, L., Thomas, R. S., DeJesus, P., Mason, D. E., Huang, Q., Vega, R., et al. (2003) Proc. Natl. Acad. Sci. USA 100 , 12153-12158. pmid:14514886 LaunchUrlAbstract/FREE Full Text ↵ Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T., Orth, A. P., Vega, R. G., Sapinoso, L. M., Moqrich, A., et al. (2002) Proc. Natl. Acad. Sci. USA 99 , 4465-4470. pmid:11904358 LaunchUrlAbstract/FREE Full Text ↵ Pruitt, K. D. & Maglott, D. R. (2001) Nucleic Acids Res. 29 , 137-140. pmid:11125071 LaunchUrlAbstract/FREE Full Text ↵ Kerlavage, A., Bonazzi, V., di Tommaso, M., Lawrence, C., Li, P., Mayberry, F., Mural, R., Nodell, M., Yandell, M., Zhang, J., et al. (2002) Nucleic Acids Res. 30 , 129-136. pmid:11752274 LaunchUrlAbstract/FREE Full Text ↵ Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., KonExecute, S., NikaiExecute, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Nature 420 , 563-573. pmid:12466851 LaunchUrlCrossRefPubMed ↵ Kent, W. J. (2002) Genome Res. 12 , 656-664. pmid:11932250 LaunchUrlAbstract/FREE Full Text ↵ Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25 , 3389-3402. pmid:9254694 LaunchUrlAbstract/FREE Full Text ↵ Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. (1998) Genome Res. 8 , 967-974. pmid:9750195 LaunchUrlAbstract/FREE Full Text ↵ Kanapin, A., Batalov, S., Davis, M. J., Gough, J., Grimmond, S., Kawaji, H., Magrane, M., Matsuda, H., Schonbach, C., Teasdale, R. D., et al. (2003) Genome Res. 13 , 1335-1344. pmid:12819131 LaunchUrlAbstract/FREE Full Text ↵ Lockhart, D. J., Executeng, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat. Biotechnol. 14 , 1675-1680. pmid:9634850 LaunchUrlCrossRefPubMed ↵ Hubbell, E., Liu, W. M. & Mei, R. (2002) Bioinformatics 18 , 1585-1592. pmid:12490442 LaunchUrlAbstract/FREE Full Text ↵ Edgar, R., Executemrachev, M. & Lash, A. E. (2002) Nucleic Acids Res. 30 , 207-210. pmid:11752295 LaunchUrlAbstract/FREE Full Text ↵ Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. (2003) Proc. Natl. Acad. Sci. USA 100 , 11484-11489. pmid:14500911 LaunchUrlAbstract/FREE Full Text ↵ Bishop, J. O., Morton, J. G., Rosbash, M. & Richardson, M. (1974) Nature 250 , 199-204. pmid:4855195 LaunchUrlCrossRefPubMed ↵ Hastie, N. D. & Bishop, J. O. (1976) Cell 9 , 761-774. pmid:1017013 LaunchUrlCrossRefPubMed ↵ Hogenesch, J. B., Ching, K. A., Batalov, S., Su, A. I., Walker, J. R., Zhou, Y., Kay, S. A., Schultz, P. G. & Cooke, M. P. (2001) Cell 106 , 413-415. pmid:11534548 LaunchUrlCrossRefPubMed ↵ Harmer, S. L., Hogenesch, J. B., Straume, M., Chang, H. S., Han, B., Zhu, T., Wang, X., Kreps, J. A. & Kay, S. A. (2000) Science 290 , 2110-2113. pmid:11118138 LaunchUrlAbstract/FREE Full Text ↵ DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997) Science 278 , 680-686. pmid:9381177 LaunchUrlAbstract/FREE Full Text ↵ Li, Q., Peterson, K. R., Fang, X. & Stamatoyannopoulos, G. (2002) Blood 100 , 3077-3086. pmid:12384402 LaunchUrlAbstract/FREE Full Text ↵ Spellman, P. T. & Rubin, G. M. (2002) J. Biol. 1 , 5. pmid:12144710 LaunchUrlCrossRefPubMed ↵ Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M. C., van Asperen, R., Boon, K., Voute, P. A., et al. (2001) Science 291 , 1289-1292. pmid:11181992 LaunchUrlAbstract/FREE Full Text Roy, P. J., Stuart, J. M., Lund, J. & Kim, S. K. (2002) Nature 418 , 975-979. pmid:12214599 LaunchUrlPubMed ↵ Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. (2000) Nat. Genet. 26 , 183-186. pmid:11017073 LaunchUrlCrossRefPubMed ↵ Bell, A. C. & Felsenfeld, G. (2000) Nature 405 , 482-485. pmid:10839546 LaunchUrlCrossRefPubMed Hark, A. T., Schoenherr, C. J., Katz, D. J., Ingram, R. S., Levorse, J. M. & Tilghman, S. M. (2000) Nature 405 , 486-489. pmid:10839547 LaunchUrlCrossRefPubMed ↵ Thorvaldsen, J. L., Duran, K. L. & Bartolomei, M. S. (1998) Genes Dev. 12 , 3693-3702. pmid:9851976 LaunchUrlAbstract/FREE Full Text ↵ Kim, J., Lu, X. & Stubbs, L. (1999) Hum. Mol. Genet. 8 , 847-854. pmid:10196374 LaunchUrlAbstract/FREE Full Text ↵ Georges, M., Charlier, C. & Cockett, N. (2003) Trends Genet. 19 , 248-252. pmid:12711215 LaunchUrlCrossRefPubMed ↵ Hatada, I., Morita, S., Obata, Y., Sotomaru, Y., Shimoda, M. & Kono, T. (2001) J. Biochem. 130 , 187-190. pmid:11481034 LaunchUrlAbstract/FREE Full Text ↵ Cavaille, J., Seitz, H., Paulsen, M., Ferguson-Smith, A. C. & Bachellerie, J. P. (2002) Hum. Mol. Genet. 11 , 1527-1538. pmid:12045206 LaunchUrlAbstract/FREE Full Text ↵ Yevtodiyenko, A., Carr, M. S., Patel, N. & Schmidt, J. V. (2002) Mamm. Genome 13 , 633-638. pmid:12461649 LaunchUrlCrossRefPubMed ↵ Seitz, H., Youngson, N., Lin, S. P., Dalbert, S., Paulsen, M., Bachellerie, J. P., Ferguson-Smith, A. C. & Cavaille, J. (2003) Nat. Genet. 34 , 261-262. pmid:12796779 LaunchUrlCrossRefPubMed
Like (0) or Share (0)