Arabidopsis Thaliana: Characteristics and Annotation of a Model Genome

Arabidopsis thaliana: Characteristics and Annotation of a Model Genome


Arabidopsis thaliana is an annual plant of the Brassicaceae family and is commonly found in temperate regions of the world. Its suitability for molecular and genetic experiments has made it one of the most widely studied plants today. Arabidopsis is the first plant to be completely sequenced and remains the most completely sequenced eukaryotic genome to date. Approximately 11,000 researchers around the world are currently engaged in unraveling the functions of this genome. Lessons learned from this reference plant will facilitate more systematic and targeted approaches for manipulating and managing plants that impact humans and the environment.


The genome of Arabidopsis, one of the smallest among angiosperms, has been estimated at approximately 146 Mb and is highly dense with genes. To date, 117.3 Mb of the nonredundant sequences have been completely sequenced. The remaining gaps are in the centromeres and other highly repetitive regions (The Arabidopsis Information Resource (TAIR), http://www. Approximately 15–20% of each chromosome is composed of heterochromatin around the centromere and, additionally, in two heterochromatic knobs of chromosomes 4 and 5. In the euchromatic regions, the average gene density is 5 kb per gene with 50% of the euchromatic sequence allotted to genes.

Most characteristics such as gene density, distribution of repetitive DNA, and guanine/cytosine (GC) content are constant within and among all 10 euchromatin chromosome arms. This organization is quite different from the organization of most crop plant genomes, where most of the gene-rich tracts are clustered and separated by huge stretches of repetitive DNA. The average GC content over the five chromosomes is 34.9% with about 4–6% of the cytosine residues in the Arabidopsis genome being methylated, compared to 30–33% and 22% cytosine methylation in tobacco and wheat, respectively. The repetitive fraction of the genome is more highly methylated than single and low-copy genes.

The most recent genome reannotation by The Institute for Genome Research (TIGR) includes approximately 29,000 genes ( containing an average of five exons per gene with a mean unprocessed transcript length of 2085 bp (1584 bp mode) and a mean protein length of 425 amino acid residues (221 modes). Both size distributions are extremely right-skewed with over 90% of the genes being smaller than 4 kb. Approximately a third of the genes identified by computer prediction has been verified with full-length cDNA information and about 64% of the genes have full or partial-length cDNA sequences associated with them (TAIR,

Preliminary results from analyzing hypothetical genes (i.e., genes with no transcript sequence or sequence similarity information) indicate that approximately 80% of them have detectable transcripts. This suggests that approximately 95% of the identified genes (approximately 27,500 transcripts) are part of the transcriptome. In addition, a comparison of 5000 full-length cDNAs to the genome reveals that about 2% of the genes are alternatively transcribed, suggesting that the transcriptome size is likely to increase as more experimental data are generated. Taking into account post-translational modifications like phosphorylation and glycosylation, the Arabidopsis proteome is likely to be even larger than its transcriptome. Unlike the human and many other plant genomes, only about 10% of the Arabidopsis genome is composed of repetitive DNA. 

It consists largely of 5S rRNA arrays, 18S-5.8S-25S rRNA arrays, centromere-associated repeat sequences, nucleolar organizers, telomeres, and transposons. Arabidopsis contains a rich diversity of most known transposons, as well as some that are structurally unique. Most repetitive sequences are found in the centromeres and telomeres. Genetically defined centromeres contain a central region composed of 180 bp repeat microsatellites and Athila transposable elements, flanked by sequences containing a number of additional microsatellites, transposable elements, 5s rDNA, and unique sequences containing expressed genes. The unique sequences in the five centromeres are not similar to each other.

The only conserved elements appear to be those in the central domain, suggesting that these structural aspects may be sufficient for centromere function. Telomeres consist of tandemly repeated blocks of CCCTAAA, similar to the DNA patterns found in lower eukaryotes. Unlike animals, Arabidopsis can tolerate a severe reduction of telomeric DNA for up to ten generations It is unknown how this tolerance is achieved, and what additional factors contribute to the maintenance of chromosomal integrity, or whether this phenomenon occurs in other plants.

Sequence analysis of the Arabidopsis genome has revealed a history of genome rearrangements, most notably duplications followed by gene loss. Approximately 60% of the genome has been duplicated, with at least two rounds of large segmental duplications occurring approximately 112 million years ago. The estimated timing of the duplication events suggests that the ancient duplications occurred before the divergence of the tomato and Arabidopsis lineages.

In addition to these large segmental duplications, the genome has undergone many smaller gene duplication events. Approximately 17% of the genes are tandemly repeated and approximately 4000 genes belong to about 1500 gene families with more than five members each.  A very recent rearrangement in the Arabidopsis genome involves one of its organellar genomes. Approximately 620 kb of mitochondrial DNA has been inserted near the centromeric region of chromosome 2 in the Columbia ecotype. This mitochondrial DNA is not found in other ecotypes such as Landsberg erecta, indicating that the rearrangement occurred very recently.

Functional Composition

The Arabidopsis genome is being functionally annotated based on data from the literature and from sequence comparisons by TAIR and TIGR. Using the controlled vocabularies developed by the Gene Ontology consortium (, approximately 20,100 genes have been annotated to terms describing a gene product’s molecular function, biological process, and subcellular localization. Fig. 1 illustrates the current distribution of annotations of these genes. About 2020 of the 29,000 genes have been described in the literature (TAIR analysis). Many of these genes are involved in agronomically important processes such as responses to drought, cold, light, and disease as well as processes not found in animal systems such as secondary metabolism.

The challenge of characterizing the remaining 26,980 genes is being addressed using multifaceted approaches to predict gene function followed by in planta assays. Based on conserved domains, there are approximately 11,000 gene families, similar to the number found in other sequenced multicellular eukaryotes such as Drosophila and Caenorhabditis elegans. An ongoing analysis of the Arabidopsis proteome with respect to known metabolic pathways (AraCyc, has so far resulted in the assignment of about 1000 genes to 174 pathways, including plant-specific pathways like those involved in secondary metabolism. Characterization of the Arabidopsis transcriptome is facilitated by the extensive use of microarrays to analyze gene expression patterns. Available technologies include targeted cDNA arrays as well as high-density oligonucleotide arrays representing 25,000 genes. Results from over 570 Arabidopsis cDNA arrays and multiple high-density oligonucleotide arrays are available to the research community.

Cluster analysis of these experimental results can reveal patterns of possible co-regulation between genes as yet unknown and experimentally characterized genes and can shed light on the composition and characteristics of the Arabidopsis transcriptome. Finally, the functions of genes are being elucidated using forward and reverse genetic approaches. Functional analysis of specific genes is facilitated by the availability of a suite of insertion lines ( insertion.html). The ability to obtain insertions in specific genes allows the construction of appropriate double or triple mutant lines to uncover potentially redundant functions among members of gene families. Functions of redundant genes may also be inferred from the phenotypes of engineered dominant mutations created by activation tagging and by the characterization and eventual cloning of loci defined by mutation and quantitative trait loci assessed in natural populations.


The sequencing of the Arabidopsis genome opened unprecedented opportunities for plant biology and agronomy. It confirmed previous hypotheses about the genome and simultaneously raises many new questions. Understanding the organization and evolutionary history of the Arabidopsis genome is important in making an accurate assessment of its relatedness to other plant species and in leveraging knowledge gained from studying this plant to other plants.

Using the sequenced genome and a set of rapidly developing genomic analysis tools, researchers are now able to address biological questions in ways that were impossible just a few years ago. These new investigative methods and the results they generate will undoubtedly change the way research is conducted and published. Furthermore, advances made in this small weed serve as both a reference for understanding processes shared with agronomically important plants, and as a tool for understanding the genetic basis for diversity.

Previous Post Next Post