Genome completeness: A novel approach using core genes
Ian Korf Lab. Genome Center. UCDavis


   

We can help you!

As a free service, we will run our CEGMA pipeline software against any (eukaryotic) genome sequence to:
  • 1) assess how many genes are likely to be present/absent in that assembly and
  • 2) accurately predict the structures of a set of conserved genes that we believe are present in all eukaryotes.
Please email ifkorf@ucdavis.edu for more information.

CONTENTS

This web page contains

Abstract GO TOP

 
Genome sequencing projects have now been completed (or are underway) for a wide range of species. However, a completed project does not guarantee a completed genomic sequence, and for many published genomes it is not always clear as to how much of the genome remains to be sequenced. The finish line for genome projects usually depends on an estimate of genome size, and such estimates may be erroneous. To tackle this important, but much neglected, issue of genome completeness we have developed a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in virtually all eukaryotes in a reduced number of paralogous. For a large number of diverse eukaryotes with genome sequence data we assess the proportion of CEGs that can be mapped into each genome. Using different coverage assemblies from artifitial reassembled genome data for the nematode Caenorhabditis briggsae and H.sapiens, and multiple real assemblies from T.gondii we show the effects of increasing sequence coverage on the ability to find CEGs within a species.


Core eukaryotic genes dataset GO TOP

Based on the 458 core Eukaryotic genes (Parra et al. 2007), we generated a subset of 248 Eukaryotic core genes that are likely to be found in low number of inparalogs in a wide range of species. We selected the cases when only one ortholog in at least four of the six species. We reduced the paralogy, the number of CEGs with more than one inparalogous per species, in about 10% in the six species. The cutoff files includes for each of the CEG the group of conservation they belong to (1-4, see the main paper for details), the cutoff for the hmmsearch alignments and the length of the proteins.

    Proteins Alignment Profiles  
  (fasta) (clustal) (hmmer)
 
Core eukaryotic genes (248) 300K 360K 5M cutoffs
 
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.

Virtual assemblies GO TOP

Generating C. briggsae re-assemblies

The published genome of C. briggsae is a 12x WGS assembly (Stein al. 2003) and was produced using the Phusion assembler. We used the original sequencing reads from this assembly and randomly sampled them to produce virtual assemblies at defined levels of sequence coverage (2x, 4x, 6x, 8x, and 10x). The 2x assembly derives from 400,000 sequence reads and each subsequent assembly adds another 400,000 reads. The current version of Phusion was used to produce both contigs and scaffolds for each assembly. N50 is calculated by first ordering all contig (or scaffold) sizes and then adding the lengths (starting from the longest contig) until the summed length exceeds 50% of the total length of all contigs. N50 is reported in Kb. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

    Contigs Scaffolds
Assembly details   N   N50   Seqs   Report N   N50   Seqs   Report  
2x 34,456 2.3 20M 44.3% 10,297 14.2 20M 53.2%
4x 20,421 7.4 27M 80.6% 2,268 16.4 27M 90.2%
6x 11,399 16.4 29M 91.5% 1,028 465 29M 95.9%
8x 7,363 28.9 29M 93.1% 971 983 29M 97.2%
10x 5,614 37.4 29M 97.9% 675 1,032 29M 98.8%
This table containd the contig and scaffold files for C.briggsae.
It shows the file sizes of the files in each category. Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.


Generating virtual human genome assemblies

We generated seven virtual human genome assemblies by using the distribution of known contig sizes from the draft genome assemblies of guinea pig (1.9x), cow (3x, 6x and 7.1x), chimpanzee (4.2x and 6.6x) and the rhesus macaque (5.3x). Estimates of genome size for these species, as measured by the C-value, are all in a narrow range between 3.43 and 3.59 pg of DNA (Gregory et al 2007). For each virtual assembly we worked through the list of contig sizes and extracted an equal length of sequence from the published human genome sequence. In doing so we effectively sampled random sites from across the genome and ensured that all extracted sequences were not overlapping. The "gff coord" files refer tto the coordinates that generate the contigs and scaffolds from the Human genome (ncbi36) based on the length distribution of the reference draft genomes.N50 is calculated by first ordering all contig (or scaffold) sizes and then adding the lengths (starting from the longest contig) until the summed length exceeds 50% of the total length of all contigs. N50 is reported in Kb. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

        Contigs Scaffolds
Assembly details   gff coord   N   N50   Seqs   Report N   N50   Seqs   Report  
1.9x 10M 590,603 3.1 202M 21.0% 130,283 51.9 276M 42.3%
3x 12M 795,203 4.1 360M 35.5% 449,727 13.5 364M 50.4%
4.2x 7M 435,593 13.1 438M 57.2% 81,459 2,425 433M 90.7%
5.3x 6M 368,201 14.7 459M 59.7% 28,863 692 455M 91.9%
6x 4M 296,517 19.1 414M 60.1% 62,471 436 411M 85.4%
6.6x 4M 292,555 28.8 471M 72.2% 77,769 8,217 464M 95.9%
7.1x 1M 131,620 44.3 440M 79.8% 16,098 1,042 436M 92.7%
This table containd the contig and scaffold files for H.sapiens.
It shows the file sizes of the files in each category. Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.


Core genes in new species GO TOP

The following data correspond to the genes mapped by cegma in the some available sequenced genomes. The "Mapped CEGs" column lists the percentages of the 248 CEGs that were mapped in the genome of each species. "Partially mapped" correspond to the number of CEGs that are partially mapped. Coverage refers to approximate values of sequence coverage for WGS genomes only. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

    Release   Coverage   Mapping   Mapping   Genomic   Coord   Protein   Report  
      complete partial (fasta) (gff) (fasta) (txt)
  Mammals
C.familiaris CanFam2.0 7.5x 98.0% 99.2% X X X X
B.Taurus v3.1 7.1x 98.4% 99.2% X X X X
P.troglodytes v2.1 6.6x 96.8% 98.8% X X X X
M.musculus NCBI m36 6.3x 99.2% 100% X X X X
M.mulatta v1.0 5.3x 96.0% 98.0% X X X X
F.catus FelCat3 2x 58.1% 75.8% X X X X
L.africana LoxAfr 1.0 2x 46.0% 68.5% X X X X
C.porcellus CavPor2 1.9x 46.0% 68.1% X X X X
  Vertebrates
O.anatinus v5.0 6x 74.6% 77.0% X X X X
G.gallus v2.1 6.6x 83.9% 84.3% X X X X
X.tropicalis v4.1 7.7x 95.6% 96.8% X X X X
T.rubripes v4 8.7x 98.0% 99.2% X X X X
C.intestinalis v1.95 11x 96.4% 97.8% X X X X
  Insects
A.gambiae AgamP3 10.2x 98.8% 99.6% X X X X
A.mellifera Amel4.0 7.5x 91.9% 94.3% X X X X
  Nematodes
C.briggsae Cb2 12x 99.2% 99.2% X X X X
C.brenneri v4.0 9.5x 98.8% 100% X X X X
C.remanei v1.0 9x 96.0% 98.8% X X X X
T.spiralis v1 >30x 94.0% 96.0% X X X X
  Plants
P.trichocarpa v1.0 7.5x 98.4% 99.2% X X X X
O.sativa Build 4 - 98.4% 98.4% X X X X
C.reinhardtii v3.1 12.8x 93.1% 93.1% X X X X
  Fungi
N.crassa v7 >10x 98.8% 98.8% X X X X
M.grisea v5 7x 97.9% 98.8% X X X X
  Protozoan
P.falciparum v5.2 - 75.0% 75.0% X X X X
G.lamblia giardia14 11x 46.4% 46.4% X X X X
Click on the X to retrieve the corresponding file.



New set of species not included in the paper (soon to be completely filled)

    Release   Coverage   Mapping   Mapping   Paralogy   Genomic   Coord   Protein   Report  
      complete partial   (fasta) (gff) (fasta) (txt)
  Mammals
P.pygmaeus v2.02 6x 90.73% 99.19% 33.78% X X X X
C.jacchus v2.02 6x 95.56% 100% 39.66% X X X X
M.domestica MonDom5 6.8x 94.35% 96.77% 30.34% X X X X
R.norvegicus v3.4 7x 97.18% 100% 30.71% X X X X
  Reptile
A.carolinensis AnoCar1.0 6.3x 87.10% 93.15% 16.20% X X X X
  Fish
P.marinus v3.0 5.9x 56.05% 69.35% 18.71% X X X X
O.latipes v1 6x 97.58% 99.19% 21.49% X X X X
G.aculeatus gasAcu1 9x 98.39% 98.79% 21.31% X X X X
D.rerio Zv7 10x 96.37% 98.79% 27.62% X X X X
Click on the X to retrieve the corresponding file.


Source code distribution

 
The latest CEGMA distribution can be found in the cegma web page.


References
  • G. Parra, K. Bradnam and I. Korf.
    "CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes."
    Bioinformatics, 23: 1061-1067 (2007)   [Abstract]   [Full Text]