|
We can help you!
As a free service, we will run our CEGMA pipeline software against any (eukaryotic) genome sequence
to:
- 1) assess how many genes are likely to be present/absent in that assembly and
- 2) accurately predict the structures of a set of conserved genes that we believe are present in all eukaryotes.
Please email ifkorf@ucdavis.edu for more information.
This web page contains
Genome sequencing projects have now been completed (or are underway)
for a wide range of species. However, a completed project does not
guarantee a completed genomic sequence, and for many published genomes
it is not always clear as to how much of the genome remains to be
sequenced. The finish line for genome projects usually depends on an
estimate of genome size, and such estimates may be erroneous. To
tackle this important, but much neglected, issue of genome
completeness we have developed a set of core eukaryotic genes (CEGs),
that are extremely highly conserved and which we believe are present
in virtually all eukaryotes in a reduced number of paralogous. For a
large number of diverse eukaryotes with genome sequence data we assess
the proportion of CEGs that can be mapped into each genome. Using
different coverage assemblies from artifitial reassembled genome data
for the nematode Caenorhabditis briggsae and H.sapiens,
and multiple real assemblies from T.gondii we show the effects
of increasing sequence coverage on the ability to find CEGs within a
species.
Based on the 458 core Eukaryotic genes (Parra et al. 2007), we
generated a subset of 248 Eukaryotic core genes that are likely to be
found in low number of inparalogs in a wide range of species. We
selected the cases when only one ortholog in at least four of the six
species. We reduced the paralogy, the number of CEGs with more than
one inparalogous per species, in about 10% in the six species. The
cutoff files includes for each of the CEG the group of
conservation they belong to (1-4, see the main paper for details), the
cutoff for the hmmsearch alignments and the length of the proteins.
| | Proteins | Alignment | Profiles | |
| (fasta) | (clustal) | (hmmer) |
|
Core eukaryotic genes (248) |
300K |
360K |
5M |
cutoffs |
|
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Generating C. briggsae re-assemblies
The published genome of C. briggsae is a 12x WGS assembly (Stein
al. 2003) and was produced using the Phusion assembler. We used the
original sequencing reads from this assembly and randomly sampled them
to produce virtual assemblies at defined levels of sequence coverage
(2x, 4x, 6x, 8x, and 10x). The 2x assembly derives from 400,000
sequence reads and each subsequent assembly adds another 400,000
reads. The current version of Phusion was used to produce both contigs
and scaffolds for each assembly. N50 is calculated by first ordering
all contig (or scaffold) sizes and then adding the lengths (starting
from the longest contig) until the summed length exceeds 50% of the
total length of all contigs. N50 is reported in Kb. "Report" files
contain the summari of the genes mapped by cegma as well as
the ids of the missing genes.
This table containd the contig and scaffold files for
C.briggsae. It shows the file sizes of the files in each
category. Click on file size numbers to retrieve the
corresponding file. Click here to
get the tar.gz file with all the data.
Generating virtual human genome assemblies
We generated seven virtual human genome assemblies by using the
distribution of known contig sizes from the draft genome assemblies of
guinea pig (1.9x), cow (3x, 6x and 7.1x), chimpanzee (4.2x and 6.6x)
and the rhesus macaque (5.3x). Estimates of genome size for these
species, as measured by the C-value, are all in a narrow range between
3.43 and 3.59 pg of DNA (Gregory et al 2007). For each virtual
assembly we worked through the list of contig sizes and extracted an
equal length of sequence from the published human genome sequence. In
doing so we effectively sampled random sites from across the genome
and ensured that all extracted sequences were not overlapping. The
"gff coord" files refer tto the coordinates that generate the contigs
and scaffolds from the Human genome (ncbi36) based on the length
distribution of the reference draft genomes.N50 is calculated by first
ordering all contig (or scaffold) sizes and then adding the lengths
(starting from the longest contig) until the summed length exceeds 50%
of the total length of all contigs. N50 is reported in Kb. "Report"
files contain the summari of the genes mapped by cegma as
well as the ids of the missing genes.
|
|
|
|
Contigs |
Scaffolds |
Assembly details |
|
gff coord |
|
N |
|
N50 |
|
Seqs |
|
Report |
N |
|
N50 |
|
Seqs |
|
Report |
|
1.9x |
10M |
590,603 |
3.1 |
202M |
21.0% |
130,283 |
51.9 |
276M |
42.3% |
3x |
12M |
795,203 |
4.1 |
360M |
35.5% |
449,727 |
13.5 |
364M |
50.4% |
4.2x |
7M |
435,593 |
13.1 |
438M |
57.2% |
81,459 |
2,425 |
433M |
90.7% |
5.3x |
6M |
368,201 |
14.7 |
459M |
59.7% |
28,863 |
692 |
455M |
91.9% |
6x |
4M |
296,517 |
19.1 |
414M |
60.1% |
62,471 |
436 |
411M |
85.4% |
6.6x |
4M |
292,555 |
28.8 |
471M |
72.2% |
77,769 |
8,217 |
464M |
95.9% |
7.1x |
1M |
131,620 |
44.3 |
440M |
79.8% |
16,098 |
1,042 |
436M |
92.7% |
This table containd the contig and scaffold files for
H.sapiens. It shows the file sizes of the files in each
category. Click on file size numbers to retrieve the
corresponding file. Click here to
get the tar.gz file with all the data.
The following data correspond to the genes mapped by cegma in
the some available sequenced genomes. The "Mapped CEGs" column lists
the percentages of the 248 CEGs that were mapped in the genome of each
species. "Partially mapped" correspond to the number of CEGs that are
partially mapped. Coverage refers to approximate values of sequence
coverage for WGS genomes only. "Report" files contain the summari of
the genes mapped by cegma as well as the ids of the missing
genes.
|
|
Release |
|
Coverage |
|
Mapping |
|
Mapping |
|
Genomic |
|
Coord |
|
Protein |
|
Report |
|
|
|
|
complete |
partial |
(fasta) |
(gff) |
(fasta) |
(txt) |
  Mammals |
C.familiaris
| CanFam2.0 |
7.5x |
98.0% |
99.2% |
X |
X |
X |
X |
B.Taurus
| v3.1 |
7.1x |
98.4% |
99.2% |
X |
X |
X |
X |
P.troglodytes
| v2.1 |
6.6x |
96.8% |
98.8% |
X |
X |
X |
X |
M.musculus
| NCBI m36 |
6.3x |
99.2% |
100% |
X |
X |
X |
X |
M.mulatta
| v1.0 |
5.3x |
96.0% |
98.0% |
X |
X |
X |
X |
F.catus
| FelCat3 |
2x |
58.1% |
75.8% |
X |
X |
X |
X |
L.africana
| LoxAfr 1.0 |
2x |
46.0% |
68.5% |
X |
X |
X |
X |
C.porcellus
| CavPor2 |
1.9x |
46.0% |
68.1% |
X |
X |
X |
X |
  Vertebrates |
O.anatinus
| v5.0 |
6x |
74.6% |
77.0% |
X |
X |
X |
X |
G.gallus
| v2.1 |
6.6x |
83.9% |
84.3% |
X |
X |
X |
X |
X.tropicalis
| v4.1 |
7.7x |
95.6% |
96.8% |
X |
X |
X |
X |
T.rubripes
| v4 |
8.7x |
98.0% |
99.2% |
X |
X |
X |
X |
C.intestinalis
| v1.95 |
11x |
96.4% |
97.8% |
X |
X |
X |
X |
  Insects |
A.gambiae
| AgamP3 |
10.2x |
98.8% |
99.6% |
X |
X |
X |
X |
A.mellifera
| Amel4.0 |
7.5x |
91.9% |
94.3% |
X |
X |
X |
X |
  Nematodes |
C.briggsae
| Cb2 |
12x |
99.2% |
99.2% |
X |
X |
X |
X |
C.brenneri
| v4.0 |
9.5x |
98.8% |
100% |
X |
X |
X |
X |
C.remanei
| v1.0 |
9x |
96.0% |
98.8% |
X |
X |
X |
X |
T.spiralis
| v1 |
>30x |
94.0% |
96.0% |
X |
X |
X |
X |
  Plants |
P.trichocarpa
| v1.0 |
7.5x |
98.4% |
99.2% |
X |
X |
X |
X |
O.sativa
| Build 4 |
- |
98.4% |
98.4% |
X |
X |
X |
X |
C.reinhardtii
| v3.1 |
12.8x |
93.1% |
93.1% |
X |
X |
X |
X |
  Fungi |
N.crassa
| v7 |
>10x |
98.8% |
98.8% |
X |
X |
X |
X |
M.grisea
| v5 |
7x |
97.9% |
98.8% |
X |
X |
X |
X |
  Protozoan |
P.falciparum
| v5.2 |
- |
75.0% |
75.0% |
X |
X |
X |
X |
G.lamblia
| giardia14 |
11x |
46.4% |
46.4% |
X |
X |
X |
X |
Click on the X to retrieve the corresponding file.
New set of species not included in the paper (soon to be completely filled)
|
|
Release |
|
Coverage |
|
Mapping |
|
Mapping |
|
Paralogy |
|
Genomic |
|
Coord |
|
Protein |
|
Report |
|
|
|
|
complete |
partial |
|
(fasta) |
(gff) |
(fasta) |
(txt) |
  Mammals |
P.pygmaeus
| v2.02 |
6x |
90.73% |
99.19% |
33.78% |
X |
X |
X |
X |
C.jacchus
| v2.02 |
6x |
95.56% |
100% |
39.66% |
X |
X |
X |
X |
M.domestica
| MonDom5 |
6.8x |
94.35% |
96.77% |
30.34% |
X |
X |
X |
X |
R.norvegicus
| v3.4 |
7x |
97.18% |
100% |
30.71% |
X |
X |
X |
X |
  Reptile |
A.carolinensis
| AnoCar1.0 |
6.3x |
87.10% |
93.15% |
16.20% |
X |
X |
X |
X |
  Fish |
P.marinus
| v3.0 |
5.9x |
56.05% |
69.35% |
18.71% |
X |
X |
X |
X |
O.latipes
| v1 |
6x |
97.58% |
99.19% |
21.49% |
X |
X |
X |
X |
G.aculeatus
| gasAcu1 |
9x |
98.39% |
98.79% |
21.31% |
X |
X |
X |
X |
D.rerio
| Zv7 |
10x |
96.37% |
98.79% |
27.62% |
X |
X |
X |
X |
Click on the X to retrieve the corresponding file.
The latest CEGMA distribution can be found in the cegma web
page.
- G. Parra, K. Bradnam and I. Korf.
"CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes."
Bioinformatics, 23: 1061-1067 (2007) [Abstract] [Full Text]
|
|