Core eukaryotic genes dataset

Genome sequencing projects have now been completed (or are underway) for a wide range of species. However, a completed project does not guarantee a completed genomic sequence, and for many published genomes it is not always clear as to how much of the genome remains to be sequenced. The finish line for genome projects usually depends on an estimate of genome size, and such estimates may be erroneous. To tackle this important, but much neglected, issue of genome completeness we have developed a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in virtually all eukaryotes in a reduced number of paralogous. For a large number of diverse eukaryotes with genome sequence data we assess the proportion of CEGs that can be mapped into each genome. Using different coverage assemblies from artifitial reassembled genome data for the nematode Caenorhabditis briggsae and H.sapiens, and multiple real assemblies from T.gondii we show the effects of increasing sequence coverage on the ability to find CEGs within a species.

Based on the 458 core Eukaryotic genes (Parra et al. 2007), we generated a subset of 248 Eukaryotic core genes that are likely to be found in low number of inparalogs in a wide range of species. We selected the cases when only one ortholog in at least four of the six species. We reduced the paralogy, the number of CEGs with more than one inparalogous per species, in about 10% in the six species. The cutoff files includes for each of the CEG the group of conservation they belong to (1-4, see the main paper for details), the cutoff for the hmmsearch alignments and the length of the proteins.

Generating C. briggsae re-assemblies

The published genome of C. briggsae is a 12x WGS assembly (Stein al. 2003) and was produced using the Phusion assembler. We used the original sequencing reads from this assembly and randomly sampled them to produce virtual assemblies at defined levels of sequence coverage (2x, 4x, 6x, 8x, and 10x). The 2x assembly derives from 400,000 sequence reads and each subsequent assembly adds another 400,000 reads. The current version of Phusion was used to produce both contigs and scaffolds for each assembly. N50 is calculated by first ordering all contig (or scaffold) sizes and then adding the lengths (starting from the longest contig) until the summed length exceeds 50% of the total length of all contigs. N50 is reported in Kb. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

	Contigs				Scaffolds
Assembly details	N	N50	Seqs	Report	N	N50	Seqs	Report
2x	34,456	2.3	20M	44.3%	10,297	14.2	20M	53.2%
4x	20,421	7.4	27M	80.6%	2,268	16.4	27M	90.2%
6x	11,399	16.4	29M	91.5%	1,028	465	29M	95.9%
8x	7,363	28.9	29M	93.1%	971	983	29M	97.2%
10x	5,614	37.4	29M	97.9%	675	1,032	29M	98.8%

This table containd the contig and scaffold files for C.briggsae.
It shows the file sizes of the files in each category. Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.

Generating virtual human genome assemblies

We generated seven virtual human genome assemblies by using the distribution of known contig sizes from the draft genome assemblies of guinea pig (1.9x), cow (3x, 6x and 7.1x), chimpanzee (4.2x and 6.6x) and the rhesus macaque (5.3x). Estimates of genome size for these species, as measured by the C-value, are all in a narrow range between 3.43 and 3.59 pg of DNA (Gregory et al 2007). For each virtual assembly we worked through the list of contig sizes and extracted an equal length of sequence from the published human genome sequence. In doing so we effectively sampled random sites from across the genome and ensured that all extracted sequences were not overlapping. The "gff coord" files refer tto the coordinates that generate the contigs and scaffolds from the Human genome (ncbi36) based on the length distribution of the reference draft genomes.N50 is calculated by first ordering all contig (or scaffold) sizes and then adding the lengths (starting from the longest contig) until the summed length exceeds 50% of the total length of all contigs. N50 is reported in Kb. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

		Contigs				Scaffolds
Assembly details	gff coord	N	N50	Seqs	Report	N	N50	Seqs	Report
1.9x	10M	590,603	3.1	202M	21.0%	130,283	51.9	276M	42.3%
3x	12M	795,203	4.1	360M	35.5%	449,727	13.5	364M	50.4%
4.2x	7M	435,593	13.1	438M	57.2%	81,459	2,425	433M	90.7%
5.3x	6M	368,201	14.7	459M	59.7%	28,863	692	455M	91.9%
6x	4M	296,517	19.1	414M	60.1%	62,471	436	411M	85.4%
6.6x	4M	292,555	28.8	471M	72.2%	77,769	8,217	464M	95.9%
7.1x	1M	131,620	44.3	440M	79.8%	16,098	1,042	436M	92.7%

This table containd the contig and scaffold files for H.sapiens.
It shows the file sizes of the files in each category. Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.

The following data correspond to the genes mapped by cegma in the some available sequenced genomes. The "Mapped CEGs" column lists the percentages of the 248 CEGs that were mapped in the genome of each species. "Partially mapped" correspond to the number of CEGs that are partially mapped. Coverage refers to approximate values of sequence coverage for WGS genomes only. "Report" files contain the summari of the genes mapped by cegma as well as the ids of the missing genes.

	Release	Coverage	Mapping	Mapping	Genomic	Coord	Protein	Report
			complete	partial	(fasta)	(gff)	(fasta)	(txt)
Mammals
C.familiaris	CanFam2.0	7.5x	98.0%	99.2%	X	X	X	X
B.Taurus	v3.1	7.1x	98.4%	99.2%	X	X	X	X
P.troglodytes	v2.1	6.6x	96.8%	98.8%	X	X	X	X
M.musculus	NCBI m36	6.3x	99.2%	100%	X	X	X	X
M.mulatta	v1.0	5.3x	96.0%	98.0%	X	X	X	X
F.catus	FelCat3	2x	58.1%	75.8%	X	X	X	X
L.africana	LoxAfr 1.0	2x	46.0%	68.5%	X	X	X	X
C.porcellus	CavPor2	1.9x	46.0%	68.1%	X	X	X	X
Vertebrates
O.anatinus	v5.0	6x	74.6%	77.0%	X	X	X	X
G.gallus	v2.1	6.6x	83.9%	84.3%	X	X	X	X
X.tropicalis	v4.1	7.7x	95.6%	96.8%	X	X	X	X
T.rubripes	v4	8.7x	98.0%	99.2%	X	X	X	X
C.intestinalis	v1.95	11x	96.4%	97.8%	X	X	X	X
Insects
A.gambiae	AgamP3	10.2x	98.8%	99.6%	X	X	X	X
A.mellifera	Amel4.0	7.5x	91.9%	94.3%	X	X	X	X
Nematodes
C.briggsae	Cb2	12x	99.2%	99.2%	X	X	X	X
C.brenneri	v4.0	9.5x	98.8%	100%	X	X	X	X
C.remanei	v1.0	9x	96.0%	98.8%	X	X	X	X
T.spiralis	v1	>30x	94.0%	96.0%	X	X	X	X
Plants
P.trichocarpa	v1.0	7.5x	98.4%	99.2%	X	X	X	X
O.sativa	Build 4	-	98.4%	98.4%	X	X	X	X
C.reinhardtii	v3.1	12.8x	93.1%	93.1%	X	X	X	X
Fungi
N.crassa	v7	>10x	98.8%	98.8%	X	X	X	X
M.grisea	v5	7x	97.9%	98.8%	X	X	X	X
Protozoan
P.falciparum	v5.2	-	75.0%	75.0%	X	X	X	X
G.lamblia	giardia14	11x	46.4%	46.4%	X	X	X	X

Click on the X to retrieve the corresponding file.

New set of species not included in the paper (soon to be completely filled)

	Release	Coverage	Mapping	Mapping	Paralogy	Genomic	Coord	Protein	Report
			complete	partial		(fasta)	(gff)	(fasta)	(txt)
Mammals
P.pygmaeus	v2.02	6x	90.73%	99.19%	33.78%	X	X	X	X
C.jacchus	v2.02	6x	95.56%	100%	39.66%	X	X	X	X
M.domestica	MonDom5	6.8x	94.35%	96.77%	30.34%	X	X	X	X
R.norvegicus	v3.4	7x	97.18%	100%	30.71%	X	X	X	X
Reptile
A.carolinensis	AnoCar1.0	6.3x	87.10%	93.15%	16.20%	X	X	X	X
Fish
P.marinus	v3.0	5.9x	56.05%	69.35%	18.71%	X	X	X	X
O.latipes	v1	6x	97.58%	99.19%	21.49%	X	X	X	X
G.aculeatus	gasAcu1	9x	98.39%	98.79%	21.31%	X	X	X	X
D.rerio	Zv7	10x	96.37%	98.79%	27.62%	X	X	X	X

Click on the X to retrieve the corresponding file.

We can help you!