Citrus sinensis genome v1.0 (JGI)

Overview
Analysis NameCitrus sinensis genome v1.0 (JGI)
MethodPerformed by JGI (v1.0)
SourceJGI Citrus sinensis assembly/annotation v1.0 (154)
Date performed2011-02-01

Note: The following text comes from phytozome.org:

Genome Size / Loci
This version (v.1) of the assembly is 319 Mb spread over 12,574 scaffolds. Half the genome is accounted for by 236 scaffolds 251 kb or longer. The current gene set (orange1.1) integrates 3.8 million ESTs with homology and ab initio-based gene predictions (see below). 25,376 protein-coding loci have been predicted, each with a primary transcript. An additional 20,771 alternative transcripts have been predicted, generating a total of 46,147 transcripts. 16,318 primary transcripts have EST support over at least 50% of their length. Two-fifths of the primary transcripts (10,813) have EST support over 100% of their length.

Sequencing Method
Genomic sequence was generated using a whole genome shotgun approach with 2Gb sequence coming from GS FLX Titanium; 2.4 Gb from FLX Standard; 440 Mb from Sanger paired-end libraries; 2.0 Gb from 454 paired-end libraries

Assembly Method
The 25.5 million 454 reads and 623k Sanger sequence reads were generated by a collaborative effort by 454 Life Sciences, University of Florida and JGI. The assembly was generated by Brian Desany at 454 Life Sciences using the Newbler assembler.

Identification of Repeats
A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask 31% of the genome with RepeatMasker.

EST Alignments
We aligned the sweet orange EST sequences using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.

Assembly metrics

Assembly size  319 Mb
Number of scaffolds 12,574
N50 250,548
Predicted transcripts 46,147
Annotated genes  
Assembly BUSCO score (embryophtya_odb10) 92.2%
Annotation BUSCO score (embryophtya_odb10) 87.5%
Downloads

All assembly and annotation files are available for download by selecting the desired data type in the right-hand "Resources" side bar.  Each data type page will provide a description of the available files and links do download.  Alternatively, you can browse all available files on the CGD data repository.

Assembly

The following text comes from phytozome.org:

Genomic sequence was generated using a whole genome shotgun approach with 2Gb sequence coming from GS FLX Titanium; 2.4 Gb from FLX Standard; 440 Mb from Sanger paired-end libraries; 2.0 Gb from 454 paired-end libraries.  The 25.5 million 454 reads and 623k Sanger sequence reads were generated by a collaborative effort by 454 Life Sciences, University of Florida and JGI. The assembly was generated by Brian Desany at 454 Life Sciences using the Newbler assembler.

Please note: if you download and use the JGI whole genome assembly and annotation please abide by the requirements for this data as specified on phytozome.org's Citrus sinensis download page.  

Downloads

Scaffolds  (FASTA file, 83Mb compressed) Csinensis_v1.0_scaffolds.fa.gz
Scaffolds w/ masked repeats (FASTA file, 83Mb compresseD) Csinensis_v1.0_scaffolds_RM.fa.gz
Scaffolds (GFF3 file, 78 Mb compressed) Csinensis_v1.0_scaffolds.gff3.gz

 

Gene Predictions

The following text comes from phytozome.org:

The current gene set (orange1.1) integrates 3.8 million ESTs with homology and ab initio-based gene predictions (see below). 25,376 protein-coding loci have been predicted, each with a primary transcript. An additional 20,771 alternative transcripts have been predicted, generating a total of 46,147 transcripts. 16,318 primary transcripts have EST support over at least 50% of their length. Two-fifths of the primary transcripts (10,813) have EST support over 100% of their length.

Please note: if you download and use the JGI whole genome assembly and annotation please abide by the requirements for this data as specified on phytozome.org's Citrus sinensis download page.  

Downloads

Coding sequences--CDS (FASTA file, 11Mb compressed) Csinensis_v1.0_cds.fa.gz
Transcript sequences--mRNA (FASTA file,  15Mb compressed) Csinensis_v1.0_transcript.fa.gz
Protein sequences (FASTA file, 7Mb compressed) Csinensis_v1.0_peptide.fa.gz
Gene models (GFF3 file, 4Mb compressed) Csinensis_v1.0_gene.gff3.gz
Alternate genes (GFF3 file, 3.5 Mb compressed) Csinensis_v1.0_alt_gene.gff3.gz
RepeatsMasker repeats (GFF3 file, 6.3 Mb compressed) Csinensis_v1.0_repeats.gff3.gz

 

Protein Homology

Protein homology found here was performed by the Main Bioinformatics Lab at WSU. Proteins from the C. clementina v1.0 assembly were mapped against proteins from other genomes and databases using blastp with an e-value cutoff of 1e-6. Only the best 10 matches were kept. The available files are in Excel 2007 format.

Downloads

ExPASy SwissProt Csinensis_v1.0_vs_sprot.xls
Malus x domestica (apple) v1.0 proteins Csinensis_v1.0_vs_apple.xls
TAIR10 (arabidopsis) proteins  Csinensis_v1.0_vs_arabidopsis.xls
Prunus persica (peach) v1.0  proteins  Csinensis_v1.0_vs_peach.xls
Vitis vinifera (grape)  proteins Csinensis_v1.0_vs_grape.xls
Populus trichocarpa (poplar) v2.0  proteins Csinensis_v1.0_vs_poplar.xls

 

Repeats

The following text comes from phytozome.org:

A de novo repeat library was made by running RepeatModeler (Arian Smit, Robert Hubley) on the genome to produce a library of repeat sequences. Sequences with Pfam domains associated with non-TE functions were removed from the library of repeat sequences and the library was then used to mask 31% of the genome with RepeatMasker.

Please note: if you download and use the JGI whole genome assembly and annotation please abide by the requirements for this data as specified on phytozome.org's Citrus clementina download page.  

Downloads

RepeatsMasker repeats (GFF3 file, 6.3 Mb compressed) Csinensis_v1.0_repeats.gff3.gz