The John Innes Centre, in collaboration with JCVI and Cogenics, has developed a Brassica EST microarray intended to be an open community resource. A gene expression service using the Agilent platform is now available from Cogenics on commercial terms but the EST assembly data is made freely available to allow investigators to implement on other platforms of their choice.
The Brassica EST assemblies used to populate the array were generated by our collaborators, Foo Cheung and Chris Town at JCVI. The assemblies were computed from a total of 810,254 Brassica raw ESTs publicly available at the time of the experiment (August 2007) using an overlap identity criterion of 95%. After the ESTs were cleaned and <100 bp sequences removed, 803,326 sequences remained. The resulting clustered assembly set comprised 94,558 sequences of which 42,642 are assemblies (mean length 874 nt) and 51,916 are singletons (mean length 535 nt). 3,694 of the sequences (330 assemblies and 3,364 singletons) are represented in both orientations where no significant UniProt hit or other strand-orienting information was recovered. Of the 94,558 sequences, 72,148 have a significant (p=≤1.0E-5) BLASTX UniProt hit associated with them. This (non-redundant) representation of a pan-Brassica transcriptome is ~63 Mb in length. The new assembly identifiers are prefixed with JCVI (e.g. JCVI_992), the singletons retain their original Genbank/EMBL identifiers (e.g. AM061211).
On the Brassica Gateway page, the EST array search box can be used to search our database. The user can supply an EST identifier (this searches both the EST identifier and Assembly components fields), a UniProt identifier (e.g. Q9FVM1), a descriptive term (e.g. 'kinase'), a Gene Name (e.g. DIN1) or the organism source (e.g. Arabidopsis thaliana) for a given UniProt hit. With the returned results in HTML format, we also provide a link to a plain text version. We also provide a facility where a user can BLAST against the 95k EST assemblies. To download a copy of the 95k EST array assemblies (fasta format) please go to our FTP site.
Chart A displays the principal ontological groups associated with the array (out of the 72,148 sequences that had a protein hit against UniProt, the largest group [unclassified] is where no gene ontologies are associated with a given sequence). Chart B displays which species the protein hits correspond to with, as expected, the largest number of hits (77%) coming from Arabidopsis thaliana.