The UK Crop Plant Bioinformatics Network

BrassicaDB Process

EMBL Release 62


The following notes are specific to EMBL release 62, and deal with verifing the results of the nucleotide process against the contents of BrassicaDB.


But before we commit the data to the database we check it against the current contents of the BrassicaDB, in case we have lost any sequences.

Read these ace files into a scratch database and export a list of the sequence objects. Then count the number of differences (including sub-sequences):

% diff brassdb_seq.ace temp_seq.ace | grep -v "_" | grep -c "> "
1906
% diff brassdb_seq.ace temp_seq.ace | grep -v "_" | grep -c "< "
92
% diff brassdb_dna.ace temp_dna.ace | grep -c "> "
1907
% diff brassdb_dna.ace temp_dna.ace | grep -c "< "
10

Looking in SRS to check for these sequences gives the following results:

EM:H07586 ?
EM:H07609 ?
EM:H07709 ?
EM:H07731 ?
EM:H07752 ?
EM:H07850 ?
EM:H07853 ?
EM:L34288 -> AF052241
EM:S71338 ?
EM:X83692 -> Brassica napus (tournefortii)

Interestingly only two of these exist in the current version of EMBL (release 62). However for the others I do have the original EMBL records (release 44) so I can reparse these and read them into the database. The disapearance of these record is rather strange so we inform the EMBL curators and ask for an explaination.

The following table shows the number of objects in the files generated by the parsing process.

Paper file Sequence file
Brassica_ESTs_EMBL_r62 3,799 10,824
Brassica_EMBL_r62 1,623 4,287
Extra sequences (EMBL r44) 3 30

This table just shows the results of asking the databases for the number of objects in each class. The scratch database contains the data as parsed from EMBL release 62.

Scratch database BrassicaDB
DNA 4683 2777
Paper 621 668
Protein 140 580
Sequence 4683 (5795) 2777 (5057)

As you can see there is a large difference in the number of sequences. Most of this difference will be due to the introduction of Brassica rapa sequences into this build of BrassicaDB. However this increase is not paralleled by the introduction of a large number of new sequence objects. This looks to be caused by a decrease in the number of sub-sequences.

Scratch database BrassicaDB
CDS 858 865
misc_RNA 10 10
mRNA 213 1,291
rRNA 22 22
tRNA 9 9

Okay check to see if any there are any missing sequences:

% diff brassdb_dna.ace temp_dna.ace | grep -c "< "
1
% diff brassdb_dna.ace temp_dna.ace | grep -c "> "
1907

Doing the same thing for the sub-sequences:

% diff brassdb_sub_seq.ace temp_sub_seq.ace | grep -c "< "
1090
% diff brassdb_sub_seq.ace temp_sub_seq.ace | grep -c "> "
5

The difference in the sub-sequences seems to be accounted for by a change in EST annotation.


Once we have all the sequence data (both nucleotide and peptide), it is time to consider how this will be read into the database. The options are either to read it in over the current data, or to remove the current data and reload the classes.

In this case large changes have occured in the way the data is annoted (and we have made a few model modifications) so it was decided to remove the old sequence data before loading the new.