How to Use ESTs to Determine Intron/Exon Boundaries
- Look at 5' end of CG33521 in contig 10 of Dgri in the GEP's UCSC browser mirror (ca. position 23100).
- Open CG33521 in the Gene Record Finder. Command-click the "View in GBrowse" link for any of the four isoforms in the "mRNA Details" panel to see the structures of the Dmel isoforms of this gene.
- All isoforms share the same first coding exon; exon 14_1001 (the second exon of isoforms A and B in the Gene Record Finder) matches contig 10 well from positions 23001 to 22912.
Examining this region in the UCSC browser shows that 23001 is at the right position (downstream of a splice acceptor AG)
to begin the exon, and that it actually ends at 22909 or 22911 (upstream of a splice donor GT)
(Exon 3_1001, the second exon in isoforms C and D, starts at the same place based on homology with Dmel).
- So far so good, but what about the first exon, #13_1001 according to the GRF? Its sequence in Dmel is
It's 39 amino acid residues long, but a tBLASTn search against the entire contig only shows homology for the last 9, from positions 23092 to 23066. Cranking up the e value to 1,000,000 pulls out a few more even weaker hits, but these don't extend the homologous region at the 3' end.
- Between the strong homology and the adjacent splice donor GT, the 3' end position is almost certainly correct, but what about the 5' end? The tBLASTn search says 9 aas are homologous, but the BLAST homology results in the UCSC browser show the homology extending to position 23125. However, there is no initiating Met in frame between here and the first upstream stop codon at 23372-23374.
- Perhaps there is a non-canonical translation start site for this gene, such as the Val at positions 23135-23137. This is extremely rare, and there's no way to confirm this without wet lab experimentation anyway, so let's ignore this possibility.
- Perhaps there was a sequencing error which shifted frames and the Met at positions 23131-23133 is the translation start site. This is unlikely given the quality of the Dgri sequence, but it's still a possibility. Go to the NCBI BLAST page and click on "Search trace archives" in the "Specialized BLAST" section. On the new page, use the drop-down menu under "Choose Search Set" to select "Drosophila grimshawi - WGS" [Whole Genome Sequence] from the alphabetized (but very long) list. Paste the DNA sequence of your contig into the Query Sequence box (or Browse and select it). I'm going to search the database with only the region I'm interested in, from positions 23000 to 23200. Enter these values in the "Query subrange" boxes and BLAST.
- Scroll down to the first alignment (744167572) and note that the segment we're interested, between positions 23066 and 23133 in the contig corresponds to positions 434 and 367 in the read. Click on the accession number, and on the resulting page under "Retrieve" change "FastA" to "Trace" and click "Show." Enter "400" in the "Base #" box and click "Go" and you'll see that the read quality is pretty good, so a sequencing error seems unlikely as an explanation.
- Dang! How can we determine the 5' end of this exon, when there isn't any homology with Dmel to go by? Enter the EST database. As before, go to the NCBI BLAST page and click on "Search trace archives" in the "Specialized BLAST" section. On the new page, use the drop-down menu under "Choose Search Set" to select "Drosophila grimshawi - EST" (NOT the WGS item this time). Again paste the DNA sequence of your contig into the Query Sequence box (or Browse and select it) and BLAST away.
- The search results show all of the homologous regions between the contig and RNAs that have been sequenced from D. grimshawi. You can click on the homolog you're interested in to jump down the page to it, but for clarity I'm going to go back and search the database with only the region I'm interested in, from positions 22,000 to 25,000. Enter these values in the "Query subrange" boxes and re-BLAST.
- There are three regions of homology, running from positions 6 to 91, 90 to 123, and 122 to 416 for the Subject (the RNA), corresponding to positions 24231 to 24145, 23465 to 23432, and 23358 to 23066 for the contig.
This tells us that there are 3 exons in this region, which are spliced together so that they're adjacent in the Dgri mRNA that was sequenced.
- Let's annotate the most 3' one first, running from positions 23358 to 23066 in the contig. Examining the sequence at 23358 shows us that there's a splice acceptor (AG) at 23357-58, so the exon begins at 23356. The next runs from 23463 to 23432, and the next from 24236-ish (no splice acceptor nearby, but this is the transcription start so that's not surprising) to 24145.
- Splicing these three exons together in silico yields an open reading frame in the correct (for this gene) direction, ending with the short homology to the melanogaster exon 3' end (EYDNQNISV vs EYDNQNINM). The entire open reading frame for the grimshawi protein in these three exons is 102 residues long:
- BLASTing this polypeptide sequence against the Dmel protein database shows only some weak homologies (e = 1.3 at best) so the N-terminal region of the Dgri protein is not very similar to any Dmel proteins. BLASTing the spliced 3-exon DNA sequence against the Dgri EST database yields a single, very strong (e = 4e-179) hit, running in the opposite direction, as expected.
- So it looks like in Dmel all 4 isoforms of CG33521 have an initial 5' exon that is entirely UTR and a second exon coding for the initial 39-residue portion of the protein product with different numbers of untranslated nucleotides at the 5' end depending on the isoform.
- In contrast, Dgri has split that 2nd exon with a 76-nucleotide intron near its upstream end and has a much longer N-terminal sequence relative to Dmel.