# Find a Splice Site in a DNA Sequence

A DNA sequence consists of base letters A, C, G, and T. Suppose there is a sequence that begins in an exon, contains a splice site, and ends in an intron. If the exons have a uniform base composition, the introns are deficient in C and G, and the splice site consensus nucleotide is G with probability 0.95, the frequency distributions are as follows.

 In[1]:= Xexon = {0.25, 0.25, 0.25, 0.25}; intron = {0.4, 0.1, 0.1, 0.4}; splice = {0.05, 0, 0.95, 0};
 In[2]:= XdnaSeq = Characters["CTTCATGTGAAAGCAGACGTAAGTCA"] /. {"A" -> 1, "C" -> 2, "G" -> 3, "T" -> 4};

The state machine has states for exon (1), splice (2), intron (3), and end (4), with the following transition probabilities between states.

 In[3]:= Xtm = {{0.9, 0.1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0.9, 0.1}, {0, 0, 0, 1}};

The emissions are nucleotides A (1), C (2), G (3), T (4), or end (5).

 In[4]:= Xem = PadRight[{exon, splice, intron, UnitVector[5, 5]}];
 In[5]:= Xhmm = HiddenMarkovProcess[1, tm, em];

Find the most probable nucleotide subsequence (exon, splice, intron, or end).

 In[6]:= Xsites = FindHiddenMarkovStates[Append[dnaSeq, 5], hmm]
 Out[6]=

Find the joint probability of the preceding nucleotide sequence and the DNA sequence.

 In[7]:= XFold[Times, Likelihood[DiscreteMarkovProcess[hmm], sites], MapThread[Part[em, #1, #2] &, {sites, Append[dnaSeq, 5]}]]
 Out[7]=

## Mathematica

Questions? Comments? Contact a Wolfram expert »