Perform Typo Correction without a Dictionary 

Modeling the sequence of typed characters as observations and the sequence of the correct characters as hidden states, this example uses two different models for the evolution of correct letters to correct for typos. The first model uses the first-order Markov process to encode frequencies of what follows the given character. The second model uses the second-order Markov states to encode frequencies of what follows the given last two characters.

Take Darwins The Origin of Species and convert all non-letter symbols to spaces.

In[1]:=
Click for copyable input
X
In[2]:=
Click for copyable input
X

Introduce typos, at a 20% rate, by randomly replacing a character with one of its immediate neighbors on the standard QWERTY keyboard. The space character is never mistyped.

show complete Wolfram Language input
In[4]:=
Click for copyable input
X
In[5]:=
Click for copyable input
X
show complete Wolfram Language input
Out[6]=

Take the first 80,000 characters to be a test sequence, and the remainder of the text as a training sequence. The test sequence has 16.5% mistyped characters, less than 20% because the space character was never mistyped.

In[7]:=
Click for copyable input
X
Out[7]=

Model the stream of typed characters using hidden Markov process. Correct characters are hidden states, while the actual typed characters are observations.

In[8]:=
Click for copyable input
X

Estimate the hidden Markov process using training data.

In[9]:=
Click for copyable input
X
Out[9]=
In[10]:=
Click for copyable input
X

Use posterior decoding to correct typos in the test portion of the mistyped text.

In[11]:=
Click for copyable input
X
Out[11]=

Corrected text still has 11% typos, 5.5% fewer than the original typed text.

show complete Wolfram Language input
Out[12]=
In[13]:=
Click for copyable input
X
Out[13]=

Model evolution of the hidden states using a second-order Markov process; i.e. assume the probability of the character depends only on the two preceding characters. The second-order Markov process is modeled as a first-order Markov process with enlarged state space of pairs of consecutive characters.

In[14]:=
Click for copyable input
X
In[15]:=
Click for copyable input
X

The probability of the actual typed character depends only on the underlying true character.

In[16]:=
Click for copyable input
X

Repeat posterior decoding with the higher-order hidden Markov model.

In[17]:=
Click for copyable input
X
Out[17]=

The percentage of typos has now been reduced to under 6%.

In[18]:=
Click for copyable input
X
Out[18]=
show complete Wolfram Language input
Out[19]=
de es ja pt-br zh