Coding TypeGym: A Rust Terminal Typing Trainer | Part 5: Markov Chain Text

Published on

In Part 5 we extend TypeGym’s content system with a new option: Markov Chain text generation.

Instead of picking random words from a list we build a small statistical model from a text file (book) and generate new practice text that resembles the original source but isn’t a direct copy.

This is a perfect fit for a typing trainer: the text stays varied but still feels “language-like”.


What We Covered


Design Insights

Markov Chains are “procedural text” without the AI baggage

A Markov chain is simple:

In our implementation we use bigrams (2-word keys), which is the sweet spot for a tutorial:

Sentence splitting matters more than you’d think

If you build transitions across punctuation blindly, the chain tends to drift into nonsense faster and your starters become low quality.

By splitting into sentences first we:


Implementation Highlights

1) Data structures: starters + transitions

We represent each state as a pair of words:

type Key = (String, String);

pub struct MarkovChain {
    pub transitions: HashMap<Key, Vec<String>>,
    pub starters: Vec<Key>,
}

We deliberately store a Vec of choices, not a frequency map because repeating values naturally encodes weighting.

2) Building the chain from word triplets

For each sentence:

  1. split to words
  2. cleanup punctuation / weird quotes
  3. keep only non-empty words
  4. push the first two as a starter
  5. slide a window of 3 words and insert transitions

Conceptually:

3) Generating text by walking transitions

Generation is:

  1. pick a random starter (prev, cur)
  2. repeatedly sample next from transitions[(prev, cur)]
  3. shift the window (prev = cur, cur = next)
  4. stop when there is no transition
  5. repeat until we have enough words

Then we format it for terminal typing by chunking:

This keeps the output readable and consistent for the UI.

4) Cleaning up quotes/dashes for terminal output

We convert a few common Unicode punctuation characters into ASCII equivalents:

Then we trim common wrappers (" ' _ () []) around tokens.

This prevents “invisible weird Unicode punctuation” from showing up in practice text and keeps the experience consistent across terminals.

5) Wiring it into TextSource

Finally, we connect it all to the existing text pipeline:

TextSource::MarkovChain(ref path) => generate_markov_chain(path),

Where generate_markov_chain reads the input file, builds the chain and returns generated text.


External Resources


What’s Next?

Now that TypeGym can generate natural-ish text, we have just one more step to implement


Project Code

You will find the complete source code here: typegym