In Part 5 we extend TypeGym’s content system with a new option: Markov Chain text generation.
Instead of picking random words from a list we build a small statistical model from a text file (book) and generate new practice text that resembles the original source but isn’t a direct copy.
This is a perfect fit for a typing trainer: the text stays varied but still feels “language-like”.
What We Covered
- Added a new
TextSource::MarkovChain(path)variant - Implemented a
markovmodule with aMarkovChainstruct:- transition table
(word1, word2) -> [next_words...] - list of valid starter pairs
- transition table
- Built the chain by splitting the text into sentences and extracting word triplets
- Generated new text by walking transitions, chunking output into terminal-friendly lines
- Added a lightweight cleanup step so quotes/dashes behave nicely in ASCII terminals
- Wired everything into
get_text()so the app can generate Markov text each session
Design Insights
Markov Chains are “procedural text” without the AI baggage
A Markov chain is simple:
- look at the previous N words
- randomly pick the next word based on what you saw in the training data
In our implementation we use bigrams (2-word keys), which is the sweet spot for a tutorial:
- still easy to implement and explain
- produces surprisingly coherent local structure
- doesn’t require any external ML libraries
Sentence splitting matters more than you’d think
If you build transitions across punctuation blindly, the chain tends to drift into nonsense faster and your starters become low quality.
By splitting into sentences first we:
- get better starter pairs
- avoid weird cross-sentence transitions
- make the generated output feel more like real phrases
Implementation Highlights
1) Data structures: starters + transitions
We represent each state as a pair of words:
type Key = ;
startersstores the first two words of each training sentencetransitionsstores all possible “next word” choices for a given(prev, current)pair
We deliberately store a Vec of choices, not a frequency map because repeating values naturally encodes weighting.
2) Building the chain from word triplets
For each sentence:
- split to words
- cleanup punctuation / weird quotes
- keep only non-empty words
- push the first two as a starter
- slide a window of 3 words and insert transitions
Conceptually:
(w0, w1) -> w2(w1, w2) -> w3(w2, w3) -> w4- …and so on.
3) Generating text by walking transitions
Generation is:
- pick a random starter
(prev, cur) - repeatedly sample
nextfromtransitions[(prev, cur)] - shift the window
(prev = cur, cur = next) - stop when there is no transition
- repeat until we have enough words
Then we format it for terminal typing by chunking:
- 10 words per line
- newline between lines
This keeps the output readable and consistent for the UI.
4) Cleaning up quotes/dashes for terminal output
We convert a few common Unicode punctuation characters into ASCII equivalents:
- curly quotes →
"/' - em dash →
-
Then we trim common wrappers (" ' _ () []) around tokens.
This prevents “invisible weird Unicode punctuation” from showing up in practice text and keeps the experience consistent across terminals.
5) Wiring it into TextSource
Finally, we connect it all to the existing text pipeline:
MarkovChain => generate_markov_chain,
Where generate_markov_chain reads the input file, builds the chain and returns generated text.
External Resources
- Markov Chain - Wikipedia entry on Markov Chain
- Markov Chain Text Generation - an article explaining the approach, the resource we used for visual aid via provided image of transitions
What’s Next?
Now that TypeGym can generate natural-ish text, we have just one more step to implement
- add CLI options to configure input files for different modes, manipulate output structure (height / width / words count / etc)
Project Code
You will find the complete source code here: typegym