Next word prediction with Markov chains

A Markov chain is a way to represent how a system moves between different states over time. The key idea behind a Markov chain is that the likelihood of moving to the next state depends only on the current state, not on the history of previous states. This is known as the Markov property or "memoryless" property.

One common application of Markov chains is in language modeling, such as predicting the most likely next word in a sentence. This process involves analyzing a vast collection of text to determine how many times each word follows another. From these counts, we can estimate the probability of each word appearing next given the current word.

For example, let's consider the probability distribution for the words in the following sentence:

Current state	Next State		Next State
Current state	Word	Number of Ocurrences	Probability
I	want	1	1
want	the	1	1
the	best	2	0.6667
the	field	1	0.3333
best	of	1	0.5
best	in	1	0.5
of	the	1	1
in	the	1	1
field	.	1	1

To predict the word that comes after "the", we look at how many times different words follow "the" in this sentence. We see that "best" follows "the" twice, while "field" follows it once. Therefore, the estimated probability of "best" appearing after "the" is 2/3 or about 66.67%, while the probability for "field" is 1/3 or about 33.33%. With a sufficiently large training corpus, we can construct a comprehensive probability table that covers a wide vocabulary. This enables the language model to suggest the most probable next word for any given context.

Strategies for Selecting the Next Word

Once we have our probability distributions, we can employ different strategies to select the next word. We'll explore three primary methods: maximum likelihood, random weighted selection, and pure random selection. These are fancy words for fairly simple ideas.

Maximum Likelihood

The simplest method is to always choose the word with the highest probability of following the current word. This approach, called maximum likelihood, ensures that we select the most statistically probable word each time. While this method is straightforward and effective, it can sometimes produce repetitive and predictable sequences, making the text feel monotonous.

Random Weighted Selection

For more variety and less predictability, we can use random weighted selection. In this strategy, we consider the probabilities as weights, giving each word a chance of being selected based on its likelihood. This method strikes a balance between predictability and randomness, resulting in text that feels more natural and engaging. It prevents the text from becoming too repetitive while still maintaining coherence.

Pure Random Selection

For the most unpredictable results, we can employ pure random selection. Here, we ignore probabilities altogether and give each possible next word an equal chance of being chosen. This method guarantees maximum unpredictability but can lead to less coherent and sometimes nonsensical text. It's a high-risk, high-reward strategy that can produce both creative and chaotic outcomes.

Improving next-word predictions

Up until now, our use of Markov chains has been fairly basic. We'll now explore various steps we can take to enhance our Markov chains in order elevate the quality of our text predictions.

Adding more context

When we try to predict the next word in a sentence, relying solely on the word right before it can be quite restrictive. This approach often misses out on the full richness and natural flow of language by ignoring the broader context that could be incredibly valuable.

To make our predictions more accurate, we can look beyond just the last word. By expanding our focus to include two, three, or even more previous words, we can achieve better results. In Markov chain terminology, models that use only the current state are called first-order chains, whereas those that incorporate multiple prior words are known as higher-order chains. Utilizing these higher-order chains allows us to make predictions that are richer, more nuanced, and more reflective of natural language.

Take the phrase "I want the." It's more likely that someone would say "I want the best" rather than "I want the field," right? To understand how much more likely, we can refer to real-world data from the Corpus of Contemporary American English (COCA), a comprehensive collection of diverse texts containing over a billion English words.

If we look at just one word of context ("the") there are 226,176 occurrences of the phrase "the best" and 53,750 occurrences of the phrase "the field" in this vast corpus. Thus, "the best" is about 4.2 times more likely than "the field." However, if we expand our context to two words ("want the") we find 674 occurrences of "want the best" and only 4 occurrences of "want the field." This means "want the best" is 168.5 times more likely than "want the field."

Clearly, using two words as context provides a more accurate representation of real-world language usage than using just one word. It shows that to "want the field" is quite rare. Therefore, higher-order Markov chains, which consider multiple preceding words, generally offer better predictive power than first-order chains.

Here's the second-order Markov chain probability distributions of our favourite sentence.

Current state	Next State		Next State
Current state	Word	Number of Ocurrences	Probability
I want	the	1	1
want the	best	1	1
the best	of	1	0.5
the best	in	1	0.5
best of	the	1	1
of the	best	1	1
best in	the	1	1
in the	field	1	1
the field	.	1	1

Challenges of too much context

While higher-order Markov chains can enhance our predictions by considering more context, using very high-order chains could have some significant drawbacks. Let's discuss some of these drawbacks as well as possible solutions.

1. More training data is required

As the order increases, the number of possible word combinations grows rapidly. This means a much larger and more varied dataset is necessary to accurately predict each combination. Without enough data, the model may become unreliable and produce less accurate results. To address this, we can fallback to lower-order chains when a combination doesn't exist in the current corpus.

For example, the phrase "I want the field" doesn't exist in the billion-word COCA corpus. If we'd used this corpus to train our Markov chain, we'd fall back to the phrase "want the field" as context, which appears 4 times. This ensures that we still make reasonable predictions, even with limited data.

2. More compute and storage is required

Very high-order chains can be quite demanding in terms of computational resources. The increased complexity can slow down processing times and require more memory and power, making them less practical for real-time applications or systems with limited resources. One way to manage this is by pre-computing as many predictions as possible and, alternatively, performing predictions on a remote machine with greater computational capacity.

2. Overfitting may occur

There's a risk of overfitting with very high-order chains. This means the model may become too tailored to the specific sequences it has seen and struggle to generalize to new or unseen text, reducing its overall effectiveness in predicting natural language.

To mitigate this, it's important to experiment manually to find the optimal order for your corpus. A good starting point is using a third-order chain, as it strikes a common balance between context and computational feasibility.

Where can I actually use Markov chains?

Markov chains are incredibly effective for predicting short snippets of text. This makes them ideal for applications like autocomplete suggestions that need to be quick and contextual.

They have significant limitations when it comes to longer pieces of writing. The primary reason for this is that they focus too heavily on the most recent word or a few preceding words, rather than understanding the entire context of the text. This limited view causes them to lose coherence and relevance over more extended passages, as they lack a deeper comprehension of the overall message or theme.

For instance, if you use a Markov chain to generate a paragraph, it might start strong, but it will likely drift off-topic or become nonsensical as it continues. This is because it doesn't retain a broad perspective on the content, and instead hones in on short-term patterns.

In contrast, more sophisticated models, such as neural networks, especially those utilizing deep learning and transformers, can manage and generate coherent long-form text by understanding and maintaining context over much longer spans.

Conclusion

When I started my journey with Markov chains, it was out of necessity. I needed a simple yet effective text generation algorithm for a typing test I was creating. Markov chains proved to be the right solution because generating meaningful sentences wasn't the goal. I simply needed well-distributed, probable texts.

The alternative was deploying a neural network or an LLM like GPT-2. This felt like overkill, given the complexity of a multi-language application and the fact that Python package management is a mess. The simplicity of writing my Markov chain in JavaScript was quite appealing. You might find that you don't need an LLM after all.