There's no easy way to say it. You've probably been playing Hangman wrong your entire life.
In its purest form, Hangman is a word game played between two people. One person selects a secret word, and the other tries to determine the word by guessing it letter-by-letter. The player with the secret writes a series of dashes, one representing each letter in the solution. Initially, no information is known about the target word, other than its length.
The solver calls out letters, one-by-one. If a called letter appears in the solution, all occurrences in the solution are filled in.
If the letter does not appear in the solution, the secret writer adds one element to a drawing of a gallows (complete with a stick man).
A complete rendering takes 11 moves.
If the hangman drawing gets completed (11 incorrect letters), then the secret writer has won. If all letters of the word are revealed before this happens, then the solver wins. As a young person, when you first started to play the game, you probably called out random letters. Once you got a hit of a couple of letters, it helped you narrow down the solution.
Next, you probably graduated to calling vowels first, having learned that (just about*) all words contain at least one vowel (or the letter 'Y').
* The very complete word dictionary I'm using for this exercise contains 172,806 words. Only 20 of these words do not contain any vowel, or the letter 'Y' eg CWM, TSKTSK, PSST, PHPHT and BRRR.
(I counted just 121 words with that do not contain any of the letters 'AEIOU', so the number that use just 'Y' as a vowel is 101).
Enlightenment: Guessing The First Letter
Next, you probably graduated to learning that not all letters are used equally. It's rare that the letter 'Q' appears in a word, whereas 'T' is used a lot more often.
Once you get just a couple of letters of in a Hangman puzzle, the game becomes easier. The solution set is drastically reduced, and skills like pattern matching and word knowledge become important. It's crucial to get that first letter in the puzzle as soon as possible. Which letter should you guess first?
Code, Cyphers And Secret Writing
Growing up you probably invented or used your own substitution cypher (where each letter is replaced by a different letter or symbol). A classic example is the Pigpen Cipher. Messages encoded in a simple cipher are pretty easy to crack because the same letter always is always represented by the same symbol. If you solved a lot of puzzle cyphers, then you probably learned and used the letter ordering below.
Ordering of letter frequency in English language:
ETAOIN SHRDLU CMFWYP VBGKQJ XZ
The sequence above represents the usage order of letters in the English language, with the letter 'E' being the most common letter, followed by the letter ‘T', all the way down to the letter 'Z', the least commonly used.
So, the first letter we should guess when trying to solve a hangman is the letter 'E', right?
Since 'E' is the most popular letter in English text, it will have the highest probability of being in our word, right?
Yes, the ordering above is an accurate portrayal of the frequency of usage of letters in English text, and if we were examining English text it is what we should be using.. but, we're not looking at pages of text, we're looking at isolated words.
English text is full of words that are used very frequently: THE, OF, AND, A, TO, IN, IS, YOU, THAT, IT...
A frequency table of letter usage based on English text is biased because of the substantial presence of these common words.
(About one-third of all printed English material is made up of the top 25 occurring words. The most popular 100 words make up approximately one-half of all printed English!).
Because we're trying to guess a naked word in isolation the above frequency distribution is not appropriate. It's distorted.
Instead, what we need to look for is the incidence of letters in the words in our dictionary, not the incidence of letters in all English text.
This will give a much better probability estimate for the frequency of letters because it will be unbiased by the frequency of common words.
We can further refine this strategy and do a little better. Since we're happy if we hit one, or many, letters in our target word we do not want to double count frequency if there is more than one of the same letter in a word. Instead of counting the occurrences of all the letters, we count the number of times a letter is present (one or many) times in each word. Essentially giving a count of, if we select a letter, the number of words that this letter is present in.
We can then sort this list based on the probabilities (count of the number of words that letter is present in). Here are the results:
ESIARN TOLCDU PMGHBY FVKWZX QJ
There's a noticeable difference. Here, again, is the distribution based on frequency in English text (for comparison).
ETAOIN SHRDLU CMFWYP VBGKQJ XZ
Whilst 'E' is still the most popular letter, the next most popular (based on number of words in the dictionary that contain it), is 'S' and not 'T'. 'T' has been relogated to seventh ordinal position (60.13 per cent of all words in my dictionary have a letter 'S' in them, but only 48.23% of them have a letter 'T').
Next in popularity come two more vowels 'I' and 'A' ('O' having moved further back). 'R' occurs significantly more often in isolated words than it does when biased by the frequency in everyday text.
The ordering of vowels is now 'E I A O U' instead of 'E A O I U'
Interestingly, the least likely letter is now 'J' instead of 'Z'. (There are just 2,463 words in the dictionary that contain the letter 'J' cf. 4,592 with the letter 'X' and 7,028 containing the letter 'Z').
Now that we know the chances of a letter being in any word we can use this new table to select our guesses, right?
Don't Forget About The Length!
The above distribution has been calculated for all the words in the dictionary. But remember, when playing Hangman, we know the length of the word we are trying to guess. This allows us to further refine our searching.
Below is a table showing the popularity of letters in dictionary words grouped by the length of those words. The most popular letters are at the top of the table, and the the least popular letters at the bottom. To the left are the shorter word lengths, and to the right are the longer ones.
There are so many fascinating things to point out about this table that I don't know where to start!
- There are only two words with one letter!
- There are no two letter words containing the letter C, Q, V or Z.
- From one to four letter words, the most popular letter is A
- For five letter words it changes to S, then from six to twelve it is the letter E. From thirteen letters onwards, the most likely letter to be in a word is the letter I.
- The letter A starts off as the most popular vowel, but by the time words grow to 15 letter long, it has been relegated to fourth most common vowel.
- There is no word in the English language that is 18 letters long and contains the letter J. Similarly, there is no 20 letter word which contains the letter W.
- T is the most popular consonant in three letter words, and falls in popularity in mid-length words before regaining its popularity at fourteen.
- Z is never the least popular letter.
- O falls in popularity in mid-length words.
Plus much more...
Here is the same table again with a splash of colour highlighting the vowels.
OK, so our strategy should be to find the column corresponding to the number of letters in the target word, and start calling down the letters from the top until we get a hit, right?
Wrong! (Though now we're a lot closer to an optimal strategy!)
What went wrong? How come the above search strategy is not the optimal?
Well, the above table gives us the distribution of letters based on independent events. They show the relative probability of a letter being in a word. But, if we guess a letter and it is not in the solution, this reduces the set of possible words. Are you with me?
Let me give an example: If we have a six letter word, our first letter to guess should be 'E'. If the letter 'E' is not in the solution, we should not necessarily try the letter 'S' next (which is what the above table implies)!
Yes, 'S' is the second most likely letter for all words, but we already know that our target word does not contain a letter 'E', so we need to recalculate the probabilities for the next most likely letter based on six letter words that do not contain this letter. (In fact, if there is no 'E' in a six letter word, the next letter to suggest should be an 'A', not an 'S').
We can continue this recurssion. Each time, if we don't get a match, we remove all possible words that don't contain that letter (and all previous suggested letters), and then look for the most popular letter in the remaining set.
Here are the final results of these calculations. These charts tell you what order to call letters, based on length of the word, to maximise your chances of getting your first hit.
There are some interesting take aways from these results:
- The most challenging (least deterministically obvious) words to guess are three letter words. It can take up to 10 guesses before getting a letter on the board!
- With less than three letters, it gets easier (there are less possible words), and with more than three letters it becomes less likely there will be any words that you cannot find a letter for quickly.
- For five letter words, the best first guess is the letter S. This is the only time a consonant is the most likely first guess letter.
- For four letter words, the first non-vowel guess is an S, followed by B and then F (remember, these are only called if all preceeding letters have failed to hit).
- No row contains more than 10 guesses, and since a Hangman game takes 11 fails to lose, it is impossible to come up with any English letter word that will fail at Hangman without a single letter appearing on the board (assuming the optimal search strategy above is followed).
- A should only be your first guess if the word length is four or less letters. If five letters, go for S first. Between six and twelve letters try E and above that you should call I.
The above analysis (finding our first letter) is easy to render in table form because there are only two choices: We either miss, or we hit. If we miss, we simply try again. Once we've hit a letter or two, however, things get too complex to display in table format. eg "Show me the next best letter to guess for eight letter words that have do not have an 'E' or 'I', but have an 'A' and a 'T'!" We'd have a stack of tables reaching up to the ceiling for all combinations of letters present or not, and their positions!
Computers are far better at filtering and sifting through databases. Once a first letter has been found, this knowledge (letters not present, letter found and the position of this letter), massively reduces the solution set of possible words. Tools like SQL and regular expressions can be quickly applied to find all possible words that match the comb filter built up.
Pre-computed tables are only fine up to a point, after that, they become unmanageable. To paraphrase a famous quote:
"Battle plans are excellent up until the first shot is fired!"
A Better Strategy for Hangman [DataGenetics Blog]
Nick Berry is UK native living in Seattle. Educated as a rocket scientist and aircraft designer, he went on to found and sell a software company, work for Microsoft's Casual Gaming division and head up customer analytics for RealNetworks' GameHouse.