Scrabble bingo
& Wordensity.
I thought I could work out the probability of getting a
7-letter word in Scrabble. Apparently this is known as Scrabble bingo.
I could not do it,
because it depends on how many total letters are there to start with, and how many
are left at a given stage of the game – not to mention the total numbers of
each letter in the beginning, and the number remaining at a given point in the
game.
Finally, I found a
thread which discussed the probability of finding a 7-letter word on the starting
draw in Scrabble. The estimated numbers are higher than what I obtained
(for reasons beyond my ken). But a fair amount of searching online got me some
pointers, which I quote later in this blog.
Since I found Scrabble probability too tough, I tried to get
an upper limit by calculating the word density: namely, the number of
actual English words out of the total number of possible combinations of
letters.
But first some background (in other words: time-wasting
tactics) [1]:
“Most adult native test-takers range from 20,000–35,000
words. Average native test-takers of age 8 already know 10,000 words. Average
native test-takers of age 4 already know 5,000 words. Adult native test-takers
learn almost 1 new word a day until middle age.”
The total number of words in English [2a] according to particular criteria:
“There are an estimated 171,146 words currently in use in the
English language, according to the Oxford English Dictionary, not to mention
47,156 obsolete words.”
That is, most adult native speakers of English know about 10%
of the total.
According to another website on 8th July 2008,
however, [2b], it also quotes the
OED as saying that there are 600,000 words. And this number is predicted to go
up to 1 million by 2009. So the number should be much higher by now… The
prediction is by Paul Payack, founder of Global Language Monitor and
yourdictionary.com.
In other languages [3]:
Language |
Words in the
Dictionary |
Korean |
1,100,373 |
Japanese |
500,000 |
Italian |
260,000 |
English |
171,476 |
Russian |
150,000 |
Spanish |
93,000 |
Chinese |
85,568 |
a)
According to [4a]: the number of 7 letter words in English is: 32,909
While: the total number of 7 letter combinations in English
is:
C (26,7) = (26!)/[(19!)(7!)]
Assuming perfect knowledge and recall,
the maximum probability of getting a 7-letter word in Scrabble is:
P(7) = 32,909/687,500 = 0.0500 i.e. 5%.
b) The number of 8-letter words [4a] is: 40,161.
The total number of 8 letter combinations in English is:
C(26,8) = (26!)/[(18!)(8!)]
The binomial coefficients can be
calculated using an online calculator, for example: [5].
Assuming perfect knowledge and recall,
the maximum probability of getting an 8-letter word in Scrabble is:
P(8) = 40,161/1,562,275 = 0.0257 i.e. 2.57%.
In general for n letters, where n goes
from 3 to 20:
No.of letters in word |
No.of words |
Total number of combinations |
Ratio (%) |
3 |
1,292 |
2,600 |
49.69 |
4 |
5,454 |
14,950 |
36.48 |
5 |
12,478 |
65,780 |
18.97 |
6 |
22,157 |
230,230 |
9.62 |
7 |
32,909 |
657,800 |
5.003 |
8 |
40,161 |
1,562,275 |
2.571 |
9 |
40,727 |
3,124,550 |
1.303 |
10 |
35,529 |
5,311,735 |
0.669 |
11 |
27,893 |
7,726,160 |
0.361 |
12 |
20,297 |
9,657,700 |
0.210 |
13 |
13,857 |
10,400,600 |
0.133 |
14 |
9,116 |
9,657,700 |
0.0943 |
15 |
5,757 |
7,726,160 |
0.0745 |
16* |
783 |
5,311,735 |
0.0147 |
17* |
407 |
3,124,550 |
0.0130 |
18* |
70 |
1,562,275 |
0.00448 |
19* |
84 |
657,800 |
0.0128 |
20* |
49 |
230,230 |
0.0213 |
N.B. Ref.[4a] stops at 15 letter words; so the
number of words with 16 and more (indicated by *) is from [4b].
Note that the binomial coefficient term in Col.3 goes through
a maximum at n =13 as expected, and is symmetric about 13.
The total number of words (up till, and including, 12
letters) in the above Table in column 2 is: 238,897 - that may be compared with the OED number [2]: 171, 476 + 47,156 = 218, 632 –
which is roughly comparable.
The maximum number of
words in the wordlist seems to be at about 9 letters.
Is this due to our
cognitive limitations – or is it that we just do not need more words?
The average person can
only remember 7 digit numbers reliably, but it's possible to do much better
using mnemonic techniques [6]. That
is similar – but chunking letters is a lot easier than remembering blocks of
numbers – so the comparison is, at best, suggestive.
The word density
(ratio) exhibits roughly an exponential decrease as the number of letters
increases.
Getting back to
Scrabble, I found a math thread which gives details of a correct
Scrabble calculation [7]. The answer
is: “There are 26,514 unique sets of 7-letters that generate a legitimate
Scrabble word, and 3,199,724 possible Scrabble racks of 7 letters, so the
probability that a starting draw is a valid 7-letter word is 0.0082.” (i.e. 0.82%).
The number of 7-letter
words is a bit lower than the number tabulated above [4] (32,909), but that is okay. The problem is the 3-million
number! It is larger by a factor of 4.86 than my estimate. This calculation
takes into account the number of letters in the Scrabble starting draw:
“My program assumes a rack size of 7 and these
tile counts:
A=>9, B=>2, C=>2, D=>4, E=>12,
F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2,
N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4,
V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2
.”
It is possible that the
calculation takes into account the probability of getting each word (out of the
26,514 or 32,909) considering the available number of letters (100) and the
relative distribution of letters listed above. This is indeed what the next post
[8] states (see below).
In addition, the
contributors of the thread used packages SOWPODS (the Scrabble tournament word
list used in most countries, other than US, Canada & Thailand) and perl.
Needless to say, about such matters (among others), I am completely clueless.
However, another (more
recent) post on Scrabble bingo probability [8]
explains the calculation as follows: “there
are C(100,7) or 16 billion (actually: 16,007,560,800) equally likely ways to
draw a rack of 7 tiles from the 100 tiles in the North American version of the
game. But since some tiles are duplicated, there are only 3,199,724
distinct possible racks (not necessarily equally likely)” – the same number
quoted by [7].
He adds:
“According to the 2014 Official Tournament and Club Word List (the
latest for which an electronic version is accessible), there are 25,257
playable words with 7 letters “.
After that, though, the post goes into lots of
details about the frequency of words in the English language, rank ordering the
most frequently used on top, and the least used at the bottom of the list – on
the argument that most words in the list of 25,257 would not be either known or
recalled.
Finally he concludes:
“if we include the entire official word list,
the probability of drawing a playable 7-letter word is 21226189/160075608, or
about 0.132601. “
This (13.3%) is even higher than my
estimate of 5%! And he says he has used Mathematica in his calculations, but
where he gets this number from…I haven’t a clue.
He concludes:
“A coarse inspection of the list suggests that
I confidently recognize only about 8 or 9 thousand– roughly a third– of the
available words, meaning that my probability
of playing all 7 of my tiles on the first turn is only about 0.07.”
However, interestingly, the title of the blog
is: “Possibly wrong.”
Naturally, a lot hinges on this number 3,199,724.
Where did it come from? I reproduce here an argument from another blog [9] which explains this (full
disclosure: I do not understand it):
There are:
·
4 letters with 7 tiles;
·
3 letters with 6 tiles;
·
4 letters with 4 tiles;
·
1 letter with three tiles;
·
10 letters with 2 tiles; and
·
5 letters with 1 tile.
A=>9, B=>2, C=>2, D=>4,
E=>12, F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2, N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4, V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2
.”
4 letters with ³7
tiles: A, E,I, & O
3 letters with 6 tiles: N, R & T
4 letters with 4 tiles: D, L, S & U.
1 letter with 3 tiles: G
10 letters with 2 tiles: B, C, F, H, M, P, V, W, Y &
blank.
5 letters with 1 tile: J, K, Q, Z & Z.
A comparison with the list of letters shows that the
polynomials do follow what is described…but what these equations are for…and
how Wolfram Alpha spits out ‘3,199,724’ [7b]:
1 + 27x + 373x^2 +
3509x^3 + 25254x^4 + 148150x^5 + 737311x^6 + 3199724x^7 + 12353822x^8 +
43088473x^9 + 137412392x^10 ...
+ x^100
The thread [7b]
also contains an ab initio calculation of how to get the number 3,199,724 … but
it will take me a while to figure it out (like, t = ¥).
Anyway, another post by Derek, the Word Buff [10a], confidently states that the
probability of a Scrabble bingo by the first player from the first rack is about
15% (no details of the calculation given). He assumes that the Official
Scrabble Dictionary is being used and that the players have perfect memory and
perfect ‘anagramming’ skills.
But others also give similar estimates [10b]:
“If we instead only require
that the 7 tiles be rearranged to form a 7-letter word, then it's much easier:
of the C(100,7)=16,007,560,800 ways to draw 7 tiles from the bag, 2,068,621,350
of those may be arranged to spell a word, for a probability of about 0.129228.
If we prohibit blanks, then there are only 1,075,220,956, or about half as many
ways to draw 7 "spellable" tiles.”
And one more [10c]:
“There are 24,029 valid tournament 7 letter
words in English. Treating the fresh bag of tiles as a multivariate
hypergeometric distribution with parameters 7 and the number of each type of
tile, and getting the PMF for the distinct combinations of letters and blanks
that result in at least one of the valid scrabble words results in probability
13790809/106717072, or ~12.9% chance you pull the letters/blanks to form one of
the valid words.
N.B.: I use English word
list and the American English tile distribution. The results will differ for
different language tournament lists/tiles.”
And in the blog [10d]:
“I ran it on
/usr/share/dict/words, as I say, which on this machine is an edition of the
well-known online word list claimed to have been legally derived from Webster's
New International Dictionary, 2nd Edition. The complete list is 235,882 words.
And the results were:
7-letter words in
dictionary: 20552
2017799913/16007560800 =
12.61%
However, there are two
major sources of error here. First, the
word list includes a large number of obscure words, which in
practice few people would know.”
According to Mark Spahn [11], the process of determining
the probability of getting the word ‘boot’ in the very first draw consists of 4
steps:
1.specify the problem
2. create a dictionary of words
3. compute the non-words
4. compute the chances
The probability is calculated as 0.71% of obtaining the word
‘boot’ in the very first draw.
According to Mark Spahn [11]
the probability of obtaining the word “MINIMAL” at the very first draw is
calculated as follows:
“There is C(2,2)=1 subset
of 2 M's from the 2 M's in the bag.
There are C(9,2)=36 subsets of 2 I's from the 9 I's in the bag.
There are C(6,1)=6 subsets of 1 N from the 6 N's in the bag.
There are C(9,1)=9 subsets of 1 A from the 9 A's in the bag.
There are C(4,1)=4 subsets of 1 L from the 4 L's in the bag.
Thus there are
1*36*6*9*4=7776 distinct subsets of 2 M's, 2 I's, and 1 each N, A, L. The 7
distinct tiles can be arranged in 7! ways. Thus there are 7776*7! Distinct
permutations of tiles that can be rearranged to spell MINIMAL.
The probability of drawing such a set of tiles from the bag is this number of
permutations, divided by the number of all permutations of 7 tiles drawn
(without repetition) from the bagful of 100 tiles, which is P(100,7) =
100*99*98*97*96*95*94.
So P{MINIMAL} =
7776*7!/P(100,7) = 7776/(P(100,7)/7!)
= 7776/C(100,7).
C(100,7) =
2^5*3*5^2*7*11*19*47*97 = 16,007,560,800.
P{MINIMAL} works out to
81/166,745,425.”
Spahn [11] also quotes from Albert Weissman [12] the probability of AEINORT (the most likely rack, but,
sorry, no corresponding word) as:
9*12*9*6*8*6*6/C(100,7) = 17,496/166,745,425 =
1/9530.488
and the least likely
combination of letters as BBJKQXZ with a
probability of 1/C(100,7). That is, 1/(16 billion).
The most likely actual
Scrabble bingo according to Weissman [12],
is for the words TRAINEE and RETINAE, with a probability of 1/13,870. According
to Spahn [11], it is slightly
different:
“I get P{AEEINRT} =
P{EEAINRT} = C(12,2)*9*9*6*6*6/C(100,7)
= 2,187/30,317,350 = 1/13,862.53.”
A=>9, B=>2, C=>2, D=>4, E=>12,
F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2,
N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4,
V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2
.”
--------------------------------------------------------------------------------------------------------------------------------
So what is the probability of Scrabble bingo? The
numbers 3,199,724 and 16,007,560,800 occur everywhere. The first estimate gives
a Scrabble bingo probability of 0.82% [7].
But all the others give higher values:
13.26% [8],
15% [10a], 12.92% [10b], ~12.9% [10c] and 12.61% [10d].
Clearly, nobody believes the estimate of 0.82%
- most likely because it does not take into account the fact that the
probability of each word is different (as explained in detail by Mark Spahn
[11]). Am I sure of this? No! Because most of this combinatorics is, to me, gibberish!
Note that two writers in the thread [7b]
give almost the same value (12.9%) as [10b
& 10c]. Well, 12.9% is popular, for sure…
It is interesting that the Scrabble bingo
probability (12.9 %) is higher than the word density (5.0%)? It all boils down
to the fact that in Scrabble there are 100 letters (with different numbers of
letters, plus two blanks), while in the word density there are only 26 letters
(with equal probability). And add to that: these geeks computed the
probability of 26,514 words from the Scrabble dictionary using software that I
had never heard of (perl & SOWPODs ) as well as some I had heard of
(Mathematica & R), as well as Monte Carlo (perl: a general-purpose
programming language originally developed for text manipulation)...
I will just stick to my pay-grade and my crude
estimate and plot of wordensity…
References:
1. 1. https://www.economist.com/johnson/2013/05/29/lexical-facts
3. 3.https://blog.ititranslates.com/2018/03/07/which-language-is-richest-in-words/
4. 4a) www.bestwordlist.com
4b)
https://www.bestwordlist.com/8letterwords.htm
4c)
www.yougowords.com
5. 5. https://miniwebtool.com/binomial-coefficient-calculator/?n=26&k=9
6. 6. https://humanbenchmark.com/tests/number-memory
7. 7a). https://www.reddit.com/r/AskStatistics/comments/47a3z0/what_are_the_odds_of_being_able_to_spell_a_7/
8. 8).https://possiblywrong.wordpress.com/2017/01/20/probability-of-a-scrabble-bingo/
10.
10c). https://www.reddit.com/r/theydidthemath/comments/40t22p/request_how_many_ifferent_combination_of/
10d).
https://groups.google.com/g/rec.puzzles/c/3dFsrDa9_oE?pli=1
1212. Albert Weissman Scrabble Players Newspaper (Feb.1980)