Saturday, November 14, 2020

Scrabble Bingo and Word Density

 

Scrabble bingo & Wordensity.

I thought I could work out the probability of getting a 7-letter word in Scrabble. Apparently this is known as Scrabble bingo.

I  could not do it, because it depends on how many total letters are there to start with, and how many are left at a given stage of the game – not to mention the total numbers of each letter in the beginning, and the number remaining at a given point in the game.

 Finally, I found a thread which discussed the probability of finding a 7-letter word on the starting draw in Scrabble. The estimated numbers are higher than what I obtained (for reasons beyond my ken). But a fair amount of searching online got me some pointers, which I quote later in this blog.

Since I found Scrabble probability too tough, I tried to get an upper limit by calculating the word density: namely, the number of actual English words out of the total number of possible combinations of letters.

But first some background (in other words: time-wasting tactics) [1]:

“Most adult native test-takers range from 20,000–35,000 words. Average native test-takers of age 8 already know 10,000 words. Average native test-takers of age 4 already know 5,000 words. Adult native test-takers learn almost 1 new word a day until middle age.”

The total number of words in English [2a] according to particular criteria:

“There are an estimated 171,146 words currently in use in the English language, according to the Oxford English Dictionary, not to mention 47,156 obsolete words.”

That is, most adult native speakers of English know about 10% of the total.

According to another website on 8th July 2008, however, [2b], it also quotes the OED as saying that there are 600,000 words. And this number is predicted to go up to 1 million by 2009. So the number should be much higher by now… The prediction is by Paul Payack, founder of Global Language Monitor and yourdictionary.com.

In other languages [3]:

Language

Words in the Dictionary

Korean

1,100,373

Japanese

500,000

Italian

260,000

English

171,476

Russian

150,000

Spanish

93,000

Chinese

85,568

 

 

a)      According to [4a]: the number of 7 letter words in English is: 32,909

While: the total number of 7 letter combinations in English is:


C (26,7) = (26!)/[(19!)(7!)]

Assuming perfect knowledge and recall, the maximum probability of getting a 7-letter word in Scrabble is:

P(7) = 32,909/687,500 = 0.0500   i.e. 5%.

 

b) The number of 8-letter words [4a] is: 40,161.

The total number of 8 letter combinations in English is:


C(26,8) = (26!)/[(18!)(8!)]

The binomial coefficients can be calculated using an online calculator, for example:  [5].

Assuming perfect knowledge and recall, the maximum probability of getting an 8-letter word in Scrabble is:

P(8) = 40,161/1,562,275 = 0.0257  i.e. 2.57%.

In general for n letters, where n goes from 3 to 20:

 

No.of letters in word

No.of words

Total number of combinations

Ratio (%)

3

1,292

2,600

49.69

4

5,454

14,950

36.48

5

12,478

65,780

18.97

6

22,157

230,230

9.62

7

32,909

657,800

5.003

8

40,161

1,562,275

2.571

9

40,727

3,124,550

1.303

10

35,529

5,311,735

0.669

11

27,893

7,726,160

0.361

12

20,297

9,657,700

0.210

13

13,857

10,400,600

0.133

14

9,116

9,657,700

0.0943

15

5,757

7,726,160

0.0745

16*

783

5,311,735

0.0147

17*

407

3,124,550

0.0130

18*

70

1,562,275

0.00448

19*

84

657,800

0.0128

20*

49

230,230

0.0213

N.B. Ref.[4a] stops at 15 letter words; so the number of words with 16 and more (indicated by *)  is from [4b].

Note that the binomial coefficient term in Col.3 goes through a maximum at n =13 as expected, and is symmetric about 13.

The total number of words (up till, and including, 12 letters) in the above Table in column 2 is: 238,897  - that may be compared with the OED number [2]: 171, 476 + 47,156 = 218, 632 – which is roughly comparable.


The maximum number of words in the wordlist seems to be at about 9 letters. 

Is this due to our cognitive limitations – or is it that we just do not need more words?

The average person can only remember 7 digit numbers reliably, but it's possible to do much better using mnemonic techniques [6]. That is similar – but chunking letters is a lot easier than remembering blocks of numbers – so the comparison is, at best, suggestive.

The word density (ratio) exhibits roughly an exponential decrease as the number of letters increases.

 

Getting back to Scrabble, I found a math thread which gives details of a correct Scrabble calculation [7]. The answer is: “There are 26,514 unique sets of 7-letters that generate a legitimate Scrabble word, and 3,199,724 possible Scrabble racks of 7 letters, so the probability that a starting draw is a valid 7-letter word is 0.0082.” (i.e. 0.82%).

The number of 7-letter words is a bit lower than the number tabulated above [4] (32,909), but that is okay. The problem is the 3-million number! It is larger by a factor of 4.86 than my estimate. This calculation takes into account the number of letters in the Scrabble starting draw:

My program assumes a rack size of 7 and these tile counts:
A=>9, B=>2, C=>2, D=>4, E=>12, F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2, N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4, V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2
.”

It is possible that the calculation takes into account the probability of getting each word (out of the 26,514 or 32,909) considering the available number of letters (100) and the relative distribution of letters listed above. This is indeed what the next post [8] states (see below).

In addition, the contributors of the thread used packages SOWPODS (the Scrabble tournament word list used in most countries, other than US, Canada & Thailand) and perl. Needless to say, about such matters (among others), I am completely clueless.

However, another (more recent) post on Scrabble bingo probability [8] explains the calculation as follows:  “there are C(100,7) or 16 billion (actually: 16,007,560,800) equally likely ways to draw a rack of 7 tiles from the 100 tiles in the North American version of the game.  But since some tiles are duplicated, there are only 3,199,724 distinct possible racks (not necessarily equally likely)” – the same number quoted by [7].

He adds:

“According to the 2014 Official Tournament and Club Word List (the latest for which an electronic version is accessible), there are 25,257 playable words with 7 letters “.

After that, though, the post goes into lots of details about the frequency of words in the English language, rank ordering the most frequently used on top, and the least used at the bottom of the list – on the argument that most words in the list of 25,257 would not be either known or recalled.

Finally he concludes:

“if we include the entire official word list, the probability of drawing a playable 7-letter word is 21226189/160075608, or about 0.132601. “

This (13.3%) is even higher than my estimate of 5%! And he says he has used Mathematica in his calculations, but where he gets this number from…I haven’t a clue.

He concludes:

“A coarse inspection of the list suggests that I confidently recognize only about 8 or 9 thousand– roughly a third– of the available words, meaning that my probability of playing all 7 of my tiles on the first turn is only about 0.07.”

However, interestingly, the title of the blog is: “Possibly wrong.”

 

Naturally, a lot hinges on this number 3,199,724. Where did it come from? I reproduce here an argument from another blog [9] which explains this (full disclosure: I do not understand it):

There are:

·         4 letters with 7 tiles;

·         3 letters with 6 tiles;

·         4 letters with 4 tiles;

·         1 letter with three tiles;

·         10 letters with 2 tiles; and

·         5 letters with 1 tile.


A=>9, B=>2, C=>2, D=>4, E=>12, F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2, N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4, V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2 .”

 

4 letters with ³7 tiles: A, E,I, & O

3 letters with 6 tiles: N, R & T

4 letters with 4 tiles: D, L, S & U.

1 letter with 3 tiles: G

10 letters with 2 tiles: B, C, F, H, M, P, V, W, Y & blank.

5 letters with 1 tile: J, K, Q, Z & Z.

A comparison with the list of letters shows that the polynomials do follow what is described…but what these equations are for…and how Wolfram Alpha spits out ‘3,199,724’ [7b]:

1 + 27x + 373x^2 + 3509x^3 + 25254x^4 + 148150x^5 + 737311x^6 + 3199724x^7 + 12353822x^8 + 43088473x^9 + 137412392x^10 ...
+ x^100

The thread [7b] also contains an ab initio calculation of how to get the number 3,199,724 … but it will take me a while to figure it out (like, t = ¥).

Anyway, another post by Derek, the Word Buff [10a], confidently states that the probability of a Scrabble bingo by the first player from the first rack is about 15% (no details of the calculation given). He assumes that the Official Scrabble Dictionary is being used and that the players have perfect memory and perfect ‘anagramming’ skills.

But others also give similar estimates [10b]:

“If we instead only require that the 7 tiles be rearranged to form a 7-letter word, then it's much easier: of the C(100,7)=16,007,560,800 ways to draw 7 tiles from the bag, 2,068,621,350 of those may be arranged to spell a word, for a probability of about 0.129228. If we prohibit blanks, then there are only 1,075,220,956, or about half as many ways to draw 7 "spellable" tiles.” 

And one more [10c]:

 “There are 24,029 valid tournament 7 letter words in English. Treating the fresh bag of tiles as a multivariate hypergeometric distribution with parameters 7 and the number of each type of tile, and getting the PMF for the distinct combinations of letters and blanks that result in at least one of the valid scrabble words results in probability 13790809/106717072, or ~12.9% chance you pull the letters/blanks to form one of the valid words.

N.B.: I use English word list and the American English tile distribution. The results will differ for different language tournament lists/tiles.”

And in the blog [10d]:

“I ran it on /usr/share/dict/words, as I say, which on this machine is an edition of the well-known online word list claimed to have been legally derived from Webster's New International Dictionary, 2nd Edition. The complete list is 235,882 words. And the results were:

7-letter words in dictionary: 20552
2017799913/16007560800 = 12.61%

However, there are two major sources of error here. First, the
word list includes a large number of obscure words, which in
practice few people would know.”

According to Mark Spahn [11], the process of determining the probability of getting the word ‘boot’ in the very first draw consists of 4 steps:

1.specify the problem

2. create a dictionary of words

3. compute the non-words

4. compute the chances

The probability is calculated as 0.71% of obtaining the word ‘boot’ in the very first draw.

According to Mark Spahn [11] the probability of obtaining the word “MINIMAL” at the very first draw is calculated as follows:

“There is C(2,2)=1 subset of 2 M's from the 2 M's in the bag.
There are C(9,2)=36 subsets of 2 I's from the 9 I's in the bag.
There are C(6,1)=6 subsets of 1 N from the 6 N's in the bag.
There are C(9,1)=9 subsets of 1 A from the 9 A's in the bag.
There are C(4,1)=4 subsets of 1 L from the 4 L's in the bag.

Thus there are 1*36*6*9*4=7776 distinct subsets of 2 M's, 2 I's, and 1 each N, A, L. The 7 distinct tiles can be arranged in 7! ways. Thus there are 7776*7! Distinct permutations of tiles that can be rearranged to spell MINIMAL.
The probability of drawing such a set of tiles from the bag is this number of permutations, divided by the number of all permutations of 7 tiles drawn (without repetition) from the bagful of 100 tiles, which is P(100,7) = 100*99*98*97*96*95*94.

So P{MINIMAL} = 7776*7!/P(100,7) = 7776/(P(100,7)/7!)
= 7776/C(100,7).

C(100,7) = 2^5*3*5^2*7*11*19*47*97 = 16,007,560,800.

P{MINIMAL} works out to 81/166,745,425.”

Spahn [11] also quotes from Albert Weissman [12] the probability of AEINORT (the most likely rack, but, sorry, no corresponding word) as:

 9*12*9*6*8*6*6/C(100,7) = 17,496/166,745,425 = 1/9530.488

and the least likely combination of letters as  BBJKQXZ with a probability of 1/C(100,7). That is, 1/(16 billion).

The most likely actual Scrabble bingo according to Weissman [12], is for the words TRAINEE and RETINAE, with a probability of 1/13,870. According to Spahn [11], it is slightly different:

“I get P{AEEINRT} = P{EEAINRT} = C(12,2)*9*9*6*6*6/C(100,7)
= 2,187/30,317,350 = 1/13,862.53.”

A=>9, B=>2, C=>2, D=>4, E=>12, F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2, N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4, V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2 .”

 

--------------------------------------------------------------------------------------------------------------------------------

 

So what is the probability of Scrabble bingo? The numbers 3,199,724 and 16,007,560,800 occur everywhere. The first estimate gives a Scrabble bingo probability of 0.82% [7]. But all the others give higher values:

13.26% [8], 15% [10a], 12.92% [10b], ~12.9% [10c] and 12.61% [10d].

Clearly, nobody believes the estimate of 0.82% - most likely because it does not take into account the fact that the probability of each word is different (as explained in detail by Mark Spahn [11]). Am I sure of this? No! Because most of this combinatorics is, to me, gibberish!

Note that two writers in the thread  [7b] give almost the same value (12.9%) as [10b & 10c]. Well, 12.9% is popular, for sure…

It is interesting that the Scrabble bingo probability (12.9 %) is higher than the word density (5.0%)? It all boils down to the fact that in Scrabble there are 100 letters (with different numbers of letters, plus two blanks), while in the word density there are only 26 letters (with equal probability). And add to that: these geeks computed the probability of 26,514 words from the Scrabble dictionary using software that I had never heard of (perl & SOWPODs ) as well as some I had heard of (Mathematica & R), as well as Monte Carlo (perl: a general-purpose programming language originally developed for text manipulation)...

 

I will just stick to my pay-grade and my crude estimate and plot of wordensity…

References:

1.       1. https://www.economist.com/johnson/2013/05/29/lexical-facts

 

2.       2a). https://www.bbc.com/news/world-44569277#:~:text=We%20considered%20dusting%20off%20the,to%20mention%2047%2C156%20obsolete%20words.

 

a)      2b).  https://blogs.illinois.edu/view/25/4641#:~:text=With%20more%20than%20326%20million,it%20was%20400%20years%20ago.&text=Payack%20also%20predicts%20that%20some,millionth%20English%20word%20will%20appear.

 

3.       3.https://blog.ititranslates.com/2018/03/07/which-language-is-richest-in-words/

4.       4a) www.bestwordlist.com

4b)      https://www.bestwordlist.com/8letterwords.htm

4c)       www.yougowords.com

 

5.       5. https://miniwebtool.com/binomial-coefficient-calculator/?n=26&k=9

6.       6https://humanbenchmark.com/tests/number-memory

 

7.       7a). https://www.reddit.com/r/AskStatistics/comments/47a3z0/what_are_the_odds_of_being_able_to_spell_a_7/

b)      7b)http://godplaysdice.blogspot.com/2007/08/how-many-scrabble-racks-are-there.html?showComment=1228611720000#c2396124566679894906

8.       8).https://possiblywrong.wordpress.com/2017/01/20/probability-of-a-scrabble-bingo/

9.       9. https://math.stackexchange.com/questions/243685/how-many-possible-scrabble-racks-are-there-at-the-beginning-of-the-game

10.    

10a).       http://www.word-buff.com/what-is-the-probability-of-a-scrabble-player-making-two-seven-letter-words-on-their-first-two-turns.html

10b).      https://www.reddit.com/r/theydidthemath/comments/33noxv/request_i_pull_seven_scrabble_tiles_out_of_the/

10c).       https://www.reddit.com/r/theydidthemath/comments/40t22p/request_how_many_ifferent_combination_of/

10d).      https://groups.google.com/g/rec.puzzles/c/3dFsrDa9_oE?pli=1

1111. https://stats.stackexchange.com/questions/74468/probability-of-drawing-a-given-word-from-a-bag-of-letters-in-scrabble

1212.  Albert Weissman Scrabble Players Newspaper (Feb.1980)