This is pretty much what I did, but I mixed in a regexp to hold the location restrictions, and a penalty for using the same letter multiple times. (eg guessing “added” is worse than “aspen” for “a..e.”)
I do wonder if looking at how a letter splits the space of letters and words would be interesting
Yeah, I wasn't sure how I wanted to deal with duplicates so I mostly ignored them. I track letter positions directly (just a bunch of tuples), but don't actually do anything with this other than restricting candidates words.
I think if I work on this some more I'd try to factor in letter positioning when deciding what to guess. My hunch is that it won't make too much of a difference though.
So I tried an experiment using 15,918 five letter English words. I used a basic scoring strategy of scoring a word by summing up the frequency of the candidate letters in the candidate words as determined by a regexp of included and excluded letters. (e.g. `.aves` would score `waves` 1, but `saves` as 0 since `s` is already included)
Variations included adding in the frequency of the letter at a particular position, and adding in the frequency of two letter combinations.
Interestingly enough, the winning strategy was using single letters and using figuring in the position. Second second best was using two letters and position.
ngram=1 posfreq=True mean attempts: 4.34 WinPct 91.280%
ngram=2 posfreq=True mean attempts: 4.35 WinPct 91.186%
ngram=2 posfreq=False mean attempts: 4.37 WinPct 90.074%
ngram=1 posfreq=False mean attempts: 4.38 WinPct 90.445%
Since my base dictionary is way bigger than the Wordle one, I also mixed in a smaller 1,382 word dictionary (google-10000-english.txt) and then combined them by either just sorting by the score, or normalizing the scores, and then sorting. Normalizing the scores was strictly worse.
normalize=False ngram=1 posfreq=True mean attempts: 4.34 WinPct 91.280%
normalize=True ngram=1 posfreq=True mean attempts: 4.43 WinPct 90.281%
FWIW, the absolute worse one was:
normalize=True ngram=1 posfreq=False mean attempts: 4.43 WinPct 89.835%
There's probably more tuning I can do for the algo, but roughly:
- I took all the words from the site's js as the dictionary.
- From remaining eligible words, compute the letter distribution (ignoring letters you already know are in the solution).
- Pick a word that uses as many of the most frequent letters as possible.
- Use one of those as a guess.
The goal is essentially to greedily reduce the remaining candidate words as much as possible per guess.