Beginning the Great Wordlist Project

Historically, I haven’t really believed in New Year’s Resolutions (I’m terrified of “saving” potential activities/improvement plans for the new year and then discovering that the passion has disappeared).

But that’s the old me! That’s right, I’m newly Resolute™. However, instead of embarking on something like 75 Hard Challenge or Dry January which could potentially improve my entire life, I decided on a less impactful and even more painful idea: Fixing my crossword wordlist.

The Reason

Even since crossword construction became a computer-based venture, one major differentiator has been wordlists. In Matthew Gaffney’s 2006 book Gridlock, he describes a time of fierce possessiveness when it came to wordlists – including boasts about how there’s a 5-letter word you’ve “collected” that your rival constructor doesn’t have. More than a decade later, in Adrienne Raphael’s 2021 Thinking Inside the Box, she writes that “seasoned constructors often load custom word-hoards into the [puzzle constructing] program, which they guard as jealously as dragons protecting treasure.”

Luckily, I think we now live in a more generous time for constructing crosswords, with a lot more resources. Even without considering paid options (the most popular of which is likely Jeff Chen’s list on XWordInfo), you’ve got options: Most constructing software comes with its own wordlists; plus there are incredible resources like Spread the Word(list) (STWL) and Peter Broda’s Wordlist, as well as more niche wordlist supplements like the Expanded Crossword Name Directory.

However, there are some hurdles to surpass. Firstly, there’s a standardization problem if you try to combine lists. Wordlists score each word – you set a minimum “word score” when you try to fill a crossword. But Spread the Word(list), for example, is on a scale of 0 to 50; Broda’s list is 0 to 100; and the ECND has everything set at 64. So a 60 in Broda’s list would be a mid-tier entry by his scale, but it’s higher than anything in Spread the Word(list).

Perhaps more importantly – these lists aren’t perfect. Because wordlists often utilize some sort of automated data-scrubbing process (based on usage popularity, both in the world and in crosswords), certain words get incorrectly up-scored or down-scored. And these errors aren’t rare: I’ve been constructing with pre-made wordlists for years, and inevitably I get offered seemingly “high-scoring” words that I’d never actually include in a crossword.

There’s also an issue of taste. Even if I knew that a human had looked at every single word in a wordlist, how could I ensure their judgement was the same as mine? Maybe the wordlist-maker hates all references to sports, or thinks that obscure airport codes are common knowledge, or has a particular fetish for Roman numerals. How can I be sure that my values are the same values of the wordlist maker?

The answer: There’s no way to be sure – not unless you manually comb through the entries yourself.

So I decided to manually comb through the entries myself.

Initial Decisions

I’ve previously (and unsuccessfully) tried to do this rescoring project in the .txt file that my wordlists lives in. It was a hateful process; so for a my second try, I used the “Word Browser” window of my construction software of choice, Ingrid.

The window in question, where you can right-click words for a “rescore” option. At the top, you can narrow down the list of words to only include those of a specific length, which comes in handy as you’ll see.

My three priorities for rescoring were as follows:

Standardization – How can I ensure that at the end of my endeavor, all of the words are on the same scale?
Continuity – How can I do this in such a way that I can keep making crosswords even if I’m in the middle of rescoring?
High Value Effort – How do I optimize so that my crosswords improve with the shortest amount of time?

For standardization, I ultimately used a 0 to 50ish scale to be most compatible with Spread the Word(list) (STWL), which has been my primary list I use for the past several years. So it’s the list I’m most familiar with, and also I think it’s generally the “cleanest” list – I know that entries at 50 are generally good, entries at 40 are a mixed bag, and entries at 30 and lower are more like throwing darts blindfolded.

This decision also helped with continuity: Because I was making my list based on my main list (rather than creating, say, a new 0 to 1000 point scale), I knew that I could always make crosswords using 1) my personal in-progress wordlist as the “priority” list, and 2) Spread the Word(list) as a supplemental list. (In constructing software, you can use multiple wordlists and also “rank” them; the software will score a word based on what the highest-ranking list with that word scores it.) Therefore, I knew that the resulting list – made from the combination of those two ranked lists – would always be better attuned to my tastes than Spread the Word(list) on its own.

*A look at Ingrid’s options for wordlist “rankings”*

As for high value effort, I decided to start by rescoring all of the 3-letter words. I think improving this word pool is a lot of bang for buck – 3-letter words show up very often, and it’s also a pretty small population of words (*only* about 5,000 words total).

From there, it was pretty simple – just going through all of the 3-letter words in multiple lists, essentially doing mind-numbing data entry by choice for hours spread across multiple weeks. I worked from both the front and back of the alphabet in different sessions, which somehow helped me feel like I was sufficiently changing things up.

Standards of Care

As for evaluating and scoring words, there are some things I focused on (which I’ll probably write more in-depth at a later date). But the main dilemma at hand was this: Is it more important for a wordlist to have only “good” entries, or for it to have a wider range of quality in order to allow for more fill options? It’s definitely a balance, and one that I decided and re-decided throughout the process.

Initially, I tried to create a “grading rubric” for words – unit abbreviations were scored A if they fit B criteria, sports team acronyms were always scored C unless D was true. As a result of this method, I have about a 3-page grading rubric that I was using, with guidelines for “partial phrases” and “airport codes” and “cities with more than 200k people.”

But two things happened. First, the rubric approach was exhausting – I felt like I was constantly referring to my grading sheet and also adding new categories whenever they appeared. (How am I evaluating country abbreviations and compass directions?)

Second, I realized that the process didn’t really sacrifice any quality by taking a more “vibes-based” approach, where I judged words more on how they made me feel rather than their detailed categorization. It was also a faster method – I trusted my intuition to be mostly correct, and as a result I didn’t have to “look up” the word’s category in my rubric document.

As a result, my “guide” became a lot shorter. Here’s the scoring categorization that I ultimately decided on as a cheat-sheet (with the assumption that I am usually constructing with a minimum score of 50):

55: Would fit in any puzzle
51: Legitimate fill but 2nd-tier. Or alternatively: entries I won’t say yes to unconditionally, but which I want to at least consider as options when I’m filling a grid
45: Somewhat of a concession – either legitimate but niche & joyless entry, or unnaturalness like in rarer abbreviations & clear crosswordese
39: Bottom of the barrel concessions such as roman numerals
31: Maybe could consider using in the far off future, but for now is too obscure
21: Probably would not use unless an entirely new entity has this name
11: Stuff with Arabic numerals (e.g. C3PO)
1: Offensive/avoid at all costs

You’ll notice that all of these scores end with non-zero numbers, and that’s by design. I wanted to easily be able to differentiate between rescored entries and automatic-wordlist entries as I was constructing. Since STWL’s scores all are multiples of 10, I now know that any score not ending in 0 is a word that I’ve manually scored.

I also like how this system gives me some flexibility to trade fillability for cleanliness. If I’m in somewhat dire straits, I’ll lower the minimum score to 45 rather than 50 in order to consider some of those “somewhat of a concession” entries. Going to 39 is more like a “Break Glass In Case of Disaster.” Plus, if I want to tweak the numerical values in large groups, I can (for example) just find all of my entries scored 45 and change them in bulk to a new number.

You might be wondering: What’s the use of rescoring words to lower numbers – especially entries with Arabic numbers (which I basically never use), offensive words, and entries that don’t really have a meaningful definition – rather than just deleting them from the wordlist altogether? The answer differs: For the entries with numbers, I figure that at least they’re all in one place so that if I need a list of number entries I can just find my list of 11-scored words. The “offensive words” list helps me confirm that I’ve actively evaluated and rejected an entry rather than mistakenly omitted it – useful for words that you might see used sometimes but have etymology/connotations that I’m uncomfortable including in a puzzle (e.g. GYP). For the nonsense entries, the reason is that I have hoarder tendencies and it’s difficult to throw out garbage that might someday (see: never) be useful.

Current Status & Lessons Learned

Finally, after about a month, I’ve finished rescoring all ~5,000 3-letter words!

Did it help? I took the newly scored list out for a test drive, and I do think it improved my quality of constructing life. I felt like I could trust my 3-letter words a lot more; in the past, there have been previous moments where I’ve completed a grid draft, only to learn that one horrible small entry has snuck in and the entire puzzle needs to be dug up.

It’s also an interesting exercise in introspection and taste. I have a better sense of the types of words I’m drawn to and not – for instance, I was definitely more lenient towards internet/texting abbreviations than towards acronyms for organizations & government programs. I’d like to remain aware of these sorts of biases when I’m constructing, which might affect the way I construct for different audiences/solver groups.

Importantly (and a fact I had to make peace with while rescoring), the goal isn’t to “automatize” my 3-letter words. I’m still doing critical thinking, like deciding that certain words don’t work in a spot because of the context, crossings, etc. (For example: I try to avoid proper name- or abbreviation-heavy sections of the grid, even when all the entries are theoretically “good” entries.) But the decisions I have to make are definitely fewer and also generally more impactful, which feels more efficient in terms of my mental energy.

For next steps, I’m looking at the 4-letter words. It’s a much bigger project (my current count is at around 15,000 words – 3x the 3-letter words). But I think having both the 3- and 4-letter words in a pretty good place will provide huge returns for my construction quality as a whole. After that, the path is a little unclear. I could continue the pattern with 5- and 6-letter words, which I think would be great for filling themed puzzles and themelesses; but I could also switch things up and try to cull through/add a lot of exciting longer entries, which would be extremely useful for themelesses and maybe in fringe theme-finding cases for themed puzzles.

Ultimately, I’m glad I’ve started rescoring, though I’m not necessarily mad that it’s taken this long. I had put off The Great Wordlist Project for many years, because it seemed extremely mind-numbing and I wanted to prioritize the fun stuff, i.e. making puzzles. I still don’t think a wordlist overhaul necessary for most hobbyist constructors to make banger puzzles – the resources that exist get the job done pretty well already. I’d also argue that, by creating crosswords without my own wordlist for such a long time, I’ve drastically improved and solidified my word-evaluating judgement to the point where I can do this task adequately and somewhat quickly.

That’s all I’ve got! Let me know what you think – how you view your own wordlist, whether you have your own hacks/strategies for cleaning/maintaining. And look out for a future post describing the (many) mistakes I made during this process!

Beginning the Great Wordlist Project

The Reason

Initial Decisions

Standards of Care

Current Status & Lessons Learned

Get posts straight to your inbox

Like this:

Related

Leave a ReplyCancel reply

The Reason

Initial Decisions

Standards of Care

Current Status & Lessons Learned

Get posts straight to your inbox

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Grid Alchemy