July 3, 2024
No word begins with a(n) _ and ends with a(n) _ is a claim I occasionally see on memes and short form videos. Typically, the claims the creator present is false and a counterexample can be found in the top comments. In most cases, the creator of the content makes a false claim on purpose to increase engagement, because that's how the internet works.
In this article, I will analyze an english dictionary and corpus data to see if which combination of letters never, or very rarely start and end words. The data from this analysis is available later in this webpage.
I'm inviting everyone to adapt this into short form video as long as you credit me and follow the guidelines of fair use. It'd be funny.
The data below is derived from all of english usage. Depending on the section, it may contain swear words, racial slurs, and other socially unacceptable words. None of the data is censored, so don't be surprised if you see something weird.
All the analysis below has data from 2 sources:
I will refer to the dwyl word list as dictionary and the trillion word dataset as corpus.
Keep in mind that all the data I'm using is pretty old. The Trillion Word Corpus is 18 years old and the dictionary is 10 years old, so some contemporary words aren't on there. For example, "rizz", "goated", and "incel" did not appear in the dictionary nor the corpus.
Whatever is in the dataset.
I'm not a linguist nor an aristocrat (more of an engineering major living off Pell grant), I'm going for a descriptivist approach here. There are some nuances which I will address later with basically an arbitrary decision, but generally I'll go with the dataset unless it is true gibberish (arbitrary line).
Since this is a natural language corpus, it contains misspelled words; this shouldn't affect the data too much.
They're not free. There is a version of webster on Project Gutenburg but it'd've taken some time to write a parsing script.
I originally wanted to also use the NASPA Scrabble word list, but it is copyrighted.
I'm not gonna go into detail here because it's incredibly boring, even for me. The analysis is done in a Node JS script that outputs a bigGlob of JSON that is then parsed by your browser in this page.
If you choose to run it yourself, you'll need to download the data too. Run it with node and it should output a JSON file. You may also uncomment some of the console.table calls for a console readout.
Also here is JSON of the corpus data and JSON of the dictionary in case anyone wants it to avoid parsing headache. All github gist links, they're a couple of megabytes so be careful.
In summary
Anyways, here are the analysis results
For example, in the first table below, row c column a with the value 1731 represents words starting with c and ending with a
On ranking tables, the same condition is represented with "ca"
Data status: not ready. This may take a while on a slow connection. This page loads about 260 KB of JSON. This page does not work on Internet Explorer or older browsers.
Based on dwyl/english-words on commit 94dbea5 on June 16, 2024. Number of entries: 370104.
Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.
Based on the aforementioned corpus. 333333 words appearing a total of 588124220187 times; truncated from a total of 1.025 trillion words.
Rounded to two decimal places.
Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.
Again, only the top 333333 unique words
Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.
There are several combinations of rare beginning and end letters. Most less common ones are occupied with phonetic translations from other languages, typically non-Germanic ones. At the end, it seems like Q...J is the least used combination, with no mentions in the dictionary and only one in the corpus: "qj", which seem to come from an abbrivation of Qianjiang Motorcycle, a Chinese motorcycle company. If the only word is an abbrivation of a proper name in a foreign language, I feel like I can crown Q...J as the least used starting and ending letter; earning the title... of this page I guess.
This was a pretty cool small project. The results probably have no value outside of awe. You're welcome to check back every time you see a meme or short-form video with these claims. I'd enourage you to download the code and explore the data if this is something that interests you. I found some pretty interesting concepts researching these unusual words.
Special thanks to all the provider of the data.
Write an email to me if you have any questions, comments, or concerns. Look for a link to the address on my site homepage.