Text Coverage & DCC’s Top 1000

**Updated 2.25.21 with details from this post**

The DCC frequency list is often consulted for choosing which words to use when writing Latin for students. It certainly makes sense to use ones they might encounter over and over again vs. those they might not, but *how* frequent are these frequent words? In particular, I was curious what a student could probably read having acquired the Top 1000 words on DCC’s list. Here’s some quick background…

Text Coverage
In vocab studies, there’s this idea of coverage. My takeaway quote has been “the Top 1000 words in any language should get you 80% text coverage.” That certainly sounds nice, but it turns out that 80% text coverage is real wonky. Still, of the languages analyzed, that figure holds up pretty well. Paul Nation is a big name in vocab. He wrote in 2013 that for English learners with less than a 2000 word vocabulary, reading unsimplified texts is way too difficult. Now, that’s already twice as many words on the DCC list, and let’s not even compare English to the problem we have with Distinguished level Latin Back to Nation, in 2006, he found that each subsequent 1000 words has a significant rate of diminishing returns. So, if you know 1000 words, you’re gonna get around 80% text coverage. If you know 2000, though, your coverage goes up only slightly to something like 89%. Tack on another grand for a total of 3000 word vocabulary, and you likely have 90-95% coverage, which is still not the ideal 98% needed for reading. Of course, if we apply this to a high school language student, it would mean they’d have to learn 1,000 words each year in order to start reading the highest level texts their senior year. That figure is unquestionably high for language learners.

Students’ Latin Vocabulary
Reports vary significantly on how many words to reasonably expect a Latin student to…”learn” or “know”…each year. There are a lot of factors that go into learning and knowing words (9 aspects according to Nation et al). Whatever that reality is, and whatever can be meant by the terms, teachers estimate between 175 and 500 words per year. That is, the lowest estimates would mean that a student will know well under 1000 words by the end of high school, and the highest estimates would mean that students will know something like 2000. I have my doubts with those highest figures, for sure, but let’s pretend that’s true. A student starting their fourth year of Latin will know 50% of the vocab from what’s expected to be read on the AP Latin, which is impossibly low according to all the research, as well as just common sense. If you don’t know what every other word means, you’re gonna have problems. Still, let’s toss all that data and logic—right out the window—and just look at what a student who knows the DCC Top 1000 Latin words can be expected to read:

OK, so the student who knows all the Latin on DCC’s Top 1000 list will have a text coverage of 56% 58% when trying to read Ennius. After redoing the analysis to include total tokens and account for words repeating, the text coverage didn’t go up much. That’s because while there were quite a words on the DCC list that did repeat in Ennius, there were also words outside of the DCC list that repeated as well. Now, had the work been written with a LOT of repeated words from the DCC list, the coverage would have been much higher. So, 58% text coverage. That’s not good. It just so happens that Ennius’ fragments amount to about 1000 total words of Latin. What are the chances all the DCC list vocab is found in there, and/or repeat often—low, right? Sure, now let’s go straight to the AP syllabus because those texts are much longer and have more vocab. If the DCC frequency list was generated from texts that include Caesar, Virgil, and lots of other authors, then numbers should be higher. Instead of text coverage now, I’m just showing what vocab from the DCC list is found in Caesar. Given the Ennius experiment, I’m not hopeful an analysis would show much higher text coverage numbers to warrant the effort. Besides, if 100% text coverage doesn’t guarantee high levels of comprehension anyway, no need to focus on that metric to see how it’s mission insurmountable for reading these authors.

Well, not so great. The student who knows all the Latin on DCC’s Top 1000 list will know 45% of the words in Caesar. Maybe there are more in Virgil:

Yikes! The student who knows all Top 1000 words has just 38% of the words in Virgil AP selections. That’s horribly low. What I find interesting is that Virgil uses a third more vocab than Caesar (roughly 1500 to 500 words), yet the number of Top 1000 words remains pretty close between the two—512 in Caesar, and 584 in Virgil. Can the Top 1000 words get students much more than 50% of the vocab in any one single unadapted ancient Latin text?! It looks like “no way,” at least with these three samples. A sample size that small is hardly enough to make a sweeping claim, I know, but I’m not writing a dissertation or anything. I do have a job and it’s not counting up a bunch of Latin words. Still, it’s very odd that Ennius—an author almost no one prioritizes—includes a higher percentage of DCC’s Top 1000 than two of the most commonly and highly prioritized authors, Caesar and Virgil. It’s been suggested that could be from so many more one-off’s in the latter texts. Probably.

High Frequency: The Concept
Of course, high frequency is a concept based on words used often in a particular context. the DCC list gets students far less than 80% text coverage to read Ennius, and the context of Caesar, and Virgil are so different that lower percentages of vocab are found in each. Nation studied English, although the figures are close with other languages. Besides, 80% is far too low for reading as it is. I was asked how things might play out using the 1425 words included in “Essential Latin Vocabulary” by Mark. A.E. Williams. I have no idea since that’d have to be purchased, but got a tip that it’s basically most of the words on this list. I do know that 1425 is a LOT more than 1000. The description of Mark William’s book includes a lot of claims that really ought to be checked, like how the list of 1425 “allows a student to comprehend about 95 percent of all the vocabulary they will ever see in an actual Latin text.” We should also consider what the claims mean, if true. Consider the “learn 25 words and have a text coverage of 29%” claim. That’s just laughable knowing how hard read gets at 80%. So, if that list of 1425 words really does “allow a student to comprehend,” I wonder if there’s a qualifier in there, too. If not, and the work holds up completely, I’d say it should replace the DCC list when choosing vocab.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.