Placeholder Image

Subtitles section Play video

  • Hey, Vsauce. Michael here. About 6 percent of everything you say and read and write is

  • is the...

  • "the" - is the most used word in the English language. About one out of every 16 words we encounter on a daily basis is "the."

  • The top 20 most common English words

  • in order are "the," "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you,"

  • "was," "with," "on," "as," "have," "but," "be," "they."

  • That's a fun fact. A piece of trivia but it's also more. You see, whether the most commonly used words are ranked across an entire language,

  • or in just one book or article, almost every time a bizarre pattern emerges.

  • The second most used word will appear about half as often as the most used.

  • The third one third as often. The fourth one fourth as often. The fifth one fifth as often.

  • The sixth one sixth as often, and so on all the way down.

  • Seriously. For some reason, the amount of times a word is used is just proportional to one over its rank.

  • Word frequency and ranking on a log log graph follow a nice straight line.

  • A power-law. This phenomenon is called Zipf's Law

  • and it doesn't only apply to English. It also applies to other languages, like, well, all of them.

  • Even ancient languages we haven't been able to translate yet.

  • And here's the thing. We have no idea why. It's surprising that something as complex as reality

  • should be conveyed by something as creative as language in such a predictable way.

  • How predictable? Well, watch this. According to WordCount.org,

  • which ranks words as found in the British National Corpus, "sauce" is the

  • 5,555th most common English word. Now, here is a list of how many times

  • every word on Wikipedia and in the entire Gutenberg Corpus of tens of

  • thousands of public domain books shows up. The most used word, "the", shows up about

  • 181 million times. Knowing these two things, we can estimate that the word

  • "sauce" should appear about thirty thousand times on Wikipedia and

  • Gutenberg combined. And it pretty much does.

  • What gives? The world is chaotic. Things are distributed in myriad of ways, not just

  • power laws. And language is personal,

  • intentional, idiosyncratic. What about the world and ourselves could cause such

  • complex activities and behaviors to follow such a basic rule? We literally

  • don't know. More than a century of research has yet to close the case.

  • Moreover, Zipf's law doesn't just mysteriously describe word use. It's

  • also found in city populations, solar flare intensities, protein sequences and

  • immune receptors, the amount of traffic websites get, earthquake magnitudes, the

  • number of times academic papers are cited, last names, the firing patterns of

  • neural networks, ingredients used in cookbooks, the number of phone calls

  • people received, the diameter of Moon craters, the number of people that die

  • in wars, the popularity of opening chess moves, even the rate at which we forget.

  • There are plenty of theories about why language is 'zipf-y,' but no firm conclusions

  • and this video doesn't contain a definite explanation either. Sorry, I know

  • that's a bummer, since we appear to like knowing more than mystery. But that said,

  • we also ask more than we answer. So let's dive into Zipf's ramifications, some

  • related patterns, some possible explanations and the depth of the

  • mystery itself. Zipf's law was popularized by George Zipf,

  • a linguist at Harvard University. It is a discrete form of the continuous Pareto

  • distribution from which we get the Pareto Principle. Because so many

  • real-world processes behave this way, the Pareto Principle tells us that, as a rule

  • of thumb, it's worth assuming that 20% of the causes are responsible for 80% of

  • the outcome,

  • like in language, where the most frequently used 18 percent of words

  • account for over 80% of word occurrences. In 1896, Vilfredo Pareto showed that

  • approximately 80% of the land in Italy was owned by just twenty percent of the

  • population. It is said that he later noticed in his garden 20 percent of his

  • pea pods contained eighty percent of the peas. He and other researchers looked at

  • other datasets and found that this 80-20 imbalance comes up a lot in the world.

  • The richest 20% of humans have 82.7% of the world's income. In the US, 20% of

  • patients use eighty percent of health care resources. In 2002, Microsoft

  • reported that 80% of the errors and crashes in Windows and Office are caused

  • by 20% of the bugs detected. A common rule of thumb in the business world

  • states that 20% of your customers are responsible for 80% of your profits and

  • eighty percent of the complaints you receive will come from 20% of your

  • customers. A book titled "The 80/20 Principle" even says that in a home or

  • office,

  • 20% of the carpet receives 80 percent of the wear. Oh, and as Woody Allen famously

  • said, "eighty percent of success is just showing up." The Pareto Principle is

  • everywhere, which is good.

  • By focusing on just 20 percent of what's wrong, you can often expect to solve

  • eighty percent of the problems. A variety of different unrelated factors cause

  • this to be true from case to case, but if we can get to the bottom of what causes

  • some of them,

  • maybe we'll find that one or more of those mechanisms is responsible for

  • Zipf's law in language. George Zipf himself thought languages' interesting rank

  • frequency distribution was a consequence of the Principle of Least Effort.

  • The tendency for life and things to follow the path of least resistance. Zipf believed

  • it drove much of human behavior and hypothesized that as language developed

  • in our species, speakers naturally preferred drawing from as few words as

  • possible to get their thoughts out there. It was easier. But in order to understand

  • what was being said,

  • listeners preferred larger vocabularies that gave more specificity, so that they

  • had to do less work. The compromise between listening and speaking, Zipf felt,

  • led to the current state of language. A few words are used often and many many

  • many words are used rarely.

  • Recent papers have suggested that having a few short, often used, predictable words

  • helps dissipate information load density on listeners, spacing out important vocab

  • so that the information rate is more constant. This makes sense and much has

  • been learned by applying the least effort principle to other behaviors, but

  • later researchers argued that for language, the explanation was even more

  • simple. Just a few years after Zipf's seminal paper, Benoit Mandelbrot showed

  • that there may be nothing mysterious about Zipf's law at all, because even if you

  • just randomly type on a keyboard you will produce words distributed according

  • to Zipf's law. It's a pretty cool point and this is why it happens. There are

  • exponentially more different long words than short words. For instance, the English

  • alphabet can be used to make 26 one letter words, but 26 squared 2 letter

  • words. Also, in random typing, whenever the space bar is pressed a word terminates.

  • Since there's always a certain chance that the space bar will be pressed longer

  • stretches of time before it happens

  • are exponentially less likely than shorter ones. The combination of these

  • exponentials is pretty 'Zipfy.' For example, if all 26 letters and the

  • spacebar are equally likely to be typed, after a letter is typed and a word has

  • begun, the probability that the next input will be a space, thus creating a

  • one letter word, is just one in 27. And sure enough, if you randomly generate

  • characters or hire a proverbial typing monkey, about one out of every 27 or 3.7

  • percent of the stuff between spaces, will be single letters. Two letter words

  • appear when after beginning a word any character but the space bar is hit - a 26

  • in 27 chance and then the space bar.

  • A three-letter word is the probability of a letter, another letter and then a

  • space. If we divide by the number of unique words of each length there can be,

  • we get the frequency of occurrence expected for any particular word given

  • its length. For example, the letter V will make up about 0.142 percent of

  • random typing. The word "Vsauce" 0.0000000993 percent. Longer words are

  • less likely, but watch this. Let's spread these frequencies out according to the

  • ranks they'd take up on a most often used list. There are 26 possible one

  • letter words, so each of the top 26 ranked words are expected to occur

  • about this often. The next 676 ranks will be taken up

  • by two letter words that show up about this often. If we extend each frequency

  • according to how many members it has, we get Zipf. Subsequent researchers have

  • detailed how changing up the initial conditions can smooth the steps out.

  • Our mysterious distribution has been created out of nothing but the inevitabilities

  • of math.

  • So maybe there is no mystery. Maybe words are just the result of humans randomly

  • segmenting the observable world and the mental world into labels and Zipf's law

  • describes what naturally happens when you do that. Case closed. and as always

  • And as always,

  • thanks for... wait a minute!

  • Actual language is very different from random typing. Communication is

  • deterministic to a certain extent. Utterances and topics arrive based on

  • what was said before. And the vocabulary we have to work with certainly isn't the

  • result of purely random naming. For example, the monkey typing model can't

  • explain why even the names of the elements, the planets and the days of the

  • week are used in language according to Zipf's law. Sets like these are constrained

  • by the natural world and they're not the result of us randomly segmenting the

  • world into labels. Furthermore, when given a list of novel words, words they've

  • never heard or used before, like when prompted to write a story about alien

  • creatures with strange names, people will naturally tend to use the name of one

  • alien twice as often as another, three times as often as another... Zipf's law appears to

  • be built into our brains. Perhaps there is something about the way thoughts and

  • topics of discussion ebb and flow that contributes to Zipf's law.

  • Another way 'Zipf-ian' distributions occur is via processes that change

  • according to how they've previously operated. These are called preferential

  • attachment processes. They occur when something - money, views,

  • attention, variation, friends, jobs, anything really is given out according

  • to how much is already possessed. To go back to the carpet example, if most

  • people walk from the living room to the kitchen across a certain path, furniture

  • will be placed elsewhere, making that path even more popular. The more views

  • a video or image or post has, the more likely it is to get recommended

  • automatically or make the news for having so many views, both of which give

  • it more views.

  • It's like a snowball rolling down a snowy hill. The more snow it accumulates, the

  • bigger its surface area becomes for collecting more and the faster it grows.

  • There doesn't have to be a deliberate choice driving a preferential attachment

  • process. It can happen naturally. Try this. Take a bunch of paper clips and grab any

  • two at random.

  • Link them together and then throw them back in the pile. Now, repeat over and

  • over again. If you grab paper clips that are already part of a chain, link them anyway.

  • More often than not after a while you will have a distribution that looks

  • 'Zipf-ian.' A small number of chains contain a disproportionate amount of the

  • total paperclip count. This is simply because the longer a chain gets, the

  • greater proportion of the whole it contains, which gives it a better chance

  • of being picked up in the future and consequently made even longer. The rich

  • get richer, the big get bigger, the popular get popular-er. It's just math.

  • Perhaps languages' Zipf mystery is, if not caused by it, at least strengthened by

  • preferential attachment. Once a word is used, it's more likely to be used again soon.

  • Critical points may play a role as well.

  • Writing and conversation often stick to a topic until a critical point is reached

  • and the subject is changed and the vocabulary shifts. Processes like these

  • are known to result in power laws. So, in the end, it seems tenable that all these

  • mechanisms might collude to make Zipf's law the most natural way for language to

  • be. Perhaps some of our vocabulary and grammar was developed randomly, according

  • to Mandelbrot's theory. And the natural way conversation and discussion follow

  • preferential attachment and criticality, coupled with the principle of least

  • effort when speaking and listening are all responsible for the relationship

  • between word rank and frequency.

  • It's a shame that the answer isn't simpler, but it's fascinating because of

  • the consequences it has on what communication is made of. Roughly

  • speaking, and this is mind blowing. Nearly half of any book, conversation or article

  • will be nothing but the same 50 to 100 words. And nearly the other half will be

  • words that appear in that selection only once. That's not so surprising when you

  • consider the fact that one word accounts for 6 percent of what we say. The top 25

  • most used words make up about a third of everything we say and the top 100 about

  • half. Seriously. I mean, whether it's all the words in "Wet Hot American Summer," or all

  • the words in Plato's "Complete Works" or in the complete works of Edgar Allan Poe

  • or the Bible itself, only about 100 words are used for nearly half of everything

  • written or said. In Alice's Adventures in Wonderland 44% and in Tom Sawyer 49.8%

  • of the unique words used appear only once in the book. A word that is used

  • only once in a given selection of words is called a 'hapax legomenon.'

  • Hapax legomena are vitally important to understanding languages. If a word has

  • only been found once in the entire known collection of an ancient language, it can

  • be very difficult to figure out what it means. Now, there is no corpus of

  • everything ever said or written in English, but there are very very large

  • collections and it's fun to find hapax legomena in them. For instance, and this

  • probably won't be the case after I mention it, but the word "quizzaciously"

  • is in the Oxford English Dictionary, but appears nowhere on Wikipedia or in the

  • Gutenberg corpus or in the British National Corpus or the American National

  • Corpus, but it does appear when searched in just one result on Google. Fittingly, in a

  • book titled "ElderSpeak" that lists it as a 'rare word.' Quizzaciously, by the way,

  • means "in a mocking manner," as in "The parodist rattled off quizzaciously,

  • 'Hey, Vsauce. Michael here. But who is Michael and how much does here

  • weigh?'" It's a little sad that quizzaciously has been used so infrequently. It's a

  • fun word, but that's the way things go in a 'Zipf-ian' system. Some things get all the

  • love, some get little. Most of what you experience on a day-to-day basis is

  • forgotten, forgettable. The Dictionary of Obscure Sorrows, as it often does, has a

  • word for this - Olēka - the awareness of how few days are memorable.

  • I've been alive for almost 11,000 days but I couldn't tell you something about

  • each one of them. I mean, not even close.

  • Most of what we do and see and think and say and hear and feel is forgotten

  • at a rate quite similar to Zipf's law, which makes sense. If a number of factors

  • naturally selected for thinking and talking about the world with tools in

  • a 'Zipf-ian' way, it makes sense we'd remember it that way too. Some things

  • really well, most things hardly at all. But it bums me out sometimes because it

  • means that so much is forgotten, even things that at the time you thought you

  • could never forget. My locker number -

  • senior year - its combination, the jokes I liked when I saw a comedian on stage,

  • the names of people I saw every day 10 years ago. So many memories are gone. When

  • I look at all the books I've read and realize that I can't remember every

  • detail from them, it's a little disappointing. I mean, why even bother if

  • the Pareto Principle dictates that my 'Zipf-ian' mind will consciously remember

  • pretty much only the titles and a few basic reactions years later

  • Ralph Waldo Emerson makes me feel better. He once said, "I cannot remember the books

  • I've read any more than the meals I have eaten. Even so, they have made me."

  • And as always,

  • thanks for watching.

Hey, Vsauce. Michael here. About 6 percent of everything you say and read and write is