B1 Intermediate US 2769 Folder Collection
After playing the video, you can click or select the word to look it up in the dictionary.
Report Subtitle Errors
Hey, Vsauce. Michael here. About 6 percent of everything you say and read and write is
"the" - is the most used word in the English language. About one out of every 16 words we encounter on a daily basis is "the."
The top 20 most common English words
in order are "the," "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you,"
"was," "with," "on," "as," "have," "but," "be," "they."
That's a fun fact. A piece of trivia but it's also more. You see, whether the most commonly used words are ranked across an entire language,
or in just one book or article, almost every time a bizarre pattern emerges.
The second most used word will appear about half as often as the most used.
The third one third as often. The fourth one fourth as often. The fifth one fifth as often.
The sixth one sixth as often, and so on all the way down.
Seriously, for some reason, the amount of times a word is used is just proportional to one over its rank.
Word frequency and ranking on a log log graph follow a nice straight line.
A power-law. This phenomenon is called Zipf's Law
and it doesn't only apply to English. It also applies to other languages, like, well, all of them.
Even ancient languages we haven't been able to translate yet.
And here's the thing. We have no idea why. It's surprising that something as complex as reality
should be conveyed by something as creative as language in such a predictable way.
How predictable? Well, watch this. According to WordCount.org,
which ranks words as found in the British National Corpus, "sauce" is the
5,555th most common English word. Now, here is a list of how many times
every word on Wikipedia and in the entire Gutenberg Corpus of tens of
thousands of public domain books shows up. The most used word, 'the,' shows up about
181 million times. Knowing these two things, we can estimate that the word
"sauce" should appear about thirty thousand times on Wikipedia and
Gutenberg combined. And it pretty much does.
What gives? The world is chaotic. Things are distributed in myriad of ways, not just
power laws. And language is personal,
intentional, idiosyncratic. What about the world and ourselves could cause such
complex activities and behaviors to follow such a basic rule? We literally
don't know. More than a century of research has yet to close the case.
Moreover, Zipf's law doesn't just mysteriously describe word use. It's
also found in city populations, solar flare intensities, protein sequences and
immune receptors, the amount of traffic websites get, earthquake magnitudes, the
number of times academic papers are cited, last names, the firing patterns of
neural networks, ingredients used in cookbooks, the number of phone calls
people received, the diameter of Moon craters, the number of people that die
in wars, the popularity of opening chess moves, even the rate at which we forget.
There are plenty of theories about why language is 'zipf-y,' but no firm conclusions
and this video doesn't contain a definite explanation either. Sorry, I know
that's a bummer, since we appear to like knowing more than mystery. But that said,
we also ask more than we answer. So let's dive into Zipf's ramifications, some
related patterns, some possible explanations and the depth of the
mystery itself. Zipf's law was popularized by George Zipf,
a linguist at Harvard University. It is a discrete form of the continuous Pareto
distribution from which we get the Pareto Principle. Because so many
real-world processes behave this way, the Pareto Principle tells us that, as a rule
of thumb, it's worth assuming that 20% of the causes are responsible for 80% of
the outcome,
like in language, where the most frequently used 18 percent of words
account for over 80% of word occurrences. In 1896, Vilfredo Pareto showed that
approximately 80% of the land in Italy was owned by just twenty percent of the
population. It is said that he later noticed in his garden 20 percent of his
pea pods contained eighty percent of the peas. He and other researchers looked at
other datasets and found that this 80-20 imbalance comes up a lot in the world.
The richest 20% of humans have 82.7% of the world's income. In the US, 20% of
patients use eighty percent of health care resources. In 2002, Microsoft
reported that 80% of the errors and crashes in Windows and Office are caused
by 20% of the bugs detected. A common rule of thumb in the business world
states that 20% of your customers are responsible for 80% of your profits and
eighty percent of the complaints you receive will come from 20% of your
customers. A book titled "The 80/20 Principle" even says that in a home or
20% of the carpet receives 80 percent of the wear. Oh, and as Woody Allen famously
said, "eighty percent of success is just showing up." The Pareto Principle is
everywhere, which is good.
By focusing on just 20 percent of what's wrong, you can often expect to solve
eighty percent of the problems. A variety of different unrelated factors cause
this to be true from case to case, but if we can get to the bottom of what causes
some of them,
maybe we'll find that one or more of those mechanisms is responsible for
Zipf's law in language. George Zipf himself thought languages' interesting rank
frequency distribution was a consequence of the Principle of Least Effort.
The tendency for life and things to follow the path of least resistance. Zipf believed
it drove much of human behavior and hypothesized that as language developed
in our species, speakers naturally preferred drawing from as few words as
possible to get their thoughts out there. It was easier. But in order to understand
what was being said,
listeners preferred larger vocabularies that gave more specificity, so that they
had to do less work. The compromise between listening and speaking, Zipf felt,
led to the current state of language. A few words are used often and many many
many words are used rarely.
Recent papers have suggested that having a few short, often used, predictable words
helps dissipate information load density on listeners, spacing out important vocab
so that the information rate is more constant. This makes sense and much has
been learned by applying the least effort principle to other behaviors, but
later researchers argued that for language, the explanation was even more
simple. Just a few years after Zipf's seminal paper, Benoit Mandelbrot showed
that there may be nothing mysterious about Zipf's law at all, because even if you
just randomly type on a keyboard you will produce words distributed according
to Zipf's law. It's a pretty cool point and this is why it happens. There are
exponentially more different long words than short words. For instance, the English
alphabet can be used to make 26 one letter words, but 26 squared 2 letter
words. Also, in random typing, whenever the space bar is pressed a word terminates.
Since there's always a certain chance that the space bar will be pressed longer
stretches of time before it happens
are exponentially less likely than shorter ones. The combination of these
exponentials is pretty 'Zipfy.' For example, if all 26 letters and the
spacebar are equally likely to be typed, after a letter is typed and a word has
begun, the probability that the next input will be a space, thus creating a
one letter word, is just one in 27. And sure enough, if you randomly generate
characters or hire a proverbial typing monkey, about one out of every 27 or 3.7
percent of the stuff between spaces, will be single letters. Two letter words
appear when after beginning a word any character but the space bar is hit - a 26
in 27 chance and then the space bar.
A three-letter word is the probability of a letter, another letter and then a
space. If we divide by the number of unique words of each length there can be,
we get the frequency of occurrence expected for any particular word given
its length. For example, the letter V will make up about 0.142 percent of
random typing. The word "Vsauce" 0.0000000993 percent. Longer words are
less likely, but watch this. Let's spread these frequencies out according to the
ranks they'd take up on a most often used list. There are 26 possible one
letter words, so each of the top 26 ranked words are expected to occur
about this often. The next 676 ranks will be taken up
by two letter words that show up about this often. If we extend each frequency
according to how many members it has, we get Zipf. Subsequent researchers have
detailed how changing up the initial conditions can smooth the steps out.
Our mysterious distribution has been created out of nothing but the inevitabilities
of math.
So maybe there is no mystery. Maybe words are just the result of humans randomly
segmenting the observable world and the mental world into labels and Zipf's law
describes what naturally happens when you do that. Case closed. and as always
And as always,
thanks for... wait a minute!
Actual language is very different from random typing. Communication is
deterministic to a certain extent. Utterances and topics arrive based on
what was said before. And the vocabulary we have to work with certainly isn't the
result of purely random naming. For example, the monkey typing model can't
explain why even the names of the elements, the planets and the days of the
week are used in language according to Zipf's law. Sets like these are constrained
by the natural world and they're not the result of us randomly segmenting the
world into labels. Furthermore, when given a list of novel words, words they've
never heard or used before, like when prompted to write a story about alien
creatures with strange names, people will naturally tend to use the name of one
alien twice as often as another, three times as often as another... Zipf's law appears to
be built into our brains. Perhaps there is something about the way thoughts and
topics of discussion ebb and flow that contributes to Zipf's law.
Another way 'Zipf-ian' distributions occur is via processes that change
according to how they've previously operated. These are called preferential
attachment processes. They occur when something - money, views,
attention, variation, friends, jobs, anything really is given out according
to how much is already possessed. To go back to the carpet example, if most
people walk from the living room to the kitchen across a certain path, furniture
will be placed elsewhere, making that path even more popular. The more views
a video or image or post has, the more likely it is to get recommended
automatically or make the news for having so many views, both of which give
it more views.
It's like a snowball rolling down a snowy hill. The more snow it accumulates, the
bigger its surface area becomes for collecting more and the faster it grows.
There doesn't have to be a deliberate choice driving a preferential attachment
process. It can happen naturally. Try this. Take a bunch of paper clips and grab any
two at random.
Link them together and then throw them back in the pile. Now, repeat over and
over again. If you grab paper clips that are already part of a chain, link them anyway.
More often than not after a while you will have a distribution that looks
'Zipf-ian.' A small number of chains contain a disproportionate amount of the
total paperclip count. This is simply because the longer a chain gets, the
greater proportion of the whole it contains, which gives it a better chance
of being picked up in the future and consequently made even longer. The rich
get richer, the big get bigger, the popular get popular-er. It's just math.
Perhaps languages' Zipf mystery is, if not caused by it, at least strengthened by
preferential attachment. Once a word is used, it's more likely to be used again soon.
Critical points may play a role as well.
Writing and conversation often stick to a topic until a critical point is reached
and the subject is changed and the vocabulary shifts. Processes like these
are known to result in power laws. So, in the end, it seems tenable that all these
mechanisms might collude to make Zipf's law the most natural way for language to
be. Perhaps some of our vocabulary and grammar was developed randomly, according
to Mandelbrot's theory. And the natural way conversation and discussion follow
preferential attachment and criticality, coupled with the principle of least
effort when speaking and listening are all responsible for the relationship
between word rank and frequency.
It's a shame that the answer isn't simpler, but it's fascinating because of
the consequences it has on what communication is made of. Roughly
speaking, and this is mind blowing. Nearly half of any book, conversation or article
will be nothing but the same 50 to 100 words. And nearly the other half will be
words that appear in that selection only once. That's not so surprising when you
consider the fact that one word accounts for 6 percent of what we say. The top 25
most used words make up about a third of everything we say and the top 100 about
half. Seriously. I mean, whether it's all the words in "Wet Hot American Summer," or all
the words in Plato's "Complete Works" or in the complete works of Edgar Allan Poe
or the Bible itself, only about 100 words are used for nearly half of everything
written or said. In Alice's Adventures in Wonderland 44% and in Tom Sawyer 49.8%
of the unique words used appear only once in the book. A word that is used
only once in a given selection of words is called a 'hapax legomenon.'
Hapax legomena are vitally important to understanding languages. If a word has
only been found once in the entire known collection of an ancient language, it can
be very difficult to figure out what it means. Now, there is no corpus of
everything ever said or written in English, but there are very very large
collections and it's fun to find hapax legomena in them. For instance, and this
probably won't be the case after I mention it, but the word "quizzaciously"
is in the Oxford English Dictionary, but appears nowhere on Wikipedia or in the
Gutenberg corpus or in the British National Corpus or the American National
Corpus, but it does appear when searched in just one result on Google. Fittingly, in a
book titled "ElderSpeak" that lists it as a 'rare word.' Quizzaciously, by the way,
means "in a mocking manner," as in "The parodist rattled off quizzaciously,
'Hey, Vsauce. Michael here. But who is Michael and how much does here
weigh?'" It's a little sad that quizzaciously has been used so infrequently. It's a
fun word, but that's the way things go in a 'Zipf-ian' system. Some things get all the
love, some get little. Most of what you experience on a day-to-day basis is
forgotten, forgettable. The Dictionary of Obscure Sorrows, as it often does, has a
word for this - Olēka - the awareness of how few days are memorable.
I've been alive for almost 11,000 days but I couldn't tell you something about
each one of them. I mean, not even close.
Most of what we do and see and think and say and hear and feel is forgotten
at a rate quite similar to Zipf's law, which makes sense. If a number of factors
naturally selected for thinking and talking about the world with tools in
a 'Zipf-ian' way, it makes sense we'd remember it that way too. Some things
really well, most things hardly at all. But it bums me out sometimes because it
means that so much is forgotten, even things that at the time you thought you
could never forget. My locker number -
senior year - its combination, the jokes I liked when I saw a comedian on stage,
the names of people I saw every day 10 years ago. So many memories are gone. When
I look at all the books I've read and realize that I can't remember every
detail from them, it's a little disappointing. I mean, why even bother if
the Pareto Principle dictates that my 'Zipf-ian' mind will consciously remember
pretty much only the titles and a few basic reactions years later
Ralph Waldo Emerson makes me feel better. He once said, "I cannot remember the books
I've read any more than the meals I have eaten. Even so, they have made me."
And as always,
thanks for watching.
    You must  Log in  to get the function.
Tip: Click on the article or the word in the subtitle to get translation quickly!


The Zipf Mystery

2769 Folder Collection
張馨文 published on October 3, 2015    Chris Shao translated    Mandy Lin reviewed
More Recommended Videos
  1. 1. Search word

    Select word on the caption to look it up in the dictionary!

  2. 2. Repeat single sentence

    Repeat the same sentence to enhance listening ability

  3. 3. Shortcut


  4. 4. Close caption

    Close the English caption

  5. 5. Embed

    Embed the video to your blog

  6. 6. Unfold

    Hide right panel

  1. Listening Quiz

    Listening Quiz!

  1. Click to open your notebook

  1. UrbanDictionary 俚語字典整合查詢。一般字典查詢不到你滿意的解譯,不妨使用「俚語字典」,或許會讓你有滿意的答案喔