Information entropy | Journey into information theory | Computer Science | Khan Academy

Voiceover: Imagine two machines. They both output messages from an alphabet of A, B, C, or D. Machine One generates
each symbol randomly, they all occur 25% of the time, while Machine Two generates symbols according to the following probabilities. Which machine is producing
more information? Claude Shannon cleverly
rephrased the question. If you had to predict the next symbol from each machine, what
is the minimum number of yes or no questions
you would expect to ask? Let’s look at Machine One. The most efficient way
is to pose a question which divides the possibilities in half. For example, our first question, we could ask if it is any two symbols, such as “is it A or B?”, since there is a 50% chance of A or B and a 50% chance of C or D. After getting the answer, we can eliminate half
of the possibilities, and we will be left with two
symbols, both equally likely. So we simply pick one, such as “is it A?”, and after this second question, we will have correctly
identified the symbol. We can say the uncertainty of Machine One is two questions per symbol. What about Machine Two? As with Machine One, we could ask two questions to determine the next symbol. However this time, the
probability of each symbol is different, so we can ask
our questions differently. Here A has a 50% chance of occurring, and all other letters add to 50%. We could start by asking “is it A?”, if it is A we are done, only one question in this case. Otherwise, we are left with two equal outcomes, D or, B and C We could ask, “is it D?”. If yes, we are done with two questions. Otherwise, we have to ask a third question to identify which of the
last two symbols it is. On average, how many questions do you expect to ask, to determine a symbol from Machine Two? This can be explained
nicely with an analogy. Let’s assume instead we want to build Machine One and Machine Two, and we can generate symbols by bouncing a disc off a peg in one of two equally likely directions. Based on which way it falls, we can generate a symbol. With Machine One, we need
to add a second level, or a second bounce, so
that we have two bounces, which lead to four
equally likely outcomes. Based on where the disc lands, we output A, B, C, or D. Now Machine Two. In this case, the first
bounce leads to either an A, which occurs 50% of the time, or else we lead to a second bounce, which then can either output a D, which occurs 25% of the time, or else it leads to a third bounce, which then leads to either
B or C, 12.5% of the time. Now we just take a weighted
average as follows. The expected number of bounces is the probability of
symbol A times one bounce, plus the probability of
B times three bounces, plus the probability of
C times three bounces, plus the probability
of D times two bounces. This works out to 1.75 bounces. Notice the connection between yes or no questions and fair bounces. The expected number of questions is equal to the expected
number of bounces. So Machine One requires two
bounces to generate a symbol, while guessing an unknown
symbol requires two questions. Machine two requires 1.75 bounces. We need to ask 1.75 questions on average, meaning if we need to
guess a hundred symbols from both machines, we can expect to ask 200 questions for Machine One, and 175 questions for Machine Two. This means that Machine Two
is producing less information because there is less
uncertainty, or surprise, about it’s output, and that’s it. Claude Shannon calls this measure of average uncertainty “entropy”, and he uses the letter H to represent it. The unit of entropy Shannon chooses, is based on the uncertainty
of a fair coin flip, and he calls this “the bit”, which is equivalent to a fair bounce. We can arrive at the same result using our bounce analogy. Entropy or H is the
summation for each symbol, of the probability of that symbol times the number of bounces. The difference is, how do we express number of bounces in a more general way? As we’ve seen, number of bounces depends how far down the tree we are. We can simplify this by saying that the number of bounces equals the logarithm base two of the number of outcomes at that level. The number of outcomes at a level, is also based on the probability, where the number of outcomes at a level equals one divided by the
probability of that outcome. Number of bounces actually equals the logarithm base two of one over the
probability of that symbol, which gives us our final equation. Entropy or H, is the
summation for each symbol of the probability of that symbol times the logarithm base two of one over the
probability of that symbol. Shannon writes this slightly different, which just inverts the
expression inside the logarithm which causes us to add a negative, though both formulas give the same result. Let’s summarize. Entropy is maximum when all outcomes are equally likely. Any time you move away from equally likely outcomes, or
introduce predictability, the entropy must go down. The fundamental idea is that, if the entropy of an
information source drops, that means we can ask fewer questions to guess the outcome. Thanks to Shannon, the bit, which is the unit of entropy, is adopted as our quantitative
measure of information, or measure of surprise.

48 comments on “Information entropy | Journey into information theory | Computer Science | Khan Academy”

  1. ewerlopes says:

    Perfect explanation! 🙂

  2. mohamad reza moohebat says:

    It was perfect. thx

  3. itsRAWRtime007 says:

    good video. i like the way it shows the intuition behind the concept, that is the reason why the concepts actually exists rather than plainly defining it and then showing its properties.

  4. Leo Cheung says:

    Could anyone help explain why less uncertainty means less information (Machine 2)? Isn't it the other way round? Many thanks.

  5. Francesco De Toni says:

    Isn't there a mistake at 3:50? Shouldn't it be 0.25 x 2 instead of 0.25 x 4?

  6. BetBola says:

    como se calcula a entropia de um texto? e o que podemos fazer com isso?

  7. Raul Tellegen says:

    Amazing video. Seldom seen a better explanation of anything. Thanks!

  8. Diss & Dad says:

    Good explanation! If I wanted to calculate the entropy with log2, which calculator can do this? Is there an online calculator for this? What would be the best approach?

  9. Valentin says:

    this is wonderful. thank you

  10. surendran murugesan says:

    If Machine two provides 100 data in 175 questions, isn't it more informative? Shouldn't it be the other way around as opposed to the example?

  11. Mansi Gupta says:

    An excellent excellent excellent video. I finally get it.

  12. Kevin Francetti says:

    How to destroy this theorem: Get a string ABC.
    P(A) = 1/3
    P(B) = 1/3
    P(C) = 1/3

    Nbounces for A = 1
    Nbounces for B = 2
    Nbounces for C = 2

    formula 1: H = P(A)*1+P(B)*2+P(C)*2 = 1,66
    formula 2: H = -1 * [ P(A)* log(P(A)) + P(B)*log(P(B)) + P(C)*log(P(C)) ] = 1,58

    Yupp… sure… 1,66 = 1,58.. fuck this shit, this is why math is a lie for retards, try to proof me wrong!!

  13. varada rajan says:

    Why are they using log base 2

  14. Crispin Semmens says:


  15. Yevon says:

    This is exactly what I need. And I wonder why such explanation never takes place on a lecture, and the lecturers just assume everyone in the college to know the intuition behind.

  16. David Trojko says:

    In 6:33 letters M and W are mixed up.
    Thank you for the explanation! I understand it now clearly. 🙂

  17. Gottfried Leibniz says:

    Pffffffff…it can always be two questions for the four symbols case.

  18. Ahmed KMN Moustafa says:

    great explanation bro 🙂

  19. HIRAK MONDAL says:

    Why outcome is 1/p?

  20. Roberto I says:


  21. 桂瑋仁 says:

    at 1:24, i would argue the 3rd question (ie, the question on the right of 2nd hierarchy) should be
    "Is it C?" (or "Is it D?")
    rather than "Is it B?"
    (i think this is so because, as the 1st machine answered "No" to the 1st question ["Is it AB?"], it essentially rules-out both A and B, leaving only C(or D) as the possible outcome; hence no role for "B" anymore)

  22. SalRite says:

    What a Beautiful Explanation!!!

  23. Austin Abeyta says:

    Did anyone else notice the DEATH NOTE music at 4:40.

  24. Su Liu says:

    Great video! I can follow it but I have trouble understanding the problem statement. Why "the most efficient way is to pose a question which divides the possibility by half"?

  25. somesh shaarma says:

    Awesome explanation with a very intuitive example.Thanks a lot…

  26. Joshua Ronis says:

    I don't understand this.
    Why is it that everytime you move away entropy must go down?
    Let's say the probabilities are:

    P(A) –> 0.27
    P(B) –> 0.26
    P(C) –> 0.24
    P(D) –> 0.23

    If you use the same encoding he used, you end up getting:

    0.27(1) + 0.26(2) +0.24(3) + 0.23(3) = 2.2 Bits on average

    That is more than 2 bits on average. How did encoding the information by probability help?

  27. Jeff Motsinger says:

    But what use is it? Science seeks not truth but usefulness. And it only works if one knows the outcome distribution in advance. If the outcome distribution is not known in advance, then no information is generated?? Why call it 'information', that word is already in use….maybe because it is easier to deceive that way?

  28. Souvik Majumdar says:

    Thank You

  29. Shelendra Sharma says:

    Best explaination , salute ….

  30. Zes says:

    wrong, no such thing as info or possible states about it or efficient qx or not, any can be of infinite any info. cpux, ask/say, can ask say any no matter what and any can be perfx

  31. ThatGuy Man says:

    What if we used ternary ?

  32. TheOMGPoPCorn says:

    Is this VSauce?

  33. Ahmed El sharkawy says:

    just awesome

  34. H Rivera says:

    Fifteen people by error push thumb down option.

  35. Guerri Filippo says:


  36. Sholi says:

    can someone explain in #bounces=p(a)x1+p(b)x3+p(c)x3+p(d)x2 at 3:44 , how numbers 1,3,3 & 2 came?

  37. Pablo Biedma says:

    So if I recall correctly, the one with the highest entropy is the least informative one, then the, if a machine generates symbols, and apply the formula for each symbol, which symbol provides the most information? the one with the least amount of bits? how does that make sense, isn't it the one with the highest amount of bits? calculated by p log( 1/p)

  38. David Deleon says:

    Huh??? Why am I finding out that information entropy was a concept. MIND BLOWN!!!

  39. Samuel Gelman says:

    The 'M' and 'W' are switched and upside down while the 'Z' is just a sideways 'N'…my vote is intentional 6:32

  40. 姜昊 says:

    You should update this video. The mistakes are absurd and misleading.

  41. Sanad says:

    can't we just ask one question?
    is it abc or d ?
    edit: nevermind, i just figured that 1 bit removes uncertainty of 1/2

  42. Асылбек Умурзак says:

    Cool beat

  43. 冯小康 says:

    still confused why #outcome=1/pi

  44. Alex Hsia says:

    What do they mean by number of outcomes? Can someone give me an example using the ABCD examples they used?

  45. Chandragupta says:

    awesome video!

  46. Sirius Starlight says:

    Intuition always wins 🤣

  47. Sanjay Rakshit says:

    Heckin Shanon

  48. E says:

    I can't understand what is being said. Its too bliidy complicated. One if the worst lectures of Khan academy.

Leave a Reply

Your email address will not be published. Required fields are marked *