Computer Vision: Crash Course Computer Science #35

Hi, I’m Carrie Anne, and welcome to Crash
Course Computer Science! Today, let’s start by thinking about how
important vision can be. Most people rely on it to prepare food, walk
around obstacles, read street signs, watch videos like this, and do hundreds of other
tasks. Vision is the highest bandwidth sense, and
it provides a firehose of information about the state of the world and how to act on it. For this reason, computer scientists have
been trying to give computers vision for half a century, birthing the sub-field of computer
vision. Its goal is to give computers the ability
to extract high-level understanding from digital images and videos. As everyone with a digital camera or smartphone
knows, computers are already really good at capturing photos with incredible fidelity
and detail – much better than humans in fact. But as computer vision professor Fei-Fei Li
recently said, “Just like to hear is the not the same as to listen. To take pictures is not the same as to see.” INTRO As a refresher, images on computers are most
often stored as big grids of pixels. Each pixel is defined by a color, stored as
a combination of three additive primary colors: red, green and blue. By combining different intensities of these
three colors, what’s called a RGB value, we can represent any color. Perhaps the simplest computer vision algorithm
– and a good place to start – is to track a colored object, like a bright pink ball. The first thing we need to do is record the
ball’s color. For that, we’ll take the RGB value of the
centermost pixel. With that value saved, we can give a computer
program an image, and ask it to find the pixel with the closest color match. An algorithm like this might start in the
upper right corner, and check each pixel, one at time, calculating the difference from
our target color. Now, having looked at every pixel, the best
match is very likely a pixel from our ball. We’re not limited to running this algorithm
on a single photo; we can do it for every frame in a video, allowing us to track the
ball over time. Of course, due to variations in lighting,
shadows, and other effects, the ball on the field is almost certainly not going to be
the exact same RGB value as our target color, but merely the closest match. In more extreme cases, like at a game at night,
the tracking might be poor. And if one of the team’s jerseys used the
same color as the ball, our algorithm would get totally confused. For these reasons, color marker tracking and
similar algorithms are rarely used, unless the environment can be tightly controlled. This color tracking example was able to search
pixel-by-pixel, because colors are stored inside of single pixels. But this approach doesn’t work for features
larger than a single pixel, like edges of objects, which are inherently made up of many
pixels. To identify these types of features in images,
computer vision algorithms have to consider small regions of pixels, called patches. As an example, let’s talk about an algorithm
that finds vertical edges in a scene, let’s say to help a drone navigate safely through
a field of obstacles. To keep things simple, we’re going to convert
our image into grayscale, although most algorithms can handle color. Now let’s zoom into one of these poles to
see what an edge looks like up close. We can easily see where the left edge of the
pole starts, because there’s a change in color that persists across many pixels vertically. We can define this behavior more formally
by creating a rule that says the likelihood of a pixel being a vertical edge is the magnitude
of the difference in color between some pixels to its left and some pixels to its right. The bigger the color difference between these
two sets of pixels, the more likely the pixel is on an edge. If the color difference is small, it’s probably
not an edge at all. The mathematical notation for this operation
looks like this – it’s called a kernel or filter. It contains the values for a pixel-wise multiplication, the sum of which is saved into the center pixel. Let’s see how this works for our example
pixel. I’ve gone ahead and labeled all of the pixels
with their grayscale values. Now, we take our kernel, and center it over
our pixel of interest. This specifies what each pixel value underneath
should be multiplied by. Then, we just add up all those numbers. In this example, that gives us 147. That becomes our new pixel value. This operation, of applying a kernel to a
patch of pixels, is call a convolution. Now let’s apply our kernel to another pixel. In this case, the result is 1. Just 1. In other words, it’s a very small color
difference, and not an edge. If we apply our kernel to every pixel in the
photo, the result looks like this, where the highest pixel values are where there are strong
vertical edges. Note that horizontal edges, like those platforms
in the background, are almost invisible. If we wanted to highlight those features,
we’d have to use a different kernel – one that’s sensitive to horizontal edges. Both of these edge enhancing kernels are called
Prewitt Operators, named after their inventor. These are just two examples of a huge variety
of kernels, able to perform many different image transformations. For example, here’s a kernel that sharpens
images. And here’s a kernel that blurs them. Kernels can also be used like little image
cookie cutters that match only certain shapes. So, our edge kernels looked for image patches
with strong differences from right to left or up and down. But we could also make kernels that are good
at finding lines, with edges on both sides. And even islands of pixels surrounded by contrasting
colors. These types of kernels can begin to characterize
simple shapes. For example, on faces, the bridge of the nose
tends to be brighter than the sides of the nose, resulting in higher values for line-sensitive
kernels. Eyes are also distinctive – a dark circle
sounded by lighter pixels – a pattern other kernels are sensitive to. When a computer scans through an image, most
often by sliding around a search window, it can look for combinations of features indicative
of a human face. Although each kernel is a weak face detector
by itself, combined, they can be quite accurate. It’s unlikely that a bunch of face-like
features will cluster together if they’re not a face. This was the basis of an early and influential
algorithm called Viola-Jones Face Detection. Today, the hot new algorithms on the block
are Convolutional Neural Networks. We talked about neural nets last episode,
if you need a primer. In short, an artificial neuron – which is
the building block of a neural network – takes a series of inputs, and multiplies each by
a specified weight, and then sums those values all together. This should sound vaguely familiar, because
it’s a lot like a convolution. In fact, if we pass a neuron 2D pixel data,
rather than a one-dimensional list of inputs, it’s exactly like a convolution. The input weights are equivalent to kernel
values, but unlike a predefined kernel, neural networks can learn their own useful kernels
that are able to recognize interesting features in images. Convolutional Neural Networks use banks of
these neurons to process image data, each outputting a new image, essentially digested
by different learned kernels. These outputs are then processed by subsequent
layers of neurons, allowing for convolutions on convolutions on convolutions. The very first convolutional layer might find
things like edges, as that’s what a single convolution can recognize, as we’ve already
discussed. The next layer might have neurons that convolve
on those edge features to recognize simple shapes, comprised of edges, like corners. A layer beyond that might convolve on those
corner features, and contain neurons that can recognize simple objects, like mouths
and eyebrows. And this keeps going, building up in complexity,
until there’s a layer that does a convolution that puts it together: eyes, ears, mouth,
nose, the whole nine yards, and says “ah ha, it’s a face!” Convolutional neural networks aren’t required
to be many layers deep, but they usually are, in order to recognize complex objects and
scenes. That’s why the technique is considered deep
learning. Both Viola-Jones and Convolutional Neural
Networks can be applied to many image recognition problems, beyond faces, like recognizing handwritten
text, spotting tumors in CT scans and monitoring traffic flow on roads. But we’re going to stick with faces. Regardless of what algorithm was used, once
we’ve isolated a face in a photo, we can apply more specialized computer vision algorithms
to pinpoint facial landmarks, like the tip of the nose and corners of the mouth. This data can be used for determining things
like if the eyes are open, which is pretty easy once you have the landmarks – it’s
just the distance between points. We can also track the position of the eyebrows;
their relative position to the eyes can be an indicator of surprise, or delight. Smiles are also pretty straightforward to
detect based on the shape of mouth landmarks. All of this information can be interpreted
by emotion recognition algorithms, giving computers the ability to infer when you’re
happy, sad, frustrated, confused and so on. In turn, that could allow computers to intelligently
adapt their behavior… maybe offer tips when you’re confused, and not ask to install
updates when you’re frustrated. This is just one example of how vision can
give computers the ability to be context sensitive, that is, aware of their surroundings. And not just the physical surroundings – like
if you’re at work or on a train – but also your social surroundings – like if you’re
in a formal business meeting versus a friend’s birthday party. You behave differently in those surroundings, and so should computing devices, if they’re smart. Facial landmarks also capture the geometry
of your face, like the distance between your eyes and the height of your forehead. This is one form of biometric data, and it
allows computers with cameras to recognize you. Whether it’s your smartphone automatically
unlocking itself when it sees you, or governments tracking people using CCTV cameras, the applications
of face recognition seem limitless. There have also been recent breakthroughs
in landmark tracking for hands and whole bodies, giving computers the ability to interpret
a user’s body language, and what hand gestures they’re frantically waving at their internet
connected microwave. As we’ve talked about many times in this
series, abstraction is the key to building complex systems, and the same is true in computer
vision. At the hardware level, you have engineers
building better and better cameras, giving computers improved sight with each passing
year, which I can’t say for myself. Using that camera data, you have computer
vision algorithms crunching pixels to find things like faces and hands. And then, using output from those algorithms,
you have even more specialized algorithms for interpreting things like user facial expression
and hand gestures. On top of that, there are people building
novel interactive experiences, like smart TVs and intelligent tutoring systems, that
respond to hand gestures and emotion. Each of these levels are active areas of research,
with breakthroughs happening every year. And that’s just the tip of the iceberg. Today, computer vision is everywhere – whether
it’s barcodes being scanned at stores, self-driving cars waiting at red lights, or snapchat filters
superimposing mustaches. And, the most exciting thing is that computer
scientists are really just getting started, enabled by recent advances in computing, like
super fast GPUs. Computers with human-like ability to see is
going to totally change how we interact with them. Of course, it’d also be nice if they could
hear and speak, which we’ll discuss next week. I’ll see you then.

99 comments on “Computer Vision: Crash Course Computer Science #35”

  1. Garrett B. Settles says:


  2. Giorgos Ioak says:

    Just wondering where have you been 😊 Happy to see you again

  3. microbuilder says:

    I totally understood all of this. Yeah, thats it…

  4. Neoshaman Fulgurant says:

    YOLO (you only look once)

  5. Guilherme Moresco says:

    what a sweet world would be one that has computers capable of awareness of their surroundings

  6. SoN says:

    The best online program, don't stop doin it!

  7. brocksprogramming says:

    Way to go Carrie Anne!

  8. Tehcookie vanilla says:


  9. Dan says:

    Seems like a convoluted way to process images.

  10. oldcowbb says:

    thats really convoluted

  11. Daniel Kohwalter says:

    This is by far the greatest course that I had on my entire life about computers. I work with full flight simulators for pilot training and many things that I learnt here became so clear for me… We see many systems in a very superficial way due to those abstraction levels and with those classes I can see what's behind the scene, what's going on in a deeper way.

    Thank you, guys. Thank you very much for sharing all this knowledge and in a way so simple and easy to understand. You're the best!!!

    And I'm recommending the channel for everybody I know that likes computer science on any level of understanding!

  12. Bright Future says:

    Plz leave a link to The Origin of Everything, would love to check it out.

  13. Cubinator73 says:

    A computing device should never change behavior depending on highly subjective factors, it should only do what it is explicitely told to do.

  14. X-Raym says:

    Apart from face recognition, OCR is another nice field of research for 'teaching computers' to see !

  15. Mitchel Paulin says:

    I always thought that face recognition would require some complicated mapping of vectors. Interesting to see it can be accomplished by something as simple as a bit mask

  16. Aaron Fox says:

    Our university's robotics team is currently using OpenCV so our autonomous drone can see and navigate the world. Lots of theory, documentation reading, and pulled hairs come along with computer vision, that's for sure.

  17. Constantinos Patsalos says:

    Make a video on Mercury cycle! Please

  18. Blood Bath and Beyond - Pop Goes Metal Covers says:

    eh, i still would…

  19. Yousif Tareq says:

    Big brother 😎

  20. Mj Goat23 says:

    wheres john green

  21. Amy G says:

    Is the guy in the middle the secret brother Dave?

  22. Michael Gainey says:

    Yay Fei-Fei Li! Watch her TED talk too.

  23. cikif says:

    The computer in the thumbnail looks like the one in Don't Hug Me I'm Scared Part 4. Which makes the topic even scarier.

  24. IceMetalPunk says:

    For anyone who's interested, there's a (relatively) recent system called YOLO: You Only Look Once. Version 2 came out less than a year ago, if I remember right, and basically it uses computer vision techniques to classify many different objects in a scene in real-time video. As in, it's fast enough to fairly accurately detect and label many different objects in an arbitrary scene 24 times per second (24fps is a standard video frame rate). It's super interesting! 😀

  25. Subri Subrika says:

    You guys rock!!!!

  26. Sudo Hyde says:

    I found the narrator very pleasant to listen to. Also the video was very good.

  27. Gregg rulz ok says:

    ….are internet connected microwaves a real thing?

  28. Niko Nissinen says:

    My PC is already quite aware of it's suroundings. Usually there's me and there will be a hammer if computer starts to misbehave.

  29. frufi uni says:

    Really interesting !

  30. arjun mohan says:

    On our path to Judgement day.. hehe

  31. BobEckert56 says:

    Machine vision will match ours when we can shrink 1000s of processors each capable of 1000s of petaflops to the size of an eyeball connected to the equivalent of the human brain's vision center.

  32. asdaffewwerqa asafdaqwrad says:

    Your computer will detect when you are happy and start a forced 10GB update to swipe off the smile on your face.

  33. phienixfire says:

    No edge!

  34. Ravindu Mirihana says:

    This is awesome

  35. ka hei chan says:

    First couple seconds of the Video, what a second that looks familiar, then realise it’s a footage of my hometown.

  36. Gian Luca De Lillo says:

    wonderfully explained

  37. Huntracony says:

    Self driving cars often (also) use LIDAR, which has the great advantage of knowing distances, so the car is able to see in 3d. The (biggest) exception to this is Tesla, which decided that normal cameras work just fine, to which I say sure, but why not make it even better?

  38. RyuImperator says:

    Thanks for the greate video!

  39. Shayan Shamsi says:

    Will you guys be uploading after 2 weeks from now on as you did with this video ?

  40. Teresa White says:

    Great lesson. I can't wait 'til next week. Thanks!

  41. Zane Karl says:

    Does anyone know the titles of all the books in the background of the videos? The only ones I can make out are "Ghost in the Wires" and "Linear Systems and Signals".

  42. J M says:

    So the government is watching me through my webcam?

  43. Tahsin Loqman says:

    Great video! I'm taking a Computational Vision course right now. It was nice to know what you were talking about.

  44. Jason Nelson says:

    The clip of the tracking of the fingers, arms, and face of the guy reading from the book makes me think that some day soon there will be a presentation or something where they show a computer detecting sleight of hand in a magic trick. Would be a pretty neat way to show off the accuracy, anyway.

  45. Justin Scott says:

    I've used Photoshop for years, it's really cool took take a look under the hood of image processing.

  46. JimPlaysGames says:

    That Macintosh in the back needs some serious retrobright treatment.

  47. Glove says:

    I think that Fei Fei Le is the best name

  48. 拉瓦 says:

    我想字幕 Who stole the subtitles?

  49. c K says:

    She spoiled the next video of 3blue1brown! He's litteraly in the middle of the image recognition by deep learning subject

  50. Gernuts says:

    When my Windows laptop will be able to recognize I'm not in the mood for an update, only then I'll pull that sticky tape off my webcam. That also means I'll never get updates 🙁

  51. Baxtexx says:

    Lol I just imagined this in the next patch of Windows:
    Not that they would ever do that though…

  52. TheFloatingSheep says:

    Thanks god, the cleavage is gone.

  53. انفجارات عنيفة 激しい爆発 says:

    According to this, I should never be asked to update…

  54. Lightbeing says:

    More useful than my whole semester CV course…

  55. 振兴李 says:

    very like this video

  56. José González Núñez says:

    Would you share a link for further reading?

  57. nagalakshmi duvvuri says:

    thank you, this was helpful

  58. Z Z says:

    Carrie Anne you look so cute with your glasses on, you should keep them on for all your videos

  59. cesar brown says:

    This could be were Quantum computers shine. It can analyze all that data all at once basically seeing the bigger picture.

  60. David Kvashenkin says:

    How did you get 147 ? I can't understand.
    -185-186-186+233+233+233 = 142

  61. Ernscht1987 says:

    That's super cool^^ Thank you!!!

  62. Pllutus says:

    Where can i find the sources for this video???

  63. Anthony Osnacz says:

    KinaTrax uses computer vision to record kinematic data on baseball pitchers. Biomarkers are no longer a requirement and data can be tracked accurately in game. Computer vision is revolutionizing the game!

  64. Swati Jain says:

    Mam very nice video,
    Mam please also made full course videos also with very easy explanation & cover only those maths which require for that course.
    Because your explanation is very simple

  65. Anthenor Jr says:

    I would trade all my privacy just so Windows do not ask to install updates when I'm mad!

  66. Edmond A. says:

    Can anybody recommend a minimum hardware requirements for computer vision/object detection?

  67. Håkan Ahlström says:

    isn't it upper left corner?

  68. 1000Marcopeters says:

    "Abstraction is the key to build complex systems"

  69. walexkino papy says:

    Good video Anne.. i need your insight on something… am working on recognizing partial occluded license plate. can you contribute to my research. thanks

  70. Thomas Walder says:

    Hey, i know that place! Sydney Olympic park!

  71. Tueem Syhu says:


  72. ittybittypluto says:

    3:49 how did you get 147? i got 142

  73. Timothy Bernard says:

    This is amazing! Thanks for sharing

  74. Fredo FPV says:

    Convolution just happened to pop out from nowhere. In case you are wondering, convolution is the operation that maps a set of values (also called N-tuple where N stands for the quantity of elements) to another set of values.
    Very simple example:
    1,2,3,4 is a 4-tuple
    +1,+1,+2,+2 is a simple convolution
    2,3,5,6 is a 4-tuple as result of applying the above convolution

  75. Edson Silva says:

    This is probably the best explanation of computer vision I've ever seen in my life.

  76. Roma says:

    *connects a function generator to an oscilloscope in the background for some fun sciency atmosphere *

  77. Games TV says:

    Love to see the passion this woman have for her job.
    I lost my passion somewhere along the way.

  78. Splank Splank says:

    Wow, you did a great job of making something difficult easy to understand! This video was a great help!

  79. Salem Amer says:

    Great !!

  80. IAMTHEO says:

    At 5:52 you forgot to mention the bias value.

  81. Vamsi Mohana says:

    So…. How do you play sudoku

  82. Abdul Roshan says:

    Best videos series ever about computer science,.,, Thank you..

  83. Bongo Cat says:

    Amazon Go is an example

  84. bnfgh123 says:

    When I started watching this video, I did not expect it would actually help me with my physiology course. I finally understand receptive fields 😀

  85. Thomas Yorkshire IV says:

    Paused because I noticed the Ghost in The Wires book on your shelf. Bought this book after a Kevin Mitnick conference I saw last year 🙂

  86. GiveMeCoffee says:

    I really love this show, it's a great way to introduce concepts before having a full lecture at a college class, or to have a wide general idea of what the career path will include.

  87. Marco A Acevedo Z says:

    I was 100% in until 80% of the video. Then, it was like…

  88. Hasan Mondol says:

    Very excellent explanation. Thanks for your videos. Please upload videos on machine learning and artificial intelligence.

  89. Bob Bobety says:

    Awesome video! How exactly are these image processing softwares implemented – would it be a low-level programming language like C, a high-level like Python or would it even be at the hardware level?

  90. Some Guy says:

    1:37 upper left not right

  91. Waleed says:

    I suppose these are the same kernels used in Photoshop

  92. Vignesh says:

    Thanks a lot! It was a great introductory video to computer vision.

  93. Sas Na says:

    very fast please slow for non English speaker

  94. Float Circuit says:

    You're an absolutely brilliant communicator! I'm doing a computer vision specialization on Coursera with the University of Buffalo and your high level intuition just gave me oodles of excitement. I dream of one day developing my own algorithm for real time navigation for data constrained systems. Thanks, really, this was a fabulous primer video, and certainly one I'll show my best friends. ☺️

  95. Samuel Griffin says:

    Great video , very informative.

  96. Abdul Kareem says:

    you are so cute and intelligent too

  97. Melozzo says:

    Designer is a Liverpool FC fan I see.

  98. Zenchiassassin says:

    I love convolutional neural networks

  99. Shovon Saha says:


Leave a Reply

Your email address will not be published. Required fields are marked *