Introduction to Computer Vision¶
What is Computer Vision¶
Computer vision is the study of how to build machines that interpret images — extracting information and knowledge, not merely processing pixels. The goal is image and video understanding: labeling objects in a scene and tracking them as they move. This goes beyond image processing (images in → images out); computer vision produces meaning from images.
Applications include analyzing street scenes for autonomous vehicles, interpreting medical images such as X-rays, and recognizing actions in video.
Computer Vision vs Computational Photography¶
Computational photography focuses on capturing light from a scene to produce photographs or novel visual artifacts. Image analysis supports capture and display in new ways — it is partly about building new cameras and software.
Computer vision focuses on interpreting and analyzing scenes: identifying who and what is present, and what is happening. The two fields overlap, especially in foundational image-processing modules.
Why Study Computer Vision¶
Images and video are ubiquitous in modern technology. Key application domains:
OCR — license plate readers, handwritten check amounts, postal ZIP code recognition. Once considered hard, now standard (built into scanners and Adobe Acrobat).
Face detection and recognition — consumer cameras detect faces, detect blinks (Fuji), trigger on smiles (Sony Smile Shutter), and perform camera-based login via face recognition.
Object recognition — retail loss-prevention (Evolution Robotics LaneHawk), augmented reality on mobile devices, Google Glass.
Special effects and 3D modeling — face scanning, motion capture (e.g. Pirates of the Caribbean), structure-from-motion for aerial 3D reconstruction (Google Earth, Microsoft Virtual Earth).
Autonomous vehicles — Mobileye pedestrian detection, sign recognition, automatic braking. Stanford’s Stanley won the DARPA Urban Grand Challenge; Google’s self-driving car program followed.
Sports — Sportvision first-down line in American football uses player/background segmentation to overlay graphics without occluding players.
Vision-based interaction — Nintendo Wii (IR camera tracking), Microsoft Kinect depth sensor with real-time skeletal estimation via machine learning, enabling gesture-based UIs and robot interaction.
Surveillance — crowd safety monitoring, port security (Siemens).
Medical imaging — registering 3D models (MRI/CT) with live surgical views for augmented reality during surgery.
Why Computer Vision is Hard¶
Seeing is not the same as measuring pixel values. The visual system constructs a percept — an interpretation that goes beyond raw measurements.
Adelson’s Checker Shadow Illusion¶
In Adelson’s checkerboard illusion, squares A (in light) and B (in shadow) emit identical photon counts, yet humans perceive A as dark and B as light. The brain infers a checkerboard pattern under a cast shadow and “corrects” perceived lightness accordingly. A photometer measures equal intensity; the human visual system overrides this with scene understanding.
Perception is Active Construction¶
Shadow and motion: a ball following the same trajectory appears to move into the background when its shadow moves with it, but appears to rise when the shadow moves sideways. The ball’s image motion is identical in both cases; only the shadow differs.
Implied 3D from shadows: a static green rectangle appears to lift off a checkerboard when dark pixels (interpreted as cast shadows) shift beneath it — even though no green pixels move at all.
These demonstrations show that human vision resolves ambiguity by constructing scene descriptions. Computer vision must similarly build descriptions from measurements, going beyond pixel-level image processing.
Course Structure¶
The course is organized around three interrelated perspectives:
Computational model — the mathematical theory explaining why a computation is possible (e.g., epipolar geometry for stereo depth).
Algorithm — a concrete procedure implementing the theory under specific assumptions (e.g., patch correlation along epipolar lines using SSD).
Real images — applying algorithms to actual imagery, where practical issues (noise, occlusion, texture) require experimentation and tuning.
Problem sets bridge theory and practice: implement algorithms, apply them to real images with ground truth, and evaluate accuracy.
Topic Outline¶
Image processing fundamentals
Camera geometry and models
Multiple views and their relationships
Features — computing and matching across images
Image formation, lightness, and brightness
Motion in images (optical flow)
Object tracking
Classification and recognition (basic pattern recognition, with deeper methods requiring additional machine learning)
Miscellaneous practical topics
Human vision system