Computer Vision for Robotics

Computer vision gives robots the ability to understand their environment from camera images: detect obstacles, estimate depth, track motion, and recognize objects. It bridges raw pixel data and actionable spatial information.

Why It Matters

Cameras are cheap, information-dense sensors. A single RGB camera provides more data per frame than a LiDAR scan. Visual odometry, object detection, and depth estimation enable autonomous navigation, manipulation, and human-robot interaction — all from commodity hardware.

Pinhole Camera Model

The foundation of geometric vision. A 3D point (X, Y, Z) projects onto the image plane at (u, v):

u = fx · X/Z + cx
v = fy · Y/Z + cy

Camera matrix K = [fx  0  cx]
                  [ 0 fy  cy]
                  [ 0  0   1]

fx, fy = focal length in pixels
cx, cy = principal point (image center)
import numpy as np
 
K = np.array([[500, 0, 320],   # focal 500px, image 640×480
              [0, 500, 240],
              [0,   0,   1]])
 
# Project 3D point to pixel
point_3d = np.array([1.0, 0.5, 3.0])  # X, Y, Z in camera frame
pixel = K @ point_3d / point_3d[2]
print(f"Projects to pixel ({pixel[0]:.0f}, {pixel[1]:.0f})")

Lens distortion (radial and tangential) must be corrected for accurate geometry. OpenCV’s undistort() handles this using calibration parameters.

Depth Estimation

Stereo Vision

Two cameras separated by a known baseline. The horizontal pixel difference (disparity) between matched points gives depth:

depth = focal_length × baseline / disparity

f = 500 pixels, B = 0.12 m, disparity = 50 pixels
depth = 500 × 0.12 / 50 = 1.2 m
# OpenCV stereo block matching
import cv2
stereo = cv2.StereoBM.create(numDisparities=64, blockSize=15)
disparity = stereo.compute(left_gray, right_gray)
depth_map = f * baseline / (disparity + 1e-6)

Other Depth Methods

MethodHowProsCons
StereoTwo cameras, triangulationPassive, outdoor-readyNeeds texture, baseline limits range
Structured lightProject pattern, measure distortionVery accurate indoorsFails in sunlight (IR interference)
Time-of-flightMeasure light round-trip timePer-pixel depth, fastShort range (5-10m), noisy edges
Monocular CNNNeural network estimates depth from single imageOnly needs one cameraLess accurate, scale-ambiguous

Feature Detection and Matching

Detect distinctive points (corners, blobs) that can be tracked across frames or matched between images.

import cv2
 
# ORB: fast, rotation-invariant, binary descriptor
orb = cv2.ORB.create(nfeatures=500)
kp1, desc1 = orb.detectAndCompute(img1, None)
kp2, desc2 = orb.detectAndCompute(img2, None)
 
# Match descriptors
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
matches = sorted(matches, key=lambda m: m.distance)[:50]
DetectorSpeedInvarianceUse Case
ORBVery fastRotation + scaleReal-time SLAM, embedded
SIFTSlowRotation + scale + lightingOffline matching, panoramas
FASTVery fastMinimalReal-time corner detection
SuperPointMedium (CNN)Learned invarianceState-of-art visual SLAM

Visual Odometry

Estimate camera/robot motion from sequential images:

1. Detect features in frame N (ORB, FAST)
2. Track/match features to frame N+1
3. Compute Essential matrix E from matched points
4. Decompose E into rotation R and translation t
5. Scale translation using known world measurement or other sensor
6. Accumulate: T_total = T_total × [R|t]

Scale ambiguity: monocular VO can only recover motion up to an unknown scale factor. Solutions: known object size, IMU integration, or stereo camera.

ArUco Markers

Square fiducial markers with known size → precise 6-DOF pose estimation:

import cv2
aruco_dict = cv2.aruco.getPredefinedDictionary(cv2.aruco.DICT_4X4_50)
detector = cv2.aruco.ArucoDetector(aruco_dict)
corners, ids, _ = detector.detectMarkers(image)
 
# Estimate pose (rotation + translation relative to camera)
rvecs, tvecs, _ = cv2.aruco.estimatePoseSingleMarkers(
    corners, marker_size_meters, camera_matrix, dist_coeffs)

Used for: drone landing pads, robot calibration, ground truth for testing, indoor localization.

Optical Flow

Estimate per-pixel motion between frames. Dense flow gives a motion vector for every pixel; sparse flow tracks specific features.

# Lucas-Kanade sparse optical flow
p1, status, _ = cv2.calcOpticalFlowPyrLK(prev_gray, gray, p0, None)
# p0 = tracked points in prev frame, p1 = where they moved in current frame
# velocity = (p1 - p0) / dt

Used for: ground speed estimation (downward-facing camera on drones), motion segmentation, obstacle detection via motion divergence.

  • SLAM — visual SLAM uses features, odometry, and loop closure
  • Path Planning — vision provides obstacle maps for planning
  • Sensor Fusion — visual odometry fused with IMU for robust estimation
  • IMU and Sensors — camera complements inertial data
  • Kinematics — visual servoing uses vision to guide robot arms