Computer Vision for Robotics

Computer vision gives robots the ability to understand their environment from camera images: detect obstacles, estimate depth, track motion, and recognize objects. It bridges raw pixel data and actionable spatial information.

Why It Matters

Cameras are cheap, information-dense sensors. A single RGB camera provides more data per frame than a LiDAR scan. Visual odometry, object detection, and depth estimation enable autonomous navigation, manipulation, and human-robot interaction — all from commodity hardware.

Pinhole Camera Model

The foundation of geometric vision. A 3D point (X, Y, Z) projects onto the image plane at (u, v):

u = fx · X/Z + cx
v = fy · Y/Z + cy

Camera matrix K = [fx  0  cx]
                  [ 0 fy  cy]
                  [ 0  0   1]

fx, fy = focal length in pixels
cx, cy = principal point (image center)

import numpy as np
 
K = np.array([[500, 0, 320],   # focal 500px, image 640×480
              [0, 500, 240],
              [0,   0,   1]])
 
# Project 3D point to pixel
point_3d = np.array([1.0, 0.5, 3.0])  # X, Y, Z in camera frame
pixel = K @ point_3d / point_3d[2]
print(f"Projects to pixel ({pixel[0]:.0f}, {pixel[1]:.0f})")

Lens distortion (radial and tangential) must be corrected for accurate geometry. OpenCV’s undistort() handles this using calibration parameters.

Depth Estimation

Stereo Vision

Two cameras separated by a known baseline. The horizontal pixel difference (disparity) between matched points gives depth:

depth = focal_length × baseline / disparity

f = 500 pixels, B = 0.12 m, disparity = 50 pixels
depth = 500 × 0.12 / 50 = 1.2 m

# OpenCV stereo block matching
import cv2
stereo = cv2.StereoBM.create(numDisparities=64, blockSize=15)
disparity = stereo.compute(left_gray, right_gray)
depth_map = f * baseline / (disparity + 1e-6)

Other Depth Methods

Method	How	Pros	Cons
Stereo	Two cameras, triangulation	Passive, outdoor-ready	Needs texture, baseline limits range
Structured light	Project pattern, measure distortion	Very accurate indoors	Fails in sunlight (IR interference)
Time-of-flight	Measure light round-trip time	Per-pixel depth, fast	Short range (5-10m), noisy edges
Monocular CNN	Neural network estimates depth from single image	Only needs one camera	Less accurate, scale-ambiguous

Feature Detection and Matching

Detect distinctive points (corners, blobs) that can be tracked across frames or matched between images.

import cv2
 
# ORB: fast, rotation-invariant, binary descriptor
orb = cv2.ORB.create(nfeatures=500)
kp1, desc1 = orb.detectAndCompute(img1, None)
kp2, desc2 = orb.detectAndCompute(img2, None)
 
# Match descriptors
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
matches = sorted(matches, key=lambda m: m.distance)[:50]

Detector	Speed	Invariance	Use Case
ORB	Very fast	Rotation + scale	Real-time SLAM, embedded
SIFT	Slow	Rotation + scale + lighting	Offline matching, panoramas
FAST	Very fast	Minimal	Real-time corner detection
SuperPoint	Medium (CNN)	Learned invariance	State-of-art visual SLAM

Visual Odometry

Estimate camera/robot motion from sequential images:

1. Detect features in frame N (ORB, FAST)
2. Track/match features to frame N+1
3. Compute Essential matrix E from matched points
4. Decompose E into rotation R and translation t
5. Scale translation using known world measurement or other sensor
6. Accumulate: T_total = T_total × [R|t]

Scale ambiguity: monocular VO can only recover motion up to an unknown scale factor. Solutions: known object size, IMU integration, or stereo camera.

ArUco Markers

Square fiducial markers with known size → precise 6-DOF pose estimation:

import cv2
aruco_dict = cv2.aruco.getPredefinedDictionary(cv2.aruco.DICT_4X4_50)
detector = cv2.aruco.ArucoDetector(aruco_dict)
corners, ids, _ = detector.detectMarkers(image)
 
# Estimate pose (rotation + translation relative to camera)
rvecs, tvecs, _ = cv2.aruco.estimatePoseSingleMarkers(
    corners, marker_size_meters, camera_matrix, dist_coeffs)

Used for: drone landing pads, robot calibration, ground truth for testing, indoor localization.

Optical Flow

Estimate per-pixel motion between frames. Dense flow gives a motion vector for every pixel; sparse flow tracks specific features.

# Lucas-Kanade sparse optical flow
p1, status, _ = cv2.calcOpticalFlowPyrLK(prev_gray, gray, p0, None)
# p0 = tracked points in prev frame, p1 = where they moved in current frame
# velocity = (p1 - p0) / dt

Used for: ground speed estimation (downward-facing camera on drones), motion segmentation, obstacle detection via motion divergence.

SLAM — visual SLAM uses features, odometry, and loop closure
Path Planning — vision provides obstacle maps for planning
Sensor Fusion — visual odometry fused with IMU for robust estimation
IMU and Sensors — camera complements inertial data
Kinematics — visual servoing uses vision to guide robot arms

Engineering Notes

Explorer

Computer Vision for Robotics

Computer Vision for Robotics

Why It Matters

Pinhole Camera Model

Depth Estimation

Stereo Vision

Other Depth Methods

Feature Detection and Matching

Visual Odometry

ArUco Markers

Optical Flow

Graph View

Table of Contents

Backlinks

Engineering Notes

Explorer

Computer Vision for Robotics

Computer Vision for Robotics

Why It Matters

Pinhole Camera Model

Depth Estimation

Stereo Vision

Other Depth Methods

Feature Detection and Matching

Visual Odometry

ArUco Markers

Optical Flow

Related

Graph View

Table of Contents

Backlinks