
Computer Vision vs. Sensor Fusion: Who Wins the Self-Driving Car Race?
The central question in autonomous vehicle perception is how much of the sensing work to assign to cameras and learned models, and how much to delegate to active sensors that measure the world directly. It is framed as a debate between two camps, but it is really a question about where the failure modes of each approach fall and whether those failure modes are acceptable for the deployment context.
A camera-based system fails when image quality degrades: low light, rain, direct sun, sensor fouling. A LiDAR and radar-based system fails differently: LiDAR returns are degraded by fog and heavy precipitation, radar handles those conditions well but has poor angular resolution and cannot resolve semantic content. Neither approach has a failure mode that covers all scenarios. The question is which failure mode distribution aligns with the operating design domain and whether the gap can be closed with enough data and compute.
The Vision-Only Approach
A vision-only autonomous driving stack uses cameras as the primary sensor and derives all environmental representation from their output. In a multi-camera system with overlapping fields of view, each camera produces a 2D image stream. The perception stack’s task is to extract 3D environmental structure, detect and classify objects, estimate their velocities, predict their future positions, and produce an occupancy representation sufficient to plan a safe trajectory.
The depth estimation problem is the core difficulty. A single camera cannot directly observe depth; it observes the projection of 3D points onto a 2D plane. Depth must be inferred from monocular depth cues (texture gradient, object size priors, vertical position in the image), from stereo parallax when two cameras observe the same scene point, or from temporal parallax accumulated across frames of a moving camera. All three methods work in favorable conditions. All three degrade in edge cases: a flat road surface with no texture provides no monocular cues; a narrow baseline stereo pair cannot resolve depth for distant objects; temporal parallax fails when the camera is stationary or moving slowly.
Tesla’s approach constructs a volumetric occupancy representation from camera images using a transformer architecture trained on large-scale data. Rather than detecting individual objects and representing them as bounding boxes with associated attributes, the occupancy network predicts which voxels in a 3D grid around the vehicle are occupied, by what category, and with what velocity. This representation is richer than a bounding box output because it handles arbitrary object shapes, partial occlusions, and objects that do not belong to any training class, all of which would be represented as occupied space regardless of category.
The specific claimed advantage of this architecture over explicit depth estimation is that the learned representation can use all available cues simultaneously. A transformer that sees all camera views jointly can learn that a shadow pattern at a certain scale and orientation implies a pedestrian at a certain distance, without the system designer having explicitly coded that relationship. This generalization is the source of the case for vision-only: if the training distribution is large enough and diverse enough, the network learns to interpret visual evidence that a rule-based system could not anticipate.
The limitation is that this generalization is statistical. The network performs well on inputs similar to its training distribution and degrades on inputs outside it. The performance distribution is not well characterized for rare events, and the failures are not always graceful. A comparator or a laser rangefinder fails in ways that are physically predictable and can be designed around. A neural network fails in ways that are difficult to predict from first principles and require empirical characterization over enormous numbers of miles in diverse conditions.
The hardware case for vision-only is straightforward. Cameras are inexpensive, compact, and mass-produced. A vehicle equipped with eight cameras and a custom inference chip is significantly cheaper to manufacture than a vehicle equipped with LiDAR units, which at automotive volumes remain expensive even after significant price reductions. This cost argument has direct consequences for deployment scale: a lower-cost vehicle can be deployed in larger numbers, generating more real-world data, which feeds back into the training pipeline.
Sensor Fusion
A sensor fusion stack integrates measurements from cameras, LiDAR, radar, IMU, and GPS into a unified environmental model. The motivation is redundancy: each sensor modality has a distinct failure mode, and a system that uses multiple modalities can maintain correct function when any single modality is degraded, provided the remaining modalities provide sufficient information for the task.
LiDAR measures distances to surfaces directly by timing the return of emitted laser pulses. The output is a point cloud, a set of 3D points each with known range and angular position, sampled at rates of 100,000 to several million points per second for rotating scanners. LiDAR point clouds support accurate 3D object detection, fine-grained geometric reconstruction of the environment, and precise localization by matching observed point clouds against a precomputed HD map. The angular resolution of a 64-beam LiDAR at 100-meter range resolves object geometry well enough to classify pedestrians, cyclists, and vehicles reliably, with detection performance that is largely independent of lighting.
The LiDAR failure modes are weather-related. Water droplets in fog scatter and absorb the emitted pulses before they reach the target, reducing effective range from hundreds of meters to tens of meters in dense fog. Heavy rain causes similar degradation. Wet road surfaces produce specular reflections that confuse the return timing. These failures are well-characterized and detectable: the sensor reports a degraded point cloud with obvious artifacts rather than incorrect readings that look normal.
Radar measures the range and radial velocity of reflective objects using the Doppler effect. Modern automotive radar operates at 77 GHz and achieves accurate velocity measurements at ranges up to 200 meters with minimal degradation in rain, fog, or darkness. The limitation is angular resolution: a typical automotive radar has a horizontal beamwidth of a few degrees, which is insufficient to resolve the precise lateral positions of nearby objects or to separate objects at similar ranges. Radar alone cannot classify objects reliably. Radar combined with camera or LiDAR provides accurate velocity and range information that neither modality provides as well alone.
IMU and GPS handle ego-motion estimation. IMU integration provides pose estimates at high frequency (hundreds of Hz) that bridge the gaps between lower-frequency sensor updates. GPS provides absolute position when available. HD map localization, which matches observed sensor data against a precomputed map, provides centimeter-level accuracy that GPS alone cannot achieve, particularly in environments with tall buildings that degrade GPS signal quality.
The fusion architecture combines these modalities in several ways. Kalman filter-based fusion tracks objects across modalities by maintaining a state estimate and uncertainty for each tracked object, predicting its state forward using a motion model, and updating the estimate when new sensor observations are available. The information matrix formulation allows the filter to weight each modality’s contribution by its uncertainty, so a high-confidence LiDAR detection contributes more to the estimate than a low-confidence radar return. Learned fusion architectures, which process the raw outputs of each sensor through a neural network that learns the fusion weights from data, have largely displaced hand-designed filter fusion for object detection, though classical filtering remains important for state estimation and localization.
The complexity of a fusion stack scales with the number of modalities. Each sensor requires a calibration procedure that establishes its extrinsic pose relative to the vehicle frame and its intrinsic model. Temporal synchronization across sensors that sample at different rates and have different latencies requires careful timestamping. Failure detection for each sensor modality requires monitoring that is specific to that modality’s characteristics. The integration test matrix for a system with five sensor modalities is substantially larger than for a single-modality system.
Localization: Where the Difference Is Largest
Localization quality, the accuracy with which the vehicle knows its own pose in the world, is where the two approaches diverge most significantly for current systems.
Camera-based localization uses visual odometry and, in systems that support it, matching against a visual HD map. Visual odometry accumulates relative pose estimates from frame to frame by tracking feature correspondences, which produces good short-term accuracy but drifts over long distances without absolute corrections. Visual map matching is accurate when the visual appearance of the environment is stable, but visual appearance changes substantially across seasons, lighting conditions, and scene dynamics.
LiDAR-based localization by scan matching against a precomputed HD map provides centimeter-level accuracy in a wide range of conditions, because geometric structure changes less than visual appearance. The HD map is built from LiDAR data and represents the permanent geometric features of the environment: building facades, lane boundaries, guardrails. Matching the current scan against the map is robust to temporary occlusions and to the presence of dynamic objects because those objects do not appear in the static map and their absence from the match is handled as expected. The prerequisite is that an HD map must exist for the operating area, which constrains the operating design domain to mapped regions.
A vision-only system can operate in unmapped territory; a LiDAR-based system that depends on HD map localization cannot. This is a real operational difference. A vehicle that can navigate reliably in unmapped territory is fundamentally more flexible than one that requires prior mapping, and that flexibility has practical value for scaling to new geographies.
Safety Validation
The validation challenge for vision-only stacks is substantially harder than for modular sensor fusion stacks. A modular stack with independent sensor processing pipelines can be validated at the component level: the LiDAR detection algorithm can be tested against a dataset of labeled LiDAR scans, and its false positive and false negative rates can be characterized. The camera detection algorithm can be independently validated similarly. The fusion algorithm can then be validated by examining how it combines the independently validated inputs. Each component’s failure modes are characterized, and the system’s behavior when a component fails can be analyzed by testing with that component’s output held at its failure state.
An end-to-end learned system that takes raw camera inputs and outputs driving commands does not decompose this way. The safety properties of the system are properties of the entire input-to-output mapping, not of individually testable components. Demonstrating that the system fails safely in all relevant edge cases requires demonstrating coverage of the input space, which for a high-dimensional camera input is impractical by exhaustive testing. Statistical characterization from fleet data is the practical approach, but it requires enormous fleet scale to accumulate meaningful statistics for rare events, and rare events are precisely the ones that matter most for safety.
The ISO 26262 functional safety framework and the SOTIF (Safety of the Intended Functionality) standard both assume modular systems where components can be independently analyzed for failure modes and effects. End-to-end neural network stacks fit neither framework cleanly. This creates a certification challenge that is distinct from the technical performance question: even if a vision-only system performs well by empirical metrics, certifying it under existing safety standards requires either demonstrating sufficient statistical evidence from fleet data or developing new validation methodologies.
Where the Two Approaches Stand
Both approaches are in production deployment. Tesla has millions of vehicles in the field generating data for its vision-based system. Waymo operates a robotaxi fleet using a full sensor fusion stack with LiDAR, radar, and cameras in a geofenced area of several cities. Neither has achieved unrestricted Level 4 autonomy at scale.
The vision-only approach benefits from scale: more vehicles means more data, which feeds back into model improvement. Its current limitations are in conditions and scenarios that degrade image quality or require precise geometric understanding that monocular vision cannot provide reliably. The sensor fusion approach achieves more robust perception in adverse conditions and supports the HD map localization that enables reliable autonomous driving in mapped areas, at higher sensor and system cost.
The long-term trajectory depends partly on whether neural network generalization improves to the point where the failure modes of vision-only systems become manageable for unrestricted autonomy, and partly on whether LiDAR costs decrease enough to make sensor fusion economically viable in consumer vehicles rather than only in commercial robotaxi fleets. Neither question has a settled answer.
The practical engineering choice in any specific system design depends on the operating design domain, the acceptable failure mode distribution, the cost and form factor constraints, and the validation methodology available. There is no architecture that is universally correct; there is only the architecture that best fits the actual deployment requirements.
