Jinghang Li*,1, Shichao Li*,2, Qing Lian2, Peiliang Li2, Xiaozhi Chen2, Yi Zhou†,1
1 Neuromorphic Automation and Intelligence Lab (NAIL), Hunan University
2 Zhuoyu Technology
* Equal contribution † Corresponding author: Yi Zhou
Abstract
Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination. While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the first successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometry-aware representation learning framework leads to the best event-based object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.
At a Glance
Data Modalities
All sensors are hardware-synchronized at sub-microsecond precision using IEEE PTP and a 10 Hz trigger signal.
FLIR Blackfly S at 1920×1200, 10 Hz. Auto-exposure with exposure time priority (1–20 ms) for minimal motion blur.
Prophesee EVK4 at 1280×720. Microsecond temporal resolution with high dynamic range, ideal for challenging illumination.
3× Livox TELE-15 (320 m range) + 1× Livox Mid-360 for dense point clouds used in 3D annotation.
u-blox ZED-F9K with RTK (0.2 m position accuracy). Ego-pose via tightly-coupled GNSS-Visual-Inertial fusion (GVINS).
Sensor configuration: the event camera and RGB camera are rigidly mounted with a narrow 3 cm baseline.
Dataset Details
Diverse driving scenarios spanning highways, urban roads, and low-light conditions across different times of day and weather.
| Region | Distance | Illumination | Sequences (Train/Test) | Duration (Train/Test) |
|---|---|---|---|---|
| Highways | 178.44 km | Sunny | 13 / 3 | 65 / 15 min |
| Cloudy | 10 / 1 | 50 / 5 min | ||
| Twilight | 7 / 2 | 35 / 10 min | ||
| Urban | 52.61 km | Sunny | 5 / 1 | 25 / 5 min |
| Cloudy | 4 / 1 | 20 / 5 min | ||
| Twilight | 1 / 1 | 5 / 5 min | ||
| Night | 5 / 2 | 25 / 10 min | ||
| Low-light | 5.01 km | Night | 1 / 1 | 5 / 5 min |
| Total | 236.06 km | — | 58 | 290 min |
Annotations
Pre-labeled via BEVFusion with 500k in-house frames, tracked by 3D Kalman Filter, then manually verified and corrected by human annotators.
7-DoF cuboid annotations (x, y, z, l, w, h, yaw) in ego-vehicle coordinate system with front-camera projection.
Consistent tracking IDs across frames with 11-dimensional state vectors (location, orientation, size, velocity, angular velocity).
Per-object ego-relative speed vectors and time-to-collision (TTC) ground truth: τ = min(Z) / vrel.
Full intrinsic/extrinsic calibration via Kalibr and Calib-Anything. Sub-microsecond temporal alignment. Narrow-baseline (<5 px disparity) RGB-event mapping.
3D cuboid annotations projected on RGB and event views across diverse driving scenarios and illumination conditions.
BEV of LiDAR point cloud with 3D boxes and velocity curves of object trajectories.
Projected LiDAR point cloud on RGB image demonstrating calibration quality.
Benchmark
We evaluate on two tasks: object TTC estimation and 3D vehicle detection. Our Garl-TTC achieves state-of-the-art performance at 200 FPS.
| Method | Type | Modality | MiDc (0–3s) | MiDs (3–6s) | MiDl (6–10s) | MiDn (<0s) |
|---|---|---|---|---|---|---|
| FAITH | Model | E | 606.8 | 490.8 | 319.0 | 376.4 |
| ETTCMscaling | Model | E | 402.2 | 279.5 | 263.9 | 207.5 |
| ETTCM6-dof | Model | E | 226.1 | 326.2 | 321.6 | 223.8 |
| CMax | Model | E | 632.8 | 1583.6 | 1528.0 | 1187.0 |
| STRTTC | Model | E | 237.2 | 532.9 | 954.3 | 348.9 |
| Garl-TTC (Ours) | Learning | E+V | 53.1 | 37.6 | 40.6 | 31.3 |
MiD = Motion-in-Depth error (lower is better). TTC ranges: critical (0–3s), small (3–6s), large (6–10s), negative (<0s).
| Method | FCWD1 RTE↓ | FCWD2 RTE↓ | FCWD3 RTE↓ | eAP MiD↓ |
|---|---|---|---|---|
| ETTCM6-dof (Event) | 15.5 | 18.4 | 19.0 | 265.4 |
| STRTTC (Event) | 9.8 | 11.5 | 14.0 | 408.7 |
| DeepScale (Frame) | 25.4 | 19.6 | 21.7 | 81.9 |
| Garl-TTC (Ours) | 5.2 | 6.1 | 5.4 | 45.0 |
RTE = Relative TTC Error (%, lower is better). FCWD results without fine-tuning demonstrate strong generalization.
| Method | Modality | Frames | Driving AP↑ / ATE↓ | HDR AP↑ / ATE↓ | ||
|---|---|---|---|---|---|---|
| Visual-only | V | 1 | 0.510 / 0.497 | 0.403 / 0.584 | ||
| Event-only | E | 1 | 0.200 / 0.794 | 0.138 / 0.872 | ||
| Fusion | V+E | 1 | 0.515 / 0.482 | 0.460 / 0.503 | ||
| Fusion-temporal | V+E | 3 | 0.531 / 0.400 | 0.558 / 0.363 | ||
AP = Average Precision (higher is better), ATE = Average Translation Error (lower is better). Event fusion notably improves HDR scene performance (+38.5% AP over visual-only).
TTC estimation results: Garl-TTC produces accurate and temporally smooth TTC predictions across diverse scenarios.
3D detection BEV results: event-enhanced fusion improves detection in challenging illumination (HDR) scenarios.
Citation
@misc{li2026eap,
title = {Toward Deep Representation Learning for Event-Enhanced
Visual Autonomous Perception: the eAP Dataset},
author = {Li, Jinghang and Li, Shichao and Lian, Qing
and Li, Peiliang and Chen, Xiaozhi and Zhou, Yi},
year = {2026},
eprint = {2603.16303},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2603.16303},
}