IEEE Transactions on Robotics (T-RO) 2026

Toward Deep Representation Learning for Event-enhanced Visual Autonomous Perception:
the eAP Dataset

Jinghang Li^*,1, Shichao Li^*,2, Qing Lian², Peiliang Li², Xiaozhi Chen², Yi Zhou^†,1

¹ Neuromorphic Automation and Intelligence Lab (NAIL), Hunan University

² Zhuoyu Technology

^* Equal contribution ^† Corresponding author: Yi Zhou

Hunan University

NAIL Lab

Zhuoyu Technology

Paper PDF GitHub Dataset

Abstract

Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination. While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the first successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometry-aware representation learning framework leads to the best event-based object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.

Data Modalities

Multi-sensor Synchronized Capture

All sensors are hardware-synchronized at sub-microsecond precision using IEEE PTP and a 10 Hz trigger signal.

RGB Camera

FLIR Blackfly S at 1920×1200, 10 Hz. Auto-exposure with exposure time priority (1–20 ms) for minimal motion blur.

Event Camera

Prophesee EVK4 at 1280×720. Microsecond temporal resolution with high dynamic range, ideal for challenging illumination.

LiDAR

3× Livox TELE-15 (320 m range) + 1× Livox Mid-360 for dense point clouds used in 3D annotation.

GNSS-IMU

u-blox ZED-F9K with RTK (0.2 m position accuracy). Ego-pose via tightly-coupled GNSS-Visual-Inertial fusion (GVINS).

Sensor configuration for the eAP dataset showing the vehicle with mounted cameras and LiDAR sensors

Sensor configuration: the event camera and RGB camera are rigidly mounted with a narrow 3 cm baseline.

Dataset Details

Comprehensive Driving Coverage

Diverse driving scenarios spanning highways, urban roads, and low-light conditions across different times of day and weather.

Region	Distance	Illumination	Sequences (Train/Test)	Duration (Train/Test)
Highways	178.44 km	Sunny	13 / 3	65 / 15 min
		Cloudy	10 / 1	50 / 5 min
		Twilight	7 / 2	35 / 10 min
Urban	52.61 km	Sunny	5 / 1	25 / 5 min
		Cloudy	4 / 1	20 / 5 min
		Twilight	1 / 1	5 / 5 min
		Night	5 / 2	25 / 10 min
Low-light	5.01 km	Night	1 / 1	5 / 5 min
Total	236.06 km	—	58	290 min

Train: 46 sequences (138k frames) · Test: 12 sequences (36k frames) · Split at sequence level for fair evaluation.

Annotations

Rich, Multi-dimensional Labels

Pre-labeled via BEVFusion with 500k in-house frames, tracked by 3D Kalman Filter, then manually verified and corrected by human annotators.

3D Bounding Boxes

7-DoF cuboid annotations (x, y, z, l, w, h, yaw) in ego-vehicle coordinate system with front-camera projection.

Object Tracking

Consistent tracking IDs across frames with 11-dimensional state vectors (location, orientation, size, velocity, angular velocity).

Velocity & TTC

Per-object ego-relative speed vectors and time-to-collision (TTC) ground truth: τ = min(Z) / v_rel.

Calibration & Synchronization

Full intrinsic/extrinsic calibration via Kalibr and Calib-Anything. Sub-microsecond temporal alignment. Narrow-baseline (<5 px disparity) RGB-event mapping.

Supported Object Classes

Car

Bus

Truck

SUV

Pedestrian

Motorcycle

Bicycle

Tricycle

Future Annotation Extensions

Dense TTC maps

Optical flow

Depth maps

Segmentation masks

3D cuboid annotation examples on RGB and event camera views across different driving scenarios

3D cuboid annotations projected on RGB and event views across diverse driving scenarios and illumination conditions.

BEV visualization of LiDAR point cloud with annotated 3D bounding boxes and velocity curves

BEV of LiDAR point cloud with 3D boxes and velocity curves of object trajectories.

Projected point cloud on RGB image showing LiDAR-camera calibration quality

Projected LiDAR point cloud on RGB image demonstrating calibration quality.

Benchmark

Benchmark Results

We evaluate on two tasks: object TTC estimation and 3D vehicle detection. Our Garl-TTC achieves state-of-the-art performance at 200 FPS.

TTC Estimation on eAP (Motion-in-Depth Error ↓)

Method	Type	Modality	MiD_c (0–3s)	MiD_s (3–6s)	MiD_l (6–10s)	MiD_n (<0s)
FAITH	Model	E	606.8	490.8	319.0	376.4
ETTCM_scaling	Model	E	402.2	279.5	263.9	207.5
ETTCM_6-dof	Model	E	226.1	326.2	321.6	223.8
CMax	Model	E	632.8	1583.6	1528.0	1187.0
STRTTC	Model	E	237.2	532.9	954.3	348.9
Garl-TTC (Ours)	Learning	E+V	53.1	37.6	40.6	31.3

MiD = Motion-in-Depth error (lower is better). TTC ranges: critical (0–3s), small (3–6s), large (6–10s), negative (<0s).

Cross-dataset TTC Evaluation (FCWD Benchmark + eAP)

Method	FCWD1 RTE↓	FCWD2 RTE↓	FCWD3 RTE↓	eAP MiD↓
ETTCM_6-dof (Event)	15.5	18.4	19.0	265.4
STRTTC (Event)	9.8	11.5	14.0	408.7
DeepScale (Frame)	25.4	19.6	21.7	81.9
Garl-TTC (Ours)	5.2	6.1	5.4	45.0

RTE = Relative TTC Error (%, lower is better). FCWD results without fine-tuning demonstrate strong generalization.

3D Vehicle Detection on eAP

Method	Modality	Frames	Driving AP↑ / ATE↓	HDR AP↑ / ATE↓
Visual-only	V	1	0.510 / 0.497	0.403 / 0.584
Event-only	E	1	0.200 / 0.794	0.138 / 0.872
Fusion	V+E	1	0.515 / 0.482	0.460 / 0.503
Fusion-temporal	V+E	3	0.531 / 0.400	0.558 / 0.363

AP = Average Precision (higher is better), ATE = Average Translation Error (lower is better). Event fusion notably improves HDR scene performance (+38.5% AP over visual-only).

Runtime Performance

Garl-TTC

RGB Encoder7.11 ms

Event Encoder7.08 ms

Height Head0.15 ms

Total (A100)21.05 ms

Orin NX16G (ONNX)4.55 ms (~200 FPS)

3D Detection

RGB Encoder27.48 ms

Event Encoder23.35 ms

BEV Encoder34.26 ms

Other modules10.42 ms

Total (A100)125.07 ms

Qualitative Results

TTC estimation results: Garl-TTC produces accurate and temporally smooth TTC predictions across diverse scenarios.

3D vehicle detection BEV results comparing visual-only, event-only, and fusion approaches

3D detection BEV results: event-enhanced fusion improves detection in challenging illumination (HDR) scenarios.

Team

Authors

Jinghang Li^*

Ph.D. Student

Hunan University

Shichao Li^*

Senior Research Engineer

ByteDance

Qing Lian

Researcher

Zhuoyu Technology

Peiliang Li

Lead, E2E Self-Driving & Next-Gen Algorithms

Zhuoyu Technology

Xiaozhi Chen

Director of AI Research

Zhuoyu Technology

Yi Zhou^†

Professor

Hunan University

Citation

Cite Our Work

@misc{li2026eap,
  title         = {Toward Deep Representation Learning for Event-Enhanced
                   Visual Autonomous Perception: the eAP Dataset},
  author        = {Li, Jinghang and Li, Shichao and Lian, Qing
                   and Li, Peiliang and Chen, Xiaozhi and Zhou, Yi},
  year          = {2026},
  eprint        = {2603.16303},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.16303},
}

Toward Deep Representation Learning for Event-enhanced Visual Autonomous Perception: the eAP Dataset

Dataset Statistics

Multi-sensor Synchronized Capture

RGB Camera

Event Camera

LiDAR

GNSS-IMU

Comprehensive Driving Coverage

Rich, Multi-dimensional Labels

3D Bounding Boxes

Object Tracking

Velocity & TTC

Calibration & Synchronization

Supported Object Classes

Future Annotation Extensions

Benchmark Results

TTC Estimation on eAP (Motion-in-Depth Error ↓)

Cross-dataset TTC Evaluation (FCWD Benchmark + eAP)

3D Vehicle Detection on eAP

Runtime Performance

Garl-TTC

3D Detection

Qualitative Results

Authors

Jinghang Li*

Shichao Li*

Qing Lian

Peiliang Li

Xiaozhi Chen

Yi Zhou†

Cite Our Work

Toward Deep Representation Learning for Event-enhanced Visual Autonomous Perception:
the eAP Dataset

Jinghang Li^*

Shichao Li^*

Yi Zhou^†