IEEE Transactions on Robotics (T-RO) 2026

Toward Deep Representation Learning for Event-enhanced Visual Autonomous Perception:
the eAP Dataset

Jinghang Li*,1, Shichao Li*,2, Qing Lian2, Peiliang Li2, Xiaozhi Chen2, Yi Zhou†,1

1 Neuromorphic Automation and Intelligence Lab (NAIL), Hunan University

2 Zhuoyu Technology

* Equal contribution    Corresponding author: Yi Zhou

Hunan University
Hunan University
NAIL Lab
NAIL Lab
Zhuoyu Technology
Zhuoyu Technology

Abstract

Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination. While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the first successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometry-aware representation learning framework leads to the best event-based object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.

At a Glance

Dataset Statistics

58
Sequences
46 train / 12 test
532k+
3D Bounding Boxes
7-DoF cuboids
4.8h
Driving Duration
290 minutes total
236km
Distance Covered
Highway + Urban
174k
Annotated Frames
138k train / 36k test
8
Object Classes
Vehicle + VRU

Data Modalities

Multi-sensor Synchronized Capture

All sensors are hardware-synchronized at sub-microsecond precision using IEEE PTP and a 10 Hz trigger signal.

RGB Camera

FLIR Blackfly S at 1920×1200, 10 Hz. Auto-exposure with exposure time priority (1–20 ms) for minimal motion blur.

Event Camera

Prophesee EVK4 at 1280×720. Microsecond temporal resolution with high dynamic range, ideal for challenging illumination.

LiDAR

3× Livox TELE-15 (320 m range) + 1× Livox Mid-360 for dense point clouds used in 3D annotation.

GNSS-IMU

u-blox ZED-F9K with RTK (0.2 m position accuracy). Ego-pose via tightly-coupled GNSS-Visual-Inertial fusion (GVINS).

Sensor configuration for the eAP dataset showing the vehicle with mounted cameras and LiDAR sensors

Sensor configuration: the event camera and RGB camera are rigidly mounted with a narrow 3 cm baseline.

Dataset Details

Comprehensive Driving Coverage

Diverse driving scenarios spanning highways, urban roads, and low-light conditions across different times of day and weather.

Region Distance Illumination Sequences (Train/Test) Duration (Train/Test)
Highways 178.44 km Sunny 13 / 3 65 / 15 min
Cloudy 10 / 1 50 / 5 min
Twilight 7 / 2 35 / 10 min
Urban 52.61 km Sunny 5 / 1 25 / 5 min
Cloudy 4 / 1 20 / 5 min
Twilight 1 / 1 5 / 5 min
Night 5 / 2 25 / 10 min
Low-light 5.01 km Night 1 / 1 5 / 5 min
Total 236.06 km 58 290 min
Train: 46 sequences (138k frames) · Test: 12 sequences (36k frames) · Split at sequence level for fair evaluation.

Annotations

Rich, Multi-dimensional Labels

Pre-labeled via BEVFusion with 500k in-house frames, tracked by 3D Kalman Filter, then manually verified and corrected by human annotators.

3D Bounding Boxes

7-DoF cuboid annotations (x, y, z, l, w, h, yaw) in ego-vehicle coordinate system with front-camera projection.

Object Tracking

Consistent tracking IDs across frames with 11-dimensional state vectors (location, orientation, size, velocity, angular velocity).

Velocity & TTC

Per-object ego-relative speed vectors and time-to-collision (TTC) ground truth: τ = min(Z) / vrel.

Calibration & Synchronization

Full intrinsic/extrinsic calibration via Kalibr and Calib-Anything. Sub-microsecond temporal alignment. Narrow-baseline (<5 px disparity) RGB-event mapping.

Supported Object Classes

Car
Bus
Truck
SUV
Pedestrian
Motorcycle
Bicycle
Tricycle

Future Annotation Extensions

Dense TTC maps
Optical flow
Depth maps
Segmentation masks
3D cuboid annotation examples on RGB and event camera views across different driving scenarios

3D cuboid annotations projected on RGB and event views across diverse driving scenarios and illumination conditions.

BEV visualization of LiDAR point cloud with annotated 3D bounding boxes and velocity curves

BEV of LiDAR point cloud with 3D boxes and velocity curves of object trajectories.

Projected point cloud on RGB image showing LiDAR-camera calibration quality

Projected LiDAR point cloud on RGB image demonstrating calibration quality.

Benchmark

Benchmark Results

We evaluate on two tasks: object TTC estimation and 3D vehicle detection. Our Garl-TTC achieves state-of-the-art performance at 200 FPS.

TTC Estimation on eAP (Motion-in-Depth Error ↓)

Method Type Modality MiDc (0–3s) MiDs (3–6s) MiDl (6–10s) MiDn (<0s)
FAITH Model E 606.8 490.8 319.0 376.4
ETTCMscaling Model E 402.2 279.5 263.9 207.5
ETTCM6-dof Model E 226.1 326.2 321.6 223.8
CMax Model E 632.8 1583.6 1528.0 1187.0
STRTTC Model E 237.2 532.9 954.3 348.9
Garl-TTC (Ours) Learning E+V 53.1 37.6 40.6 31.3

MiD = Motion-in-Depth error (lower is better). TTC ranges: critical (0–3s), small (3–6s), large (6–10s), negative (<0s).

Cross-dataset TTC Evaluation (FCWD Benchmark + eAP)

Method FCWD1 RTE↓ FCWD2 RTE↓ FCWD3 RTE↓ eAP MiD↓
ETTCM6-dof (Event) 15.5 18.4 19.0 265.4
STRTTC (Event) 9.8 11.5 14.0 408.7
DeepScale (Frame) 25.4 19.6 21.7 81.9
Garl-TTC (Ours) 5.2 6.1 5.4 45.0

RTE = Relative TTC Error (%, lower is better). FCWD results without fine-tuning demonstrate strong generalization.

3D Vehicle Detection on eAP

Method Modality Frames Driving AP↑ / ATE↓ HDR AP↑ / ATE↓
Visual-only V 1 0.510 / 0.497 0.403 / 0.584
Event-only E 1 0.200 / 0.794 0.138 / 0.872
Fusion V+E 1 0.515 / 0.482 0.460 / 0.503
Fusion-temporal V+E 3 0.531 / 0.400 0.558 / 0.363

AP = Average Precision (higher is better), ATE = Average Translation Error (lower is better). Event fusion notably improves HDR scene performance (+38.5% AP over visual-only).

Runtime Performance

Garl-TTC

RGB Encoder7.11 ms
Event Encoder7.08 ms
Height Head0.15 ms
Total (A100)21.05 ms
Orin NX16G (ONNX)4.55 ms (~200 FPS)

3D Detection

RGB Encoder27.48 ms
Event Encoder23.35 ms
BEV Encoder34.26 ms
Other modules10.42 ms
Total (A100)125.07 ms

Qualitative Results

TTC estimation qualitative results comparing Garl-TTC with baselines

TTC estimation results: Garl-TTC produces accurate and temporally smooth TTC predictions across diverse scenarios.

3D vehicle detection BEV results comparing visual-only, event-only, and fusion approaches

3D detection BEV results: event-enhanced fusion improves detection in challenging illumination (HDR) scenarios.

Team

Authors

Jinghang Li

Jinghang Li*

Ph.D. Student

Hunan University

Shichao Li

Shichao Li*

Senior Research Engineer

ByteDance

Qing Lian

Qing Lian

Researcher

Zhuoyu Technology

Peiliang Li

Peiliang Li

Lead, E2E Self-Driving & Next-Gen Algorithms

Zhuoyu Technology

Xiaozhi Chen

Xiaozhi Chen

Director of AI Research

Zhuoyu Technology

Yi Zhou

Yi Zhou

Professor

Hunan University

Citation

Cite Our Work

@misc{li2026eap,
  title         = {Toward Deep Representation Learning for Event-Enhanced
                   Visual Autonomous Perception: the eAP Dataset},
  author        = {Li, Jinghang and Li, Shichao and Lian, Qing
                   and Li, Peiliang and Chen, Xiaozhi and Zhou, Yi},
  year          = {2026},
  eprint        = {2603.16303},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.16303},
}