MoRe-UAV

Overview of MoRe-UAV. Top: representative UAV clips with moving targets, camera motion, and large viewpoint changes. Bottom: a typical task example. Unlike conventional UAV grounding that mainly relies on appearance or spatial cues, MoRe-UAV requires reasoning about motion cues under moving viewpoints.

Overview

Abstract and Highlights

Abstract

UAV visual grounding in real-world applications requires localizing a target referred to by language while both the target and the UAV move. Existing UAV grounding datasets mainly focus on images, while the few video-based benchmarks are still dominated by appearance and spatial cues. As a result, they do not adequately capture two key challenges: motion-centric grounding and drastic cross-view appearance changes caused by UAV ego-motion. To address this gap, we introduce MoRe-UAV, a large-scale benchmark for motion-aware visual grounding in UAV videos. MoRe-UAV contains 22,225 video-expression pairs and 7,415,622 annotated frames, covering diverse aerial scenes with moving targets and substantial viewpoint changes. We build the dataset through a scalable human-in-the-loop pipeline for efficient annotation with quality control. We establish an initial benchmark on MoRe-UAV with spatio-temporal video grounding methods, multimodal large language models, and hybrid MLLM+tracking pipelines. We further provide a stronger baseline with a Motion-aware Prefix Adapter and a Multi-view Alignment Adapter to enhance motion reasoning and cross-view alignment. Experiments show that existing methods struggle on MoRe-UAV and remain far below human performance, highlighting substantial room for future research on motion-aware and multi-view grounding in UAV videos.

Benchmark

What Makes MoRe-UAV Different

Existing datasets mainly focus on image-level grounding or UAV video grounding with appearance-centric or spatial-centric expressions. MoRe-UAV instead targets motion-aware visual grounding in UAV videos with moving targets, strong viewpoint changes, and frame-level target annotations.

Dataset	Type	Source	#Videos / Images	#Frames	Expression Focus	Frame-level Ann.	Viewpoint Change	Motion-aware
AerialVG	Image	VisDrone2019	5,000	--	Appearance-centric	No	No	No
DVGBench	Image	ERA + VisDrone	2,863	--	Appearance-centric	No	No	No
UAVIT-1M	Image	low-altitude datasets	789K	--	QA-style	No	No	No
UAV-SVG	Video	CapERA + WebUAV-3M	3,564	2.01M	Spatial-centric	Yes	No	No
MoRe-UAV (Ours)	Video	Self-collected	22,225	7.41M	Motion-centric	Yes	Yes	Yes

Dataset Construction

Annotation Pipeline

Privacy and ethical review, trajectory annotation with manual correction, motion-aware expression generation, and dual-review human verification.

Baseline Framework

Motion-Aware Grounding Framework

Overview of the proposed UAV video grounding framework.

Our lightweight benchmark baseline builds on a frozen Qwen2.5-VL backbone with parameter-efficient tuning, using MPA for motion-sensitive query modeling and MVA for cross-view temporal alignment.

Qualitative Examples

Motion-Aware UAV Grounding

The first four demos highlight representative challenging settings. Below them, we further show a denser gallery of additional annotated examples from the benchmark.

Long video, fine-grained expression

Long-Horizon Delivery Rider Tracking

Referring expression

A yellow-clad delivery rider rides an electric scooter from the road up onto the curb, and then passes through the electric scooter parking area.

Long expression, Competitive Objectives reasoning

Long-Horizon Vehicle Disambiguation

Referring expression

The white vehicle that follows another white car and then turns right while the other continues straight.

Nighttime scene

Nighttime U-Turn Tracking

Referring expression

A white car makes a u-turn on the road.

Fast motion in dense traffic

Fast Buses in a Crowded Roundabout

Referring expression

A green bus driving in the roundabout behind another bus.

Extended Demo Gallery

More annotated benchmark clips, each paired with its motion-aware referring expression. For smoother web playback, these videos are accelerated and compressed; please refer to the final dataset for the complete versions.

Referring expression

A yellow dog runs forward.

Referring expression

A white car drives through the intersection from right to left.

Referring expression

A gray car turns left behind a truck.

Referring expression

A white car turns right into a parking lot.

Referring expression

A person in black walks straight with another person on the right.

Referring expression

A tiger walks straight, passing another tiger on the left.

Referring expression

A black car turns left and follows white cars.

Referring expression

A white car turns left behind a black car.

Referring expression

A white van drives straight and then turns left.

Citation

BibTeX

@misc{zhang2026moreuav,
  title={MoRe-UAV: A Large-Scale Benchmark for Motion-Aware Visual Grounding in UAV Videos},
  author={Zhipeng Zhang and Yiheng Zhang and Wei Suo and Le Liu and Ji Wang and Peng Wang},
  year={2026},
  howpublished={\url{https://more-uav.github.io/}},
  note={Project page, benchmark description, and release updates.}
}