MoRe-UAV: A Large-Scale Benchmark for Motion-Aware Visual Grounding in UAV Videos

Zhipeng Zhang1, Yiheng Zhang1, Wei Suo1, Le Liu1, Ji Wang1, Peng Wang1

1 School of Computer Science, Northwestern Polytechnical University, Xi'an, China

Overview figure of the MoRe-UAV benchmark.

Overview of MoRe-UAV. Top: representative UAV clips with moving targets, camera motion, and large viewpoint changes. Bottom: a typical task example. Unlike conventional UAV grounding that mainly relies on appearance or spatial cues, MoRe-UAV requires reasoning about motion cues under moving viewpoints.

Overview

Abstract and Highlights

Abstract

UAV visual grounding in real-world applications requires localizing a target referred to by language while both the target and the UAV move. Existing UAV grounding datasets mainly focus on images, while the few video-based benchmarks are still dominated by appearance and spatial cues. As a result, they do not adequately capture two key challenges: motion-centric grounding and drastic cross-view appearance changes caused by UAV ego-motion. To address this gap, we introduce MoRe-UAV, a large-scale benchmark for motion-aware visual grounding in UAV videos. MoRe-UAV contains 22,225 video-expression pairs and 7,415,622 annotated frames, covering diverse aerial scenes with moving targets and substantial viewpoint changes. We build the dataset through a scalable human-in-the-loop pipeline for efficient annotation with quality control. We establish an initial benchmark on MoRe-UAV with spatio-temporal video grounding methods, multimodal large language models, and hybrid MLLM+tracking pipelines. We further provide a stronger baseline with a Motion-aware Prefix Adapter and a Multi-view Alignment Adapter to enhance motion reasoning and cross-view alignment. Experiments show that existing methods struggle on MoRe-UAV and remain far below human performance, highlighting substantial room for future research on motion-aware and multi-view grounding in UAV videos.

Benchmark

What Makes MoRe-UAV Different

Existing datasets mainly focus on image-level grounding or UAV video grounding with appearance-centric or spatial-centric expressions. MoRe-UAV instead targets motion-aware visual grounding in UAV videos with moving targets, strong viewpoint changes, and frame-level target annotations.

Dataset Type Source #Videos / Images #Frames Expression Focus Frame-level Ann. Viewpoint Change Motion-aware
AerialVG Image VisDrone2019 5,000 -- Appearance-centric No No No
DVGBench Image ERA + VisDrone 2,863 -- Appearance-centric No No No
UAVIT-1M Image low-altitude datasets 789K -- QA-style No No No
UAV-SVG Video CapERA + WebUAV-3M 3,564 2.01M Spatial-centric Yes No No
MoRe-UAV (Ours) Video Self-collected 22,225 7.41M Motion-centric Yes Yes Yes
Dataset Construction

Annotation Pipeline

Overview of the annotation pipeline for constructing MoRe-UAV.

Privacy and ethical review, trajectory annotation with manual correction, motion-aware expression generation, and dual-review human verification.

Baseline Framework

Motion-Aware Grounding Framework

Overview of the proposed UAV video grounding framework.

Our lightweight benchmark baseline builds on a frozen Qwen2.5-VL backbone with parameter-efficient tuning, using MPA for motion-sensitive query modeling and MVA for cross-view temporal alignment.

Qualitative Examples

Motion-Aware UAV Grounding

The first four demos highlight representative challenging settings. Below them, we further show a denser gallery of additional annotated examples from the benchmark.

Citation

BibTeX

@misc{zhang2026moreuav,
  title={MoRe-UAV: A Large-Scale Benchmark for Motion-Aware Visual Grounding in UAV Videos},
  author={Zhipeng Zhang and Yiheng Zhang and Wei Suo and Le Liu and Ji Wang and Peng Wang},
  year={2026},
  howpublished={\url{https://more-uav.github.io/}},
  note={Project page, benchmark description, and release updates.}
}