Papers
arxiv:2506.05414

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Published on Jun 4, 2025
Authors:
,
,
,
,

Abstract

SAVVY-Bench introduces a benchmark for 3D spatial reasoning in dynamic audio-visual environments, and SAVVY, a training-free reasoning pipeline, significantly improves the performance of existing AV-LLMs.

3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

Community

This comment has been hidden

We preprocessed the videos and audio following the GitHub Repo instructions given by the authors. You can now find the audio and video data here in these two links: https://drive.google.com/file/d/1AAgrbdBaz4S-By6k_A5fwp5QagXLet6o/view?usp=sharing and
https://drive.google.com/file/d/1PT9Nmpd9-OJol88rddmQEwQNICzrfeoB/view?usp=sharing

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2506.05414
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.05414 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.05414 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.