Visual Acoustic Matching

Changan Chen1,4Ruohan Gao2Paul Calamia3Kristen Grauman1,4

1University of Texas at Austin 2Stanford University 3Reality Labs Research at Meta 4Meta AI

Abstract

We introduce the visual acoustic matching task, in which

an audio clip is transformed to sound like it was recorded

in a target environment. Given an image of the target envi-

ronment and a waveform for the source audio, the goal is to

re-synthesize the audio to match the target room acoustics

as suggested by its visible geometry and materials. To ad-

dress this novel task, we propose a cross-modal transformer

model that uses audio-visual attention to inject visual prop-

erties into the audio and generate realistic audio output. In

addition, we devise a self-supervised training objective that

can learn acoustic matching from in-the-wild Web videos,

despite their lack of acoustically mismatched audio. We

demonstrate that our approach successfully translates hu-

man speech to a variety of real-world environments depicted

in images, outperforming both traditional acoustic match-

ing and more heavily supervised baselines.

1. Introduction

The audio we hear is always transformed by the space we

are in, as a function of the physical environment’s geometry,

the materials of surfaces and objects in it, and the locations

of sound sources around us. This means that we perceive the

same sound differently depending on where we hear it. For

example, imagine a person singing a song while standing

on the hardwood stage in a spacious auditorium versus in a

cozy living room with shaggy carpet. The underlying song

content would be identical, but we would experience it in

two very different ways.

For this reason, it is important to model room acoustics

to deliver a realistic and immersive experience for many

applications in augmented reality (AR) and virtual reality

(VR). Hearing sounds with acoustics inconsistent with the

scene is disruptive for human perception. In AR/VR, when

the real space and virtually reproduced space have different

acoustic properties, it causes a cognitive mismatch and the

“room divergence effect” damages the user experience [63].

Creating audio signals that are consistent with an envi-

ronment has a long history in the audio community. If the

geometry (often in the form of a 3D mesh) and material

Source Audio

Tar ge t Sp ac e

Output Audio

Figure 1. Goal of visual acoustic matching: transform the sound

recorded in one space to another space depicted in the target visual

scene. For example, given source audio recorded in a studio, re-

synthesize that audio to match the room acoustics of a concert hall.

properties of the space are known, simulation techniques

can be applied to generate a room impulse response (RIR),

a transfer function between the sound source and the micro-

phone that describes how the sound gets transformed by the

space. RIRs can then be convolved with an arbitrary source

audio signal to generate the audio signals received by the

microphone [8,9,17,50,51]. In the absence of geometry

and material information, the acoustical properties can be

estimated blindly from audio captured in that room (e.g., re-

verberant speech), then used to auralize a signal [29,42,56].

However, both approaches have practical limitations: the

former requires access to the full mesh and material prop-

erties of the target space, while the latter gets only limited

acoustic information about the target space from the rever-

beration in the audio sample. Neither uses imagery of the

target scene to perform acoustic matching.

We propose a novel task: visual acoustic matching.

Given an image of the target environment and a source au-

dio clip, the goal is to re-synthesize the audio as if it were

recorded in the target environment (see Figure 1). The idea

is to transform sounds from one space to another space

by altering their scene-driven acoustic signatures. Visual

Visual Acoustic Matching, Study notes of Microwave Engineering and Acoustics

Related documents

Partial preview of the text

Download Visual Acoustic Matching and more Study notes Microwave Engineering and Acoustics in PDF only on Docsity!

Changan Chen^1 ,^4 Ruohan Gao^2 Paul Calamia^3 Kristen Grauman^1 ,^4

1 University of Texas at Austin 2 Stanford University 3 Reality Labs Research at Meta 4 Meta AI

Abstract

1. Introduction

Source Audio

Target Space

Output Audio

2. Related Work

5. Approach

5.1. Audio-Visual Feature Sequence Generation

5.2. Cross-Modal Encoder

5.3. Waveform Generation and Loss

LG =

X^ K

LD =

X^ K

5.4. Acoustics Alteration for Self-Supervision

6.2. Results on Acoustic AVSpeech

7. Conclusion