






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Typology: Study notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target envi- ronment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To ad- dress this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual prop- erties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates hu- man speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic match- ing and more heavily supervised baselines.
The audio we hear is always transformed by the space we are in, as a function of the physical environment’s geometry, the materials of surfaces and objects in it, and the locations of sound sources around us. This means that we perceive the same sound differently depending on where we hear it. For example, imagine a person singing a song while standing on the hardwood stage in a spacious auditorium versus in a cozy living room with shaggy carpet. The underlying song content would be identical, but we would experience it in two very different ways. For this reason, it is important to model room acoustics to deliver a realistic and immersive experience for many applications in augmented reality (AR) and virtual reality (VR). Hearing sounds with acoustics inconsistent with the scene is disruptive for human perception. In AR/VR, when the real space and virtually reproduced space have different acoustic properties, it causes a cognitive mismatch and the “room divergence effect” damages the user experience [63]. Creating audio signals that are consistent with an envi- ronment has a long history in the audio community. If the geometry (often in the form of a 3D mesh) and material
Figure 1. Goal of visual acoustic matching: transform the sound recorded in one space to another space depicted in the target visual scene. For example, given source audio recorded in a studio, re- synthesize that audio to match the room acoustics of a concert hall.
properties of the space are known, simulation techniques can be applied to generate a room impulse response (RIR), a transfer function between the sound source and the micro- phone that describes how the sound gets transformed by the space. RIRs can then be convolved with an arbitrary source audio signal to generate the audio signals received by the microphone [8, 9, 17, 50, 51]. In the absence of geometry and material information, the acoustical properties can be estimated blindly from audio captured in that room (e.g., re- verberant speech), then used to auralize a signal [29,42,56]. However, both approaches have practical limitations: the former requires access to the full mesh and material prop- erties of the target space, while the latter gets only limited acoustic information about the target space from the rever- beration in the audio sample. Neither uses imagery of the target scene to perform acoustic matching. We propose a novel task: visual acoustic matching. Given an image of the target environment and a source au- dio clip, the goal is to re-synthesize the audio as if it were recorded in the target environment (see Figure 1). The idea is to transform sounds from one space to another space by altering their scene-driven acoustic signatures. Visual
acoustic matching has many potential applications, includ- ing smart video editing where a user can inject sounding objects into new backgrounds, film dubbing to make a dif- ferent actor’s voice sound appropriate for the movie scene, audio enhancement for video conference calls, and audio synthesis for AR/VR to make users feel immersed in the visual space displayed to them.
To address visual acoustic matching, we introduce a cross-modal transformer model together with a novel self- supervised training objective that accommodates in-the- wild Web videos having unknown room acoustics.
Our approach accounts for two key challenges: how to faithfully model the complex cross-modal interactions, and how to achieve scalable training data. Regarding the first challenge, different regions of a room affect the acoustics in different ways. For example, reflective glass leads to longer reverberation in high frequencies while absorptive ceilings reduce the reverberation more quickly. Our model provides fine-grained audio-visual reasoning by attending to regions of the image and how they affect the acoustics. Further- more, to capture the fine details of reverberation effects— which are typically much smaller in magnitude than the direct signal—we use 1D convolutions to generate time- domain signals directly and apply a multi-resolution gen- erative adversarial audio loss.
Regarding the second key challenge, one would ideally have paired training data consisting of a sound sample not recorded in the target space plus its proper acoustic render- ing for the scene shown in the target image, i.e., a source and target audio for each visual scene in the training set. However, such a strategy requires either physical access to the pictured environments, or knowledge of their room impulse response functions—either of which severely lim- its the source of viable training data. Meanwhile, though a Web video does exhibit strong correspondence between its visual scene and the scene acoustics, it offers only the audio recorded in the target space. Accounting for these tradeoffs, we propose a self-supervised objective that auto- matically creates acoustically mismatched audio for train- ing with Web videos. The key insight is to use dereverbera- tion and acoustic randomization to alter the original audio’s acoustics while preserving its content.
We demonstrate our approach on challenging real-world sounds and environments, as well as controlled experiments with realistic acoustic simulations in scanned scenes. Our quantitative results and subjective evaluations via human studies show that our model generates audio that matches the target environment with high perceptual quality, outper- forming a state-of-the-art model that has heavier supervi- sion requirements [52] as well as traditional acoustic match- ing models.
Acoustic matching. The goal of acoustic matching is to transform an audio recording made in one environment to sound as if it were recorded in a target environment. The au- dio community deals with this task with various approaches depending on what information about the target environ- ment is accessible. If audio recorded in the target environ- ment is provided, blind estimation of two acoustic parame- ters, direct-to-reverberant ratio (DRR), which describes the energy ratio of direct arrival sound and reflected sound, and reverberation time (RT60), the time it takes for a sound to decay 60dB, is sufficient to create simple RIRs that yield plausibly matched audio [15, 18, 29, 38, 42, 65]. Blind es- timation of the room impulse response from reverberant speech has also been explored [54, 62]. In music produc- tion, acoustic matching is applied to change the reverber- ation to emulate that of a target space or processing algo- rithm [33,49]. Recent work conditions the target-audio gen- eration on a low-dimensional audio embedding [56]. Unlike any of the above, we introduce and tackle the visual acous- tic matching problem, where the target environment is ex- pressed via an input image.
Visual understanding of room acoustics. The room im- pulse response (RIR) is the (time-domain) transfer function capturing the room acoustics for arbitrary source stimuli given specific source and receiver/listener positions in an environment. Convolving an RIR with a sound waveform yields the sound of that source in the context of the partic- ular physical space. RIRs are traditionally measured with special equipment in the room itself [26, 53] or simulated with sound propagation models [5, 11, 43]. Recent work explores estimating an RIR from an input image [31, 52], which requires access to paired image and impulse response training data. While video recordings provide a natural source for learning the correspondence between space (cap- tured by the visual stream) and acoustics (captured by the audio stream), they have not been explored in the literature. We show how to leverage Web video data for understand- ing room acoustics in a self-supervised fashion, obviating the need for expensive paired RIR-image training data. Our results demonstrate the advantages.
Audio-visual learning. Recent advances in multi-modal video understanding enable new forms of self-supervised cross-modal feature learning from video [6, 34, 41], object localization [28], and audio-visual speech enhancement and source separation [1, 2, 12, 16, 27, 40, 44, 48, 67, 69]. Work in embodied AI explores acoustic simulations with real vi- sual scans to study audio-visual navigation tasks [11,14,19], where an agent moves intelligently based on the visual and auditory observations. However, no prior work investigates the visual acoustic matching task as we propose.
Self Attention
Feed Forward
Cross-modal Attention
Convolution
Nx
1D Conv.
ResNet- 18
Feed Forward
1/2 x
!! 1/2 x
"!
$!
Input Image
""
&"
"^ %" 1D Conv.
Cross-modal Encoder
Acoustics Alteration
"$
Multi-resolution iff no mismatched audio^ Speech Loss
Figure 3. AViTAR model illustration. We extract visual feature sequence Vi from input image IT with a ResNet-18 [25], and audio feature sequence Ai from input audio AS with 1D convolutions. Vi and Ai are passed into cross-modal encoders for cross-modal reasoning. The output feature sequence Mi is processed and upsampled with 1D convolutions to recover the output of the same temporal length. Finally, we use a multi-resolution speech GAN loss to guide the audio synthesis to be high fidelity. The acoustics alteration process is applied to the target audio during training if and only if there is no mismatched audio, e.g., on the Acoustic AVSpeech dataset.
the source speech but also the reverberation caused by the environment), and 2) audio recording should be reverber- ant (so that the physical space has influenced the audio). Cameras in this dataset are typically static, and thus we use single frames and their corresponding audio for this task. See Supp. for details. This yields 113k/3k/3k video clips for train/val/test splits. We refer to this filtered dataset as Acoustic AVSpeech. See Figure 2b.
We present the Audio-Visual Transformer for Audio Generation model (AViTAR) (Figure 3). AViTAR learns to perform cross-modal attention based on sequences of con- volutional features of audio and images and then synthe- sizes the desired waveform AˆT. We first define the audio- visual features (Sec. 5.1) and their cross-modal attention (Sec. 5.2), followed by our approach to waveform gener- ation (Sec. 5.3). Finally, we present our acoustics alteration idea to enable learning from in-the-wild video (Sec. 5.4).
To apply cross-modal attention, we first need to generate sequences of audio and visual features, where each element in the sequence represents features of a part of the input space. For visual sequence generation from image IT , we use ResNet18 [25] and flatten the last feature map before the pooling layer, yielding the visual feature sequence Vi. For audio feature sequence generation from source au- dio AS , we generate audio features Ai from the waveform directly with stacked 1D convolutions. We first use one 1D conv layer to embed the input waveform into a latent space.
We then apply a sequence of strided 1D convolutions, each doubling the channel size while downsampling the input se- quence. The output audio features are a sequence of vectors of size S, with length downsampled D times from the in- put. Weight normalization is applied to 1D conv layers. We employ 1D convolutions rather than STFT spectrograms so that the audio features are not limited to one resolution and can be optimized end-to-end to learn the most important features for the visual acoustic matching task.
Prior work often models audio-visual inputs in a sim- plistic manner by representing the image feature with one single vector and concatenating it with the audio fea- ture [11, 12, 16, 20, 21, 44, 67]. However, for visual acoustic matching, it is important to reason how different regions of the space contribute to the acoustics differently. For exam- ple, a highly reflective glass door leads to longer reverber- ation time for high frequencies, while absorptive ceilings diminish that quickly. Thus, we propose to attend to image regions to reason how different image patches contribute to the acoustics, leveraging recent advances on the transformer architecture [24, 30, 60]. For cross-modal attention, we first adopt the conformer variant [24] of encoder blocks, which adds one convolu- tion layer inside the block for modeling local interaction for speech features. Based on this block, we insert one cross- modal attention layer Acm after the first feed-forward layer, described as follows:
Acm(Ai, Vi) = softmax(
AiV (^) iT √ S
)Vi, (1)
where the attention scores between the two sequences of features Ai and Vi are first calculated by dot-product, then normalized by softmax, scaled by √^1 S , and finally used to weight the visual features Vi. This cross-modal attention al- lows the model to attend to different image region features and reason about how they affect the reverberation. Ab- solute positional encoding is added to the visual encoding. After passing Vi and Ai through N encoder blocks, we ob- tain the fused audio-visual feature sequence Mi, which has the same length as Ai.
Recent audio-visual work generates audio outputs by in- ferring spectrograms then using ISTFT reconstruction to obtain a waveform (e.g., [16, 20, 21, 66–68]). While sensi- ble for source separation, where the target signal is a subset of the source signal, ratio mask prediction is inadequate for our task, because reverberation might occupy periods of si- lence in the input audio and the ratio will be unbounded (as we verify in results). Futhermore, generating audio based on spectrograms is limiting because 1) predicting the co- herent phase component remains challenging [3,13], and 2) the spectrogram has one fixed resolution (one FFT size, hop length, and window size). Instead, we aim to synthesize time-domain signals di- rectly, skipping the intermediate spectrogram generation step and allowing more flexibility for what losses can be im- posed, inspired by recent advances on time-domain speech synthesis [32,35,47,59]. Specifically, with the fused audio- visual feature sequence Mi, we apply a sequence of trans- posed strided 1D convolutions, each halving the channel size while upsampling the input sequence, which is exactly the reverse operation of the audio encoding. Altogether, we upsample the audio sequence D times and obtain a wave- form of the same length as the input. Next we incorporate a multi-resolution generative loss. We found directly minimizing a Euclidean distance based loss between the target ground truth audio AT and the in- ferred audio AˆT leads to distortion in the generated audio on this task (cf. Figure 5 and Tab. 2). Therefore, to let the model learn how to reverberate the input speech properly, we employ a generative adversarial loss where a set of dis- criminators operating at different resolutions are trained to identify reverberation patterns and guide the generated au- dio to sound like real examples. Specifically, we apply an adversarial loss [32] comprised of the generator and dis- criminator losses:
k=
(LAdv (G; Dk) + λ 1 LF M (G; Dk)) + λ 2 LM el(G),
k=
LAdv (Dk; G),
Dereverberation Acoustic Randomization
Adding Noise
!" !$
!# !!
Figure 4. Acoustics alteration process. Spectrograms of the re- sulting audio after each step are shown. We first dereverberate the target audio AT to obtain cleaner audio AC , randomize its acous- tics by applying an impulse response of another environment to obtain AR, and finally, add Gaussian noise to AR to create AS. Notice how the spectral pattern changes in this process.
where each Dk is a sub-discriminator that operates at one of K different scales and periods for distinguishing the fake and real examples. LAdv is the LS-GAN [39] training ob- jective, which trains the generator to fake the discrimina- tor and trains the discriminator to distinguish real examples from fake ones. For the generator G, a feature matching loss [35] LF M is used, which is a learned similarity metric measured by the difference in features of the discrimina- tor between a ground truth sample and a generated sample. An additional mel-spectrogram loss LM el is imposed on the generator for improving the training efficiency and fidelity of the generated audio. λ 1 and λ 2 are two weighting factors for these two losses. The generator loss LG and discrimina- tor loss LD are trained alternatively competing against each other. For more details, refer to [32].
The training paradigm differs in one important way de- pending on the source of training data (cf. Sec. 4). For the simulated SoundSpaces data, we have access to an anechoic audio sample AS as well as the ground truth reverberated sample AT as it should be rendered in the target environ- ment for a camera seeing view IT. This means we can train to (implicitly) discover the mapping that takes the target im- age to an RIR which, when convolved with AS , yields AT.
For the in-the-wild video data (AVSpeech), however, we have only AT and IT to train, i.e., we only observe sounds that do match their respective views. Thus, to leverage unannotated Web video, we need to create an audio clip that preserves the target audio content but has mismatched acoustics. Figure 4 illustrates the steps for this process. First we strip away the original acoustics of the target en-
SoundSpaces-Speech Acoustic AVSpeech Seen Unseen Seen Unseen STFT RTE (s) MOSE STFT RTE (s) MOSE RTE (s) MOSE RTE (s) MOSE Input audio 1.192 0.331 0.617 1.206 0.356 0.611 0.387 0.658 0.392 0. Blind Reverberator [61] 1.338 0.044 0.312 - - - - - - - Image2Reverb [52] 2.538 0.293 0.508 2.318 0.317 0.518 - - - - AV U-Net [20] 0.638 0.095 0.353 0.658 0.118 0.367 0.156 0.570 0.188 0. AViTAR w/o visual 0.862 0.140 0.217 0.902 0.186 0.236 0.194 0.504 0.207 0. AViTAR 0.665 0.034 0.161 0.822 0.062 0.195 0.144 0.481 0.183 0.
Table 1. Results on the SoundSpaces-Speech and Acoustic AVSpeech datasets for Seen and Unseen environments. All input audio at test time is novel (unheard during training). Note that the STFT metric is applicable only for SoundSpaces, where we can access the ground truth AT ’s spectrogram. For all metrics, lower values are better. Standard errors for STFT, RTE and MOSE are all less than 0.04, 0.013s and 0.01 on SoundSpaces-Speech. Standard errors for RTE and MOSE are all less than 0.005s and 0.01 on Acoustic AVSpeech.
Table 1 (left) shows the results. As expected, the clean input audio baseline does poorly because it does not account for the target environment. Our AViTAR model has the low- est RT60 error and MOS error, indicating that it best pre- dicts the correct acoustics from images, injects them into the speech, and synthesizes high-quality audio. The AV U-Net baseline has slightly lower STFT distance than ours, likely because its training objective is to minimize STFT distance. However it has higher perceptual errors (RTE and MOSE). Image2Reverb’s [52] high errors reveal the difficulty of our task and data, and its inapplicability to AVSpeech high- lights our model’s self-supervised training advantage. De- spite having the estimated RT60 as input (and thus having low RT60 error), Blind Reverberator’s STFT and MOS er- rors are much higher than AViTAR’s, showing that images are a promising way to characterize room acoustics beyond the traditional RT60. Plus, its inapplicability for the other scenarios highlights fundamental advantages of AViTAR. Without access to visual information (“w/o visual”), AVi- TAR can only learn to add an average amount of reverber- ation to the input audio; this confirms that our model suc- cessfully learns the acoustics from the visual scene. Al- though this variant has higher RT60 error than AV U-Net, its MOS error is lower because the audio quality is better. See Supp. video for examples.
Ablations. Table 2 shows results for ablations on unseen images. For the model architecture, to understand if at- tending to different image regions with cross-modal atten- tion is helpful, we train the full model with the length of visual feature sequence reduced to one by mean pooling the final ResNet feature map (“w/ pooled visual feature”). This model underperforms the full model on both STFT and RT60 metrics, showing that the audio-visual attention leads to a better visual understanding of room acoustics. Next we ablate the generative loss and replace it with the non- generative multi-resolution STFT loss [35] (“w/o generative loss”), which slightly improves the STFT error but leads to
AViTAR STFT RTE (s) MOSE Full model 0.822 0.062 0. w/ pooled visual feature 0.850 0.067 0. w/o generative loss 0.777 0.081 0. w/o human 0.884 0.139 0. w/ random image 0.940 0.236 0.
Table 2. Ablations on model design and data. a large drop on the acoustics recovery and speech quality. Despite being multi-resolution, without learnable discrimi- nators to learn to model those fine reverberation details, the audio quality gets worse. See Supp. for GAN loss ablations. The synthetic dataset provides access to meta informa- tion useful to evaluate whether and how much AViTAR rea- sons about different visual properties. The location of the sound source matters for acoustics because it directly influ- ences acoustic characteristics like the direct-to-reverberant ratio (DRR). When we remove the 3D humanoid from the scene (“w/o human”) in all test images, all error metrics in- crease, which indicates that our model reasons about the lo- cation of the sound source in the image for accurate acous- tic matching. To understand if the model learns meaningful information from the visuals, we replace the target image with a random image (“w/ random image”); this signifi- cantly harms our model’s performance.
Next, we train our model on the in-the-wild AVSpeech videos, and test it on novel clean speech clips from Lib- riSpeech [45] (AS ) paired with target images (IT ) from AVSpeech. Here we do not have ground truth for the tar- get speech, so we evaluate with RTE and MOSE. Table 1 (right) shows the results. Our proposed AViTAR model achieves the lowest RT60 error compared to all base- lines. This shows our model trained in its self-supervised fashion successfully generalizes to novel images and novel audio, and demonstrates we can do acoustic matching even
Image Input AV U-Net AViTAR GT Target
Input 0.01s Office 0.34s (^) Garage 0.40s Auditorium 0.58s
Image2reverb
Figure 5. Qualitative predicted audio. For all audio clips, we compute the magnitude spectrogram, convert the magnitude to dB, and plot the spectrogram with x-axis spanning from 0 to 1.28 s (left to right) and y-axis from 0 to 3000 Hz (bottom to top). Row 1: SoundSpaces- Speech example where the target space is a large empty room with a lot of reverberation. Our model predicts the audio closest to the target clip. AV U-Net’s spectrogram is too smoothed compared to ours and misses some fine reverb details, which leads to perceptual distortion. Row 2: examples on Acoustic AVSpeech (unseen images). We feed one clean audio clip to match three different scenarios (office, garage, auditorium). From left to right, the audio spectrogram becomes more reverberant as phoneme patterns get extended and blurred on the temporal axis (est. RT60 times shown). NB: AViTAR processes waveforms, not spectrograms; here they are for visualization.
Acoustics Alteration Seen Unseen Dereverb. + Randomization + Noise 0.144 0. Dereverb. + Randomization 0.178 0. Dereverb. + Noise 0.170 0. Dereverb. 0.230 0. AT + Randomization + Noise 0.236 0. Table 3. Ablations on acoustics alteration. RTE is reported.
for non-anechoic inputs. AViTAR’s MOS error is also the lowest compared to all baselines, showing that it is able to synthesize high-fidelity audio while injecting the proper amount of reverberation into the speech. The absolute er- rors on AVSpeech are higher than on SoundSpaces, which makes sense because the YouTube imagery is more variable, and it has a narrower field of view and no depth, making the geometry and materials of the scene only partly visible. See Supp. for sim2real generalization.
Ablations on acoustic alteration. Table 3 shows abla- tions on the proposed acoustics alteration strategy. In short, all three steps are necessary to create an acoustic mismatch with the image, thereby forcing the model to recover the correct acoustics based on the image and allowing better generalization to novel sounds. See Supp. for details.
User study. To supplement the quantitative metrics and directly capture the perceptual quality of the generated sam- ples, we next conduct a user study. We show participants the image of the target environment IT , the accompanying ground truth audio clip AT as reference, and paired audio clips AˆT generated by AViTAR and each baseline. We ask participants to select the clip that most sounds as if it were recorded in the target environment and best matches the re- verberation in the given clip. We select 30 reverberant ex- amples from SoundSpaces-Speech and AVSpeech and ask
SoundSpaces AVSpeech Input Speech 42.1% / 57.9% 40.1% / 59.9% Image2Reverb [52] 25.9% / 74.1% - / - AV U-Net [20] 29.8% / 70.2% 27.2% / 72.8% AViTAR w/o visual 39.6% / 60.4% 46.3% / 53.9% Table 4. User study results. X%/Y% indicates among all paired examples for this baseline and AViTAR, X% of participants prefer this baseline while Y% prefer AViTAR.
30 participants to complete the assignment on MTurk. Table 4 shows the resulting preference scores. Compared to each baseline, AViTAR is always preferred. Note that no participant has a background in acoustics, and some might simply pick the one that sounds “clean” rather than having the correct room acoustics. This may be the reason even the anechoic input has a higher preference score than the U-Net model. Despite the lack of domain knowledge, participants still consistently favor our model over other baselines. Qualitative examples. Figure 5 shows example outputs. Please see the Supp. video to gauge the audio quality.
We proposed the visual acoustic matching task and in- troduced the first model to address it. Given an image and audio clip, our method injects realistic room acoustics to match the target environment. Our results validate their re- alism with both objective and perceptual measures. Impor- tantly, the proposed model is trainable with unannotated, in-the-wild Web videos. In future work we aim to extend our model to leverage the dynamics in target visual scenes in video. We discuss potential societal impact in Supp. Acknowledgements UT Austin is supported in part by a gift from Google and the IFML NSF AI Institute.
[30] Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. In An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 3, 4 [31] Homare Kon and Hideki Koike. Estimation of late rever- beration characteristics from a single two-dimensional envi- ronmental image using convolutional neural networks. In Journal of the Audio Engineering Society, 2019. 2 [32] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fi- delity speech synthesis. In NeurIPS, 2020. 5 [33] Junghyun Koo, Seungryeol Paik, and Kyogu Lee. Reverb conversion of mixed vocal tracks using an end-to-end convo- lutional deep neural network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 81–85. IEEE, 2021. 2 [34] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018. 2 [35] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lu- cas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Bre- bisson, Yoshua Bengio, and Aaron Courville. Melgan: Gen- erative adversarial networks for conditional waveform syn- thesis. In NeurIPS, 2019. 5, 7 [36] Yan-Bo Lin and Yu-Chiang Frank Wang. Audiovisual trans- former with instance attention for audio-visual event local- ization. In ACCV, 2020. 3 [37] Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conver- sion. In Proc. Interspeech 2019, 2019. 6 [38] Wolfgang Mack, Shuwen Deng, and Emanu¨el AP Habets. Single-channel blind direct-to-reverberation ratio estimation using masking. In INTERSPEECH, pages 5066–5070, 2020. 2 [39] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076, 2016. 5 [40] Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. An overview of deep-learning-based audio-visual speech en- hancement and separation. In arXiv, 2020. 2 [41] Pedro Morgado, Yi Li, and Nuno Vasconcelos. Learn- ing representations from audio-visual spatial alignment. In NeurIPS, 2020. 2 [42] Prateek Murgai, Mark Rau, and Jean-Marc Jot. Blind estima- tion of the reverberation fingerprint of unknown acoustic en- vironments. In Audio Engineering Society Convention 143. Audio Engineering Society, 2017. 1, 2 [43] D.T. Murphy, Antti Kelloniemi, Jack Mullen, and Simon Shelley. Acoustic modeling using the digital waveguide mesh. Signal Processing Magazine, IEEE, 24:55 – 66, 04
[44] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018. 2, 3, 4, 6 [45] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public do- main audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015. 3, 7 [46] Mandela Patrick, Yuki M Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, and Andrea Vedaldi. Space- time crop & attend: Improving cross-modal video represen- tation learning. arXiv preprint arXiv:2103.10211, 2021. 3 [47] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveg- low: A flow-based generative network for speech synthesis. In arXiv, 2018. 5 [48] Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-PIneda, Laurent Girin, and Radu Horaud. Audio-visual speech en- hancement using conditional variational auto-encoders. In IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020. 2 [49] Andy Sarroff and Roth Michaels. Blind arbitrary reverb matching. In Proceedings of the 23rd International Confer- ence on Digital Audio Effects (DAFx-2020), 2020. 2 [50] Lauri Savioja and U Peter Svensson. Overview of geometri- cal room acoustic modeling techniques. The Journal of the Acoustical Society of America, 138(2):708–730, 2015. 1 [51] Lauri Savioja and Ning Xiang. Simulation-based auraliza- tion of room acoustics. Acoust. Today, 16(4):48–55, 2020. 1 [52] Nikhil Singh, Jeff Mentch, Jerry Ng, Matthew Beveridge, and Iddo Drori. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, 2021. 2, 3, 6, 7, 8 [53] Guy-Bart Stan, Jean-Jacques Embrechts, and Dominique Ar- chambeau. Comparison of different impulse response mea- surement techniques. Journal of the Audio Engineering So- ciety, 50(4):249–262, april 2002. 2 [54] Christian Steinmetz, Vamsi Krishna Ithapu, and Paul Calamia. Filtered noise shaping for time domain room im- pulse response estimation from reverberant speech. In 2021 IEEE Workshop on Applications of Signal Processing to Au- dio and Acoustics (WASPAA), 2021. 2 [55] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797,
Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy, 7 2019. Association for Computa- tional Linguistics. 3 [59] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A gen- erative model for raw audio. In arXiv, 2016. 5 [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 4 [61] Vesa V¨alim¨aki, Julian Parker, Lauri Savioja, Julius O. Smith, and Jonathan Abel. More than 50 years of artificial reverber- ation. In 60th International Conference: DREAMS, 2016. 6, 7 [62] Sanna Wager, Keunwoo Choi, and Simon Durand. Dere- verberation using joint estimation of dry speech signal and acoustic system. arXiv preprint arXiv:2007.12581, 2020. 2 [63] Stephan Werner, Florian Klein, Annika Neidhardt, Ul- rike Sloma, Christian Schneiderwind, and Karlheinz Bran- denburg. Creation of auditory augmented reality using a position-dynamic binaural synthesis system—technical components, psychoacoustic needs, and perceptual evalua- tion. Applied Sciences, 11(3), 2021. 1 [64] Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jiten- dra Malik, and Silvio Savarese. Gibson Env: real-world per- ception for embodied agents. In Computer Vision and Pat- tern Recognition (CVPR), 2018 IEEE Conference on. IEEE,