FRIEREN:

Efficient Video-to-Audio Generation with Rectified Flow Matching

Anonymous Author

Abstract. Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.

Model Overview



We illustrate the model architecture of Frieren at different levels in the figure. As shown in figure (a), we first utilize a pre-trained visual encoder with frozen parameters to extract a frame-level feature sequence from the video. Usually, the video frame rate is lower than the temporal length per second of the spectrogram latent. To align the visual feature sequence with the mel latent at the temporal dimension for the cross-modal feature fusion mentioned below, we adopt a length regulator, which simply duplicates each item in the feature sequence by the ratio of the latent length per second and the video frame rate for regulation. The regulated feature sequence is then fed to the vector field estimator as the condition, together with x and t, to get the vector field prediction v.

Figure (b) demonstrates the structure of the vector field estimator, which is composed of a feed-forward transformer and some auxiliary layers. The regularized visual feature c and the point x on the transport path are first processed by stacks of shallow layers separately, with output dimensions being both half of the transformer hidden dimension, and are then concatenated along the channel dimension to realize cross-modal feature fusion. This simple mechanism leverages the inherent alignment within the video and audio, achieving enforced alignment without relying on learning-based mechanisms such as attention. As a result, the generated audio and input video sequences exhibit excellent temporal alignment. After appending the time step embedding to the beginning, the sequence is added with a learnable positional embedding and is then fed into the feed-forward transformer. The structure of the transformer block is illustrated in figure (c), the design of which is derived from the spatial transformer in latent diffusion, with the 2D convolution layers replaced by 1D ones. The feed-forward transformer does not involve temporal downsampling, thus preserving the resolution of the temporal dimension and further ensuring the preservation of alignment. The output of the stacked transformer blocks is then passed through a normalization layer and a 1D convolution layer to finally obtain the prediction of the vector field.

Table of Contents

  • Video-to-Audio Generation Results
  • Few and Single Step Generation Results
  • Video-to-Audio Generation Results

    In this section, we compare our results with other V2A models as well as ground truth videos. You may need to scroll right to see full results.

    Note that Im2Wav only support audio generation of 4.2 seconds, so the video clips from Im2Wav are shorter.

    Content Type Ground Truth Frieren (ours) DDPM (ours) SpecVQGAN Diff-Foley (w/o CG) Diff-Foley (w/ CG) Im2Wav
    playing the drum
    sea waves
    a man talking and then shooting
    striking bowling
    playing the cello
    train passing by
    bird twittering
    typing on computer keyboard
    playing billiards
    chainsawing wood
    playing the erhu
    eating crisps
    fireworks

     

    Few and Single Step Generation Results

    In this section, we demonstrate our few-step and single-step results. You may need to scroll right to see full results.

    Content Type Ground Truth Frieren (no reflow)
    25 steps
    Frieren (reflow)
    5 steps
    Frieren (reflow)
    1 step
    Frieren (reflow + distillation)
    1 step
    playing the drum
    sea waves
    a man talking and then shooting
    striking bowling
    playing the cello
    train passing by
    bird twittering
    typing on computer keyboard
    playing billiards
    chainsawing wood
    playing the erhu
    eating crisps
    fireworks