TempoControl: Temporal Attention Guidance for Text-to-Video Models

Abstract

Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision.

TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation.

TempoControl Applications

We use Wan 2.1-1.3B as a baseline with textual temporal cues, which we find the model largely ignores. Using the exact same seed and prompt, we then apply TempoControl optimization. We denote in orange the words aligned with the tempo-control signal, and in blue the words assigned static temporal control, which ensures the objects remain throughout the video.

Use the arrows to navigate through the results.

Single Object Control

Here you can see how TempoControl guides when a single object appears in the video. The baseline model (left) receives the text prompt but fails to follow the temporal instructions, while our method (right) uses an additional control signal to precisely time when objects appear exactly.

An empty scene. Suddenly, during the last second of the video, a dog appears out of nowhere, drawing all attention.

Temporal Control Signal

Wan 2.1

Ours

An empty scene. Suddenly, during the fourth second of the video, a apple appears out of nowhere, drawing all attention.

Temporal Control Signal

Wan 2.1

Ours

An empty scene. Suddenly, during the third second of the video, a umbrella appears out of nowhere, drawing all attention.

Temporal Control Signal

Wan 2.1

Ours

An empty scene. Suddenly, during the second second of the video, a dog appears out of nowhere, drawing all attention.

Temporal Control Signal

Wan 2.1

Ours

An empty scene. Suddenly, during the third second of the video, a skateboard appears out of nowhere, drawing all attention.

Temporal Control Signal

Wan 2.1

Ours

Multiple Object Control

Watch how our method handles multiple objects in one scene. The baseline (left) receives the temporal instruction but cannot execute it properly. TempoControl (right) keeps the blue object visible throughout and uses the control signal to synchronize the appearance of the orange object to only "in the second half" as instructed.

“The video begins with a serene view centered on the cat, with no sign of the dog. Suddenly, in the second half, dog unexpectedly appears, altering the dynamic of the scene.”

Temporal Control Signal

Wan 2.1

Ours

The video begins with a serene view centered on bird, with no sign of cat. Suddenly, in the second half, cat unexpectedly appears, altering the dynamic of the scene.

Temporal Control Signal

Wan 2.1

Ours

“The video begins with a serene view centered on the wine glass, with no sign of the chair. Suddenly, in the second half, the chair unexpectedly appears, altering the dynamic of the scene.”

Temporal Control Signal

Wan 2.1

Ours

“The video begins with a serene view centered on the bicycle, with no sign of the truck. Suddenly, in the second half, the truck unexpectedly appears, altering the dynamic of the scene.”

Temporal Control Signal

Wan 2.1

Ours

“The video begins with a serene view centered on the sheep, with no sign of the horse. Suddenly, in the second half, the horse unexpectedly appears, altering the dynamic of the scene.”

Temporal Control Signal

Wan 2.1

Ours

Movement Control

Beyond controlling objects, TempoControl can control when actions happen. The baseline (left) gets the text prompt but fails to time the movements correctly, while our method (right) follows the control signal to synchronize actions precisely, demonstrating temporal control over verbs and motion.

A video of a chimpanzee clapping playfully, with a strong movement at the first second.

Temporal Control Signal

Wan 2.1

Ours

A video of a wolf howling hauntingly, with a strong movement at the second second.

Temporal Control Signal

Wan 2.1

Ours

A video of a man lifting weights powerfully, with a strong movement at the third second.

Temporal Control Signal

Wan 2.1

Ours

A video of a rabbit thumping nervously, with a strong movement at the fourth second.

Temporal Control Signal

Wan 2.1

Ours

A video of a magician pulling a rabbit suddenly, with a strong movement at the last second.

Temporal Control Signal

Wan 2.1

Ours

Audio-Video Alignment

Here we demonstrate TempoControl ability to align visual actions with audio cues. The baseline (left) receives the text prompt but cannot synchronize visual events with audio timing, while our method (right) aligns visual moments with the corresponding audio cues.

An elephant raises its trunk high and swings it forcefully as it lets out a powerful trumpet sound.

Wan 2.1

Ours

An elephant raises its trunk high and swings it forcefully as it lets out a powerful trumpet sound.

Wan 2.1

Ours

A dark sky flashes with lightning as thunder cracks and clouds violently light up the landscape.

Wan 2.1

Ours

A dark sky flashes with lightning as thunder cracks and clouds violently light up the landscape.

Wan 2.1

Ours

A rooster stretches its neck upward and flaps its wings as it crows at dawn.

Wan 2.1

Ours

A rooster stretches its neck upward and flaps its wings as it crows at dawn.

Wan 2.1

Ours

Motivation for TempoControl

TempoControl uses inference-time optimization by steering the latent variables during the diffusion process. At each denoising step, we apply a few gradient descent iterations until a satisfactory level of temporal alignment is achieved; importantly, without updating the model parameters.

Illustration of TempoControl. During a single denoising step $t$, we extract spatial attention maps $\bar{A}^t_{j,i}$ (for word $i$ at temporal index $j$), aggregate to a temporal attention signal $a_{i}^t$, and align it with the target mask vector $m_{i}$ via temporal and spatial losses. Gradients $\nabla \mathcal{L}$ are used to update the latent code $z_t$.

TempoControl's temporal losses rely on cross-attention maps that link words to their visual appearance in the video. Thus, steering the attention placement and strength can influence the temporal and spatial appearance of objects.

Motivation for our approach. Top: We show a video generated for the prompt "The video begins with a serene view centered on the cat, with no sign of the dog. In the second half, the dog unexpectedly appears, altering the dynamic of the scene." The top row displays the attention maps for the tokens cat and dog, extracted from the denoising step $t{=}3$, for frames $j{=}2,6,10,14,19$. On the left, the video is generated without our optimization. Despite the prompt specifying that the dog should appear in the second half, it appears early. This behavior is common, as Wan 2.1 often fails to depict objects or movements according to temporal cues in the prompt. On the right, after applying our conditioning method, the dog correctly appears in the second half of the video. Bottom: Temporal attention $a_{i,j}^t$ (blue) versus target mask $m_{i,j}$ (orange), with the corresponding Pearson correlation loss shown.

The loss $\mathcal{L}^t$ consists of a single main component: a temporal correlation term that aligns the attention pattern with the target mask. While this term is most significant for aligning with the temporal condition, we also include two terms that further improve the results: (ii) a magnitude term that amplifies or suppresses token-level activation, and (iii) an attention entropy regularization term that preserves spatial consistency:

(i) Temporal correlation term $\mathcal{L}_{\text{corr}}^t$:

Encourages the temporal shape of attention to match that of an external control signal, operating on the normalized attention vector to align the presence of a concept with the desired timing.

$$\begin{equation} \mathcal{L}_{\text{corr}}^t = -\frac{\text{Cov}(m_i, \tilde{a}_i^t)}{\sigma_{m_i} \sigma_{\tilde{a}_i^t}} \end{equation}$$

(ii) Attention magnitude term $\mathcal{L}_{\text{energy}}^t$:

Directly promotes stronger attention in frames where the temporal signal is high, and suppresses it elsewhere, mitigating cases where correlation scores are high but attention values are too low to render the object visible.

$$\begin{equation} \mathcal{L}_{\oplus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} > \tau\}} \cdot a_{i,j}^t \end{equation}$$ $$\begin{equation} \mathcal{L}_{\ominus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} \leq \tau\}} \cdot a_{i,j}^t \end{equation}$$ $$\begin{equation} \mathcal{L}_{\text{energy}}^t = \mathcal{L}_{\ominus}^t - \mathcal{L}_{\oplus}^t \end{equation}$$

(iii) Attention entropy regularization $\mathcal{L}_{\text{entropy}}^t$:

Maintains spatial focus, ensuring that when attention is activated, it remains coherent and not diffusely spread across the frame.

$$\begin{equation} \mathcal{L}_{\text{entropy}}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}[m_{i,j} > \tau] \cdot \mathcal{H}(\bar{A}^{t}_{j,i}) \end{equation}$$

Total Loss:

$$\begin{equation} \mathcal{L}^t = \mathcal{L}_{\text{corr}}^t + \lambda_1 \mathcal{L}_{\text{energy}}^t + \lambda_2 \mathcal{L}_{\text{entropy}}^t \end{equation}$$

where $\lambda_1$ and $\lambda_2$ are hyperparameters that balance the contributions of each loss term.

TempoControl: Temporal Attention Guidance for Text-to-Video Models

Abstract

TempoControl Applications

Single Object Control

Multiple Object Control

Movement Control

Audio-Video Alignment

Motivation for TempoControl

BibTeX