TempoControl: Temporal Attention Guidance for Text-to-Video Models

Bar Ilan University

Abstract

Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision.

TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation.

TempoControl Applications

We use Wan 2.1-1.3B as a baseline with textual temporal cues, which we find the model largely ignores. Using the exact same seed and prompt, we then apply TempoControl optimization. We denote in orange the words aligned with the tempo-control signal, and in blue the words assigned static temporal control, which ensures the objects remain throughout the video.

Use the arrows to navigate through the results.

Single Object Control

Here you can see how TempoControl guides when a single object appears in the video. The baseline model (left) receives the text prompt but fails to follow the temporal instructions, while our method (right) uses an additional control signal to precisely time when objects appear exactly.

Multiple Object Control

Watch how our method handles multiple objects in one scene. The baseline (left) receives the temporal instruction but cannot execute it properly. TempoControl (right) keeps the blue object visible throughout and uses the control signal to synchronize the appearance of the orange object to only "in the second half" as instructed.

Movement Control

Beyond controlling objects, TempoControl can control when actions happen. The baseline (left) gets the text prompt but fails to time the movements correctly, while our method (right) follows the control signal to synchronize actions precisely, demonstrating temporal control over verbs and motion.

Audio-Video Alignment

Here we demonstrate TempoControl ability to align visual actions with audio cues. The baseline (left) receives the text prompt but cannot synchronize visual events with audio timing, while our method (right) aligns visual moments with the corresponding audio cues.

Motivation for TempoControl

TempoControl uses inference-time optimization by steering the latent variables during the diffusion process. At each denoising step, we apply a few gradient descent iterations until a satisfactory level of temporal alignment is achieved; importantly, without updating the model parameters.

Method Overview

Illustration of TempoControl. During a single denoising step $t$, we extract spatial attention maps $\bar{A}^t_{j,i}$ (for word $i$ at temporal index $j$), aggregate to a temporal attention signal $a_{i}^t$, and align it with the target mask vector $m_{i}$ via temporal and spatial losses. Gradients $\nabla \mathcal{L}$ are used to update the latent code $z_t$.

TempoControl's temporal losses rely on cross-attention maps that link words to their visual appearance in the video. Thus, steering the attention placement and strength can influence the temporal and spatial appearance of objects.

Attention Visualization

Motivation for our approach. Top: We show a video generated for the prompt "The video begins with a serene view centered on the cat, with no sign of the dog. In the second half, the dog unexpectedly appears, altering the dynamic of the scene." The top row displays the attention maps for the tokens cat and dog, extracted from the denoising step $t{=}3$, for frames $j{=}2,6,10,14,19$. On the left, the video is generated without our optimization. Despite the prompt specifying that the dog should appear in the second half, it appears early. This behavior is common, as Wan 2.1 often fails to depict objects or movements according to temporal cues in the prompt. On the right, after applying our conditioning method, the dog correctly appears in the second half of the video. Bottom: Temporal attention $a_{i,j}^t$ (blue) versus target mask $m_{i,j}$ (orange), with the corresponding Pearson correlation loss shown.

The loss $\mathcal{L}^t$ consists of a single main component: a temporal correlation term that aligns the attention pattern with the target mask. While this term is most significant for aligning with the temporal condition, we also include two terms that further improve the results: (ii) a magnitude term that amplifies or suppresses token-level activation, and (iii) an attention entropy regularization term that preserves spatial consistency:

(i) Temporal correlation term $\mathcal{L}_{\text{corr}}^t$:

Encourages the temporal shape of attention to match that of an external control signal, operating on the normalized attention vector to align the presence of a concept with the desired timing.

$$\begin{equation} \mathcal{L}_{\text{corr}}^t = -\frac{\text{Cov}(m_i, \tilde{a}_i^t)}{\sigma_{m_i} \sigma_{\tilde{a}_i^t}} \end{equation}$$

(ii) Attention magnitude term $\mathcal{L}_{\text{energy}}^t$:

Directly promotes stronger attention in frames where the temporal signal is high, and suppresses it elsewhere, mitigating cases where correlation scores are high but attention values are too low to render the object visible.

$$\begin{equation} \mathcal{L}_{\oplus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} > \tau\}} \cdot a_{i,j}^t \end{equation}$$ $$\begin{equation} \mathcal{L}_{\ominus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} \leq \tau\}} \cdot a_{i,j}^t \end{equation}$$ $$\begin{equation} \mathcal{L}_{\text{energy}}^t = \mathcal{L}_{\ominus}^t - \mathcal{L}_{\oplus}^t \end{equation}$$

(iii) Attention entropy regularization $\mathcal{L}_{\text{entropy}}^t$:

Maintains spatial focus, ensuring that when attention is activated, it remains coherent and not diffusely spread across the frame.

$$\begin{equation} \mathcal{L}_{\text{entropy}}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}[m_{i,j} > \tau] \cdot \mathcal{H}(\bar{A}^{t}_{j,i}) \end{equation}$$

Total Loss:

$$\begin{equation} \mathcal{L}^t = \mathcal{L}_{\text{corr}}^t + \lambda_1 \mathcal{L}_{\text{energy}}^t + \lambda_2 \mathcal{L}_{\text{entropy}}^t \end{equation}$$

where $\lambda_1$ and $\lambda_2$ are hyperparameters that balance the contributions of each loss term.

BibTeX

If you find this project useful for your research, please cite the following:

@misc{TempoControl2025,
Author = {Shira Schiber and Ofir Lindenbaum and Idan Schwartz},
Title = {TempoControl: Temporal Attention Guidance for Text-to-Video Models},
Year = {2025},
}