TempoControl uses inference-time optimization by steering the latent variables during the diffusion process. At each denoising step, we apply a few gradient descent iterations until a satisfactory level of temporal alignment is achieved; importantly, without updating the model parameters.
Illustration of TempoControl. During a single denoising step $t$, we extract spatial attention maps $\bar{A}^t_{j,i}$ (for word $i$ at temporal index $j$), aggregate to a temporal attention signal $a_{i}^t$, and align it with the target mask vector $m_{i}$ via temporal and spatial losses. Gradients $\nabla \mathcal{L}$ are used to update the latent code $z_t$.
TempoControl's temporal losses rely on cross-attention maps that link words to their visual appearance in the video. Thus, steering the attention placement and strength can influence the temporal and spatial appearance of objects.
Motivation for our approach. Top: We show a video generated for the prompt "The video begins with a serene view centered on the cat, with no sign of the dog. In the second half, the dog unexpectedly appears, altering the dynamic of the scene." The top row displays the attention maps for the tokens cat and dog, extracted from the denoising step $t{=}3$, for frames $j{=}2,6,10,14,19$. On the left, the video is generated without our optimization. Despite the prompt specifying that the dog should appear in the second half, it appears early. This behavior is common, as Wan 2.1 often fails to depict objects or movements according to temporal cues in the prompt. On the right, after applying our conditioning method, the dog correctly appears in the second half of the video. Bottom: Temporal attention $a_{i,j}^t$ (blue) versus target mask $m_{i,j}$ (orange), with the corresponding Pearson correlation loss shown.
The loss $\mathcal{L}^t$ consists of a single main component: a temporal correlation term that aligns the attention pattern with the target mask. While this term is most significant for aligning with the temporal condition, we also include two terms that further improve the results: (ii) a magnitude term that amplifies or suppresses token-level activation, and (iii) an attention entropy regularization term that preserves spatial consistency:
(i) Temporal correlation term $\mathcal{L}_{\text{corr}}^t$:
Encourages the temporal shape of attention to match that of an external control signal, operating on the normalized attention vector to align the presence of a concept with the desired timing.
$$\begin{equation}
\mathcal{L}_{\text{corr}}^t = -\frac{\text{Cov}(m_i, \tilde{a}_i^t)}{\sigma_{m_i} \sigma_{\tilde{a}_i^t}}
\end{equation}$$
(ii) Attention magnitude term $\mathcal{L}_{\text{energy}}^t$:
Directly promotes stronger attention in frames where the temporal signal is high, and suppresses it elsewhere, mitigating cases where correlation scores are high but attention values are too low to render the object visible.
$$\begin{equation}
\mathcal{L}_{\oplus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} > \tau\}} \cdot a_{i,j}^t
\end{equation}$$
$$\begin{equation}
\mathcal{L}_{\ominus}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}_{\{m_{i,j} \leq \tau\}} \cdot a_{i,j}^t
\end{equation}$$
$$\begin{equation}
\mathcal{L}_{\text{energy}}^t = \mathcal{L}_{\ominus}^t - \mathcal{L}_{\oplus}^t
\end{equation}$$
(iii) Attention entropy regularization $\mathcal{L}_{\text{entropy}}^t$:
Maintains spatial focus, ensuring that when attention is activated, it remains coherent and not diffusely spread across the frame.
$$\begin{equation}
\mathcal{L}_{\text{entropy}}^t = \frac{1}{T'} \sum_{j=1}^{T'} \mathbb{1}[m_{i,j} > \tau] \cdot \mathcal{H}(\bar{A}^{t}_{j,i})
\end{equation}$$
Total Loss:
$$\begin{equation}
\mathcal{L}^t = \mathcal{L}_{\text{corr}}^t + \lambda_1 \mathcal{L}_{\text{energy}}^t + \lambda_2 \mathcal{L}_{\text{entropy}}^t
\end{equation}$$
where $\lambda_1$ and $\lambda_2$ are hyperparameters that balance the contributions of each loss term.
BibTeX
If you find this project useful for your research, please cite the following:
@misc{TempoControl2025,
Author = {Shira Schiber and Ofir Lindenbaum and Idan Schwartz},
Title = {TempoControl: Temporal Attention Guidance for Text-to-Video Models},
Year = {2025},
}