TIE: Time Interval Encoding for Video Generation over Events

2026 Preprint

Time Interval Encoding

for Video Generation over Events

Events are not points.

Let them live in time.

Main result 77.34% → 96.03% temporal satisfaction with coarse intervals

Cost 0 architecture overhead drop-in over RoPE

Convergence −12.6% FVD at 20K steps nDTW +4.9%, EMD −9.8%

Effect 96.03% temporal satisfaction boundary error: 0.261s → 0.073s

Presented by Matrix Team — Neural Interactive Simulation

arXiv Paper PDF Code coming soon Dataset coming soon BibTeX

arXiv2605.10543 statuspreprint methodplug-and-play OmniEvents 419K clips · 3.7M interval events · general + robotics + gameplay

One prompt. Multiple events. Explicit time intervals.

TIE teaser showing event-conditioned video generation with explicit time intervals

TIE turns event descriptions into temporally grounded video control. Each event carries an explicit interval, allowing the model to generate overlapping and concurrent events that point-wise or single-active-prompt conditioning cannot represent.

Insight

What this paper says

Existing video generators know when a frame is, but not when an event lives. They encode time as discrete points, while events occupy intervals. This creates a structural mismatch: overlapping events collapse into ambiguous token sequences. TIE fixes it by generalizing RoPE from point-wise timestamps to interval-aware event keys.

No masks.Standard attention stays.

No frame-perfect labels.Coarse intervals suffice.

No extra architecture.RoPE becomes interval-aware.

Theory

Boundary timestamp noise decays with interval length.

Noise means errors in annotated event start and end times. TIE confines their effect to the interval margin.

O(delta / r) vs. O(theta_max delta)

Experiment

Strong boundary noise still wins.

+12.2%

nDTW over clean Finetuned under significant event-boundary noise.

0.286 vs. 0.255

Experiment

Temporal drift remains lower.

−7.0%

EMD below clean Finetuned under significant event-boundary noise.

0.132 vs. 0.142

Temporal Constraint Satisfaction

77.34%→96.03%

+24% relative

Boundary Error

0.261s→0.073s

−72% relative

Concurrent Events in the Wild

68% / 99%

general / robotics & gameplay

Architecture Overhead

drop-in over RoPE

Why existing video generators struggle with events

A structural problem, not a scaling problem.

In 68% of general clips and over 99% of robotics & gameplay clips, multiple events overlap in time. Yet every modern multi-event generator rests on a single-active-prompt assumption.

The old grammar breaks at overlap.

RoPE — Point-wise

Time as discrete points

Standard rotary embeddings assign each token a single timestamp. Events that span an interval — let alone overlap — collapse into ambiguous token sequences. Because all temporal evidence is concentrated at points or endpoints, timestamp noise directly perturbs the encoded phase. This is exactly the failure mode that appears in large-scale training, where event boundaries are produced by imperfect automatic annotators.

TIE — Interval-aware

Time as intervals

A textual event token carries an interval $I = [t^s, t^e]$. Cross-attention aggregates positional evidence over the entire span — so concurrent and overlapping events become first-class citizens.

Point-wise RoPE compared with interval-aware TIE

Point-wise RoPE cannot naturally model the point-to-interval activation pattern of an event. TIE supports it natively via interval integration.

Two principles → one closed-form encoder

No heuristics. The form of TIE is uniquely determined.

TIE asks a simple question: if a text token describes an event lasting from start to end, should attention see only one timestamp, two boundaries, or the whole interval?

TIE is the interval-encoding principle. RoTE is the uniform-kernel closed-form instantiation used in our experiments.

Two principles. One inevitable form.

Definition

Point-to-interval attention score

For a video query $q_i$ at time $m_i$ and an event key $k_j$ with interval $I_j=[t_j^s,t_j^e]$, TIE defines the attention logit by aggregating RoPE evidence over the full temporal support of the event.

$$s_{i,j}=s(q_i,k_j;m_i,I_j)$$ the logit used by DiT cross-attention

Temporal Integrability

$$\bar{s}_{i,j}=\mathbb{E}_{\tau\sim\mu_{I_j}}\!\left[ s_{\mathrm{RoPE}}(q_i,k_j;m_i,\tau)\right]$$

The cross-attention logit must integrate point-wise RoPE evidence across the event's full support. The interior of the interval is preserved, not collapsed to a center or boundary timestamp.

Duration Invariance

$$s_{i,j}=C(\mu_{I_j})^{-1}\bar{s}_{i,j},\qquad C(\mu_{I_j})>0$$

The final logit must reflect semantic relevance, not interval length. Normalization prevents long events from winning simply because they accumulate more temporal evidence.

$$\text{RoTE}(k, c, r) \;=\; \frac{1}{C_r}\, R_{c,r}\, k, \qquad \mathbf{A}_{i,c,r} = \operatorname{sinc}(\theta_i r) \begin{pmatrix}\cos(\theta_i c) & -\sin(\theta_i c)\\ \sin(\theta_i c) & \cos(\theta_i c)\end{pmatrix}$$

Closed-form RoTE — uniform kernel TIE instantiation

Visual intuition

Final method schematic from the paper: point-wise RoPE compared with interval-aware TIE

The figure illustrates the modeling consequence of the two definitions above: point-wise RoPE samples a single timestamp, while TIE turns each event token into an interval-aware key whose evidence is integrated across time.

The center $c$ controls the rotation phase; the radius $r$ acts as a built-in temporal low-pass filter via $\operatorname{sinc}(\theta r)$. As $r \to 0$, RoTE reduces to standard RoPE — so it slots into any DiT with zero overhead.

Robust to noisy timestamps — by construction

Boundary perturbations only affect the marginal portion of the integration domain. The main theorem gives $\Delta_{\mathrm{RoTE}}=\mathcal{O}(\delta/r)$ when $\delta\le r/2$, while point-wise RoPE and boundary-only DoTE have local worst-case sensitivity $\mathcal{O}(\theta_{\max}\delta)$ with no decay in interval radius. This strict structural advantage is what makes interval-conditioned large-scale pretraining with VLM-derived annotations practical.

Results

Visual quality preserved · Temporal control dramatically improved · Robust to annotation noise.

Human-verified temporal grounding on OmniEvents

Experiment 96.03% Temporal constraints satisfied, by human verification.

Headline temporal grounding metrics comparing Finetuned and TIE

Metric	Base	Finetuned	TIE (RoTE)	Δ vs. Finetuned
Temporal Constraint Satisfaction (TCSR)	—	77.34%	96.03%	+18.70 pp
Boundary Error	—	0.261 s	0.073 s	−0.188 s
Event Occurrence	—	80.45%	96.03%	+15.58 pp
Order Accuracy	—	63.37%	92.32%	+28.95 pp
Overlap Accuracy	—	66.23%	88.64%	+22.40 pp

These numbers follow our human evaluation protocol: 100 structured prompts, one Finetuned and one TIE video per prompt, and 10 human annotators. For each event $e_i=[s_i,t_i]$, annotators verify occurrence $o_i$ and boundary deviations $b_i^s=\hat{s}_i-s_i,\; b_i^t=\hat{t}_i-t_i$.

Event Occurrence

Does the requested event appear?

$\mathrm{Occ}=\frac{1}{N}\sum_i o_i$

Temporal Error

Boundary deviation, with missing events penalized by target duration.

$\mathrm{TE}_i=\frac{|b_i^s|+|b_i^t|}{2}$, or $t_i-s_i$ if $o_i=0$

Order / Overlap

Checks before-after relations and whether concurrent events remain concurrent.

$\min(\hat t_i,\hat t_j)>\max(\hat s_i,\hat s_j)$

TCSR

Prompt-level fraction of satisfied event, order, and overlap constraints.

$\mathrm{TCSR}=\frac{1}{|\mathcal C|}\sum_{c\in\mathcal C}\mathbf{1}[c\ \mathrm{satisfied}]$

nDTW

Normalized dynamic time warping over a temporal cost matrix; higher means less temporal drift.

$\exp\!\left(-\frac{1}{\sigma L}\min_P\sum_{(i,j)\in P}D_{ij}\right)$

CLIP-Event

Frame-level video-text alignment between event descriptions and the generated event windows.

EMD

Earth Mover's Distance on the temporal event distribution; lower means the generated timeline is closer.

Visual quality preserved on PexelsEvents

Method	FID	FVD	Visual Quality	Temporal Cons.	Text Align.	CLIP-Event
Base	59.68	357.51	2.73	2.78	2.68	0.226
Finetuned	43.74	234.40	3.03	3.01	2.86	0.235
DoTE (boundary-only)	43.84	234.79	3.05	3.01	2.88	0.241
TIE (RoTE)	42.53	217.29	3.10	3.05	2.92	0.246

Ablation isolates the gain: NoRoPE → DoTE shows boundary encoding helps; DoTE → RoTE shows interval interior matters; RoTE wins on every metric.

Robust to noisy temporal annotations

A core requirement for large-scale event-conditioned pretraining, not a minor stress test.

At scale, event intervals come from VLMs, action detectors, captioning systems, or other automatic annotators. Their boundaries are approximate, so the model must be robust to endpoint noise by design.

Imperfect timestamps are not an edge case. They are the world we train on.

Expected Boundary-Noise Sensitivity

Theory

|A(c~, r~) - A(c, r)| <= K delta / (r - delta) Delta_RoTE = O(delta / r)

In contrast, point-wise RoPE and boundary-only DoTE concentrate temporal information on one timestamp or two endpoints, yielding O(theta_max delta) sensitivity with no decay as the event interval grows.

RoTE / TIE

Interval integral

RoTE(k,c,r) = C_r^-1 E_tau~U(I)[RoPE(k,tau)]

Noise is averaged over the event support.

Frequency response

Built-in low-pass smoothing

theta |sinc(theta r)| = |sin(theta r)| / r <= 1/r

High-frequency timestamp noise is attenuated.

RoPE / DoTE

Point or endpoint phases

Delta = O(theta_max delta)

No interval-radius decay under boundary noise.

At significant boundary-noise strength, TIE still beats clean-timestamp finetuning by +12.2% nDTW and -7.0% EMD.

Robustness of TIE to noisy temporal annotations

Following the paper, we perturb each interval as [s, e] -> [s + epsilon_s, e + epsilon_e] with Gaussian noise. A noise level of sigma = 0.6 is already large relative to many event durations; under this setting, RoTE still improves nDTW from 0.255 to 0.286 and reduces EMD from 0.142 to 0.132.

See it in action

Concurrent events, multi-subject interactions. We compare TIE with existing prompt-based time control SoTA (Seedance 2.0), to see their time response performance under complex scenarios. We show videos with their accurate text prompts and timestamps in the below. We pick a few key visual changes for reader to better understand the differences. Traditional prompt-based methods struggle with temporal accuracy while TIE maintains robust performance.

General concurrent events

Bimanual robot manipulation

Multi-subject combat (Elden Ring)

Long Video Scenario (No Seedance 2.0 Comparison)

Text-to-video temporal comparison between baseline and TIE

Localized future-event control on GameEvents

OmniEvents

A structured event-prompt dataset built specifically for the concurrent-event regime.

How it is built

OmniEvents combines general videos, robotics demonstrations, and gameplay traces. Open-domain and robotics clips are annotated with a self-reflective VLM pipeline: structured JSON generation, deterministic temporal checks, semantic self-verification, and iterative refinement.

Annotation format

{
  "subject": "left arm",
  "event": "closes around the spoon",
  "start": 3.00,
  "end": 9.00
}

PexelsEvents

253,903 clips

General-domain web videos with structured event-interval annotations. 68% per-clip event overlap probability.

RoboticsEvents

85,956 clips

Task-specific robotics demonstrations from AgiBot, with event-level temporal supervision for bimanual manipulation.

GameEvents

79,959 clips

Elden Ring gameplay traces with frame-accurate event boundaries from game-state instrumentation.

OmniEvents distribution statistics

Dataset	Clips	Average events per clip	Average event duration (seconds)	Total events	Total event duration (seconds)	Total text-prompt length	Overlap probability
PexelsEvents	253,903	4.72	3.67	1,197,973	4,391,589	164,639,876	68.00%
RoboticsEvents	85,956	14.47	2.79	1,244,058	3,472,766	86,717,027	99.99%
GameEvents	79,959	16.01	1.24	1,280,208	1,584,762	42,368,009	99.63%

Dataset release coming soon

Authors & Affiliations

Zhilei Shu^1,2,∗, Shangwen Zhu^2,3,∗, Zihang Liang^2,6, Xiaofan Li², Qianyu Peng⁸, Xinyu Cui⁷, Bo Ye⁷, Yiming Li⁴, Fan Cheng³, Jian Zhao⁷, Yang Cao¹, Zheng-Jun Zha^1,†, Ruili Feng^5,2,9,∗

¹University of Science and Technology of China · ²Matrix Team · ³Shanghai Jiao Tong University · ⁴Nanyang Technological University · ⁵University of Waterloo · ⁶The Pennsylvania State University · ⁷Zhongguancun Academy · ⁸The University of Hong Kong · ⁹NVIDIA Research

∗ Equal contribution. † Corresponding author. Project lead: Ruili Feng.

Affiliated institutions

Citation

If you find this work helpful, please consider citing:

@misc{shu2026tie,
  title     = {TIE: Time Interval Encoding for Video Generation over Events},
  author    = {Shu, Zhilei and Zhu, Shangwen and Liang, Zihang and Li, Xiaofan
               and Peng, Qianyu and Cui, Xinyu and Ye, Bo and Li, Yiming
               and Cheng, Fan and Zhao, Jian and Cao, Yang
               and Zha, Zheng-Jun and Feng, Ruili},
  year      = {2026},
  eprint    = {2605.10543},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi       = {10.48550/arXiv.2605.10543},
  url       = {https://arxiv.org/abs/2605.10543}
}