TIE · Preprint Matrix Team Matrix Team
2026 Preprint

Time Interval Encoding

for Video Generation over Events

Events are not points.

Let them live in time.

Main result 77.34% → 96.03% temporal satisfaction with coarse intervals
Cost 0 architecture overhead drop-in over RoPE
Convergence −12.6% FVD at 20K steps nDTW +4.9%, EMD −9.8%
Effect 96.03% temporal satisfaction boundary error: 0.261s → 0.073s
Presented by Matrix Team — Neural Interactive Simulation
arXiv Paper PDF Code coming soon Dataset coming soon BibTeX
arXiv2605.10543 statuspreprint methodplug-and-play OmniEvents 419K clips · 3.7M interval events · general + robotics + gameplay

One prompt. Multiple events. Explicit time intervals.

TIE teaser showing event-conditioned video generation with explicit time intervals

TIE turns event descriptions into temporally grounded video control. Each event carries an explicit interval, allowing the model to generate overlapping and concurrent events that point-wise or single-active-prompt conditioning cannot represent.

Insight

What this paper says

Existing video generators know when a frame is, but not when an event lives. They encode time as discrete points, while events occupy intervals. This creates a structural mismatch: overlapping events collapse into ambiguous token sequences. TIE fixes it by generalizing RoPE from point-wise timestamps to interval-aware event keys.

No masks.Standard attention stays.
No frame-perfect labels.Coarse intervals suffice.
No extra architecture.RoPE becomes interval-aware.
Theory

Boundary timestamp noise decays with interval length.

Noise means errors in annotated event start and end times. TIE confines their effect to the interval margin.

O(delta / r) vs. O(theta_max delta)
Experiment

Strong boundary noise still wins.

+12.2%

nDTW over clean Finetuned under significant event-boundary noise.

0.286 vs. 0.255
Experiment

Temporal drift remains lower.

−7.0%

EMD below clean Finetuned under significant event-boundary noise.

0.132 vs. 0.142
Temporal Constraint Satisfaction
77.34%96.03%
+24% relative
Boundary Error
0.261s0.073s
−72% relative
Concurrent Events in the Wild
68% / 99%
general / robotics & gameplay
Architecture Overhead
0
drop-in over RoPE

News

Latest updates on TIE.

Why existing video generators struggle with events

A structural problem, not a scaling problem.

In 68% of general clips and over 99% of robotics & gameplay clips, multiple events overlap in time. Yet every modern multi-event generator rests on a single-active-prompt assumption.

The old grammar breaks at overlap.
RoPE — Point-wise

Time as discrete points

Standard rotary embeddings assign each token a single timestamp. Events that span an interval — let alone overlap — collapse into ambiguous token sequences. Because all temporal evidence is concentrated at points or endpoints, timestamp noise directly perturbs the encoded phase. This is exactly the failure mode that appears in large-scale training, where event boundaries are produced by imperfect automatic annotators.

TIE — Interval-aware

Time as intervals

A textual event token carries an interval $I = [t^s, t^e]$. Cross-attention aggregates positional evidence over the entire span — so concurrent and overlapping events become first-class citizens.

Point-wise RoPE compared with interval-aware TIE

Point-wise RoPE cannot naturally model the point-to-interval activation pattern of an event. TIE supports it natively via interval integration.

Two principles → one closed-form encoder

No heuristics. The form of TIE is uniquely determined.

TIE asks a simple question: if a text token describes an event lasting from start to end, should attention see only one timestamp, two boundaries, or the whole interval?

TIE is the interval-encoding principle. RoTE is the uniform-kernel closed-form instantiation used in our experiments.
Two principles. One inevitable form.
Definition

Point-to-interval attention score

For a video query $q_i$ at time $m_i$ and an event key $k_j$ with interval $I_j=[t_j^s,t_j^e]$, TIE defines the attention logit by aggregating RoPE evidence over the full temporal support of the event.

$$s_{i,j}=s(q_i,k_j;m_i,I_j)$$ the logit used by DiT cross-attention
01

Temporal Integrability

$$\bar{s}_{i,j}=\mathbb{E}_{\tau\sim\mu_{I_j}}\!\left[ s_{\mathrm{RoPE}}(q_i,k_j;m_i,\tau)\right]$$

The cross-attention logit must integrate point-wise RoPE evidence across the event's full support. The interior of the interval is preserved, not collapsed to a center or boundary timestamp.

02

Duration Invariance

$$s_{i,j}=C(\mu_{I_j})^{-1}\bar{s}_{i,j},\qquad C(\mu_{I_j})>0$$

The final logit must reflect semantic relevance, not interval length. Normalization prevents long events from winning simply because they accumulate more temporal evidence.

$$\text{RoTE}(k, c, r) \;=\; \frac{1}{C_r}\, R_{c,r}\, k, \qquad \mathbf{A}_{i,c,r} = \operatorname{sinc}(\theta_i r) \begin{pmatrix}\cos(\theta_i c) & -\sin(\theta_i c)\\ \sin(\theta_i c) & \cos(\theta_i c)\end{pmatrix}$$
Closed-form RoTE — uniform kernel TIE instantiation
Visual intuition
Final method schematic from the paper: point-wise RoPE compared with interval-aware TIE

The figure illustrates the modeling consequence of the two definitions above: point-wise RoPE samples a single timestamp, while TIE turns each event token into an interval-aware key whose evidence is integrated across time.

The center $c$ controls the rotation phase; the radius $r$ acts as a built-in temporal low-pass filter via $\operatorname{sinc}(\theta r)$. As $r \to 0$, RoTE reduces to standard RoPE — so it slots into any DiT with zero overhead.

Robust to noisy timestamps — by construction

Boundary perturbations only affect the marginal portion of the integration domain. The main theorem gives $\Delta_{\mathrm{RoTE}}=\mathcal{O}(\delta/r)$ when $\delta\le r/2$, while point-wise RoPE and boundary-only DoTE have local worst-case sensitivity $\mathcal{O}(\theta_{\max}\delta)$ with no decay in interval radius. This strict structural advantage is what makes interval-conditioned large-scale pretraining with VLM-derived annotations practical.

Results

Visual quality preserved · Temporal control dramatically improved · Robust to annotation noise.

Human-verified temporal grounding on OmniEvents

Experiment 96.03% Temporal constraints satisfied, by human verification.
Headline temporal grounding metrics comparing Finetuned and TIE
Metric Base Finetuned TIE (RoTE) Δ vs. Finetuned
Temporal Constraint Satisfaction (TCSR) 77.34% 96.03% +18.70 pp
Boundary Error 0.261 s 0.073 s −0.188 s
Event Occurrence 80.45% 96.03% +15.58 pp
Order Accuracy 63.37% 92.32% +28.95 pp
Overlap Accuracy 66.23% 88.64% +22.40 pp

These numbers follow our human evaluation protocol: 100 structured prompts, one Finetuned and one TIE video per prompt, and 10 human annotators. For each event $e_i=[s_i,t_i]$, annotators verify occurrence $o_i$ and boundary deviations $b_i^s=\hat{s}_i-s_i,\; b_i^t=\hat{t}_i-t_i$.

Event Occurrence

Does the requested event appear?

$\mathrm{Occ}=\frac{1}{N}\sum_i o_i$

Temporal Error

Boundary deviation, with missing events penalized by target duration.

$\mathrm{TE}_i=\frac{|b_i^s|+|b_i^t|}{2}$, or $t_i-s_i$ if $o_i=0$

Order / Overlap

Checks before-after relations and whether concurrent events remain concurrent.

$\min(\hat t_i,\hat t_j)>\max(\hat s_i,\hat s_j)$

TCSR

Prompt-level fraction of satisfied event, order, and overlap constraints.

$\mathrm{TCSR}=\frac{1}{|\mathcal C|}\sum_{c\in\mathcal C}\mathbf{1}[c\ \mathrm{satisfied}]$

nDTW

Normalized dynamic time warping over a temporal cost matrix; higher means less temporal drift.

$\exp\!\left(-\frac{1}{\sigma L}\min_P\sum_{(i,j)\in P}D_{ij}\right)$

CLIP-Event

Frame-level video-text alignment between event descriptions and the generated event windows.

EMD

Earth Mover's Distance on the temporal event distribution; lower means the generated timeline is closer.

Visual quality preserved on PexelsEvents

Method FID FVD Visual Quality Temporal Cons. Text Align. CLIP-Event
Base59.68357.512.732.782.680.226
Finetuned43.74234.403.033.012.860.235
DoTE (boundary-only)43.84234.793.053.012.880.241
TIE (RoTE) 42.53 217.29 3.10 3.05 2.92 0.246

Ablation isolates the gain: NoRoPE → DoTE shows boundary encoding helps; DoTE → RoTE shows interval interior matters; RoTE wins on every metric.

Robust to noisy temporal annotations

A core requirement for large-scale event-conditioned pretraining, not a minor stress test.

At scale, event intervals come from VLMs, action detectors, captioning systems, or other automatic annotators. Their boundaries are approximate, so the model must be robust to endpoint noise by design.

Imperfect timestamps are not an edge case. They are the world we train on.
Expected Boundary-Noise Sensitivity
Theory
|A(c~, r~) - A(c, r)| <= K delta / (r - delta) Delta_RoTE = O(delta / r)

In contrast, point-wise RoPE and boundary-only DoTE concentrate temporal information on one timestamp or two endpoints, yielding O(theta_max delta) sensitivity with no decay as the event interval grows.

RoTE / TIE

Interval integral

RoTE(k,c,r) = C_r^-1 E_tau~U(I)[RoPE(k,tau)]

Noise is averaged over the event support.
Frequency response

Built-in low-pass smoothing

theta |sinc(theta r)| = |sin(theta r)| / r <= 1/r

High-frequency timestamp noise is attenuated.
RoPE / DoTE

Point or endpoint phases

Delta = O(theta_max delta)

No interval-radius decay under boundary noise.
At significant boundary-noise strength, TIE still beats clean-timestamp finetuning by +12.2% nDTW and -7.0% EMD.
Robustness of TIE to noisy temporal annotations

Following the paper, we perturb each interval as [s, e] -> [s + epsilon_s, e + epsilon_e] with Gaussian noise. A noise level of sigma = 0.6 is already large relative to many event durations; under this setting, RoTE still improves nDTW from 0.255 to 0.286 and reduces EMD from 0.142 to 0.132.

See it in action

Concurrent events, multi-subject interactions, controllable future editing.

Text-to-video temporal comparison between baseline and TIE
Localized future-event control on GameEvents
Robotics temporal control comparison

OmniEvents

A structured event-prompt dataset built specifically for the concurrent-event regime.

How it is built

OmniEvents combines general videos, robotics demonstrations, and gameplay traces. Open-domain and robotics clips are annotated with a self-reflective VLM pipeline: structured JSON generation, deterministic temporal checks, semantic self-verification, and iterative refinement.

Annotation format
{
  "subject": "left arm",
  "event": "closes around the spoon",
  "start": 3.00,
  "end": 9.00
}
PexelsEvents annotation example

PexelsEvents

253,903 clips

General-domain web videos with structured event-interval annotations. 68% per-clip event overlap probability.

RoboticsEvents annotation example

RoboticsEvents

85,956 clips

Task-specific robotics demonstrations from AgiBot, with event-level temporal supervision for bimanual manipulation.

GameEvents annotation example

GameEvents

79,959 clips

Elden Ring gameplay traces with frame-accurate event boundaries from game-state instrumentation.

OmniEvents distribution statistics

Dataset Clips Average events per clip Average event duration (seconds) Total events Total event duration (seconds) Total text-prompt length Overlap probability
PexelsEvents253,9034.723.671,197,9734,391,589164,639,87668.00%
RoboticsEvents85,95614.472.791,244,0583,472,76686,717,02799.99%
GameEvents79,95916.011.241,280,2081,584,76242,368,00999.63%

Dataset release coming soon

Authors & Affiliations

Zhilei Shu1,2,, Shangwen Zhu2,3,, Zihang Liang2,6, Xiaofan Li2, Qianyu Peng8, Xinyu Cui7, Bo Ye7, Yiming Li4, Fan Cheng3, Jian Zhao7, Yang Cao1, Zheng-Jun Zha1,, Ruili Feng5,2,9,

1University of Science and Technology of China  ·  2Matrix Team  ·  3Shanghai Jiao Tong University  ·  4Nanyang Technological University  ·  5University of Waterloo  ·  6The Pennsylvania State University  ·  7Zhongguancun Academy  ·  8The University of Hong Kong  ·  9NVIDIA Research

Equal contribution.   Corresponding author.   Project lead: Ruili Feng.

Affiliated institutions

Citation

If you find this work helpful, please consider citing:

@misc{shu2026tie,
  title     = {TIE: Time Interval Encoding for Video Generation over Events},
  author    = {Shu, Zhilei and Zhu, Shangwen and Liang, Zihang and Li, Xiaofan
               and Peng, Qianyu and Cui, Xinyu and Ye, Bo and Li, Yiming
               and Cheng, Fan and Zhao, Jian and Cao, Yang
               and Zha, Zheng-Jun and Feng, Ruili},
  year      = {2026},
  eprint    = {2605.10543},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi       = {10.48550/arXiv.2605.10543},
  url       = {https://arxiv.org/abs/2605.10543}
}