Inference#
1.Inference with the_matrix.py#
We provide the the_matrix class as an interface for users to easily generate interactive worlds. The related code can be found in the_matrix.py. We also offer a generation example script generation_example.py in the root dir.
from the_matrix import the_matrix
the_matrix_generator = the_matrix(generation_model_path="path/to/stage2_model", streaming_model_path="path/to/stage3_model")
the_matrix_generator.generate(
prompt="...",
length=8,
output_folder="./",
control_signal="..."
)
generation_model_path and streaming_model_path are the stage2 and stage3 models we provide. Note that the control_signal parameter can be set to None. In this case, the code will generate a random control signal to be used for video generation. Detailed parameter descriptions are as follows:
prompt:
The description of the video to be generated.
length:
Length in second of generated video.
control_signal:
Control signal for generated video, like "D,D,D,D,D,DL,DL,DL,DL,D,D,D,DR,DR,DR,DR,DR".
Meanings:
"D": The car is moving straight ahead.
"DL": The car is turning left ahead.
"DR": The car is turning right ahead.
For input, if it's length is less than 4 * length + 1, it will be randomly padded.
Leave it to None for random generation.
control_seed:
If control_signal is None, this seed determines the random generated control signal.
output_folder:
Folder path for saving generated videos.
guidance_scale:
CFG parameter. Leave it to default is good enough.
seed:
Random seed for video generation.
gpu_id:
The index of GPU to be used.
The generation function in the_matrix calls inference.py within the respective stage folders. For more detailed generation settings, you can use the corresponding scripts in those folders.
Here ‘D’ stands for ‘The car is Driving forward’ which corresponds to pressing the ‘W’ key on the keyboard, ‘DL’ stands for ‘The car is Driving Left’ and corresponds to pressing the ‘A’ key, ‘DR’ stands for ‘The car is Driving Right’ and corresponds to pressing the ‘D’ key.
2. Inference with run_interactive.sh#
Summary#
run_interactive.sh launches a fully parallelized, low-latency pipeline that generates video at 16 FPS end-to-end (i.e. real-time). This script leverages our 8-GPU DiT & VAE parallel inference, stream consistency models to reduce a single-GPU baseline’s 32 s per 4 s video down to 4 s—a 8× speedup—while maintaining infinite-horizon stability.
Highlights#
8-GPU Parallel Inference DiT and VAE stages each slice work across 8 GPUs for a 6–8× speedup vs. single-GPU.
Stream Consistency Models Novel consistency losses yield 7–10× higher throughput over naïve frame-by-frame generation.
Real-Time Feedback Loop Sustains a continuous 16 FPS generation/playback cycle in real time.
Two Inference Modes#
API-Driven (`the_matrix.py`) - Use when embedding generation inside your Python app. - Offers interactive control via the_matrix.generate(…) calls. - Suitable for few-shot or ad-hoc video snippets.
Scripted Pipeline (`run_interactive.sh`) - End-to-end shell script for bulk or real-time production. - Spins up a Ray cluster, runs all stages in parallel, and tears down automatically. - Ideal for continuous/live deployments or performance benchmarking.
Configuration#
At the top of run_interactive.sh, set:
# GPUs for DiT stage
NUM_GPUS_DIT=1
# GPUs for VAE stage
NUM_GPUS_VAE=3
# Path to stage4 model weights
MODEL_PATH="../models/stage4"
The script computes:
GPU_IDS: comma-separated list NUM_GPUS_DIT,…,NUM_GPUS_DIT+NUM_GPUS_VAE-1
CUDA_VISIBLE_DEVICES: exported for Ray & all Python processes
Usage#
Run the full pipeline:
bash run_interactive.sh
Or override via environment:
export NUM_GPUS_DIT=2
export NUM_GPUS_VAE=6
export MODEL_PATH="../models/stage4"
bash run_interactive.sh
Sub-script: start_dit.sh#
bash start_dit.sh <NUM_GPUS_DIT> <MODEL_PATH>
- NUM_GPUS_DIT:
Number of GPUs allocated to DiT.
- MODEL_PATH:
Directory or prefix of stage4 checkpoint files.
Environment Variables#
CUDA_VISIBLE_DEVICES List of GPU indices assigned to Ray head, DiT, VAE, etc.
PYTORCH_CUDA_ALLOC_CONF Set to expandable_segments:True to optimize CUDA allocator behavior.