TrackerBench
A neutral stress-degradation benchmark for released humanoid whole-body motion-tracking policies — do nominal rankings survive shoves, payloads, sensor noise, and latency?
Exploratory research, in progress (July 2026). This page is a live research log. Everything below comes from a pre-registered probe: kill criteria and thresholds were frozen before any data was collected, and instrument amendments were logged before the affected data existed. Simulation-only (MuJoCo sim2sim), two policy families so far — read as an existence proof, not a benchmark release.
📄 Findings report (PDF) — TrackerBench: Stress-Testing Released Humanoid Motion-Tracking Policies — probe findings, 2026-07-05.
TL;DR
Humanoid whole-body tracking policies (GMT, TWIST, SONIC, BeyondMimic, …) are ranked by nominal tracking precision — MPJPE under ideal conditions. TrackerBench asks whether that ranking survives contact with reality’s disturbances: shoves, payload, sensor noise, actuation latency.
First result, from ~1,120 paired rollouts on a simulated Unitree G1 (two released policies, eight motions, identical seeded perturbations for both policies in every cell):
- GMT tracks more precisely than TWIST on all 8 motions (1.4–2.9 cm vs 1.7–4.1 cm MPJPE) — by today’s metrics, GMT is simply “better.”
- Under stress the ranking inverts: in 33 cells the nominally-worse TWIST survives significantly longer — carrying 16 kg of torso payload through a walk where GMT falls almost immediately (survival 1.00 vs 0.15), and walking through 16× sensor noise that ends GMT.
- But neither policy is uniformly tougher: GMT out-survives TWIST under actuation latency and some pushes (14 reverse cells). Robustness is axis-specific — a single scalar “robustness score” would hide exactly the structure that matters.
- The crossovers are protocol-stable: identical under two different failure definitions (zero flips), seed test–retest agreement 0.88, and measured relative to each policy’s own re-hosted nominal (controlling the re-hosting confound).
Takeaway: nominal precision does not predict stress robustness. The single number every tracking paper reports hides which policy breaks under what — which is precisely what a deployment cares about.
How it’s tested
Each released policy runs on its authors’ own MuJoCo deploy stack (model, PD gains, observation pipeline verbatim; the policy is a black box). The harness then injects four frozen stress axes — pushes (50–400 N), torso payload (2–24 kg), proprioceptive noise (1–16× base), action latency (20–100 ms) — with randomness seeded per cell so every policy faces the same shove directions and noise draws. Failure is scored post-hoc under two termination rules (fell; fell-or-diverged), and degradation is measured against each policy’s own nominal run. Pre-registered gates guard the two ways this could lie: no reordering anywhere (benchmark adds nothing) and protocol artifact (the “signal” is the harness, not the policies). Both kill gates failed to fire.
Research log
- 2026-07-05 — probe verdict: SURVIVE. GMT + TWIST, 8 shared clips, 1,120 paired cells. 33 rank-crossover cells stable under both failure rules; fingerprints axis-specific; instrument gates (re-host fidelity d=0.000, axis dynamic range 4/4, seed stability 0.86) all passed.
- 2026-07-04 — probe opened. Kill criteria frozen before data. One instrument amendment logged pre-data: world-frame root error is meaningless for heading-free trackers (GMT discards yaw) — divergence is scored in the root frame instead. GMT envelope measured: survives 200 N shoves, 12 kg payload, 8× noise without a single failure — far tougher than nominal numbers suggest.
- Next: third general family (OpenTrack re-host in progress; BeyondMimic blocked — no public checkpoint), PBHC/KungfuBot as a per-motion specialist family, terrain + reference-corruption axes, neutral clip set, then a public leaderboard.
Working notes and the full pre-registration live in the project repo; this page tracks headline results as they land.