diff-sim benchmark | Shubham Singh

Eight differentiable rigid-body simulators wired into a single SimulatorAdapter contract, exercised by five experiments and a shared methodology gate. Targeted venue: ICRA 2027.

paper PDF repo (private)

TL;DR — gradient correctness on the Franka Panda

The headline result. Each adapter’s gradient_rollout_cost is compared against a central finite-difference reference at $h = \sqrt{\epsilon}$ on a 5-step rollout of the Franka Panda from $q = 0$, $v = 0$ with random torques.

Adapter	rel. $L^2$	rel. $L^\infty$	cosine sim.	status
pinocchio	8.82 × 10⁻⁸	1.88 × 10⁻⁷	1.0000	ok
drake	1.37 × 10⁻⁷	2.42 × 10⁻⁷	1.0000	ok
jaxsim	7.53 × 10⁻⁸	9.72 × 10⁻⁸	1.0000	ok
brax	—	—	—	failed (Panda MJCF upstream bug)
mjx	1.00	1.00	−0.13	failed (sm_120 XLA compile pathology)
newton	1.5 × 10³⁰	7.2 × 10²⁹	0.00	failed (tape contamination on Panda; ok on chain)
genesis	—	—	—	gradient not exposed in Python
tds	—	—	—	gradient bindings absent from the wheel

Three of eight adapters (Pinocchio analytical RBD, Drake AutoDiffXd, JaxSim jax.grad) agree with FD to one part in $10^7$. The remaining five fail in different and informative ways, summarised in the findings section below.

Robots in the benchmark

Four robots are loadable through the adapter contract. Panda and the parametric chain are exercised by every experiment; Go2, G1, and Allegro landed in Phase 3 and are wired through forward dynamics + floating-base SE(3) integration, but the locomotion / manipulation task suite (Phase 4+) is still ahead.

**Franka Panda**
7-DoF arm, fixed base. Used by Exp1/2/5.

**Unitree Go2**
12-DoF quadruped + 6-DoF floating base. Phase 3.

**Unitree G1**
29-DoF humanoid + 6-DoF floating base. Phase 3.

**Wonik Allegro**
16-DoF dexterous hand, fixed base. Phase 3.

GIFs are rendered from MuJoCo MJCF via the offscreen renderer. Go2 and G1 are shown in passive freefall — the contact-rich locomotion/sliding/impact task suite is Phase 4+ scope.

A parametric chain robot (planar N-link revolute, $N \in {3, 6, 12, 24, 48}$) is also used for the scaling sweep (Exp3) and the learning experiment (Exp4 on chain_3).

Other experiments

Exp2 — wall-clock timing on the Panda

Pinocchio analytical ABA: 16 µs/step, 1.9 ms/gradient — the obvious floor. Drake’s AutoDiffXd is 2× slower per step but still 50× faster than the JAX-traced paths (JaxSim, Newton) at this size.

Exp3 — scaling with chain DoF count

Pinocchio’s step exponent $b \approx 0.46$ confirms ABA’s empirical near-linear cost. Genesis is flat: Taichi kernel-launch overhead dominates the actual integration at these sizes.

Exp4 — gradient-based (SHAC) vs gradient-free (CEM)

Methodology gate: the four gradient-capable adapters that work on chain_3 (Drake, JaxSim, Newton, Pinocchio) reduce cost by 1.02× — identical to four digits. That confirms the gradient signal is being delivered and used the same way across implementations.

Exp5 — import/load/first-step time + LOC

JAX initialization eats ~3.9 s of JaxSim’s first step; Newton’s 200 ms first step is Warp’s JIT kernel compile. Adapter LOC sits in a narrow 239–494 range across all eight — the contract is at the right level of abstraction.

Findings

MJX on sm_120 (RTX 5060). jit_rollout_cost takes ~60 min of XLA compilation and produces a gradient with cosine similarity $-0.13$ vs FD. Reproducible; isolated to the newest Blackwell SASS.
Brax pipeline vs menagerie Panda. generalized.pipeline.init raises a vmap shape mismatch on the Panda MJCF. Chain models compile fine — upstream Brax (deprecated) is unlikely to be fixed.
Differentiable but not in Python. Genesis exposes gradients only via a checkpoint-style sim.sub_step_grad() API that doesn’t compose with a pure-function gradient_rollout_cost contract. TDS’s pip wheel ships the plain-double instantiation only; CppAD bindings exist in-tree but aren’t wired into Python.
Newton gravity sign. Newton stores gravity as a signed scalar along up_vector; passing the positive vector norm (the obvious thing) flipped gravity skyward. Caught by the cross-adapter pendulum-agreement test.
Newton gradient correct on chain, broken on Panda. Same warp.Tape code path reproduces FD to 0.5% rel. error on the 3-DoF chain ($\epsilon = 10^{-2}$) but produces cosine = 0 on Panda Exp1. Looks like tape-state contamination across the FD reference’s ~45 perturbations; per-perturbation fresh tape is the obvious next thing to try.

Reproducibility

git clone https://github.com/shubhamsingh91/sim_diff
cd sim_diff
DOCKER_BUILDKIT=0 docker build -t simdiff-bench:dev -f docker/Dockerfile .
./scripts/reproduce.sh

First run is ~30 min (clones robot_descriptions repos, builds JIT caches). Subsequent runs hit warm caches → ~5 min for the non-MJX adapters; MJX on sm_120 stays at ~60 min because of the documented XLA-compile pathology. All numbers above are single-seed; multi-seed CI bars + a contact-rich locomotion task suite are the next milestone.

To run a single adapter:

docker run --rm --gpus all -v "$PWD:/workspace" -w /workspace \
    simdiff-bench:dev \
    python scripts/run_phase1.py --output-dir results/phase1 --adapter pinocchio