Store M as CSR and factor it per-block (dense / sparse / simple) by adenzler-nvidia · Pull Request #1424 · google-deepmind/mujoco_warp

adenzler-nvidia · 2026-06-12T14:56:55Z

Apologies for the size of this one. It is a single coherent change — making M's factorization per-block — but that representation threads through every consumer of M: how it is stored (io), the factor and solve kernels (smooth), the Newton Hessian assembly and the island solver (solver, support), plus the type definitions and tests. Most of the diff is the mechanical consequence of M always being CSR rather than many independent changes, so it is best read as one idea propagated outward. Happy to split it if that helps review.

This makes the inertia matrix M's factorization a per-block decision instead of a single global dense-or-sparse choice. Previously is_sparse picked one factorization for the whole matrix: a dense tile-Cholesky for small models, or MuJoCo's level-structured sparse LDL for everything else. A large model therefore paid the sparse LDL solve even for dofs whose diagonal block is trivial, and a model with one big tree plus many tiny joints could not use a dense factor for the tree without forcing it on everything.

M is now always stored in compact CSR form, and is_sparse only selects the constraint Jacobian/Hessian layout — it no longer governs how M is stored or factored. Each diagonal block of M (a connected kinematic sub-tree, a contiguous dof range) is classified independently into one of three categories: a decoupled diagonal block (a "simple" dof such as a point mass on orthogonal slides, or a free cloth particle) needs no factorization at all and uses D = 1/diag; a small coupled block (≤ 64 dofs) factors as a packed dense tile-Cholesky; a larger coupled block uses the sparse LDL factor. The factor is packed as qLD = [dense region | LDL region], the three passes write disjoint dofs, and a mixed model runs whichever passes it needs (e.g. one large tree plus many small free joints). Dense blocks are densified from the CSR layout with a tile_load_indexed gather, which is why this requires warp-lang 1.14 (the lockfile is pinned to the 1.14.0 release), and the block-Cholesky thread count scales with block size. The island solver's M-solve mirrors the same three-way split.

The win lands wherever a coupled block fits the dense tile, or where decoupled "simple" dofs dominate; a single tree larger than the dense cutoff stays on the sparse LDL path and is essentially unchanged. End-to-end throughput (steps/s, best of 8 on a single-process GPU) and the M factor+solve GPU time per step (the kernel this PR changes), branch vs upstream/main on the same GPU and the same warp 1.14.0:

model	nv	M path	baseline steps/s	this PR steps/s	speedup	M factor+solve baseline → PR
humanoid	27	dense block	7,640,133	8,037,783	1.05×	88.6 → 52.6 µs (−41%)
g1 + hands (512 worlds)	49	dense block	181,060	189,100	1.04×	194 → 46 µs (−76%)
g1 + hands (8192 worlds)	49	dense block	962,968	1,019,926	1.06×	941 → 636 µs (−32%)
three_humanoids	81	3 dense blocks	1,323,305	1,470,055	1.11×	888 → 290 µs (−67%)
flybody	108	sparse LDL	93,783	91,890	0.98×	319 → 328 µs (+3%)
aloha_clutter	136	small blocks	41,868	41,567	0.99×	58 → 23 µs (−60%)
cloth	~2,706	simple + dense	7,033	7,811	1.11×	17.1 → 3.5 µs/CG-iter (−80%)

Per-kernel figures are median per-launch GPU time from Nsight Systems over a real stepping rollout; unchanged control kernels (the Newton Hessian factor, _JTDAJ_sparse) match within 1% between baseline and branch, confirming the measurement. cloth's row is the M⁻¹ preconditioner applied every CG iteration (a free-joint dense solve plus a 1/diag solve for the 2700 particle dofs, replacing one sparse LDL solve over all dofs).

Two honest caveats. flybody (a single 108-dof tree) stays on the sparse LDL path, so it sees no win and a small overhead (the per-dof simple-skip check, with no payoff when there are no simple or dense blocks) — that is the expected cost of the unified path on a pure-sparse model. And mul_m regresses on dense-M models (humanoid: 2.9 → 5.2 µs/launch) because always-CSR M replaces the dense matmul with a CSR gather; this only affects models that were previously is_sparse=False, and the factor win outweighs it. Everything that was already sparse keeps its CSR mul_m unchanged.

A few non-obvious points for review. is_sparse changes meaning: it no longer describes M, only the constraint system, so any code that read it as "M is sparse" must use the new qLD_has_dense / qLD_has_sparse / qLD_has_simple flags. The get_data_into round-trip recomputes MuJoCo's LDL factor via mj_factorM whenever any block is dense or simple, since those factors do not live in the qLD LDL region — a pure-simple model has no LDL region at all. The Newton solver's Hessian assembly now reconstructs M's contribution from CSR (a scatter into the dense H tile, or an M_elemid lookup for the island Hessian) rather than a dense load, since M is no longer stored dense; the Hessian factorization itself is unchanged. And warp-lang 1.14 is now required (and locked to the 1.14.0 release) for tile_load_indexed.

Store the inertia matrix M unconditionally in compact CSR form and make its factorization a per-block decision rather than a global is_sparse choice. is_sparse now selects only the constraint Jacobian/Hessian layout; it no longer governs how M is stored or factored. M's diagonal blocks (connected kinematic sub-trees, each a contiguous dof range) are classified by size: a block small enough for a dense tile-Cholesky factors as a packed dense block, and the rest fall back to the sparse LDL factor. The packed factor is laid out as qLD = [packed dense region | nC LDL region]; the dense and sparse passes write disjoint dofs and may both run for a mixed model (e.g. one large tree plus many small free joints). Dense blocks are densified from the CSR layout with a tile_load_indexed gather (requires warp-lang 1.14), and the block-Cholesky thread count is sourced from BlockDim and scales with block size.

A decoupled diagonal block of M -- a "simple" dof such as a point mass on orthogonal slides or a free cloth particle -- needs no factorization at all: its factor is just D = 1/diag. Classifying these as a third block category (alongside dense and sparse) lets them skip the level-structured LDL solve, which otherwise drags every trivial dof through the forward/backward substitution. Classify a block as simple when it is decoupled (max M row nnz == 1) and give simple dofs dedicated factor and solve kernels, plus a fused factor+solve for factor_solve_i. solve_LD's sparse pass skips simple dofs so the passes stay disjoint. The island solver mirrors the same three-way split (dense / sparse / simple) with its own island-local simple solve. Also fix the get_data_into round-trip: a pure-simple model has no dense or LDL region, so reconstruct MuJoCo's LDL factor via mj_factorM whenever any block is dense or simple (simple factors live in qLDiagInv, never in the qLD LDL region).

Regenerate uv.lock for the warp-lang 1.14 bump and change the dev-extra specifier from a prerelease floor (>=1.14.0.dev0) to >=1.14, so the lock resolves to the 1.14.0 release reproducibly instead of tracking warp nightlies. The tile_load_indexed gather only needs the 1.14 release. Reduce the island packed-block Cholesky solve to the minimal change over upstream: keep only the packed-offset load (qLD_block_adr), the reshape, and fill_mode="upper", and restore upstream's local names so the diff shows just the storage change rather than a rename churn.

erikfrey · 2026-06-13T14:10:18Z

@adenzler-nvidia I notice the test against stable deps is failing with NaNs, but that you updated pyproject.toml so it's not immediately obvious to me why this would be. Let me know if you'd like us to review it even with the failing tests, or if you still have a bit more cleanup to do.

adenzler-nvidia · 2026-06-15T08:09:40Z

will put it to draft and investigate - sorry for the noise.

The dense Newton-Hessian assembly densifies M into the H tile with a tile_scatter_add loop whose trip count and stride used the compile-time block_dim. launch_tiled runs a single lane on the Warp CPU backend, so that loop scattered only 1/block_dim of M's entries -- a wrong Hessian and NaN solves on CPU. Local GPU runs were correct and masked it; the CI test jobs run on CPU and went red. Compute the per-lane trip count from the runtime wp.block_dim() so a single CPU lane covers every entry while CUDA keeps its parallel lanes. Keep the uniform count plus enable mask so the collective tile_scatter_add is called by all lanes; a divergent range(rank, nC, lanes) leaves some lanes out of the collective and deadlocks the GPU.

adenzler-nvidia · 2026-06-15T12:15:50Z

Ready for review now - sorry this got a bit big, but it's not trivial to separate this into individual steps.

erikfrey

This is super cool, and brings us one step closer to strong MJC : MJW parity which is great.

There might be a few opportunities to simplify / clean up a bit more as a result of these changes, some suggestions below

…k-mass-factor # Conflicts: # mujoco_warp/_src/derivative_test.py # mujoco_warp/_src/forward_test.py # mujoco_warp/_src/io.py # mujoco_warp/_src/smooth_test.py # mujoco_warp/_src/solver_test.py

Drop the vestigial singleton middle dimension from the CSR matrix family (M, qLD, qLU, qDeriv, qH, M_integration): (nworld, 1, X) -> (nworld, X). The 1 was a placeholder from when M had a dense (nworld, nv, nv) variant that shared the wp.array3d type and the same kernels; with M always CSR it is pure vestige. The whole family is flattened together since they flow through the shared factor_solve_i / solve_m / mul_m kernels. Remove the dead is_sparse branch from _tendon_armature: it wrote a dense M[i, j] layout that can no longer exist, so only the CSR path remains. Delete _update_gradient_set_h_M_dense_island. Both island variants add the CSR mass matrix to the dense island Hessian; the element-parallel sparse variant covers every layout, so a single pass suffices. Trim comments that restated M is CSR now that it is implied.

The dense-block densify index map was read back from the device via tile.adr.numpy(), but the same block-start data is already on the host in the tiles dict that the device array was built from. Read it directly to drop the round-trip (and its sync) from put_model.

The Store-M-as-CSR commit inadvertently dropped the flex_stiffnessadr / flex_bendingadr < 0 guards (and the zero-stiffness early-out) that main carries. Without them, _flex_elasticity and _flex_bending index flex_stiffness / flex_bending at a -1 base for flexes that have no elasticity / bending, reading out of bounds. Restore them to match main.

…k-mass-factor

M couples a dof only with its tree ancestors, so its diagonal blocks are exactly the kinematic trees. Use mjModel's tree_dofadr / tree_dofnum directly instead of re-deriving tree roots from dof_parentid. Verified to produce identical blocks across the test models. dof_simplenum is intentionally not used for the simple/coupled split: it is a contiguous-suffix run-length, so it misses interspersed decoupled dofs; the per-block M_rownnz check covers those.

solver.py: the M_in param comment still said (nworld, 1, nC); it's (nworld, nC) after the flatten. contrib/kernel_analyzer/package-lock.json had reverted to the pre-merge version, dropping upstream's dependabot bumps; restore upstream's.

…k-mass-factor

erikfrey

So good!

adenzler-nvidia added 3 commits June 12, 2026 14:19

adenzler-nvidia force-pushed the adenzler/dense-block-mass-factor branch from ff0093e to 3fd8639 Compare June 12, 2026 14:59

adenzler-nvidia marked this pull request as draft June 15, 2026 08:09

adenzler-nvidia marked this pull request as ready for review June 15, 2026 12:14

erikfrey requested changes Jun 15, 2026

View reviewed changes

Comment thread mujoco_warp/_src/types.py Outdated

Comment thread mujoco_warp/_src/types.py Outdated

Comment thread mujoco_warp/_src/smooth.py

Comment thread mujoco_warp/_src/solver.py Outdated

adenzler-nvidia added 2 commits June 15, 2026 20:50

Merge remote-tracking branch 'upstream/main' into adenzler/dense-bloc…

79e0101

…k-mass-factor # Conflicts: # mujoco_warp/_src/derivative_test.py # mujoco_warp/_src/forward_test.py # mujoco_warp/_src/io.py # mujoco_warp/_src/smooth_test.py # mujoco_warp/_src/solver_test.py