Skip to content

Store M as CSR and factor it per-block (dense / sparse / simple)#1424

Open
adenzler-nvidia wants to merge 13 commits into
google-deepmind:mainfrom
adenzler-nvidia:adenzler/dense-block-mass-factor
Open

Store M as CSR and factor it per-block (dense / sparse / simple)#1424
adenzler-nvidia wants to merge 13 commits into
google-deepmind:mainfrom
adenzler-nvidia:adenzler/dense-block-mass-factor

Conversation

@adenzler-nvidia

Copy link
Copy Markdown
Collaborator

Apologies for the size of this one. It is a single coherent change — making M's factorization per-block — but that representation threads through every consumer of M: how it is stored (io), the factor and solve kernels (smooth), the Newton Hessian assembly and the island solver (solver, support), plus the type definitions and tests. Most of the diff is the mechanical consequence of M always being CSR rather than many independent changes, so it is best read as one idea propagated outward. Happy to split it if that helps review.

This makes the inertia matrix M's factorization a per-block decision instead of a single global dense-or-sparse choice. Previously is_sparse picked one factorization for the whole matrix: a dense tile-Cholesky for small models, or MuJoCo's level-structured sparse LDL for everything else. A large model therefore paid the sparse LDL solve even for dofs whose diagonal block is trivial, and a model with one big tree plus many tiny joints could not use a dense factor for the tree without forcing it on everything.

M is now always stored in compact CSR form, and is_sparse only selects the constraint Jacobian/Hessian layout — it no longer governs how M is stored or factored. Each diagonal block of M (a connected kinematic sub-tree, a contiguous dof range) is classified independently into one of three categories: a decoupled diagonal block (a "simple" dof such as a point mass on orthogonal slides, or a free cloth particle) needs no factorization at all and uses D = 1/diag; a small coupled block (≤ 64 dofs) factors as a packed dense tile-Cholesky; a larger coupled block uses the sparse LDL factor. The factor is packed as qLD = [dense region | LDL region], the three passes write disjoint dofs, and a mixed model runs whichever passes it needs (e.g. one large tree plus many small free joints). Dense blocks are densified from the CSR layout with a tile_load_indexed gather, which is why this requires warp-lang 1.14 (the lockfile is pinned to the 1.14.0 release), and the block-Cholesky thread count scales with block size. The island solver's M-solve mirrors the same three-way split.

The win lands wherever a coupled block fits the dense tile, or where decoupled "simple" dofs dominate; a single tree larger than the dense cutoff stays on the sparse LDL path and is essentially unchanged. End-to-end throughput (steps/s, best of 8 on a single-process GPU) and the M factor+solve GPU time per step (the kernel this PR changes), branch vs upstream/main on the same GPU and the same warp 1.14.0:

model nv M path baseline steps/s this PR steps/s speedup M factor+solve baseline → PR
humanoid 27 dense block 7,640,133 8,037,783 1.05× 88.6 → 52.6 µs (−41%)
g1 + hands (512 worlds) 49 dense block 181,060 189,100 1.04× 194 → 46 µs (−76%)
g1 + hands (8192 worlds) 49 dense block 962,968 1,019,926 1.06× 941 → 636 µs (−32%)
three_humanoids 81 3 dense blocks 1,323,305 1,470,055 1.11× 888 → 290 µs (−67%)
flybody 108 sparse LDL 93,783 91,890 0.98× 319 → 328 µs (+3%)
aloha_clutter 136 small blocks 41,868 41,567 0.99× 58 → 23 µs (−60%)
cloth ~2,706 simple + dense 7,033 7,811 1.11× 17.1 → 3.5 µs/CG-iter (−80%)

Per-kernel figures are median per-launch GPU time from Nsight Systems over a real stepping rollout; unchanged control kernels (the Newton Hessian factor, _JTDAJ_sparse) match within 1% between baseline and branch, confirming the measurement. cloth's row is the M⁻¹ preconditioner applied every CG iteration (a free-joint dense solve plus a 1/diag solve for the 2700 particle dofs, replacing one sparse LDL solve over all dofs).

Two honest caveats. flybody (a single 108-dof tree) stays on the sparse LDL path, so it sees no win and a small overhead (the per-dof simple-skip check, with no payoff when there are no simple or dense blocks) — that is the expected cost of the unified path on a pure-sparse model. And mul_m regresses on dense-M models (humanoid: 2.9 → 5.2 µs/launch) because always-CSR M replaces the dense matmul with a CSR gather; this only affects models that were previously is_sparse=False, and the factor win outweighs it. Everything that was already sparse keeps its CSR mul_m unchanged.

A few non-obvious points for review. is_sparse changes meaning: it no longer describes M, only the constraint system, so any code that read it as "M is sparse" must use the new qLD_has_dense / qLD_has_sparse / qLD_has_simple flags. The get_data_into round-trip recomputes MuJoCo's LDL factor via mj_factorM whenever any block is dense or simple, since those factors do not live in the qLD LDL region — a pure-simple model has no LDL region at all. The Newton solver's Hessian assembly now reconstructs M's contribution from CSR (a scatter into the dense H tile, or an M_elemid lookup for the island Hessian) rather than a dense load, since M is no longer stored dense; the Hessian factorization itself is unchanged. And warp-lang 1.14 is now required (and locked to the 1.14.0 release) for tile_load_indexed.

Store the inertia matrix M unconditionally in compact CSR form and make
its factorization a per-block decision rather than a global is_sparse
choice. is_sparse now selects only the constraint Jacobian/Hessian
layout; it no longer governs how M is stored or factored.

M's diagonal blocks (connected kinematic sub-trees, each a contiguous
dof range) are classified by size: a block small enough for a dense
tile-Cholesky factors as a packed dense block, and the rest fall back to
the sparse LDL factor. The packed factor is laid out as
qLD = [packed dense region | nC LDL region]; the dense and sparse passes
write disjoint dofs and may both run for a mixed model (e.g. one large
tree plus many small free joints).

Dense blocks are densified from the CSR layout with a tile_load_indexed
gather (requires warp-lang 1.14), and the block-Cholesky thread count is
sourced from BlockDim and scales with block size.
A decoupled diagonal block of M -- a "simple" dof such as a point mass on
orthogonal slides or a free cloth particle -- needs no factorization at
all: its factor is just D = 1/diag. Classifying these as a third block
category (alongside dense and sparse) lets them skip the level-structured
LDL solve, which otherwise drags every trivial dof through the
forward/backward substitution.

Classify a block as simple when it is decoupled (max M row nnz == 1) and
give simple dofs dedicated factor and solve kernels, plus a fused
factor+solve for factor_solve_i. solve_LD's sparse pass skips simple dofs
so the passes stay disjoint. The island solver mirrors the same three-way
split (dense / sparse / simple) with its own island-local simple solve.

Also fix the get_data_into round-trip: a pure-simple model has no dense
or LDL region, so reconstruct MuJoCo's LDL factor via mj_factorM whenever
any block is dense or simple (simple factors live in qLDiagInv, never in
the qLD LDL region).
Regenerate uv.lock for the warp-lang 1.14 bump and change the dev-extra
specifier from a prerelease floor (>=1.14.0.dev0) to >=1.14, so the lock
resolves to the 1.14.0 release reproducibly instead of tracking warp
nightlies. The tile_load_indexed gather only needs the 1.14 release.

Reduce the island packed-block Cholesky solve to the minimal change over
upstream: keep only the packed-offset load (qLD_block_adr), the reshape,
and fill_mode="upper", and restore upstream's local names so the diff
shows just the storage change rather than a rename churn.
@adenzler-nvidia adenzler-nvidia force-pushed the adenzler/dense-block-mass-factor branch from ff0093e to 3fd8639 Compare June 12, 2026 14:59
@erikfrey

Copy link
Copy Markdown
Collaborator

@adenzler-nvidia I notice the test against stable deps is failing with NaNs, but that you updated pyproject.toml so it's not immediately obvious to me why this would be. Let me know if you'd like us to review it even with the failing tests, or if you still have a bit more cleanup to do.

@adenzler-nvidia

Copy link
Copy Markdown
Collaborator Author

will put it to draft and investigate - sorry for the noise.

@adenzler-nvidia adenzler-nvidia marked this pull request as draft June 15, 2026 08:09
The dense Newton-Hessian assembly densifies M into the H tile with a
tile_scatter_add loop whose trip count and stride used the compile-time
block_dim. launch_tiled runs a single lane on the Warp CPU backend, so
that loop scattered only 1/block_dim of M's entries -- a wrong Hessian
and NaN solves on CPU. Local GPU runs were correct and masked it; the CI
test jobs run on CPU and went red.

Compute the per-lane trip count from the runtime wp.block_dim() so a
single CPU lane covers every entry while CUDA keeps its parallel lanes.
Keep the uniform count plus enable mask so the collective tile_scatter_add
is called by all lanes; a divergent range(rank, nC, lanes) leaves some
lanes out of the collective and deadlocks the GPU.
@adenzler-nvidia adenzler-nvidia marked this pull request as ready for review June 15, 2026 12:14
@adenzler-nvidia

Copy link
Copy Markdown
Collaborator Author

Ready for review now - sorry this got a bit big, but it's not trivial to separate this into individual steps.

@erikfrey erikfrey left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool, and brings us one step closer to strong MJC : MJW parity which is great.

There might be a few opportunities to simplify / clean up a bit more as a result of these changes, some suggestions below

Comment thread mujoco_warp/_src/types.py Outdated
Comment thread mujoco_warp/_src/types.py Outdated
Comment thread mujoco_warp/_src/smooth.py
Comment thread mujoco_warp/_src/solver.py Outdated
…k-mass-factor

# Conflicts:
#	mujoco_warp/_src/derivative_test.py
#	mujoco_warp/_src/forward_test.py
#	mujoco_warp/_src/io.py
#	mujoco_warp/_src/smooth_test.py
#	mujoco_warp/_src/solver_test.py
Drop the vestigial singleton middle dimension from the CSR matrix family
(M, qLD, qLU, qDeriv, qH, M_integration): (nworld, 1, X) -> (nworld, X).
The 1 was a placeholder from when M had a dense (nworld, nv, nv) variant
that shared the wp.array3d type and the same kernels; with M always CSR
it is pure vestige. The whole family is flattened together since they
flow through the shared factor_solve_i / solve_m / mul_m kernels.

Remove the dead is_sparse branch from _tendon_armature: it wrote a dense
M[i, j] layout that can no longer exist, so only the CSR path remains.

Delete _update_gradient_set_h_M_dense_island. Both island variants add
the CSR mass matrix to the dense island Hessian; the element-parallel
sparse variant covers every layout, so a single pass suffices.

Trim comments that restated M is CSR now that it is implied.
Comment thread mujoco_warp/_src/io.py Outdated
Comment thread mujoco_warp/_src/io.py
Comment thread mujoco_warp/_src/passive.py
Comment thread mujoco_warp/_src/passive.py
Comment thread mujoco_warp/_src/types.py Outdated
The dense-block densify index map was read back from the device via
tile.adr.numpy(), but the same block-start data is already on the host
in the tiles dict that the device array was built from. Read it directly
to drop the round-trip (and its sync) from put_model.
The Store-M-as-CSR commit inadvertently dropped the flex_stiffnessadr /
flex_bendingadr < 0 guards (and the zero-stiffness early-out) that main
carries. Without them, _flex_elasticity and _flex_bending index
flex_stiffness / flex_bending at a -1 base for flexes that have no
elasticity / bending, reading out of bounds. Restore them to match main.
M couples a dof only with its tree ancestors, so its diagonal blocks are
exactly the kinematic trees. Use mjModel's tree_dofadr / tree_dofnum
directly instead of re-deriving tree roots from dof_parentid. Verified to
produce identical blocks across the test models.

dof_simplenum is intentionally not used for the simple/coupled split: it
is a contiguous-suffix run-length, so it misses interspersed decoupled
dofs; the per-block M_rownnz check covers those.
solver.py: the M_in param comment still said (nworld, 1, nC); it's
(nworld, nC) after the flatten. contrib/kernel_analyzer/package-lock.json
had reverted to the pre-merge version, dropping upstream's dependabot
bumps; restore upstream's.

@erikfrey erikfrey left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants