Store M as CSR and factor it per-block (dense / sparse / simple)#1424
Open
adenzler-nvidia wants to merge 13 commits into
Open
Store M as CSR and factor it per-block (dense / sparse / simple)#1424adenzler-nvidia wants to merge 13 commits into
adenzler-nvidia wants to merge 13 commits into
Conversation
Store the inertia matrix M unconditionally in compact CSR form and make its factorization a per-block decision rather than a global is_sparse choice. is_sparse now selects only the constraint Jacobian/Hessian layout; it no longer governs how M is stored or factored. M's diagonal blocks (connected kinematic sub-trees, each a contiguous dof range) are classified by size: a block small enough for a dense tile-Cholesky factors as a packed dense block, and the rest fall back to the sparse LDL factor. The packed factor is laid out as qLD = [packed dense region | nC LDL region]; the dense and sparse passes write disjoint dofs and may both run for a mixed model (e.g. one large tree plus many small free joints). Dense blocks are densified from the CSR layout with a tile_load_indexed gather (requires warp-lang 1.14), and the block-Cholesky thread count is sourced from BlockDim and scales with block size.
A decoupled diagonal block of M -- a "simple" dof such as a point mass on orthogonal slides or a free cloth particle -- needs no factorization at all: its factor is just D = 1/diag. Classifying these as a third block category (alongside dense and sparse) lets them skip the level-structured LDL solve, which otherwise drags every trivial dof through the forward/backward substitution. Classify a block as simple when it is decoupled (max M row nnz == 1) and give simple dofs dedicated factor and solve kernels, plus a fused factor+solve for factor_solve_i. solve_LD's sparse pass skips simple dofs so the passes stay disjoint. The island solver mirrors the same three-way split (dense / sparse / simple) with its own island-local simple solve. Also fix the get_data_into round-trip: a pure-simple model has no dense or LDL region, so reconstruct MuJoCo's LDL factor via mj_factorM whenever any block is dense or simple (simple factors live in qLDiagInv, never in the qLD LDL region).
Regenerate uv.lock for the warp-lang 1.14 bump and change the dev-extra specifier from a prerelease floor (>=1.14.0.dev0) to >=1.14, so the lock resolves to the 1.14.0 release reproducibly instead of tracking warp nightlies. The tile_load_indexed gather only needs the 1.14 release. Reduce the island packed-block Cholesky solve to the minimal change over upstream: keep only the packed-offset load (qLD_block_adr), the reshape, and fill_mode="upper", and restore upstream's local names so the diff shows just the storage change rather than a rename churn.
ff0093e to
3fd8639
Compare
Collaborator
|
@adenzler-nvidia I notice the test against stable deps is failing with NaNs, but that you updated |
Collaborator
Author
|
will put it to draft and investigate - sorry for the noise. |
The dense Newton-Hessian assembly densifies M into the H tile with a tile_scatter_add loop whose trip count and stride used the compile-time block_dim. launch_tiled runs a single lane on the Warp CPU backend, so that loop scattered only 1/block_dim of M's entries -- a wrong Hessian and NaN solves on CPU. Local GPU runs were correct and masked it; the CI test jobs run on CPU and went red. Compute the per-lane trip count from the runtime wp.block_dim() so a single CPU lane covers every entry while CUDA keeps its parallel lanes. Keep the uniform count plus enable mask so the collective tile_scatter_add is called by all lanes; a divergent range(rank, nC, lanes) leaves some lanes out of the collective and deadlocks the GPU.
Collaborator
Author
|
Ready for review now - sorry this got a bit big, but it's not trivial to separate this into individual steps. |
erikfrey
requested changes
Jun 15, 2026
erikfrey
left a comment
Collaborator
There was a problem hiding this comment.
This is super cool, and brings us one step closer to strong MJC : MJW parity which is great.
There might be a few opportunities to simplify / clean up a bit more as a result of these changes, some suggestions below
…k-mass-factor # Conflicts: # mujoco_warp/_src/derivative_test.py # mujoco_warp/_src/forward_test.py # mujoco_warp/_src/io.py # mujoco_warp/_src/smooth_test.py # mujoco_warp/_src/solver_test.py
Drop the vestigial singleton middle dimension from the CSR matrix family (M, qLD, qLU, qDeriv, qH, M_integration): (nworld, 1, X) -> (nworld, X). The 1 was a placeholder from when M had a dense (nworld, nv, nv) variant that shared the wp.array3d type and the same kernels; with M always CSR it is pure vestige. The whole family is flattened together since they flow through the shared factor_solve_i / solve_m / mul_m kernels. Remove the dead is_sparse branch from _tendon_armature: it wrote a dense M[i, j] layout that can no longer exist, so only the CSR path remains. Delete _update_gradient_set_h_M_dense_island. Both island variants add the CSR mass matrix to the dense island Hessian; the element-parallel sparse variant covers every layout, so a single pass suffices. Trim comments that restated M is CSR now that it is implied.
thowell
reviewed
Jun 16, 2026
thowell
reviewed
Jun 16, 2026
thowell
reviewed
Jun 16, 2026
thowell
reviewed
Jun 16, 2026
thowell
reviewed
Jun 16, 2026
The dense-block densify index map was read back from the device via tile.adr.numpy(), but the same block-start data is already on the host in the tiles dict that the device array was built from. Read it directly to drop the round-trip (and its sync) from put_model.
The Store-M-as-CSR commit inadvertently dropped the flex_stiffnessadr / flex_bendingadr < 0 guards (and the zero-stiffness early-out) that main carries. Without them, _flex_elasticity and _flex_bending index flex_stiffness / flex_bending at a -1 base for flexes that have no elasticity / bending, reading out of bounds. Restore them to match main.
M couples a dof only with its tree ancestors, so its diagonal blocks are exactly the kinematic trees. Use mjModel's tree_dofadr / tree_dofnum directly instead of re-deriving tree roots from dof_parentid. Verified to produce identical blocks across the test models. dof_simplenum is intentionally not used for the simple/coupled split: it is a contiguous-suffix run-length, so it misses interspersed decoupled dofs; the per-block M_rownnz check covers those.
solver.py: the M_in param comment still said (nworld, 1, nC); it's (nworld, nC) after the flatten. contrib/kernel_analyzer/package-lock.json had reverted to the pre-merge version, dropping upstream's dependabot bumps; restore upstream's.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Apologies for the size of this one. It is a single coherent change — making M's factorization per-block — but that representation threads through every consumer of M: how it is stored (io), the factor and solve kernels (smooth), the Newton Hessian assembly and the island solver (solver, support), plus the type definitions and tests. Most of the diff is the mechanical consequence of M always being CSR rather than many independent changes, so it is best read as one idea propagated outward. Happy to split it if that helps review.
This makes the inertia matrix M's factorization a per-block decision instead of a single global dense-or-sparse choice. Previously
is_sparsepicked one factorization for the whole matrix: a dense tile-Cholesky for small models, or MuJoCo's level-structured sparse LDL for everything else. A large model therefore paid the sparse LDL solve even for dofs whose diagonal block is trivial, and a model with one big tree plus many tiny joints could not use a dense factor for the tree without forcing it on everything.M is now always stored in compact CSR form, and
is_sparseonly selects the constraint Jacobian/Hessian layout — it no longer governs how M is stored or factored. Each diagonal block of M (a connected kinematic sub-tree, a contiguous dof range) is classified independently into one of three categories: a decoupled diagonal block (a "simple" dof such as a point mass on orthogonal slides, or a free cloth particle) needs no factorization at all and uses D = 1/diag; a small coupled block (≤ 64 dofs) factors as a packed dense tile-Cholesky; a larger coupled block uses the sparse LDL factor. The factor is packed as qLD = [dense region | LDL region], the three passes write disjoint dofs, and a mixed model runs whichever passes it needs (e.g. one large tree plus many small free joints). Dense blocks are densified from the CSR layout with atile_load_indexedgather, which is why this requires warp-lang 1.14 (the lockfile is pinned to the 1.14.0 release), and the block-Cholesky thread count scales with block size. The island solver's M-solve mirrors the same three-way split.The win lands wherever a coupled block fits the dense tile, or where decoupled "simple" dofs dominate; a single tree larger than the dense cutoff stays on the sparse LDL path and is essentially unchanged. End-to-end throughput (steps/s, best of 8 on a single-process GPU) and the M factor+solve GPU time per step (the kernel this PR changes), branch vs upstream/main on the same GPU and the same warp 1.14.0:
Per-kernel figures are median per-launch GPU time from Nsight Systems over a real stepping rollout; unchanged control kernels (the Newton Hessian factor,
_JTDAJ_sparse) match within 1% between baseline and branch, confirming the measurement. cloth's row is the M⁻¹ preconditioner applied every CG iteration (a free-joint dense solve plus a 1/diag solve for the 2700 particle dofs, replacing one sparse LDL solve over all dofs).Two honest caveats. flybody (a single 108-dof tree) stays on the sparse LDL path, so it sees no win and a small overhead (the per-dof simple-skip check, with no payoff when there are no simple or dense blocks) — that is the expected cost of the unified path on a pure-sparse model. And
mul_mregresses on dense-M models (humanoid: 2.9 → 5.2 µs/launch) because always-CSR M replaces the dense matmul with a CSR gather; this only affects models that were previouslyis_sparse=False, and the factor win outweighs it. Everything that was already sparse keeps its CSRmul_munchanged.A few non-obvious points for review.
is_sparsechanges meaning: it no longer describes M, only the constraint system, so any code that read it as "M is sparse" must use the newqLD_has_dense/qLD_has_sparse/qLD_has_simpleflags. Theget_data_intoround-trip recomputes MuJoCo's LDL factor viamj_factorMwhenever any block is dense or simple, since those factors do not live in the qLD LDL region — a pure-simple model has no LDL region at all. The Newton solver's Hessian assembly now reconstructs M's contribution from CSR (a scatter into the dense H tile, or anM_elemidlookup for the island Hessian) rather than a dense load, since M is no longer stored dense; the Hessian factorization itself is unchanged. And warp-lang 1.14 is now required (and locked to the 1.14.0 release) fortile_load_indexed.