[docs/examples] Blackwell cute tutorials: narrow TMEM_LOAD atoms (32dp32b1x) carry a large per-load lowering cost — prefer wider atoms (32dp32b32x) in t2r epilogues#3313
Open
cfregly wants to merge 1 commit into
Conversation
…LOAD atoms; prefer wider atoms in t2r epilogues ptxas (CUDA 13.x) lowers each tcgen05.ld of the 32dp32b1x atom through a per-load LEPC + CALL.ABS.NOINC + WARPSYNC convergence-helper call: 256 loads/thread for a 128x256 fp32 accumulator = 256 helper calls per warp, which can dominate a small kernel's fixed cost. Switching one line to SM100_TMEM_LOAD_32dp32b32x (8 loads/thread) measured 1.49x on a full GEMM kernel on GB300 (sm_103), bit-identical output; x128 regresses (serialized register writeback). Comment-only change to the five Blackwell tutorials.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
All five Blackwell CuTe tutorials demonstrate the TMEM->register epilogue
with the narrowest TMEM_LOAD copy atom:
// examples/cute/tutorial/blackwell/01_mma_sm100.cu (and 02..05 likewise) TiledCopy tiled_t2r_copy = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, tCtAcc);On sm_103 (CUDA 13.2 ptxas; also observed on the 13.x line generally) every
tcgen05.ld.sync.aligned.32x32b.x1.b32this atom emits is lowered to aper-load convergence-helper subroutine call in SASS:
For a 128x256 fp32 accumulator that is 256 loads per thread = 256 helper
calls per warp (~60 cycles each), which can dominate a small kernel's fixed
cost. In a warp-specialized 2048^3 fp16 GEMM kernel built from these
tutorials we measured the t2r phase at 7.65-7.81 us of a 23.8-24.2 us
kernel (in-kernel
%globaltimerstamps, per-CTA medians, 128 CTAs).Switching the one line to
SM100_TMEM_LOAD_32dp32b32x(8 loads/thread)cut t2r to 0.42 us and the whole kernel from 23.8 -> 16.0 us (1.49x),
bit-identically (
torch.equalon the outputs — the atom only changes howmany columns one instruction moves; each thread keeps the same
(row, all-columns) fragment, so per-element conversion and store mapping
are untouched). Width sweep at the same shape:
x16tiesx32(0.51 vs0.42 us);
x128REGRESSES (0.99 us t2r + slower writeback — the 128-outputasm serializes register writeback), so widest is not best; x32 was the
sweet spot measured.
A tutorial that demonstrates the 1x atom without comment teaches a ~1.5x
performance bug as the canonical epilogue.
Proposed change
Minimal, docs-only (no behavior change to any library kernel):
In tutorials 01-05, either switch the epilogue atom to
SM100_TMEM_LOAD_32dp32b32x(and refresh the affected printannotations), or keep
32dp32b1xfor pedagogical simplicity and add ashort comment block, e.g.:
Optionally, a sentence in
media/docs/cpp/cute/0y_tmem_tensor.md(or the blackwell functionalitydoc) noting atom width as a first-class performance knob.
Standalone evidence
A ~120-line reproducer (attached:
tmem_load_atom_repro.cu) containingonly the TMEM alloc +
make_tmem_copyt2r + store (no mainloop, no MMA),built with
nvcc -std=c++20 -arch=sm_103aagainst current CUTLASS headers:32dp32b1x32dp32b32xLDTMin SASSCALL.ABS.NOINC/LEPCcuobjdump -sassone-liners to see it:Environment
pipeline; CUDA-graph interleaved A/B, 7 reps/arm, zero distribution
overlap; ncu cross-check (
sm__pipe_tensor_cycles_active20.4% -> 37.9%of elapsed)
tmem_load_atom_repro.cusass_evidence.txt