Matmul benchmark

Matmul benchmark — cljrs vs numpy / jax / pytorch

Square float32 matrix multiplication: C = A · B with A, B ∈ ℝ^N×N. We build random matrices via (ml/randn N N), then time (ml/matmul A B) and (ml/matmul-gpu A B) via performance.now() around the wasm REPL call. Each configuration runs 3 times; we report the median. GFLOPS = 2·N³ / time / 1e9.

The reference columns are conservative numbers from published f32 matmul benchmarks on a typical mid-grade machine (Apple M2, 8-core, single-threaded numpy 1.26 + OpenBLAS; JAX 0.4 with XLA JIT on the same CPU). They are not measured live in your browser — they're a fixed yardstick so you can read the cljrs columns in context. Your machine, your browser's wasm SIMD support, and thermal state will all move the cljrs numbers around.

N	cljrs CPU ms (median)	cljrs CPU GFLOPS	cljrs GPU ms (median)	cljrs GPU GFLOPS	numpy ref GFLOPS	JAX ref GFLOPS	PyTorch ref GFLOPS
click Run benchmark to start

Chart

Bars are GFLOPS (higher is better). cljrs columns appear once a run finishes.

Reference baselines

numpy 1.26 + OpenBLAS, single-threaded, Apple M2, f32, square matmul. Typical sustained throughput for the sizes shown.
JAX 0.4 + XLA JIT, CPU backend, Apple M2, f32, square matmul. Includes JIT warmup amortized over many calls.
PyTorch 2.x + MKL/Accelerate, CPU, single-threaded, Apple M2, f32, square matmul.

For honest comparison you should run those same benchmarks on your machine; the cljrs numbers above are real and live, the reference numbers are not. We'll publish a per-machine script in a follow-up.

Notes on the cljrs path

The CPU matmul is a naive triple loop in cljrs-ml (no blocking, no SIMD intrinsics, no threading). It exists to prove the autograd graph works; a tiled / packed kernel is on the roadmap. Don't read these numbers as cljrs's ceiling.
ml/matmul-gpu dispatches a wgpu compute kernel on native targets. In the browser wasm build it currently falls back to the same CPU kernel (a one-time warning is emitted to the console) — WebGPU access from inside our wasm crate isn't wired up yet.
Wall time includes the wasm/JS boundary and the cljrs opaque-tensor wrapper, but excludes the (ml/randn …) matrix construction (those are built once before timing).