Matmul benchmark — cljrs vs numpy / jax / pytorch
Square float32 matrix multiplication: C = A · B with
A, B ∈ ℝN×N. We build random matrices via
(ml/randn N N), then time
(ml/matmul A B) and (ml/matmul-gpu A B)
via performance.now() around the wasm REPL call.
Each configuration runs 3 times; we report the median.
GFLOPS = 2·N³ / time / 1e9.
The reference columns are conservative numbers from published f32 matmul benchmarks on a typical mid-grade machine (Apple M2, 8-core, single-threaded numpy 1.26 + OpenBLAS; JAX 0.4 with XLA JIT on the same CPU). They are not measured live in your browser — they're a fixed yardstick so you can read the cljrs columns in context. Your machine, your browser's wasm SIMD support, and thermal state will all move the cljrs numbers around.
| N | cljrs CPU ms (median) |
cljrs CPU GFLOPS |
cljrs GPU ms (median) |
cljrs GPU GFLOPS |
numpy ref GFLOPS |
JAX ref GFLOPS |
PyTorch ref GFLOPS |
|---|---|---|---|---|---|---|---|
| click Run benchmark to start | |||||||
Chart
Bars are GFLOPS (higher is better). cljrs columns appear once a run finishes.
Reference baselines
- numpy 1.26 + OpenBLAS, single-threaded, Apple M2, f32, square matmul. Typical sustained throughput for the sizes shown.
- JAX 0.4 + XLA JIT, CPU backend, Apple M2, f32, square matmul. Includes JIT warmup amortized over many calls.
- PyTorch 2.x + MKL/Accelerate, CPU, single-threaded, Apple M2, f32, square matmul.
For honest comparison you should run those same benchmarks on your machine; the cljrs numbers above are real and live, the reference numbers are not. We'll publish a per-machine script in a follow-up.
Notes on the cljrs path
-
The CPU matmul is a naive triple loop in
cljrs-ml(no blocking, no SIMD intrinsics, no threading). It exists to prove the autograd graph works; a tiled / packed kernel is on the roadmap. Don't read these numbers as cljrs's ceiling. -
ml/matmul-gpudispatches a wgpu compute kernel on native targets. In the browser wasm build it currently falls back to the same CPU kernel (a one-time warning is emitted to the console) — WebGPU access from inside our wasm crate isn't wired up yet. -
Wall time includes the wasm/JS boundary and the cljrs
opaque-tensor wrapper, but excludes the
(ml/randn …)matrix construction (those are built once before timing).