Benchmarks
Reproducible numbers. Source and scripts are in the repo. Methodology, hardware, and known biases are listed below. If something looks off, please open an issue.
CPU micro-benchmarks
Apple M3 Max, macOS 14.5, best of 5 trials after one warmup run. Numbers as of 2026-04-16.
| Benchmark | cljrs-native | Clojure/JVM | Babashka | jank | vs JVM |
|---|---|---|---|---|---|
fib(35) | 0.048s | 0.12s | 3.7s | 0.63s | 2.4x |
loop_sum 100M | 0.081s | 1.08s | 35.2s | 1.91s | 13.3x |
cond_chain 50M | 0.077s | 0.76s | 8.9s | 1.44s | 9.9x |
What each tests
- fib(35). Naive recursive Fibonacci. Function-call overhead, integer arithmetic, branching.
- loop_sum.
loop/recursumming 100M integers. Tight-loop codegen. - cond_chain. 10-arm
condover 50M values. Conditional dispatch.
Method
git clone <repo> && cd cljrs
cargo build --release --features mlir --bin bench
cd bench && ./run.sh
For fib the source is:
;; bench/fib.clj
(defn fib [n] (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))
;; bench/fib_native.clj
(defn-native fib ^i64 [^i64 n]
(if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))
Biases
-
JVM numbers exclude JVM startup. We measure the inner loop via
(System/nanoTime). Startup-inclusive is worse for JVM (about 1.5s cold). - One warmup iteration. Longer warmup favors JVM slightly.
- Babashka uses GraalVM native-image. Great for startup, poor for hot loops.
- jank uses its
timemacro with millisecond granularity. Numbers under 10ms are noisy. - No I/O in timed loops. Each kernel returns a scalar, printed after the timer stops.
- One machine, M3 Max. x86_64 results may differ.
Versions
| cljrs | current HEAD with --features mlir |
| Clojure | 1.12.0 on OpenJDK 22 |
| Babashka | 1.12.196 |
| jank | latest main |
| LLVM | 22.1.3 |
GPU benchmarks
Elementwise dst[i] = sin(x[i]) + cos(x[i] * 2), f32,
Apple M3 (integrated GPU, Metal). Steady-state median of 10 after
warmup, including GPU to CPU readback. Lower is better.
| N | cljrs-gpu | cljrs-cpu | numpy | pytorch-cpu | pytorch-mps |
|---|---|---|---|---|---|
| 100 k | 1.35 ms | 0.41 ms | 0.24 ms | 0.17 ms | 0.43 ms |
| 1 M | 1.39 ms | 3.35 ms | 2.66 ms | 0.75 ms | 0.93 ms |
| 10 M | 3.63 ms | 34.0 ms | 27.5 ms | 16.96 ms | 4.59 ms |
| 100 M | 32.5 ms | 350 ms | 267 ms | 256 ms | 43.8 ms |
Green cells are the fastest at each size.
Runtimes
- cljrs-gpu. wgpu with Metal backend. vec4 grid-stride kernel emitted from cljrs. Buffer trio reused across calls, input kept on-device between steady-state iterations.
- cljrs-cpu. Single-threaded plain Rust.
- numpy 2.4.4. macOS Accelerate.
- pytorch-cpu, pytorch-mps 2.11.0. CPU and Apple Metal Performance Shaders with synchronize-per-op.
Reading the table
- At 100k and 1M, multithreaded CPU (pytorch, numpy) wins. Kernel launch overhead exceeds compute.
- At 10M and 100M, cljrs-gpu is fastest. Beats pytorch-mps by 20 to 35 percent despite going through generic wgpu instead of Apple's hand-tuned MPS kernels.
- Single-thread Rust is 10 to 13x behind vectorized CPU paths at large N, as expected.
Kernel source
vec4 grid-stride WGSL, emitted from cljrs:
@group(0) @binding(0) var<storage, read> src: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> dst: array<vec4<f32>>;
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>,
@builtin(num_workgroups) nwg: vec3<u32>) {
let stride = nwg.x * 256u;
let n = arrayLength(&src);
var i = gid.x;
loop {
if (i >= n) { break; }
let v = src[i];
dst[i] = sin(v) + cos(v * 2.0);
i = i + stride;
}
}
Each thread processes four f32s via a vec4 load. Cuts loop overhead by 4x and lets the memory subsystem coalesce 128-bit accesses.
Reproduce
cargo run --release --features gpu --bin gpu-bench
python bench/gpu/sincos_numpy.py
python bench/gpu/sincos_pytorch.py
Biases
- cljrs-gpu includes GPU to CPU readback per call. PyTorch-MPS keeps tensors on GPU by default. If you chained ops, MPS would look even better. These numbers are one op with full round-trip.
- One kernel. Doesn't exercise matmul, reductions, stencils.
- macOS Accelerate on Apple Silicon. On Linux x86 numpy would use OpenBLAS or MKL. Numbers may differ.
- No CUDA on this machine. Planned for a box with an NVIDIA GPU.
- WebGPU not timed. Same kernel runs in browsers on the GPU demo page, but reliable browser timing needs instrumentation we haven't built yet.