Benchmarks

Reproducible numbers. Source and scripts are in the repo. Methodology, hardware, and known biases are listed below. If something looks off, please open an issue.

CPU micro-benchmarks

Apple M3 Max, macOS 14.5, best of 5 trials after one warmup run. Numbers as of 2026-04-16.

Benchmark	cljrs-native	Clojure/JVM	Babashka	jank	vs JVM
`fib(35)`	0.048s	0.12s	3.7s	0.63s	2.4x
`loop_sum` 100M	0.081s	1.08s	35.2s	1.91s	13.3x
`cond_chain` 50M	0.077s	0.76s	8.9s	1.44s	9.9x

What each tests

fib(35). Naive recursive Fibonacci. Function-call overhead, integer arithmetic, branching.
loop_sum. loop/recur summing 100M integers. Tight-loop codegen.
cond_chain. 10-arm cond over 50M values. Conditional dispatch.

Method

git clone <repo> && cd cljrs
cargo build --release --features mlir --bin bench
cd bench && ./run.sh

For fib the source is:

;; bench/fib.clj
(defn fib [n] (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))

;; bench/fib_native.clj
(defn-native fib ^i64 [^i64 n]
  (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))

Biases

JVM numbers exclude JVM startup. We measure the inner loop via (System/nanoTime). Startup-inclusive is worse for JVM (about 1.5s cold).
One warmup iteration. Longer warmup favors JVM slightly.
Babashka uses GraalVM native-image. Great for startup, poor for hot loops.
jank uses its time macro with millisecond granularity. Numbers under 10ms are noisy.
No I/O in timed loops. Each kernel returns a scalar, printed after the timer stops.
One machine, M3 Max. x86_64 results may differ.

Versions

cljrs	current HEAD with `--features mlir`
Clojure	1.12.0 on OpenJDK 22
Babashka	1.12.196
jank	latest main
LLVM	22.1.3

GPU benchmarks

Elementwise dst[i] = sin(x[i]) + cos(x[i] * 2), f32, Apple M3 (integrated GPU, Metal). Steady-state median of 10 after warmup, including GPU to CPU readback. Lower is better.

N	cljrs-gpu	cljrs-cpu	numpy	pytorch-cpu	pytorch-mps
100 k	1.35 ms	0.41 ms	0.24 ms	0.17 ms	0.43 ms
1 M	1.39 ms	3.35 ms	2.66 ms	0.75 ms	0.93 ms
10 M	3.63 ms	34.0 ms	27.5 ms	16.96 ms	4.59 ms
100 M	32.5 ms	350 ms	267 ms	256 ms	43.8 ms

Green cells are the fastest at each size.

Runtimes

cljrs-gpu. wgpu with Metal backend. vec4 grid-stride kernel emitted from cljrs. Buffer trio reused across calls, input kept on-device between steady-state iterations.
cljrs-cpu. Single-threaded plain Rust.
numpy 2.4.4. macOS Accelerate.
pytorch-cpu, pytorch-mps 2.11.0. CPU and Apple Metal Performance Shaders with synchronize-per-op.

Reading the table

At 100k and 1M, multithreaded CPU (pytorch, numpy) wins. Kernel launch overhead exceeds compute.
At 10M and 100M, cljrs-gpu is fastest. Beats pytorch-mps by 20 to 35 percent despite going through generic wgpu instead of Apple's hand-tuned MPS kernels.
Single-thread Rust is 10 to 13x behind vectorized CPU paths at large N, as expected.

Kernel source

vec4 grid-stride WGSL, emitted from cljrs:

@group(0) @binding(0) var<storage, read>       src: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> dst: array<vec4<f32>>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>,
        @builtin(num_workgroups) nwg: vec3<u32>) {
    let stride = nwg.x * 256u;
    let n = arrayLength(&src);
    var i = gid.x;
    loop {
        if (i >= n) { break; }
        let v = src[i];
        dst[i] = sin(v) + cos(v * 2.0);
        i = i + stride;
    }
}

Each thread processes four f32s via a vec4 load. Cuts loop overhead by 4x and lets the memory subsystem coalesce 128-bit accesses.

Reproduce

cargo run --release --features gpu --bin gpu-bench
python bench/gpu/sincos_numpy.py
python bench/gpu/sincos_pytorch.py

Biases

cljrs-gpu includes GPU to CPU readback per call. PyTorch-MPS keeps tensors on GPU by default. If you chained ops, MPS would look even better. These numbers are one op with full round-trip.
One kernel. Doesn't exercise matmul, reductions, stencils.
macOS Accelerate on Apple Silicon. On Linux x86 numpy would use OpenBLAS or MKL. Numbers may differ.
No CUDA on this machine. Planned for a box with an NVIDIA GPU.
WebGPU not timed. Same kernel runs in browsers on the GPU demo page, but reliable browser timing needs instrumentation we haven't built yet.