Benchmarks

Reproducible numbers. Source and scripts are in the repo. Methodology, hardware, and known biases are listed below. If something looks off, please open an issue.

CPU micro-benchmarks

Apple M3 Max, macOS 14.5, best of 5 trials after one warmup run. Numbers as of 2026-04-16.

Benchmark cljrs-native Clojure/JVM Babashka jank vs JVM
fib(35) 0.048s 0.12s 3.7s 0.63s 2.4x
loop_sum 100M 0.081s 1.08s 35.2s 1.91s 13.3x
cond_chain 50M 0.077s 0.76s 8.9s 1.44s 9.9x

What each tests

Method

git clone <repo> && cd cljrs
cargo build --release --features mlir --bin bench
cd bench && ./run.sh

For fib the source is:

;; bench/fib.clj
(defn fib [n] (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))

;; bench/fib_native.clj
(defn-native fib ^i64 [^i64 n]
  (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2)))))

Biases

Versions

cljrscurrent HEAD with --features mlir
Clojure1.12.0 on OpenJDK 22
Babashka1.12.196
janklatest main
LLVM22.1.3

GPU benchmarks

Elementwise dst[i] = sin(x[i]) + cos(x[i] * 2), f32, Apple M3 (integrated GPU, Metal). Steady-state median of 10 after warmup, including GPU to CPU readback. Lower is better.

N cljrs-gpu cljrs-cpu numpy pytorch-cpu pytorch-mps
100 k 1.35 ms 0.41 ms 0.24 ms 0.17 ms 0.43 ms
1 M 1.39 ms 3.35 ms 2.66 ms 0.75 ms 0.93 ms
10 M 3.63 ms 34.0 ms 27.5 ms 16.96 ms 4.59 ms
100 M 32.5 ms 350 ms 267 ms 256 ms 43.8 ms

Green cells are the fastest at each size.

Runtimes

Reading the table

Kernel source

vec4 grid-stride WGSL, emitted from cljrs:

@group(0) @binding(0) var<storage, read>       src: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read_write> dst: array<vec4<f32>>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>,
        @builtin(num_workgroups) nwg: vec3<u32>) {
    let stride = nwg.x * 256u;
    let n = arrayLength(&src);
    var i = gid.x;
    loop {
        if (i >= n) { break; }
        let v = src[i];
        dst[i] = sin(v) + cos(v * 2.0);
        i = i + stride;
    }
}

Each thread processes four f32s via a vec4 load. Cuts loop overhead by 4x and lets the memory subsystem coalesce 128-bit accesses.

Reproduce

cargo run --release --features gpu --bin gpu-bench
python bench/gpu/sincos_numpy.py
python bench/gpu/sincos_pytorch.py

Biases