Skip to content

para:simd

import simd from "para:simd";

para:simd is the CPU-side numerical primitive layer. WebAssembly v128 vectorizes Float32Array (4-lane f32) and Float64Array (2-lane f64) ops; for inputs above ~4 MiB, the wasm module reads straight from the original buffer rather than copying into wasm linear memory.

The same operations exist on para:gpu with device-dispatch fallback — para:simd is the floor that always works.

FunctionDescription
mulScalar(a, k)a[i] * k. Returns a fresh typed array.
addScalar(a, k)a[i] + k.
add(a, b)Element-wise sum. Same shape required.
mul(a, b)Element-wise product.
const y = simd.mulScalar(new Float32Array([1, 2, 3, 4]), 3); // [3, 6, 9, 12]
const z = simd.add(a, b);
const w = simd.mul(a, b);
FunctionDescription
sum(a)Σ a[i]. Kahan-compensated.
dot(a, b)Σ a[i] * b[i].
topK(a, k)Returns { indices: Int32Array, values: Float32Array } of the top k by value.

matrix[nRows, nCols] row-major × vector[nCols] → result [nRows]. Used as the CPU fallback inside para:gpu’s matVec.

Element-wise function application. fn is (x, i) => number. Significantly faster than Array.prototype.map for typed arrays of plausible size — the wasm side compiles a per-call closure. CPU ceiling unless para:gpu gates this through to a runtime-compiled GPU kernel.

const r = simd.simdMap(x => Math.sqrt(x * x + 1), input);

Returns a typed array backed by wasm linear memory. Operations on these inputs skip the HtoW copy entirely.

const buf = simd.alloc(1_000_000, "f32");
// fill buf in place...
const total = simd.sum(buf); // zero-copy

True when arr.buffer is the wasm linear-memory ArrayBuffer.

isWasmAvailable() / wasmWinsForSize(op, n, elemBytes)

Section titled “isWasmAvailable() / wasmWinsForSize(op, n, elemBytes)”

isWasmAvailable is false on hosts without v128 (typical x86-32, some embedded). wasmWinsForSize returns the calibrated CPU-vs-wasm crossover — for very small arrays the wasm dispatch overhead loses to a tight scalar JS loop, so the higher-level callers (and para:gpu) gate on this.

hasUnifiedMemoryGPU() and hasDiscreteGPU() return whether the host has a Metal-style unified-memory accelerator or a separate-memory CUDA-style one. Useful for choosing residency strategy upstream of para:gpu.

CPU release build, x86_64 (AVX2 supported), N=100k:

op (f32).map / .reducetight scalar looppara:simd
mulScalar(a, 3)808 µs60 µs30 µs
add(a, b)884 µs73 µs40 µs
sum(a)574 µs43 µs17 µs
dot(a, b)716 µs51 µs24 µs

The wasm path beats a tight scalar loop by ~2× and JS array methods by ~20-50×. Above ~4 MiB the zero-copy path adds another ~10-15% by skipping the HtoW transfer.