para:arrow

import arrow from "para:arrow";

Apache Arrow’s columnar model, in-process, with no npm dep on apache-arrow. RecordBatches are typed-array views with optional validity bitmaps; tables are sequences of batches sharing a schema. The Arrow IPC streaming + file formats round-trip both directions against the canonical implementations.

Building tables

`recordBatch({ ... })`

Takes a map of column name → values and infers per-column type. Values can be:

A typed array — Int32Array → int32, BigInt64Array → int64, Float32Array → float32, Float64Array → float64, Uint8Array → bool.
A string[] → utf8.
A T[][] (array of arrays) → list<T> with the child type inferred from the flattened first non-empty row.

const batch = arrow.recordBatch({
  age:    new Int32Array([25, 30, 35]),
  score:  new Float64Array([0.95, 0.82, 0.71]),
  name:   ["alice", "bob", "carol"],
  tags:   [["a", "b"], [], ["c", "d", "e"]],   // list<utf8>
});

batch.numRows;                    // 3
batch.column("age").get(0);       // 25
batch.column("tags").get(2);      // ["c", "d", "e"]

`table(batches)`

Concatenates batches sharing a schema. Table has a .column(name) that returns a ConcatColumn — a virtual view across batches. Pass it to any compute function for a table-wide aggregate.

`fromRows(rows, opts?)` / `toRows(source)`

Bridge between row-shaped JS data and the columnar form. fromRows is the typical entry point from para:csv output:

import csv from "para:csv";
import arrow from "para:arrow";

const rows: any[] = [];
for await (const r of csv.parseCsv(Bun.file("data.csv"), { header: true })) rows.push(r);
const tbl = arrow.fromRows(rows);

Compute primitives

All take a Column or ConcatColumn. Numeric reductions return a single scalar; predicate-style return a column or new batch.

Function	Description
`sum`, `mean`	Kahan-compensated sum / mean.
`min`, `max`	Skips nulls. NaN propagation matches IEEE 754.
`argMin`, `argMax`	First-occurrence tie-break, NaN-aware.
`count`	Counts non-null entries.
`variance(col, { ddof? })`, `stddev`	Welford accumulator. `ddof=0` (population) by default.
`quantile(col, q)`, `median(col)`	Sorts internally; honor nulls.
`distinct(col)`	Returns the unique values as a typed array (or string set for utf8).
`cumsum(col)`, `diff(col)`	New column of running totals / first differences.
`concat(col)`	Materializes a `ConcatColumn` into a single typed array.

`filter(batch, predicate)`

Returns a new RecordBatch keeping rows where predicate(row) is truthy. Predicate sees a row-shaped object keyed by column name.

const adults = arrow.filter(batch, row => row.age >= 30);

`groupBy(batch, keys, aggs)`

Hash group-by. keys is a string or array of column names; aggs is a map of output-name → { column, op }. Supported ops: sum, mean, min, max, count, variance, stddev, distinct.

const result = arrow.groupBy(batch, "city", {
  rows:    { column: "name",  op: "count" },
  avgAge:  { column: "age",   op: "mean"  },
  topScore:{ column: "score", op: "max"   },
});

`sort(batch, by, opts?)`

Stable sort by one or more keys. by is string | string[] | { name, descending?: boolean }[]. Returns a new batch with rows reordered.

Arrow IPC

Streaming format

const bytes = arrow.toIPC(table);          // Uint8Array
const restored = arrow.fromIPC(bytes);     // Table

Continuation-prefixed Schema + RecordBatch messages, FlatBuffers metadata (hand-rolled builder/reader; no npm dep), 8-byte-aligned body buffers, EOS marker. DictionaryBatch decode is implemented for round-tripping apache-arrow’s default Dictionary<Utf8> for string columns.

File format

Pass "file" as the second arg to write the ARROW1-bracketed file format:

const fileBytes = arrow.toIPC(table, "file");   // ARROW1 + messages + EOS + Footer + len + ARROW1
const restored = arrow.fromIPC(fileBytes);      // auto-detects via head/tail magic

The Footer flatbuffer carries a redundant copy of the schema plus a list of Block { offset, metaDataLength, bodyLength } entries pointing at each RecordBatch / DictionaryBatch — random-access on read.

fromIPC auto-detects: if the bytes start with ARROW1\0\0 and end with ARROW1 the file path is taken (Footer’s schema and Block list drive the decode); otherwise it falls through to the streaming reader. Same callsite, both formats.

Type coverage

Logical kind	In-memory storage	IPC type ID	Notes
`int32`	`Int32Array`	`Int(32, signed)`	Reads narrow int8/int16/uint8/uint16 by widening.
`int64`	`BigInt64Array`	`Int(64, signed)`	Reads uint32 by widening (zero-extend). uint64 throws — no lossless target.
`float32`	`Float32Array`	`FloatingPoint(SINGLE)`
`float64`	`Float64Array`	`FloatingPoint(DOUBLE)`
`bool`	`Uint8Array` (one byte/value)	`Bool`	Bit-packed on the wire.
`utf8`	`string[]`	`Utf8`
`list<T>`	`Int32Array` offsets + recursive child column	`List`	Depth-first FieldNode + buffer walk. Lists of lists work.

Date / Time / Timestamp from upstream Arrow streams are coerced to int32 / int64 on read (unit and timezone metadata dropped). Round-tripping re-emits them as plain ints.

Wire compat

bench/parabun-arrow-ipc-interop/ round-trips both directions against apache-arrow@21.1.0:

Parabun encodes streaming + file → apache-arrow decodes.
apache-arrow encodes streaming + file (including default Dictionary<Utf8> strings + Date64) → Parabun decodes.

Mixed type table (Int8, Uint16, Uint32, Int32, Float64, Date64, Dictionary<Utf8>, List<Float64>) round-trips bit-for-bit through both formats.

Output you can read elsewhere

The bytes Parabun produces are the same wire format pyarrow, arrow-rs, nanoarrow, polars, and duckdb consume on the streaming + file paths. Save with .arrow:

await Bun.write("data.arrow", arrow.toIPC(table, "file"));

Then in Python:

import pyarrow.feather as feather
df = feather.read_table("data.arrow")

Parquet

fromParquet(bytes) reads and toParquet(source, opts?) writes Apache Parquet files. Hand-rolled Thrift compact-protocol codec, Snappy compressor + decompressor, dictionary + RLE + bit-pack hybrid decoders, RLE writer for definition levels — no npm dep.

// Read
const bytes = new Uint8Array(await Bun.file("rows.parquet").arrayBuffer());
const tbl = arrow.fromParquet(bytes);

// Write
const out = arrow.toParquet(tbl, { compression: "snappy" });
await Bun.write("rows.parquet", out);

toParquet options:

Option	Default	Description
`compression`	`"snappy"`	`"uncompressed"` \| `"snappy"` \| `"gzip"`.

Feature	Read	Write
Physical types	`BOOLEAN`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BYTE_ARRAY` (utf8). `INT96` + `FIXED_LEN_BYTE_ARRAY` pending.	Same set.
Encodings	`PLAIN`, `PLAIN_DICTIONARY` (alias), `RLE_DICTIONARY`, `RLE`.	`PLAIN` for values, `RLE` for def levels (no dictionary yet — strings PLAIN-encoded; less compact than pyarrow but correct).
Compression	`UNCOMPRESSED`, `SNAPPY`, `GZIP`. `LZ4`, `BROTLI`, `ZSTD` follow when wired.	`UNCOMPRESSED`, `SNAPPY`, `GZIP`.
Pages	V1 data pages with def-level null reconstruction; dictionary pages. V2 pages pending.	V1 data pages only.
Row groups	Multi-row-group reads.	Single row group.
Schemas	Flat columns, required + optional. Nested types need rep-level reconstruction — pending.	Same.

Verified end-to-end against pyarrow output:

Read: round-trips 6-column fixtures (int32 / int64 / float32 / float64 / utf8 / bool) under uncompressed + snappy, plus a 10,000-row fixture with nulls at 1/5, 1/7, 1/13 ratios across 4 row groups under all three compression codecs.
Write: pyarrow reads Parabun’s output bit-for-bit (uncompressed, snappy, gzip); 100-row fixture with scattered nulls (1/7 id, 1/11 score, 1/13 name) → null counts match pyarrow’s 15 / 10 / 8 exactly.

What’s not here yet

Dictionary encoding on write — strings emit as PLAIN today. pyarrow’s default RLE_DICTIONARY for low-cardinality columns is an opt-in encoding pass that lands when there’s a workload that needs the density.
Multi-row-group writes — large tables write as one giant row group today. Splitting at ~1M rows would match pyarrow’s defaults.
Struct / Map / FixedSizeList / Union / Decimal128 / FixedSizeBinary — nested + decimal types. The List<T> shape proves out the recursive FieldNode + buffer walk; the others reuse it.
Dictionary delta batches (isDelta=true) — apache-arrow’s default is non-delta, so this is a long-tail follow-up.
uint64 — no lossless 64-bit unsigned representation in JS Number / BigInt without losing range.
Lossless narrow-type round-trip — Parabun reads int8 by widening to int32, then writes int32. Lossless on values, lossy on the type tag. A typed wrapper that remembers wire types can land if there’s a use case.