Skip to content

para:arrow

import arrow from "para:arrow";

Apache Arrow’s columnar model, in-process, with no npm dep on apache-arrow. RecordBatches are typed-array views with optional validity bitmaps; tables are sequences of batches sharing a schema. The Arrow IPC streaming + file formats round-trip both directions against the canonical implementations.

Takes a map of column name → values and infers per-column type. Values can be:

  • A typed array — Int32Arrayint32, BigInt64Arrayint64, Float32Arrayfloat32, Float64Arrayfloat64, Uint8Arraybool.
  • A string[]utf8.
  • A T[][] (array of arrays) → list<T> with the child type inferred from the flattened first non-empty row.
const batch = arrow.recordBatch({
age: new Int32Array([25, 30, 35]),
score: new Float64Array([0.95, 0.82, 0.71]),
name: ["alice", "bob", "carol"],
tags: [["a", "b"], [], ["c", "d", "e"]], // list<utf8>
});
batch.numRows; // 3
batch.column("age").get(0); // 25
batch.column("tags").get(2); // ["c", "d", "e"]

Concatenates batches sharing a schema. Table has a .column(name) that returns a ConcatColumn — a virtual view across batches. Pass it to any compute function for a table-wide aggregate.

Bridge between row-shaped JS data and the columnar form. fromRows is the typical entry point from para:csv output:

import csv from "para:csv";
import arrow from "para:arrow";
const rows: any[] = [];
for await (const r of csv.parseCsv(Bun.file("data.csv"), { header: true })) rows.push(r);
const tbl = arrow.fromRows(rows);

All take a Column or ConcatColumn. Numeric reductions return a single scalar; predicate-style return a column or new batch.

FunctionDescription
sum, meanKahan-compensated sum / mean.
min, maxSkips nulls. NaN propagation matches IEEE 754.
argMin, argMaxFirst-occurrence tie-break, NaN-aware.
countCounts non-null entries.
variance(col, { ddof? }), stddevWelford accumulator. ddof=0 (population) by default.
quantile(col, q), median(col)Sorts internally; honor nulls.
distinct(col)Returns the unique values as a typed array (or string set for utf8).
cumsum(col), diff(col)New column of running totals / first differences.
concat(col)Materializes a ConcatColumn into a single typed array.

Returns a new RecordBatch keeping rows where predicate(row) is truthy. Predicate sees a row-shaped object keyed by column name.

const adults = arrow.filter(batch, row => row.age >= 30);

Hash group-by. keys is a string or array of column names; aggs is a map of output-name → { column, op }. Supported ops: sum, mean, min, max, count, variance, stddev, distinct.

const result = arrow.groupBy(batch, "city", {
rows: { column: "name", op: "count" },
avgAge: { column: "age", op: "mean" },
topScore:{ column: "score", op: "max" },
});

Stable sort by one or more keys. by is string | string[] | { name, descending?: boolean }[]. Returns a new batch with rows reordered.

const bytes = arrow.toIPC(table); // Uint8Array
const restored = arrow.fromIPC(bytes); // Table

Continuation-prefixed Schema + RecordBatch messages, FlatBuffers metadata (hand-rolled builder/reader; no npm dep), 8-byte-aligned body buffers, EOS marker. DictionaryBatch decode is implemented for round-tripping apache-arrow’s default Dictionary<Utf8> for string columns.

Pass "file" as the second arg to write the ARROW1-bracketed file format:

const fileBytes = arrow.toIPC(table, "file"); // ARROW1 + messages + EOS + Footer + len + ARROW1
const restored = arrow.fromIPC(fileBytes); // auto-detects via head/tail magic

The Footer flatbuffer carries a redundant copy of the schema plus a list of Block { offset, metaDataLength, bodyLength } entries pointing at each RecordBatch / DictionaryBatch — random-access on read.

fromIPC auto-detects: if the bytes start with ARROW1\0\0 and end with ARROW1 the file path is taken (Footer’s schema and Block list drive the decode); otherwise it falls through to the streaming reader. Same callsite, both formats.

Logical kindIn-memory storageIPC type IDNotes
int32Int32ArrayInt(32, signed)Reads narrow int8/int16/uint8/uint16 by widening.
int64BigInt64ArrayInt(64, signed)Reads uint32 by widening (zero-extend). uint64 throws — no lossless target.
float32Float32ArrayFloatingPoint(SINGLE)
float64Float64ArrayFloatingPoint(DOUBLE)
boolUint8Array (one byte/value)BoolBit-packed on the wire.
utf8string[]Utf8
list<T>Int32Array offsets + recursive child columnListDepth-first FieldNode + buffer walk. Lists of lists work.

Date / Time / Timestamp from upstream Arrow streams are coerced to int32 / int64 on read (unit and timezone metadata dropped). Round-tripping re-emits them as plain ints.

bench/parabun-arrow-ipc-interop/ round-trips both directions against apache-arrow@21.1.0:

  • Parabun encodes streaming + file → apache-arrow decodes.
  • apache-arrow encodes streaming + file (including default Dictionary<Utf8> strings + Date64) → Parabun decodes.

Mixed type table (Int8, Uint16, Uint32, Int32, Float64, Date64, Dictionary<Utf8>, List<Float64>) round-trips bit-for-bit through both formats.

The bytes Parabun produces are the same wire format pyarrow, arrow-rs, nanoarrow, polars, and duckdb consume on the streaming + file paths. Save with .arrow:

await Bun.write("data.arrow", arrow.toIPC(table, "file"));

Then in Python:

import pyarrow.feather as feather
df = feather.read_table("data.arrow")

fromParquet(bytes) reads and toParquet(source, opts?) writes Apache Parquet files. Hand-rolled Thrift compact-protocol codec, Snappy compressor + decompressor, dictionary + RLE + bit-pack hybrid decoders, RLE writer for definition levels — no npm dep.

// Read
const bytes = new Uint8Array(await Bun.file("rows.parquet").arrayBuffer());
const tbl = arrow.fromParquet(bytes);
// Write
const out = arrow.toParquet(tbl, { compression: "snappy" });
await Bun.write("rows.parquet", out);

toParquet options:

OptionDefaultDescription
compression"snappy""uncompressed" | "snappy" | "gzip".
FeatureReadWrite
Physical typesBOOLEAN, INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (utf8). INT96 + FIXED_LEN_BYTE_ARRAY pending.Same set.
EncodingsPLAIN, PLAIN_DICTIONARY (alias), RLE_DICTIONARY, RLE.PLAIN for values, RLE for def levels (no dictionary yet — strings PLAIN-encoded; less compact than pyarrow but correct).
CompressionUNCOMPRESSED, SNAPPY, GZIP. LZ4, BROTLI, ZSTD follow when wired.UNCOMPRESSED, SNAPPY, GZIP.
PagesV1 data pages with def-level null reconstruction; dictionary pages. V2 pages pending.V1 data pages only.
Row groupsMulti-row-group reads.Single row group.
SchemasFlat columns, required + optional. Nested types need rep-level reconstruction — pending.Same.

Verified end-to-end against pyarrow output:

  • Read: round-trips 6-column fixtures (int32 / int64 / float32 / float64 / utf8 / bool) under uncompressed + snappy, plus a 10,000-row fixture with nulls at 1/5, 1/7, 1/13 ratios across 4 row groups under all three compression codecs.
  • Write: pyarrow reads Parabun’s output bit-for-bit (uncompressed, snappy, gzip); 100-row fixture with scattered nulls (1/7 id, 1/11 score, 1/13 name) → null counts match pyarrow’s 15 / 10 / 8 exactly.
  • Dictionary encoding on write — strings emit as PLAIN today. pyarrow’s default RLE_DICTIONARY for low-cardinality columns is an opt-in encoding pass that lands when there’s a workload that needs the density.
  • Multi-row-group writes — large tables write as one giant row group today. Splitting at ~1M rows would match pyarrow’s defaults.
  • Struct / Map / FixedSizeList / Union / Decimal128 / FixedSizeBinary — nested + decimal types. The List<T> shape proves out the recursive FieldNode + buffer walk; the others reuse it.
  • Dictionary delta batches (isDelta=true) — apache-arrow’s default is non-delta, so this is a long-tail follow-up.
  • uint64 — no lossless 64-bit unsigned representation in JS Number / BigInt without losing range.
  • Lossless narrow-type round-trip — Parabun reads int8 by widening to int32, then writes int32. Lossless on values, lossy on the type tag. A typed wrapper that remembers wire types can land if there’s a use case.