Zippy Data System
A human-readable, schema-flexible document store built for modern ML and data engineering workflows. Store JSON documents with the simplicity of files and the speed of databases.
Why ZDS?
Modern ML and data workflows need flexibility that traditional formats struggle to provide. Parquet and Arrow enforce rigid schemas. SQLite requires SQL. Plain JSON has no indexing. ZDS bridges this gap.
Human-Readable
Debug with cat. Edit with vim. Version control with git. Your data is always accessible with standard tools.
Schema-Flexible
Each document defines its own shape. No migrations needed. Perfect for iterative development and heterogeneous data.
High Performance
Rust core with mmap, simd-json, and FxHashMap. O(1) random access. Writes at 4.6M records/second.
Multi-Language
Native bindings for Python, Node.js, and Rust. Query with DuckDB SQL. One format, every platform.
Quick Start
Python
pip install zippy-data
from zippy import ZDSStore, ZDSRoot, ZDataset
# Legacy helper: single-collection store (still supported)
store = ZDSStore.open("./my_dataset", collection="train")
# Add documents
store.put("doc_001", {"text": "Hello world", "label": 1})
store.put("doc_002", {"text": "Goodbye", "label": 0, "extra": [1, 2, 3]})
# Random access
print(store["doc_001"]) # {"text": "Hello world", "label": 1}
# Preferred: open a root once, then grab multiple collections
root = ZDSRoot.open("./my_dataset", native=True)
train = root.collection("train")
test = root.collection("test")
# Iterate like HuggingFace
dataset = ZDataset(train)
for doc in dataset.shuffle(seed=42):
print(doc["text"])
print(root.list_collections()) # ['test', 'train']
Node.js
npm install @zippydata/core
const { ZdsStore } = require('@zippydata/core');
const store = ZdsStore.open('./my_dataset', 'train');
store.put('doc_001', { text: 'Hello world', label: 1 });
console.log(store.get('doc_001'));
for (const doc of store.scan()) {
console.log(doc.text);
}
CLI
# Initialize a store
zippy init ./my_dataset -c train
# Add documents
zippy put ./my_dataset -c train doc_001 --data '{"text": "Hello"}'
# Query
zippy scan ./my_dataset -c train --fields text,label
How It Compares
| Feature | ZDS | Parquet | SQLite | Plain JSON |
|---|---|---|---|---|
| Human-readable | β | β | β | β |
| Schema-flexible | β | β | β οΈ | β |
| Fast random access | β | β | β | β |
| Indexed lookups | β | β | β | β |
| Git-friendly | β | β | β | β |
| No special tools | β | β | β οΈ | β |
| ML dataset API | β | β οΈ | β | β |
The Philosophy
βThe best format is one you can understand in 5 minutes and debug with
cat.β
ZDS follows proven patterns. A ZIP container wrapping human-readable JSONL documents, enhanced with binary indexes for performance. Like DOCX wraps XML, or EPUB wraps HTML.
my_dataset/
βββ collections/
βββ train/
βββ meta/
β βββ data.jsonl # Your data (one JSON per line)
βββ index.bin # Optional: O(1) lookups
This isnβt meant to be novel and itβs intentionally unoriginal. Novelty in file formats creates lock-in. We chose boring technologies that will outlast any single library.
Use Cases
Evaluation Pipelines
Run experiment β Generate 10,000 results
βββ Each result has: metrics, predictions, metadata
βββ Some results have additional debug info
βββ Need to inspect failures manually
βββ Want to version control changes
Synthetic Data Generation
Generate training examples with LLM
βββ Each example has variable structure
βββ Tool calls, function schemas, nested conversations
βββ Need to filter, edit, regenerate subsets
βββ Feed directly into training pipeline
Dataset Distribution
# Pack for sharing
zippy pack ./my_dataset dataset.zds
# Recipients can inspect without any library
unzip dataset.zds -d extracted/
cat extracted/collections/train/meta/data.jsonl | head -5 | jq .
Performance
Benchmarked on Apple M3 Max with 100,000 records:
| Operation | ZDS | SQLite | Pandas CSV | HF Datasets |
|---|---|---|---|---|
| Write | 4.66M rec/s | 237k | 205k | 633k |
| Read All (warm) | 510k rec/s | 263k | 8.18M* | 40k |
| Random Access | 308k rec/s | 88k | 227k | 30k |
*Pandas warm = in-memory DataFrame