Documentation
Everything you need to build with ZDS—from quick starts to deep dives.
Choose Your Path
Quick Installation
| Language | Command |
|---|---|
| Python | pip install zippy-data |
| Node.js | npm install @zippydata/core |
| Rust | cargo add zippy_data |
| CLI | Download from releases |
Core Concepts
Stores and Collections
A store is a directory (or ZIP archive) containing one or more collections. Each collection holds documents.
from zippy import ZDSStore, ZDataset
# Single collection (classic helper)
store = ZDSStore.open("./my_dataset", collection="train")
store.put("doc_001", {"text": "Hello world", "label": 1})
store.put("doc_002", {"text": "Goodbye", "label": 0, "extra": [1, 2, 3]})
# Multi-collection: omit the collection argument for a root-capable handle
store = ZDSStore.open("./my_dataset", native=True)
train = store.collection("train")
test = store.collection("test")
# Iterate like HuggingFace
dataset = ZDataset(train)
for doc in dataset.shuffle(seed=42):
print(doc["text"])
print(store.list_collections()) # ['test', 'train']
# Advanced: inspect lock/mode state via the exposed root
native_root = store.root # NativeRoot / ZDSRoot
# ⚠️ Closing the root tears down every reader/writer for this path.
# Do this only during shutdown/cleanup.
native_root.close()
💡
ZDSRootnow lives understore.root. Only reach for it when you need explicit read/write modes, manual locking, or to share the memoized root with another runtime.⚠️ Closing the root invalidates every handle into that store. Call it once you’re done writing/reading, never mid-workload.
Documents
Documents are JSON objects with a unique _id:
{"_id": "doc_001", "text": "Hello world", "label": 1}
{"_id": "doc_002", "text": "Goodbye", "nested": {"deep": "value"}}
Schema is per-document—each document can have different fields.
Storage Modes
| Mode | Files | Best For |
|---|---|---|
| JSONL | meta/data.jsonl |
Performance, streaming |
| File-per-doc | docs/*.json |
Git diffs, manual editing |
Indexes
ZDS uses a binary index (index.bin) for O(1) lookups by document ID. The index is optional—without it, operations fall back to sequential scan.
API Overview
| Operation | Python | Node.js | Rust | CLI |
|---|---|---|---|---|
| Open/create | ZDSStore.open() |
ZdsStore.open() |
FastStore::open() |
zippy init |
| Put | store.put(id, doc) |
store.put(id, doc) |
store.put(id, doc) |
zippy put |
| Get | store.get(id) |
store.get(id) |
store.get(id) |
zippy get |
| Delete | store.delete(id) |
store.delete(id) |
store.delete(id) |
zippy delete |
| Scan | store.scan() |
store.scan() |
store.scan_all() |
zippy scan |
| Count | len(store) |
store.count |
store.len() |
zippy stats |
HuggingFace Compatibility
ZDS provides a ZDataset class that mirrors the HuggingFace Dataset API:
from zippy import ZDataset
dataset = ZDataset.from_store("./data", collection="train")
# HuggingFace-style operations
shuffled = dataset.shuffle(seed=42)
filtered = dataset.filter(lambda x: x["label"] == 1)
batches = dataset.batch(32)
# Convert to/from HuggingFace
from zippy import from_hf, to_hf
zds = from_hf(hf_dataset, "./output")
hf = to_hf(zds)
DuckDB Integration
Query ZDS collections with SQL:
from zippy import query_zds
results = query_zds(
"./data",
"SELECT label, COUNT(*) FROM train GROUP BY label"
)
print(results)
Need Help?
- GitHub Issues — Bug reports and feature requests
- Examples — Working code samples for all languages
- Paper — Design rationale and benchmarks