Getting Started

Get up and running with ZDS in under 5 minutes. This guide walks you through installation, creating your first dataset, and understanding what makes ZDS different.

Table of contents
  1. What is ZDS?
  2. Installation
  3. Your First Dataset
    1. Python
    2. Node.js
    3. CLI
  4. What Just Happened?
  5. Working with ML Datasets
    1. The ZDataset API
    2. Converting from HuggingFace
  6. DuckDB Integration
  7. Packaging for Distribution
  8. Next Steps

What is ZDS?

ZDS (Zippy Data System) is a document store designed for ML and data engineering workflows. It stores JSON documents in a human-readable format while providing database-like performance.

Key benefits:

  • Human-readable: Your data is stored as JSONL files you can inspect with cat, edit with vim, and version with git
  • Schema-flexible: Each document can have different fields—no migrations needed
  • Fast: O(1) random access, 4.6M writes/second, binary indexes
  • Zero lock-in: Standard ZIP + JSONL format works without any special tools

Installation

Choose your preferred language:

### Python ```bash pip install zippy-data ``` For all integrations (Pandas, DuckDB, HuggingFace): ```bash pip install zippy-data[all] ``` ### Node.js ```bash npm install @zippydata/core ``` ### Rust ```toml # Cargo.toml [dependencies] zippy_data = "0.1" ``` ### CLI ```bash # macOS brew install zippydata/tap/zippy # Or download from releases curl -L https://github.com/zippydata/zippy/releases/latest/download/zippy-$(uname -m)-apple-darwin.tar.gz | tar xz sudo mv zippy /usr/local/bin/ ```

Your First Dataset

Python

from zippy import ZDSStore

# Single collection (classic helper)
store = ZDSStore.open("./my_first_dataset", collection="examples")

# Add some documents
store.put("greeting_001", {
    "text": "Hello, world!",
    "language": "en",
    "sentiment": "positive"
})

store.put("greeting_002", {
    "text": "Bonjour le monde!",
    "language": "fr",
    "sentiment": "positive"
})

store.put("greeting_003", {
    "text": "Hola mundo!",
    "language": "es",
    "sentiment": "positive",
    "extra_field": ["this", "is", "flexible"]
})

print(f"Created {len(store)} documents")

# Multi-collection: omit the collection argument
store = ZDSStore.open("./my_first_dataset", native=True)
examples = store.collection("examples")
holdout = store.collection("holdout")

examples.put("greeting_004", {"text": "Ciao mondo!", "language": "it"})
holdout.put("greeting_eval", {"text": "Hallo Welt!", "language": "de"})

print(store.list_collections())  # ['examples', 'holdout']

# Advanced: access the underlying root (for locking/mode visibility)
native_root = store.root
# ⚠️ Closing the root invalidates every handle into this path.
# Only do this during shutdown/cleanup.
native_root.close()

# Retrieve by ID
doc = store.get("greeting_001")
print(doc["text"])  # "Hello, world!"

# Iterate all documents
for doc in store.scan():
    print(f"{doc['language']}: {doc['text']}")

Node.js

const { ZdsStore } = require('@zippydata/core');

// Single collection (classic helper)
const store = ZdsStore.open('./my_first_dataset', { collection: 'examples' });

store.put('greeting_001', {
    text: 'Hello, world!',
    language: 'en',
    sentiment: 'positive'
});

store.put('greeting_002', {
    text: 'Bonjour le monde!',
    language: 'fr',
    sentiment: 'positive'
});

console.log(`Created ${store.count} documents`);

// Multi-collection
const multi = ZdsStore.open('./my_first_dataset');
const examples = multi.collection('examples');
const holdout = multi.collection('holdout');

examples.put('greeting_003', { text: 'Hola mundo!', language: 'es' });
holdout.put('greeting_eval', { text: 'Hallo Welt!', language: 'de' });

console.log(multi.listCollections());  // ['examples', 'holdout']

// Advanced: inspect lock/mode via the exposed root
const nativeRoot = multi.root;
// ⚠️ Closing the root invalidates every handle for this path. Shutdown only.
nativeRoot.close();

// Retrieve by ID
const doc = store.get('greeting_001');
console.log(doc.text);  // "Hello, world!"

// Iterate
for (const doc of store.scan()) {
    console.log(`${doc.language}: ${doc.text}`);
}

store.close();

CLI

# Initialize a store
zippy init ./my_first_dataset -c examples

# Add documents
zippy put ./my_first_dataset -c examples greeting_001 \
    --data '{"text": "Hello, world!", "language": "en"}'

zippy put ./my_first_dataset -c examples greeting_002 \
    --data '{"text": "Bonjour!", "language": "fr"}'

# View a document
zippy get ./my_first_dataset -c examples greeting_001 --pretty

# List all documents
zippy scan ./my_first_dataset -c examples

# Show statistics
zippy stats ./my_first_dataset

What Just Happened?

Your data is now stored in a human-readable format:

$ tree my_first_dataset/
my_first_dataset/
└── collections/
    └── examples/
        ├── meta/
        │   ├── data.jsonl      # Your documents
        │   ├── index.bin       # Binary index for O(1) lookups
        │   └── manifest.json   # Collection metadata
        └── docs/               # (optional file-per-doc mode)

$ cat my_first_dataset/collections/examples/meta/data.jsonl
{"_id":"greeting_001","text":"Hello, world!","language":"en","sentiment":"positive"}
{"_id":"greeting_002","text":"Bonjour le monde!","language":"fr","sentiment":"positive"}
{"_id":"greeting_003","text":"Hola mundo!","language":"es","sentiment":"positive","extra_field":["this","is","flexible"]}

No special tools needed to inspect your data!


Working with ML Datasets

The ZDataset API

For ML workflows, use ZDataset which provides a HuggingFace-compatible interface:

from zippy import ZDataset, ZIterableDataset

# Map-style dataset (random access)
dataset = ZDataset.from_store("./my_first_dataset", collection="examples")

# Length and indexing
print(len(dataset))    # 3
print(dataset[0])      # First document
print(dataset[-1])     # Last document

# Shuffle with seed
shuffled = dataset.shuffle(seed=42)

# Filter
english = dataset.filter(lambda x: x["language"] == "en")

# Map transformation
def add_uppercase(doc):
    return {**doc, "text_upper": doc["text"].upper()}

mapped = dataset.map(add_uppercase)

# Batching
for batch in dataset.batch(2):
    print(f"Batch of {len(batch)} documents")

# Streaming (memory-efficient for large datasets)
iterable = ZIterableDataset.from_store("./my_first_dataset", collection="examples")
for doc in iterable.shuffle(buffer_size=100):
    process(doc)

Converting from HuggingFace

Already have HuggingFace datasets? Convert them:

from datasets import load_dataset
from zippy import from_hf, to_hf

# Load any HuggingFace dataset
hf_dataset = load_dataset("imdb", split="train")

# Convert to ZDS
zds = from_hf(hf_dataset, "./imdb_zds", collection="train")
print(f"Converted {len(zds)} documents")

# Now you can inspect with standard tools
# cat ./imdb_zds/collections/train/meta/data.jsonl | head -1 | jq .

# Convert back when needed
hf_back = to_hf(zds)

DuckDB Integration

Query your data with SQL:

from zippy import query_zds, register_zds
import duckdb

# Quick query
results = query_zds(
    "./my_first_dataset",
    "SELECT language, COUNT(*) as count FROM examples GROUP BY language"
)
print(results)
# [{'language': 'en', 'count': 1}, {'language': 'fr', 'count': 1}, ...]

# Register in DuckDB session for complex queries
conn = duckdb.connect()
register_zds(conn, "./my_first_dataset", collection="examples")

conn.execute("""
    SELECT * FROM examples 
    WHERE sentiment = 'positive'
    ORDER BY language
""").fetchall()

Packaging for Distribution

Pack your dataset into a single .zds file:

# Pack
zippy pack ./my_first_dataset my_dataset.zds

# Share the .zds file...

# Recipients unpack
zippy unpack my_dataset.zds ./extracted

# Or just unzip (it's a ZIP file!)
unzip my_dataset.zds -d extracted/

Next Steps