Guides
In-depth coverage of DataChain capabilities. Start with Get Data In and Transform for the core workflow, then explore deeper topics as needed.
Get Data In
- Reading Data: storage files, structured formats, SQL databases, in-memory sources, metadata merging
- Remote Storage: S3, GCS, Azure configuration, credentials, and access patterns
Transform
- Data Engine Operations: filter, merge, group_by, mutate, and other SQL-speed operations
- Python Operations: map, gen, agg, setup, and class-based lifecycle
- Function Library: dc.func.* for distance, aggregate, window, path, string, and conditional functions
- Vector Search: embedding computation, similarity search, drift detection
Get Data Out
- Exporting Data: pandas, Parquet, CSV, JSON, PyTorch DataLoader, train/test split, storage, SQL databases
Datasets
- Datasets: creating, versioning, namespaces, comparing, management, metrics
Knowledge Base
- Knowledge Base: skill installation,
dc-knowledge/generation, agent workflow, browsing
Scale and Recover
- Scaling and Performance: parallel, distributed, async prefetch, caching
- Delta Processing: incremental processing of new and changed files
- Checkpoints: automatic resume from failures
- Multi-Stage Pipelines: stage boundaries, comparative evaluation, cost tracking
Reference
- Best Practices: rules for writing correct, idiomatic DataChain code
- Error Handling and Retries: handling processing errors
- Data Processing Overview: overview of processing features
- Environment Variables: configuration options
- Namespaces: namespace and project details
- Local DB Migrations: handling upgrades