Skip to content

DataChain DataChain

Data Memory for AI Agents

PyPI Python Version Codecov Tests

The model floor is the same for everyone. The context ceiling is yours.

Your data lives in object storage (millions of images, hours of video, documents) and databases (structured tables). Every chain a teammate or agent runs deposits a typed, versioned dataset into Data Memory: embeddings, classifications, joins, scores. At scale, those datasets are too expensive to recompute and too scattered to find on demand.

DataChain is the Python library that runs your code over heavy files and tables in parallel and queries Data Memory at warehouse speed. Read from S3, GCS, or Azure, run your code, save as a Pydantic-typed dataset; the next pipeline or agent picks up from there.

Why Data Memory

Claude Code, Cursor, and Codex made AI good at code by giving it the repo context. Agents over your data need the same: a data context layer with schemas, lineage, and prior conclusions. That layer is captured during production, not curated after. Every DataChain pipeline run deposits a typed, versioned dataset into Data Memory; the Knowledge Base compiles those datasets into what agents read. Without production through DataChain, the layer has nothing structured to describe.

Get started

  • 🤖 Agents - knowledge base for Claude Code, Codex, and Cursor
  • 🐍 Python - full control over data processing
  • 💡 Concepts - Data Memory, the Python and Query engines, and the Knowledge Base
  • 🧩 Use Cases - patterns where the harness changes the work

Architecture

DataChain architecture