Files and Types
DataChain's type system is built on Pydantic. Every chain carries a schema, every column has a type, and custom models integrate automatically.
File Abstraction
File is the bridge between object storage and the data layer. Every chain begins with files: read_storage() produces a chain of File objects. File carries storage coordinates and provides methods to read content.
Storage Coordinates
File tracks everything needed to locate and identify a blob: source, path, version, etag, size, is_latest, and last_modified. This metadata is indexed by the Query Engine and available for data-engine operations without touching the actual bytes.
Content Access
import datachain as dc
file = dc.File.at("s3://bucket/path/to/file.png")
content = file.read() # bytes
text = file.read_text() # str
with file.open("rb") as f: # stream large files
chunk = f.read(4096)
file.ensure_cached() # cache locally
local = file.get_local_path() # local path after caching
file.export("/local/output/", placement="filename")
file.save("s3://bucket/output/result.png")
File inherits from DataModel (a Pydantic BaseModel subclass), so it participates in schemas, gets stored as columns, and flows through all chain operations.
Modality Types
File has specialized subclasses for each data modality. Each adds domain-specific methods while inheriting all of File's capabilities.
import datachain as dc
images = dc.read_storage("s3://bucket/images/", type="image") # ImageFile
videos = dc.read_storage("s3://bucket/videos/", type="video") # VideoFile
audio = dc.read_storage("s3://bucket/audio/", type="audio") # AudioFile
ImageFile: read() returns a PIL.Image.Image. get_info() returns Image metadata (width, height, format). save() supports format conversion.
VideoFile: get_frames(start, end, step) yields VideoFrame objects. get_fragments(duration, start, end) yields VideoFragment time slices. get_info() returns Video metadata (fps, duration, codec, resolution).
AudioFile: get_fragments(duration, start, end) yields AudioFragment chunks. get_info() returns Audio metadata (sample_rate, channels, duration, codec). save() supports format conversion.
Sub-File Units
Videos and audio tracks can be sliced into smaller units that are themselves DataModels:
import datachain as dc
# Expand one video into thousands of frames
chain = (
dc.read_storage("s3://bucket/videos/", type="video")
.gen(frame=lambda file: file.get_frames(step=30))
.save("video_frames")
)
# Time-based slicing
from typing import Iterator
def split_into_clips(file: dc.VideoFile) -> Iterator[dc.VideoFragment]:
yield from file.get_fragments(duration=10.0)
chain = (
dc.read_storage("s3://videos/", type="video")
.gen(frag=split_into_clips)
.save("video_clips")
)
Annotation Types
Built-in DataModels for annotation primitives: BBox, OBBox, Pose, Pose3D, Segment.
from datachain import model
# BBox with format conversion
bbox = model.BBox.from_coco([100, 150, 200, 300], title="car")
bbox = model.BBox.from_yolo([0.5, 0.5, 0.4, 0.6], img_size=(640, 480), title="car")
coco = bbox.to_coco() # [x, y, w, h]
yolo = bbox.to_yolo(img_size=(640, 480)) # normalized [cx, cy, w, h]
bbox.point_inside(300, 250) # spatial queries
Annotation types compose naturally into Pydantic models:
from pydantic import BaseModel
from datachain import model
class YoloPose(BaseModel):
bbox: model.BBox
pose: model.Pose
confidence: float
DataModel and Custom Types
Use pydantic.BaseModel directly for custom types; DataChain accepts it natively:
from pydantic import BaseModel
import datachain as dc
class AudioSegment(BaseModel):
audio: dc.AudioFragment
id: int
channel: str
rms: float
def get_segments(file: dc.AudioFile) -> Iterator[AudioSegment]:
...
yield AudioSegment(audio=file, ...)
chain = (
dc.read_storage("s3://mybucket/audio_dir", type="audio")
.gen(segm=get_segments)
.save("audio_segments")
)
For external Pydantic models (like Mistral's ChatCompletionResponse), register them explicitly: dc.DataModel.register(MistralResponse).
Core Classes
- DataChain: the core class for composing queries with 60+ methods
- DataModel: Pydantic base for structured types
- Column (aliased as
C): references a column by name for vectorized expressions - File, ImageFile, VideoFile, AudioFile, TextFile: storage-native file types
- BBox, OBBox, Pose, Pose3D, Segment: annotation types (in
datachain.model)