Dataset Management

Your datasets deserve version control

Git for your computer vision data. Track changes, compare versions, and ensure every experiment is reproducible.

100%
Reproducibility
0
Data loss
5x
Faster iterations
Version history

Used by teams at

SGS
RTE
Pellenc
Skillcorner
Version Control

Track every change, reproduce any result

Your datasets evolve constantly—new images, corrected labels, filtered samples. Without version control, you're flying blind.

Immutable snapshots

Every dataset version is a permanent snapshot. Reference exact data states in experiments.

Label management

Create, rename, and merge labels across your dataset. Keep your taxonomy clean and consistent.

Fork for experiments

Fork dataset versions to test hypotheses without affecting production data.

Audit-ready history

Complete changelog with who changed what, when, and why. Perfect for compliance.

Version Timeline

Interactive
v2.0
Dataset: defect-detection
Last modified Feb 28
Images
2,100
Annotations
8,400
Change
+650
Capabilities

Everything you need to manage datasets

Version, organize, and share your datasets. Everything connects to your experiments.

Never lose your data again

Git-like Version Control

Track every change to your datasets. Compare versions, rollback mistakes, and branch for experiments. Full lineage from raw data to trained models.

Full audit trail included
100%
Reproducibility
5x
Faster discovery
Structure without chaos

Smart Data Organization

Tag, filter, and slice your data in seconds. Create custom views, save queries, and share collections. No more hunting through folders.

60%
Less coordination
Work together, not in silos

Team Collaboration

Share datasets across teams with fine-grained permissions. Track who changed what, when, and why. Comments and reviews built-in.

100%
Audit compliance
Know where your data comes from

Full Data Lineage

Trace any prediction back to its training data. Audit-ready lineage for compliance. Understand model behavior through data.

Developer Experience

Programmatic dataset management

Full Python SDK with type hints, auto-completion, and comprehensive documentation. Integrate datasets directly into your ML pipelines.

dataset_management.py
from picsellia import Client

client = Client()
datalake = client.get_datalake()

# Get or create dataset
dataset = client.get_dataset("defect-detection")

# Create a new version
version = dataset.create_version(
  version="v3",
  description="Added edge cases"
)

# Add data from datalake
data = datalake.list_data(
  tags=["edge-case", "validated"]
)
version.add_data(data)
Python SDK
create_version()
labels_export.py
# Label manipulation
labels = version.list_labels()
version.create_label("scratch")

# Rename a label
label = version.get_label("defect")
label.update(name="surface_defect")

# Export annotations in COCO format
version.export_annotation_file(
  AnnotationFileType.COCO,
  "./training_data"
)
Supports COCO, YOLO, Pascal VOC
label.update()
COCO
Object detection & segmentation
YOLO
YOLOv5/v8 format
Pascal VOC
XML annotations
Custom
JSON/CSV exports

Dataset Browser

train
8,40070%
validation
1,80015%
test
1,80015%
12K
Total images
48K
Annotations
Balanced
Class dist.
Data Organization

Structure your data the right way

Proper data splits are crucial for model performance. Create reproducible train/val/test splits, stratify by class, and ensure no data leakage.

Automatic stratified splits by class distribution
Custom split ratios with reproducible seeds
No overlap guarantee between splits
Re-split without losing annotations
Workflow Integration

Fits into your existing workflow

Datasets connect directly to annotations, experiments, and deployments. No manual handoffs.

Auto-sync
From datalake
Version
Every change
Reference
In experiments
VERSION_CONTROL

Ready to version your datasets?

Free trial, no credit card. Start versioning your datasets today.

No credit card required
14-day free trial
Unlimited versions
50M+
Images versioned
Version history
100%
Reproducibility
0
Data loss