Zenodo Reproducibility Guide¶

This guide explains how to deposit your scientific project's data and environment on Zenodo so that anyone can reproduce your results from scratch — no Python installation required.

The workflow relies on: - Zenodo — for long-term, DOI-citable data archiving - Docker — for capturing the complete software environment - GitHub Container Registry (GHCR) — for hosting the container image

Overview¶

The reproducibility stack looks like this:

Reader
  │
  ├─ 1. Pulls Docker image from GHCR   ← contains: code + dependencies
  │       ghcr.io/you/my-analysis:v1.0.0
  │
  ├─ 2. Downloads data from Zenodo     ← contains: raw/processed data
  │       DOI: 10.5281/zenodo.XXXXXXX
  │
  └─ 3. docker compose run reproduce   ← mounts data, runs analysis, outputs plots
           ./plots/figure_1.pdf  ✓
           ./plots/figure_2.pdf  ✓

Why separate image and data? Docker images should be lightweight and versionable via git tags. Data (often GB+) belongs on a data repository (Zenodo) with its own DOI. Keeping them separate means: - The image can be rebuilt without re-uploading 10 GB of data - Data can be updated/versioned independently

One-Time Setup: Scaffold with `--docker`¶

When creating a new project, pass --docker to generate all Docker files:

pywatson --project-name my-analysis \
  --author-name "Jane Doe" \
  --author-email "jane@university.edu" \
  --description "Spin-chain Monte Carlo study" \
  --project-type full \
  --docker

This generates:

File	Purpose
`Dockerfile`	Builds the analysis environment
`.dockerignore`	Excludes data/plots from image (stays lean)
`docker-compose.yml`	One-command reproduction workflow
`README_DOCKER.md`	Reader-facing reproduction instructions
`.github/workflows/docker-publish.yml`	Automatic image builds on push

Commit and push immediately¶

cd my-analysis
git add .
git commit -m "feat: initial project scaffold with Docker"
git push -u origin main

GitHub Actions will now automatically build and push the Docker image to GHCR on every push to main and on every semver tag (e.g. v1.0.0).

Deposit Data on Zenodo¶

Step 1 — Connect Zenodo to GitHub (recommended)¶

Go to https://zenodo.org/account/settings/github/
Toggle ON your repository
Push a tag → Zenodo automatically creates a record with DOI

git tag v1.0.0
git push origin v1.0.0

Zenodo will create: - A record for the code snapshot (the tag's zip/tarball) - DOI: 10.5281/zenodo.XXXXXXX

Important: this archives the code, not your HDF5 data.
You still need a separate data deposit (Step 2).

Step 2 — Create a standalone data deposit¶

Your raw and processed data does not live in git. Create a dedicated deposit:

Go to https://zenodo.org/uploads/new
Upload your data archive:

cd my-analysis
# Package only the data needed to reproduce the paper figures
tar -czf my-analysis-data-v1.0.0.tar.gz \
    data/sims/ \
    data/exp_pro/ \
    data/exp_raw/

# Optional: generate a checksum file
sha256sum my-analysis-data-v1.0.0.tar.gz > checksums.sha256

Fill in Zenodo metadata:
Title: "Data for: My Analysis Paper Title"
Authors: same as paper
Description: what the data is, how it was generated, file format
Keywords: your field, reproducibility, hdf5, python
License: CC-BY 4.0 (recommended for open science)
Related identifiers: link to your code DOI (from Step 1)
Reserve the DOI before publishing (Zenodo gives it to you upfront — add it to your paper before it's accepted)
Publish — the deposit is now permanently citable

Step 3 — Update README_DOCKER.md with the real DOI¶

# In README_DOCKER.md, replace the placeholder:
sed -i 's/YOUR_ZENODO_ID/12345678/' README_DOCKER.md
sed -i 's/YOUR_GITHUB_USERNAME/your-github-handle/' README_DOCKER.md

git add README_DOCKER.md
git commit -m "docs: add Zenodo DOI and GHCR image URL"
git tag v1.0.0-final
git push origin main --tags

Publish the Docker Image¶

The GitHub Actions workflow (.github/workflows/docker-publish.yml) does this automatically. It:

Builds the image on every push to main
Runs a smoke-test (generates sample data, runs analysis inside the container, verifies plots/ is non-empty)
Pushes to GHCR only if the smoke-test passes
Tags images as :latest, :SHA, and :v1.2.3 on semver tags

For the first push, enable GHCR in your repository: - Go to your GitHub repo → Settings → Packages → make it public

After that, docker pull ghcr.io/YOUR_USERNAME/my-analysis:latest works for anyone.

Manual image push (optional)¶

# Build locally
docker build -t ghcr.io/YOUR_USERNAME/my-analysis:v1.0.0 .

# Log in
echo $GITHUB_TOKEN | docker login ghcr.io -u YOUR_USERNAME --password-stdin

# Push
docker push ghcr.io/YOUR_USERNAME/my-analysis:v1.0.0
docker push ghcr.io/YOUR_USERNAME/my-analysis:latest

Reader Reproduction Steps¶

This is what you should document in your README and README_DOCKER.md:

# 1. Pull the image
docker pull ghcr.io/YOUR_USERNAME/my-analysis:v1.0.0

# 2. Download data (replace with real Zenodo URL)
curl -L "https://zenodo.org/records/XXXXXXX/files/my-analysis-data-v1.0.0.tar.gz" \
     -o data.tar.gz
tar -xzf data.tar.gz          # → ./data/sims/ etc.
mkdir -p plots

# 3. Reproduce
docker compose run reproduce  # uses docker-compose.yml

# Plots appear in ./plots/

Total time for a reader: ~5 minutes (mostly download time).

Linking Code, Data, and Software¶

Zenodo supports "related identifiers" to create a web of linked records:

Record type	Zenodo relation	Example
Paper → Code	`is_supplement_to`	`10.5281/zenodo.code-doi`
Paper → Data	`is_supplement_to`	`10.5281/zenodo.data-doi`
Code → Data	`is_documented_by`	`10.5281/zenodo.data-doi`
Data → Code	`is_compiled_by`	`10.5281/zenodo.code-doi`
Docker image → Code	`is_derived_from`	`10.5281/zenodo.code-doi`

Add these in the "Related identifiers" section of each Zenodo deposit.

Checklist Before Submitting a Paper¶

□ All scripts run from scratch against the Zenodo data
□ docker compose run reproduce produces all paper figures
□ README_DOCKER.md has correct Zenodo DOI and GHCR URL
□ Zenodo data deposit is PUBLISHED (not draft)
□ Code deposit / GitHub release is tagged and linked
□ DOIs added to paper and README
□ Docker image pushed and publicly accessible
□ Image smoke-test passes in GitHub Actions
□ uv.lock committed (pinned reproducible dependency graph)
□ git log is clean (no "fix typo" commits without messages)

Advanced: Bit-Exact Reproducibility¶

For truly bit-exact reproduction (same random seeds, same OS, same library patch versions):

# Pin to an exact image SHA (not :latest)
docker pull ghcr.io/YOUR_USERNAME/my-analysis@sha256:ABCDEF...

# Run with the exact SHA
docker run --rm \
  -v "$PWD/data":/workspace/data:ro \
  -v "$PWD/plots":/workspace/plots \
  ghcr.io/YOUR_USERNAME/my-analysis@sha256:ABCDEF...

Record the SHA in your paper supplementary material. The SHA is printed in the GitHub Actions log and on the GHCR package page.

Generated by PyWatson.