Project structure

Project structure#

This page documents the Open Climate Risk (OCR) repository layout and explains the purpose of key directories and files. Use this as a technical reference when contributing code, adding datasets, or extending documentation.

Repository overview#

The OCR platform is organized into distinct layers: the core Python package (ocr/), supporting infrastructure (configuration, deployment, testing), input data management, exploratory research notebooks, and comprehensive documentation. The structure follows best practices for scientific Python projects with emphasis on reproducibility, modularity, and cloud-native execution.

        graph TB
    subgraph Repository["OCR Repository"]
        subgraph Core["Core Package"]
            OCR[ocr/]
            OCR --> CONFIG[config.py - Configuration models]
            OCR --> TYPES[types.py - Type definitions]
            OCR --> DATASETS[datasets.py - Data catalog]
            OCR --> CONUS[conus404.py - Climate data]
            OCR --> UTILS[utils.py - Utilities]
            OCR --> TESTING[testing.py - Test helpers]

            OCR --> DEPLOY[deploy/]
            DEPLOY --> CLI[cli.py - CLI app]
            DEPLOY --> MANAGERS[managers.py - Orchestration]

            OCR --> PIPELINE[pipeline/]
            PIPELINE --> PROCESS[process_region.py]
            PIPELINE --> PARTITION[partition.py]
            PIPELINE --> STATS[fire_wind_risk_regional_aggregator.py]
            PIPELINE --> PYRAMID[create_pyramid.py]
            PIPELINE --> PMTILES[create_building_pmtiles.py<br/>create_building_centroid_pmtiles.py<br/>create_regional_pmtiles.py]
            PIPELINE --> WRITERS[write_aggregated_region_analysis_files.py]

            OCR --> INPUTDS[input_datasets/]
            INPUTDS --> INPUTCLI[cli.py - Ingest CLI]
            INPUTDS --> INPUTBASE[base.py - Base classes]
            INPUTDS --> STORAGE[storage.py - Storage utils]
            INPUTDS --> INPUTTENSOR[tensor/ - Tensor ingestion]
            INPUTDS --> INPUTVECTOR[vector/ - Vector ingestion]

            OCR --> RISKS[risks/]
            RISKS --> FIRE[fire.py - Risk models]
        end

        subgraph Data["Data & Inputs"]
            INPUT[input-data/]
            INPUT --> TENSOR[tensor/ - CONUS404, USFS fire risk]
            INPUT --> VECTOR[vector/ - Buildings, regions, structures]
        end

        subgraph Research["Research & Exploration"]
            NOTEBOOKS[notebooks/]
            NOTEBOOKS --> NB1[Wind analysis notebooks]
            NOTEBOOKS --> NB2[Fire risk kernels]
            NOTEBOOKS --> NB3[Scaling experiments]
        end

        subgraph Docs["Documentation"]
            DOCSDIR[docs/]
            DOCSDIR --> HOWTO[how-to/ - Guides]
            DOCSDIR --> METHODS[methods/ - Science docs]
            DOCSDIR --> REFERENCE[reference/ - API & specs]
        end

        subgraph Testing["Testing & QA"]
            TESTS[tests/]
            TESTS --> UNIT[Unit tests]
            TESTS --> INTEGRATION[Integration tests]
            TESTS --> SNAPSHOTS[Snapshot tests]
        end

        subgraph Config["Configuration & Build"]
            PYPROJECT[pyproject.toml - Package config]
            PIXI[pixi.lock - Environment lock]
            SPHINX[docs/conf.py - Docs config]
            ENV[ocr-*.env - Environment vars]
            GITHUB[.github/ - CI/CD workflows]
        end

        subgraph Infra["Infrastructure"]
            BUCKET[bucket_creation/ - S3 setup]
        end
    end

    style Core fill:#e1f5ff
    style Data fill:#fff4e1
    style Research fill:#f3e8ff
    style Docs fill:#e8f5e9
    style Testing fill:#ffebee
    style Config fill:#fafafa
    style Infra fill:#fff9c4

Core package (`ocr/`)#

Contains all production code organized into logical modules:

Top-level modules#

Module	Purpose
`config.py`	Pydantic models for storage, chunking, Coiled, and processing configuration
`types.py`	Type definitions and enums (Environment, Platform, RiskType, RegionType)
`datasets.py`	Catalog abstraction for Zarr and GeoParquet datasets in S3 storage
`conus404.py`	CONUS404 climate data helpers: load variables, compute humidity, wind transformations
`utils.py`	DuckDB utilities, S3 secrets, vector sampling, file transfer helpers
`testing.py`	Snapshot testing extensions for xarray and GeoPandas
`console.py`	Rich console instance for pretty terminal output

Deployment (`deploy/`)#

Orchestration layer for local and cloud execution:

cli.py - Typer-based CLI application (ocr command) with commands for processing regions, aggregation, PMTiles generation, and analysis file creation
managers.py - Abstract batch manager interface with CoiledBatchManager (cloud) and LocalBatchManager (local) implementations

Pipeline (`pipeline/`)#

Internal processing modules coordinated by the CLI. These implement the data processing workflow:

process_region.py - Sample risk values to building locations
partition.py - Partition GeoParquet by geographic regions
fire_wind_risk_regional_aggregator.py - Compute regional statistics with DuckDB
create_pyramid.py - Generate ndpyramid multiscale Zarr for web visualization
create_building_pmtiles.py - Generate PMTiles for building footprint visualization
create_building_centroid_pmtiles.py - Generate PMTiles for building centroid visualization
create_regional_pmtiles.py - Generate PMTiles for regional aggregated statistics
write_aggregated_region_analysis_files.py - Write regional summary tables for all regions

Risk models (`risks/`)#

Domain-specific risk calculation logic:

fire.py - Fire/wind risk kernels, wind classification, elliptical spread models

Input datasets (`input_datasets/`)#

Infrastructure for ingesting and processing input datasets:

cli.py - CLI application for dataset ingestion (ocr ingest-data command)
base.py - Abstract base classes for dataset processors
storage.py - Storage utilities for managing dataset files
tensor/ - Tensor (raster) dataset ingestion modules
vector/ - Vector (GeoParquet) dataset ingestion modules

Data management (`input-data/`)#

Organized storage for input datasets and ingestion scripts:

Tensor data (`tensor/`)#

conus404/ - CONUS404 climate reanalysis data (wind speed, direction, temperature, etc.)

Vector data (`vector/`)#

alexandre-2016/ - Historical fire perimeter data
calfire_stuctures_destroyed/ - Structure damage records from CalFire

Note

Raw data files are typically not committed. This directory contains ingestion scripts and metadata. Large datasets are stored on S3.

Research notebooks (`notebooks/`)#

Exploratory Jupyter notebooks for prototyping and analysis:

conus404-winds.ipynb - Wind data exploration and CONUS404 analysis
elliptical_kernel.ipynb - Fire spread kernel development
evaluating_wind_spreading.ipynb - Wind spreading validation
fire-weather-wind-mode-reprojected.ipynb - Wind mode analysis
wind_spread.ipynb - Wind-driven fire spread modeling
wind-spreading-kernels.ipynb - Wind spread kernel experiments
methods-figures.ipynb - Generate figures for methodology documentation
benchmarking.ipynb - Performance benchmarking experiments

Note

Convention: When a notebook reaches maturity and demonstrates stable workflows, consider converting it into a how-to guide under docs/how-to/.

Documentation (`docs/`)#

docs/
├── how-to/                    # Task-oriented guides
├── reference/                 # Information-oriented technical specs
├── methods/                   # Explanation-oriented background
├── assets/                    # Images, stylesheets, static files
└── access-data.md             # Quick reference for downloads
└── terms-of-data-access.md    # Terms that apply to downloads
└── index.md                   # Documentation home page

Documentation is built with Sphinx using the sphinx-book-theme and deployed automatically to ReadTheDocs on every PR and merge to main.

Testing (`tests/`)#

Comprehensive test suite with unit and integration tests:

File	Purpose
`conftest.py`	Pytest fixtures and configuration
`test_config.py`	Configuration model validation
`test_conus404.py`	CONUS404 data loading and transformations
`test_datasets.py`	Dataset catalog and access patterns
`test_managers.py`	Batch manager orchestration logic
`test_utils.py`	Utility function tests
`test_pipeline_snapshots.py`	Snapshot-based integration tests for pipeline outputs
`risks/`	Risk model tests

Test execution:

pixi run tests              # Unit tests only
pixi run tests-integration  # Integration tests (may require S3 access)

Configuration files#

Package and environment#

pyproject.toml - Project metadata, dependencies (managed by Pixi), build config, tool settings (ruff, pytest, coverage)
pixi.lock - Locked dependency versions for reproducible environments
environment.yaml - Conda environment export (auto-generated from Pixi for Coiled deployments)

Documentation#

docs/conf.py - Sphinx configuration: theme, extensions, intersphinx mappings
.readthedocs.yaml - ReadTheDocs build configuration

Environment templates#

ocr-local.env - Template for local development (uses local filesystem)
ocr-coiled-s3.env - Template for cloud execution (S3 backend)
ocr-coiled-s3-staging.env - Staging environment configuration
ocr-coiled-s3-production.env - Production environment configuration

Code quality#

.pre-commit-config.yaml - Pre-commit hooks for linting and formatting
.prettierrc.json - Prettier configuration for Markdown/YAML formatting
codecov.yml - Code coverage reporting configuration

Infrastructure (`bucket_creation/`)#

Helper scripts for cloud infrastructure setup:

create_s3_bucket.py - Script to create and configure S3 buckets with appropriate permissions and lifecycle policies

CI/CD (`.github/`)#

GitHub Actions workflows for automated testing, building, and deployment:

workflows/ - CI/CD pipeline definitions (tests, linting, docs deployment, releases)
scripts/ - Helper scripts for environment export and Coiled software creation
dependabot.yaml - Automated dependency updates configuration
release-drafter.yml - Automated release notes generation

Development workflows#

Adding new code#

Create module under ocr/ (or in appropriate subpackage)
Add tests under tests/ (unit tests are required, integration tests for complex scenarios)
Update documentation:
- Add how-to guide if introducing new user-facing workflow
- Update API reference if adding public functions/classes
- Add method explanation if introducing new scientific approach

Adding new datasets#

Create ingestion script under input-data/tensor/ or input-data/vector/
Register dataset in ocr.datasets catalog with metadata
Document provenance: Add new source information to docs/reference/data-sources.md
Document ingestion Add information to docs/how-to/input-dataset-ingestion.md
Document workflow Add information to docs/how-to/work-with-input-datasets.md

Updating documentation#

Choose appropriate section based on Diátaxis framework:
- How-to guides: task-oriented, assume prior knowledge
- Reference: information-oriented, technical specifications
- Methods: explanation-oriented, scientific background
Update navigation in docs/index.md toctree if adding new top-level pages
Test locally: pixi run docs-build && pixi run docs-serve to preview changes
Submit PR: Documentation builds are tested in ReadTheDocs (PR preview link will be posted)

Release workflow#

See Release procedure for detailed release instructions.

Project structure

Contents

Project structure#

Repository overview#

Core package (ocr/)#

Top-level modules#

Deployment (deploy/)#

Pipeline (pipeline/)#

Risk models (risks/)#

Input datasets (input_datasets/)#

Data management (input-data/)#

Tensor data (tensor/)#

Vector data (vector/)#

Research notebooks (notebooks/)#

Documentation (docs/)#

Testing (tests/)#