Input dataset ingestion#

This guide covers how to ingest and process input datasets for the Open Climate Risk (OCR) project using the unified CLI infrastructure.

Overview#

The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:

Quick start#

Discovery#

List all available datasets:

pixi run ocr ingest-data list-datasets

Processing#

Process a dataset (always dry run first to preview):

# Preview operations (recommended first step)
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run

# Execute the full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

Dataset-specific options#

Different datasets support different processing options:

# Vector datasets: Overture Maps - select data type
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Vector datasets: Census TIGER - select geography and states
pixi run ocr ingest-data process census-tiger \
  --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

Available datasets#

Our processed input dataset have been transfered to a public AWS bucket in us-west-2 hosted by the Source Cooperative project.

Tensor datasets (raster/Icechunk)#

scott-et-al-2024#

USFS Wildfire Risk to Communities (2nd Edition)

  • RDS ID: RDS-2020-0016-02

  • Version: 2024-V2

  • Source: USFS Research Data Archive

  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)

  • Coverage: CONUS

  • Variables: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)

Pipeline:

  1. Download 8 TIFF files from USFS Box (one per variable)

  2. Merge TIFFs into Icechunk store (EPSG:5070, native resolution)

  3. Reproject to EPSG:4326 at 30m resolution

Usage:

# Full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

# Individual steps
pixi run ocr ingest-data download scott-et-al-2024
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled

Outputs:

  • Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/

  • Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/

  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/


riley-et-al-2025#

USFS Probabilistic Wildfire Risk - 2011 & 2047 climate runs

  • RDS ID: RDS-2025-0006

  • Version: 2025

  • Source: USFS Research Data Archive

  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)

  • Coverage: CONUS

  • Variables: Multiple climate scenarios (2011 baseline, 2047 projections)

Pipeline:

  1. Download TIFF files for both time periods

  2. Process and merge into Icechunk stores

  3. Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled

Outputs:

  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/


dillon-et-al-2023#

USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)

  • RDS ID: RDS-2016-0034-3

  • Version: 2023

  • Source: USFS Research Data Archive

  • Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)

  • Coverage: CONUS

  • Variables: BP, FLP1-6 (Flame Length Probability levels)

Pipeline:

  1. Download ZIP archive and extract TIFFs

  2. Upload TIFFs to S3 and merge into Icechunk

  3. Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled

Outputs:

  • Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/

  • Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/

  • Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/


Vector datasets (GeoParquet)#

overture-maps#

Overture Maps building and address data for CONUS

  • Release: 2025-09-24.0

  • Source: Overture Maps Foundation

  • Format: GeoParquet (WKB geometry, zstd compression)

  • Coverage: CONUS (spatially filtered from global dataset)

  • Data types: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)

Pipeline:

  1. Query Overture S3 bucket directly (no download step)

  2. Filter by CONUS bounding box using DuckDB

  3. Write subsetted data to carbonplan-ocr S3 bucket

  4. (If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers

Region-tagged buildings processing:

When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:

  • Loads census FIPS lookup table for state/county names

  • Creates spatial indexes on buildings and census blocks

  • Performs bbox-filtered spatial join using ST_Intersects

  • Adds identifiers at multiple administrative levels: state, county, tract, block group, and block

Usage:

# Both buildings and addresses (default)
# Also creates region-tagged buildings automatically
pixi run ocr ingest-data run-all overture-maps

# Only buildings (also creates region-tagged buildings)
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Only addresses (no region tagging)
pixi run ocr ingest-data process overture-maps --overture-data-type addresses

# Dry run
pixi run ocr ingest-data run-all overture-maps --dry-run

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all overture-maps --use-coiled

Outputs:

  • Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet

  • Addresses: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet

  • Region-Tagged Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet


census-tiger#

US Census TIGER/Line geographic boundaries

  • Vintage: 2024 (tracts/counties), 2025 (blocks)

  • Source: US Census Bureau TIGER/Line

  • Format: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)

  • Coverage: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)

  • Geography types: Blocks, Tracts, Counties

Pipeline:

  1. Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)

  2. Convert to GeoParquet with spatial metadata

  3. Aggregate tract files using DuckDB

Usage:

# All geography types (default)
pixi run ocr ingest-data run-all census-tiger

# Only counties
pixi run ocr ingest-data process census-tiger --census-geography-type counties

# Tracts for specific states
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

# Dry run
pixi run ocr ingest-data run-all census-tiger --dry-run

Outputs:

  • Blocks: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet

  • Tracts (per-state): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet

  • Tracts (aggregated): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet

  • Counties: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet

CLI reference#

Commands#

  • list-datasets: Show all available datasets

  • download <dataset>: Download raw source data (tensor datasets only)

  • process <dataset>: Process and upload to S3/Icechunk

  • run-all <dataset>: Complete pipeline (download + process + cleanup)

Global options#

  • --dry-run: Preview operations without executing (recommended before any real run)

  • --debug: Enable debug logging for troubleshooting

Tensor dataset options#

  • --use-coiled: Use Coiled for distributed processing (USFS datasets)

Vector dataset options#

Overture Maps#

  • --overture-data-type <type>: Which data to process

    • buildings: Only building geometries

    • addresses: Only address points

    • both: Both datasets (default)

Census TIGER#

  • --census-geography-type <type>: Which geography to process

    • blocks: Census blocks

    • tracts: Census tracts (per-state + aggregated)

    • counties: County boundaries

    • all: All three types (default)

  • --census-subset-states <state> [<state> ...]: Process only specific states

    • Repeat option for each state: --census-subset-states California --census-subset-states Oregon

    • Use full state names (case-sensitive): California, Oregon, Washington, etc.

Configuration#

Environment variables#

All settings can be overridden via environment variables:

# S3 configuration
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
export OCR_INPUT_DATASET_S3_REGION=us-east-1
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix

# Processing options
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
export OCR_INPUT_DATASET_DEBUG=true

# Temporary storage
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp

Configuration class#

The InputDatasetConfig class (Pydantic model) provides:

  • Type validation for all settings

  • Automatic environment variable loading (prefix: OCR_INPUT_DATASET_)

  • Default values for all options

  • Case-insensitive environment variable names

Troubleshooting#

Dry run first#

Always test with --dry-run before executing:

ocr ingest-data run-all <dataset> --dry-run

This previews all operations without making changes.