Input dataset ingestion

Input dataset ingestion#

This guide covers how to ingest and process input datasets for the Open Climate Risk (OCR) project using the unified CLI infrastructure.

Overview#

The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:

Quick start#

Discovery#

List all available datasets:

pixi run ocr ingest-data list-datasets

Processing#

Process a dataset (always dry run first to preview):

# Preview operations (recommended first step)
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run

# Execute the full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

Dataset-specific options#

Different datasets support different processing options:

# Vector datasets: Overture Maps - select data type
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Vector datasets: Census TIGER - select geography and states
pixi run ocr ingest-data process census-tiger \
  --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

Available datasets#

Our processed input dataset have been transfered to a public AWS bucket in us-west-2 hosted by the Source Cooperative project.

Tensor datasets (raster/Icechunk)#

scott-et-al-2024#

USFS Wildfire Risk to Communities (2nd Edition)

RDS ID: RDS-2020-0016-02
Version: 2024-V2
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)

Pipeline:

Download 8 TIFF files from USFS Box (one per variable)
Merge TIFFs into Icechunk store (EPSG:5070, native resolution)
Reproject to EPSG:4326 at 30m resolution

Usage:

# Full pipeline
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled

# Individual steps
pixi run ocr ingest-data download scott-et-al-2024
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled

Outputs:

Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/
Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/
Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/

riley-et-al-2025#

USFS Probabilistic Wildfire Risk - 2011 & 2047 climate runs

RDS ID: RDS-2025-0006
Version: 2025
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: Multiple climate scenarios (2011 baseline, 2047 projections)

Pipeline:

Download TIFF files for both time periods
Process and merge into Icechunk stores
Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled

Outputs:

Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/

dillon-et-al-2023#

USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)

RDS ID: RDS-2016-0034-3
Version: 2023
Source: USFS Research Data Archive
Resolution: 30m (EPSG:4326), native 270m (EPSG:5070)
Coverage: CONUS
Variables: BP, FLP1-6 (Flame Length Probability levels)

Pipeline:

Download ZIP archive and extract TIFFs
Upload TIFFs to S3 and merge into Icechunk
Reproject to EPSG:4326 at 30m resolution

Usage:

pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled

Outputs:

Raw TIFFs: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/
Native Icechunk: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/
Reprojected: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/

Vector datasets (GeoParquet)#

overture-maps#

Overture Maps building and address data for CONUS

Release: 2025-09-24.0
Source: Overture Maps Foundation
Format: GeoParquet (WKB geometry, zstd compression)
Coverage: CONUS (spatially filtered from global dataset)
Data types: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)

Pipeline:

Query Overture S3 bucket directly (no download step)
Filter by CONUS bounding box using DuckDB
Write subsetted data to carbonplan-ocr S3 bucket
(If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers

Region-tagged buildings processing:

When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:

Loads census FIPS lookup table for state/county names
Creates spatial indexes on buildings and census blocks
Performs bbox-filtered spatial join using ST_Intersects
Adds identifiers at multiple administrative levels: state, county, tract, block group, and block

Usage:

# Both buildings and addresses (default)
# Also creates region-tagged buildings automatically
pixi run ocr ingest-data run-all overture-maps

# Only buildings (also creates region-tagged buildings)
pixi run ocr ingest-data process overture-maps --overture-data-type buildings

# Only addresses (no region tagging)
pixi run ocr ingest-data process overture-maps --overture-data-type addresses

# Dry run
pixi run ocr ingest-data run-all overture-maps --dry-run

# Use Coiled for distributed processing
pixi run ocr ingest-data run-all overture-maps --use-coiled

Outputs:

Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet
Addresses: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet
Region-Tagged Buildings: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet

census-tiger#

US Census TIGER/Line geographic boundaries

Vintage: 2024 (tracts/counties), 2025 (blocks)
Source: US Census Bureau TIGER/Line
Format: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)
Coverage: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)
Geography types: Blocks, Tracts, Counties

Pipeline:

Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)
Convert to GeoParquet with spatial metadata
Aggregate tract files using DuckDB

Usage:

# All geography types (default)
pixi run ocr ingest-data run-all census-tiger

# Only counties
pixi run ocr ingest-data process census-tiger --census-geography-type counties

# Tracts for specific states
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
  --census-subset-states California --census-subset-states Oregon

# Dry run
pixi run ocr ingest-data run-all census-tiger --dry-run

Outputs:

Blocks: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet
Tracts (per-state): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet
Tracts (aggregated): s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet
Counties: s3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet

CLI reference#

Commands#

list-datasets: Show all available datasets
download <dataset>: Download raw source data (tensor datasets only)
process <dataset>: Process and upload to S3/Icechunk
run-all <dataset>: Complete pipeline (download + process + cleanup)

Global options#

--dry-run: Preview operations without executing (recommended before any real run)
--debug: Enable debug logging for troubleshooting

Tensor dataset options#

--use-coiled: Use Coiled for distributed processing (USFS datasets)

Vector dataset options#

Overture Maps#

--overture-data-type <type>: Which data to process
- buildings: Only building geometries
- addresses: Only address points
- both: Both datasets (default)

Census TIGER#

--census-geography-type <type>: Which geography to process
- blocks: Census blocks
- tracts: Census tracts (per-state + aggregated)
- counties: County boundaries
- all: All three types (default)
--census-subset-states <state> [<state> ...]: Process only specific states
- Repeat option for each state: --census-subset-states California --census-subset-states Oregon
- Use full state names (case-sensitive): California, Oregon, Washington, etc.

Configuration#

Environment variables#

All settings can be overridden via environment variables:

# S3 configuration
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
export OCR_INPUT_DATASET_S3_REGION=us-east-1
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix

# Processing options
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
export OCR_INPUT_DATASET_DEBUG=true

# Temporary storage
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp

Configuration class#

The InputDatasetConfig class (Pydantic model) provides:

Type validation for all settings
Automatic environment variable loading (prefix: OCR_INPUT_DATASET_)
Default values for all options
Case-insensitive environment variable names

Troubleshooting#

Dry run first#

Always test with --dry-run before executing:

ocr ingest-data run-all <dataset> --dry-run

This previews all operations without making changes.

Input dataset ingestion

Contents

Input dataset ingestion#

Overview#

Quick start#

Discovery#

Processing#

Dataset-specific options#

Available datasets#

Tensor datasets (raster/Icechunk)#

scott-et-al-2024#

riley-et-al-2025#

dillon-et-al-2023#

Vector datasets (GeoParquet)#

overture-maps#

census-tiger#

CLI reference#

Commands#

Global options#

Tensor dataset options#

Vector dataset options#

Overture Maps#

Census TIGER#

Configuration#

Environment variables#

Configuration class#

Troubleshooting#

Dry run first#