API reference

Contents

API reference#

This page provides a structured, auto-generated reference for the ocr Python package. Each section links to the corresponding module(s) and surfaces docstrings, type hints, and signatures.


Package overview#

High-level package entry points and public exports.


Core modules#

Configuration#

Configuration models for storage, chunking, Coiled, and processing settings.

class ocr.config.CoiledConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, tag={'Project': 'OCR'}, forward_aws_credentials=False, spot_policy='spot_with_fallback', region='us-west-2', ntasks=1, vm_type='m8g.2xlarge', scheduler_vm_type='m8g.2xlarge')[source]#

Bases: BaseSettings

tag: dict[str, str]#
forward_aws_credentials: bool#
spot_policy: Literal['on-demand', 'spot', 'spot_with_fallback']#
region: str#
ntasks: Annotated[int, Gt(gt=0)]#
vm_type: str#
scheduler_vm_type: str#
model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'ocr_coiled_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ocr.config.ChunkingConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, chunks=None, debug=False)[source]#

Bases: BaseSettings

chunks: dict | None#
debug: bool#
model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'ocr_chunking_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_ChunkingConfig__context)[source]#

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

property extent[source]#
property extent_as_tuple[source]#
property extent_as_tuple_5070[source]#

5070 projection as tuple (xmin, xmax, ymin, ymax)

Type:

Get extent in EPSG

property ds[source]#
property transform[source]#
property chunk_info: dict[source]#

Get information about the dataset’s chunks

property valid_region_ids: list[source]#

Generate valid region IDs by checking which regions contain non-null data.

Returns:

List of valid region IDs (e.g., ‘y1_x3’, ‘y2_x4’, etc.)

Return type:

list

index_to_coords(x_idx, y_idx)[source]#

Convert array indices to EPSG:4326 coordinates

Parameters:
  • x_idx (int) – Index along the x-dimension (longitude)

  • y_idx (int) – Index along the y-dimension (latitude)

Returns:

x, y – Corresponding EPSG:4326 coordinates (longitude, latitude)

Return type:

tuple[float, float]

chunks_to_slices(chunks)[source]#

Create a dict of chunk_ids and slices from input chunk dict

Parameters:

chunks (dict) – Dictionary with chunk sizes for ‘longitude’ and ‘latitude’

Returns:

Dictionary with chunk IDs as keys and corresponding slices as values

Return type:

dict

region_id_chunk_lookup(region_id)[source]#

given a region_id, ex: ‘y5_x14, returns the corresponding chunk (5, 14)

Parameters:

region_id (str) – The region_id for chunk_id lookup.

Returns:

index – The corresponding chunk (iy, ix) for the given region_id.

Return type:

tuple[int, int]

region_id_slice_lookup(region_id)[source]#

given a region_id, ex: ‘y5_x14, returns the corresponding x,y slices. ex: (slice(np.int64(30000), np.int64(36000), None), slice(np.int64(85500), np.int64(90000), None))

Parameters:

region_id (str) – The region_id for chunk_id lookup.

Returns:

indexer – The corresponding slices (y_slice, x_slice) for the given region_id.

Return type:

tuple[slice]

chunk_id_to_slice(chunk_id)[source]#

Convert a chunk ID (iy, ix) to corresponding array slices

Parameters:

chunk_id (tuple) – The chunk identifier as a tuple (iy, ix) where: - iy is the index along y-dimension - ix is the index along x-dimension

Returns:

chunk_slices – A tuple of slices (y_slice, x_slice) to extract data for this chunk

Return type:

tuple[slice]

region_id_to_latlon_slices(region_id)[source]#

Get latitude and longitude slices from region_id

Returns (lat_slice, lon_slice) where lat_slice.start < lat_slice.stop and lon_slice.start < lon_slice.stop (lower-left origin, lat ascending).

get_chunk_mapping()[source]#

Returns a dict of region_ids and their corresponding chunk_indexes.

Returns:

chunk_mapping – Dictionary with region IDs as keys and corresponding chunk indexes (iy, ix) as values

Return type:

dict

plot_all_chunks(color_by_size=False)[source]#

Plot all data chunks across the entire CONUS with their indices as labels

Parameters:

color_by_size (bool, default False) – If True, color chunks based on their size (useful to identify irregularities)

bbox_from_wgs84(xmin, ymin, xmax, ymax)[source]#

https://observablehq.com/@rdmurphy/u-s-state-bounding-boxes

get_chunks_for_bbox(bbox)[source]#

Find all chunks that intersect with the given bounding box

Parameters:

bbox (BoundingBox or tuple) – Bounding box to check for intersection. If tuple, format is (minx, miny, maxx, maxy)

Returns:

List of (iy, ix) tuples identifying the intersecting chunks

Return type:

list of tuples

visualize_chunks_on_conus(chunks=None, color_by_size=False, highlight_chunks=None, include_all_chunks=False)[source]#

Visualize specified chunks on CONUS map

Parameters:
  • chunks (list of tuples, optional) – List of (iy, ix) tuples specifying chunks to visualize If None, will show all chunks

  • color_by_size (bool, default False) – If True, color chunks based on their size

  • highlight_chunks (list of tuples, optional) – List of (iy, ix) tuples specifying chunks to highlight

  • include_all_chunks (bool, default False) – If True, show all chunks in background with low opacity

class ocr.config.PyramidConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, environment=Environment.QA, version=None, storage_root, output_prefix=None, debug=False)[source]#

Bases: BaseSettings

Configuration for visualization pyramid / multiscales

environment: Environment#
version: SemanticVersion | None#
storage_root: str#
output_prefix: str | None#
debug: bool#
model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'ocr_vector_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_PyramidConfig__context)[source]#

Post-initialization to set up prefixes and URIs based on environment.

property pyramid_uri: UPath#
wipe()[source]#

Wipe the pyramid data storage.

class ocr.config.VectorConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, environment=Environment.QA, version=None, storage_root, prefix=None, output_prefix=None, debug=False, metadata=None)[source]#

Bases: BaseSettings

Configuration for vector data processing.

environment: Environment#
version: SemanticVersion | None#
storage_root: str#
prefix: str | None#
output_prefix: str | None#
debug: bool#
model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'ocr_vector_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

metadata: dict[str, str] | None#
property metadata_dict: dict[str, str]#

Get metadata dict with ODBL license for vector data.

model_post_init(_VectorConfig__context)[source]#

Post-initialization to set up prefixes and URIs based on environment.

wipe()[source]#

Wipe the vector data storage.

property pmtiles_prefix: str[source]#
property buildings_pmtiles_uri: UPath[source]#
property building_centroids_pmtiles_uri: UPath[source]#
property region_pmtiles_uri: UPath[source]#
property region_geoparquet_prefix: str[source]#
property geoparquet_prefix: str[source]#
property region_geoparquet_uri: UPath[source]#
property aggregated_region_analysis_prefix: str[source]#
property aggregated_region_analysis_uri: UPath[source]#
property building_geoparquet_uri: UPath#
property region_summary_stats_prefix: UPath[source]#
property block_summary_stats_uri: UPath[source]#

URI for the block summary statistics file.

property tracts_summary_stats_uri: UPath[source]#

URI for the tracts summary statistics file.

property counties_summary_stats_uri: UPath[source]#

URI for the counties summary statistics file.

property states_summary_stats_uri: UPath[source]#

URI for the states summary statistics file.

property nation_summary_stats_uri: UPath[source]#

URI for the nation summary statistics file.

upath_delete(path)[source]#

Use UPath to handle deletion in a cloud-agnostic way

pretty_paths()[source]#

Pretty print key VectorConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

class ocr.config.IcechunkConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, environment=Environment.QA, version=None, storage_root, prefix=None, debug=False, metadata=None)[source]#

Bases: BaseSettings

Configuration for icechunk processing.

environment: Environment#
version: SemanticVersion | None#
storage_root: str#
prefix: str | None#
debug: bool#
metadata: dict[str, str] | None#
property metadata_dict: dict[str, str]#

Get metadata dict with CC-BY-4.0 license for icechunk data.

model_post_init(_IcechunkConfig__context)[source]#

Post-initialization to set up prefixes and URIs based on environment.

wipe()[source]#

Wipe the icechunk repository.

property uri: UPath[source]#

Return the URI for the icechunk repository.

property storage: Storage[source]#
init_repo()[source]#

Creates an icechunk repo or opens if does not exist

repo_and_session(readonly=False, branch='main')[source]#

Open an icechunk repository and return the session.

delete()[source]#

Delete the icechunk repository.

create_template()[source]#

Create a template dataset for icechunk store

commit_messages_ancestry(branch='main')[source]#

Get the commit messages ancestry for the icechunk repository.

region_id_exists(region_id, *, branch='main')[source]#
processed_regions(*, branch='main')[source]#

Get a list of region IDs that have already been processed.

insert_region_uncooperative(subset_ds, *, region_id, branch='main')[source]#

Insert region into Icechunk store

Parameters:
  • subset_ds (xr.Dataset) – The subset dataset to insert into the Icechunk store.

  • region_id (str) – The region ID corresponding to the subset dataset.

  • branch (str, optional) – The branch to use in the Icechunk repository, by default ‘main’.

pretty_paths()[source]#

Pretty print key IcechunkConfig paths and URIs.

This version touches cached properties (e.g., uri, storage) to surface real configuration and types.

model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ocr.config.RegionIDStatus(provided_region_ids: set[str], valid_region_ids: set[str], invalid_region_ids: set[str], processed_region_ids: set[str], previously_processed_ids: set[str], unprocessed_valid_region_ids: set[str])[source]#

Bases: object

provided_region_ids: set[str]#
valid_region_ids: set[str]#
invalid_region_ids: set[str]#
processed_region_ids: set[str]#
previously_processed_ids: set[str]#
unprocessed_valid_region_ids: set[str]#
class ocr.config.OCRConfig(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, *, environment=Environment.QA, version=None, storage_root, vector=None, icechunk=None, pyramid=None, chunking=None, coiled=None, debug=False)[source]#

Bases: BaseSettings

Configuration settings for OCR processing.

environment: Environment#
version: SemanticVersion | None#
storage_root: str#
vector: VectorConfig | None#
icechunk: IcechunkConfig | None#
pyramid: PyramidConfig | None#
chunking: ChunkingConfig | None#
coiled: CoiledConfig | None#
debug: bool#
model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'ocr_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_OCRConfig__context)[source]#

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

pretty_paths()[source]#

Pretty print key OCRConfig paths and URIs.

This method intentionally touches cached properties that create directories (e.g., via mkdir) so you can verify real locations.

resolve_region_ids(provided_region_ids, *, allow_all_processed=False)[source]#

Validate provided region IDs against valid + processed sets.

Parameters:
  • provided_region_ids (set[str]) – The set of region IDs to validate.

  • allow_all_processed (bool, optional) – If True, don’t raise an error when all regions are already processed. This is useful for production reruns where you want to regenerate vector outputs even if icechunk regions are complete. Default is False.

Returns:

Status object with validation results.

Return type:

RegionIDStatus

Raises:

ValueError – If no valid unprocessed region IDs remain and allow_all_processed is False.

select_region_ids(region_ids, *, all_region_ids=False, allow_all_processed=False)[source]#

Helper to pick the effective set of region IDs (all or user-provided) and return the validated status object.

Parameters:
  • region_ids (list[str] | None) – User-provided region IDs to process.

  • all_region_ids (bool, optional) – If True, use all valid region IDs instead of user-provided ones. Default is False.

  • allow_all_processed (bool, optional) – If True, don’t raise an error when all regions are already processed. Passed through to resolve_region_ids. Default is False.

Returns:

Status object with validation results.

Return type:

RegionIDStatus

ocr.config.load_config(file_path)[source]#

Load OCR configuration from an env file (dotenv) or current environment.

Type definitions#

Strongly typed enums for environment, platform, and risk types.

class ocr.types.Environment(*values)[source]#

Bases: StrEnum

QA = 'qa'#
STAGING = 'staging'#
PRODUCTION = 'production'#
class ocr.types.Platform(*values)[source]#

Bases: StrEnum

COILED = 'coiled'#
LOCAL = 'local'#
class ocr.types.RiskType(*values)[source]#

Bases: StrEnum

Available risk types for calculation.

FIRE = 'fire'#

Data access#

Datasets#

Dataset and Catalog abstractions for Zarr and GeoParquet on S3/local storage.

class ocr.datasets.Dataset(*, name, description, bucket, prefix, data_format, version='v1', license=None)[source]#

Bases: BaseModel

Base class for datasets.

name: str#
description: str#
bucket: str#
prefix: str#
data_format: Literal['geoparquet', 'zarr']#
version: str#
license: str | None#
to_xarray(*, is_icechunk=None, xarray_open_kwargs=None, xarray_storage_options=None)[source]#

Convert the dataset to an xarray.Dataset.

Parameters:
  • is_icechunk (bool | None, default None) – Whether to use icechunk to access the data. - If True: only try using icechunk - If None: try icechunk first, fall back to direct S3 access if it fails - If False: only use direct S3 access

  • xarray_open_kwargs (dict, optional) – Additional keyword arguments to pass to xarray.open_dataset.

  • xarray_storage_options (dict, optional) – Storage options for S3 access when not using icechunk.

Returns:

The opened dataset.

Return type:

xr.Dataset

Raises:
query_geoparquet(query=None, *, install_extensions=True)[source]#

Query a geoparquet file using DuckDB.

Parameters:
  • query (str, optional) – SQL query to execute. If not provided, returns all data.

  • install_extensions (bool, default True) – Whether to install and load the spatial and httpfs extensions.

Returns:

Result of the DuckDB query.

Return type:

duckdb.DuckDBPyRelation

Raises:

ValueError – If dataset is not in ‘geoparquet’ format.

Example

Example of querying buildings with a converted geometry column:

>>> buildings = catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1')
>>> result = buildings.query_geoparquet("""
...     SELECT
...         id,
...         roof_material,
...         geometry
...     FROM read_parquet('{s3_path}')
...     WHERE roof_material = 'concrete'
... """)
>>> # Then convert to GeoDataFrame
>>> gdf = buildings.to_geopandas("""
...     SELECT
...         id,
...         roof_material,
...         geometry
...     FROM read_parquet('{s3_path}')
...     WHERE roof_material = 'concrete'
... """)
to_geopandas(query=None, geometry_column='geometry', crs='EPSG:4326', target_crs=None, **kwargs)[source]#

Convert query results to a GeoPandas GeoDataFrame.

Parameters:
  • query (str, optional) – SQL query to execute. If not provided, returns all data.

  • geometry_column (str, default 'geometry') – The name of the geometry column in the query result.

  • crs (str, default 'EPSG:4326') – The coordinate reference system to use for the geometries.

  • target_crs (str, optional) – The target coordinate reference system to convert the geometries to.

  • **kwargs (dict) – Additional keyword arguments passed to query_geoparquet.

Returns:

A GeoPandas GeoDataFrame containing the queried data with geometries.

Return type:

gpd.GeoDataFrame

Raises:

ValueError – If dataset is not in ‘geoparquet’ format or if the geometry column is not found.

Example

Example of converting buildings to GeoPandas GeoDataFrame - no need for ST_AsText(): >>> buildings = catalog.get_dataset(‘conus-overture-buildings’, ‘v2025-03-19.1’) >>> gdf = buildings.to_geopandas(“”” … SELECT … id, … roof_material, … geometry … FROM read_parquet(‘{s3_path}’) … WHERE roof_material = ‘concrete’ … “””) >>> gdf.head()

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ocr.datasets.Catalog(*, datasets)[source]#

Bases: BaseModel

Base class for datasets catalog.

datasets: list[Dataset]#
get_dataset(name, version=None, *, case_sensitive=True, latest=False)[source]#

Get a dataset by name and optionally version.

Parameters:
  • name (str) – Name of the dataset to retrieve

  • version (str, optional) – Specific version of the dataset. If not provided, returns the dataset if only one version exists, or raises an error if multiple versions exist, unless get_latest=True.

  • case_sensitive (bool, default True) – Whether to match dataset names case-sensitively

  • latest (bool, default False) – If True and version=None, returns the latest version instead of raising an error when multiple versions exist

Returns:

The matched dataset

Return type:

Dataset

Raises:
  • ValueError – If multiple versions exist and version is not specified (and latest=False)

  • KeyError – If no matching dataset is found

Examples

>>> # Get a dataset with a specific version
>>> catalog.get_dataset('conus-overture-buildings', 'v2025-03-19.1')
>>>
>>> # Get latest version of a dataset
>>> catalog.get_dataset('conus-overture-buildings', get_latest=True)
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

CONUS404 helpers#

Load CONUS404 variables, compute relative humidity, wind rotation and diagnostics. Geographic selection utilities (point/bbox) with CRS-aware transforms.

ocr.conus404.load_conus404(add_spatial_constants=True)[source]#

Load the CONUS 404 dataset.

Parameters:

add_spatial_constants (bool, optional) – If True, adds spatial constant variables (SINALPHA, COSALPHA) to the dataset.

Returns:

ds – The CONUS 404 dataset.

Return type:

xr.Dataset

ocr.conus404.compute_relative_humidity(ds)[source]#

Compute relative humidity from specific humidity, temperature, and pressure.

Parameters:

ds (xr.Dataset) – Input dataset containing ‘Q2’ (specific humidity), ‘T2’ (temperature in K), and ‘PSFC’ (pressure in Pa).

Returns:

hurs – Relative humidity as a percentage.

Return type:

xr.DataArray

ocr.conus404.rotate_winds_to_earth(ds)[source]#

Rotate grid-relative 10 m winds (U10,V10) to earth-relative components. Uses SINALPHA / COSALPHA convention from WRF.

Parameters:

ds (xr.Dataset) – Input dataset containing ‘U10’, ‘V10’, ‘SINALPHA’, and ‘COSALPHA’.

Returns:

  • earth_u (xr.DataArray) – Earth-relative U component of wind at 10 m.

  • earth_v (xr.DataArray) – Earth-relative V component of wind at 10 m.

Return type:

tuple[DataArray, DataArray]

ocr.conus404.compute_wind_speed_and_direction(u10, v10)[source]#

Derive hourly wind speed (m/s) and direction (degrees from) using xclim.

Parameters:
  • u10 (xr.DataArray) – U component of wind at 10 m (m/s).

  • v10 (xr.DataArray) – V component of wind at 10 m (m/s).

Returns:

wind_ds – Dataset containing wind speed (‘sfcWind’) and wind direction (‘sfcWindfromdir’).

Return type:

xr.Dataset


Utilities#

General utilities#

Helpers for DuckDB (extension loading, S3 secrets), vector sampling, and file transfer.

ocr.utils.get_temp_dir()[source]#

Get optimal temporary directory path for the current environment.

Returns the current working directory if running in /scratch (e.g., on Coiled clusters), otherwise returns None to use the system default temp directory.

On Coiled clusters, /scratch is bind-mounted directly to the NVMe disk, avoiding Docker overlay filesystem overhead and providing better I/O performance and more available space compared to /tmp which sits on the Docker overlay.

Returns:

Current working directory if in /scratch, None otherwise (uses system default).

Return type:

Path | None

Examples

>>> import tempfile
>>> from ocr.utils import get_temp_dir
>>> with tempfile.TemporaryDirectory(dir=get_temp_dir()) as tmpdir:
...     # tmpdir will be in /scratch on Coiled, system temp otherwise
...     pass
ocr.utils.apply_s3_creds(region='us-west-2', *, con=None)[source]#

Register AWS credentials as a DuckDB SECRET on the given connection.

Parameters:
  • region (str) – AWS region used for S3 access.

  • con (duckdb.DuckDBPyConnection | None) – Connection to apply credentials to. If None, uses duckdb’s default connection (duckdb.sql), preserving prior behavior.

ocr.utils.install_load_extensions(aws=True, spatial=True, httpfs=True, con=None)[source]#

Installs and applies duckdb extensions.

Parameters:
  • aws (bool, optional) – Install and load AWS extension, by default True

  • spatial (bool, optional) – Install and load SPATIAL extension, by default True

  • httpfs (bool, optional) – Install and load HTTPFS extension, by default True

  • con (duckdb.DuckDBPyConnection | None) – Connection to apply extensions to. If None, uses duckdb’s default

ocr.utils.extract_points(gdf, da)[source]#

Extract/sample points from a GeoDataFrame to an Xarray DataArray.

Parameters:
  • gdf (gpd.GeoDataFrame) – Input geopandas GeoDataFrame. Geometry should be points

  • da (xr.DataArray) – Input Xarray DataArray

Returns:

DataArray with geometry sampled

Return type:

xr.DataArray

Notes

UserWarning: Geometry is in a geographic CRS. Results from ‘centroid’ are likely incorrect. Use ‘GeoSeries.to_crs()’ to re-project geometries to a projected CRS before this operation.

The relatively small size of a building footprint should account for a very small shift in the centroid when calculating from EPSG:4326 vs EPSG:5070.

TODO: Should/can this be a DataArray for typing

ocr.utils.bbox_tuple_from_xarray_extent(ds, x_name='x', y_name='y')[source]#

Creates a bounding box from an Xarray Dataset extent.

Parameters:
  • ds (xr.Dataset) – Input Xarray Dataset

  • x_name (str, optional) – Name of x coordinate, by default ‘x’

  • y_name (str, optional) – Name of y coordinate, by default ‘y’

Returns:

Bounding box tuple in the form: (x_min, y_min, x_max, y_max)

Return type:

tuple

ocr.utils.copy_or_upload(src, dest, overwrite=True, chunk_size=16777216)[source]#

Copy a single file from src to dest using UPath/fsspec. - Uses server-side copy if available on the same filesystem (e.g., s3->s3). - Falls back to streaming copy otherwise. - Creates destination parent directories when supported.

Parameters:
  • src (UPath) – Source UPath

  • dest (UPath) – Destination UPath (file path; if pointing to a directory-like path, src.name is appended)

  • overwrite (bool) – If False, raises if dest exists

  • chunk_size (int) – Buffer size for streaming copies

Return type:

None

ocr.utils.geo_sel(ds, *, lon=None, lat=None, bbox=None, method='nearest', tolerance=None, crs_wkt=None)[source]#

Geographic selection helper.

Exactly one of:
  • (lon AND lat)

  • (lons AND lats)

  • bbox=(west, south, east, north)

Parameters:
  • ds (xr.Dataset) – Input dataset with x, y coordinates and a valid ‘crs’ variable with WKT

  • lon (float, optional) – Longitude of point to select, by default None

  • lat (float, optional) – Latitude of point to select, by default None

  • bbox (tuple, optional) – Bounding box to select (west, south, east, north), by default None

  • method (str, optional) – Method to use for point selection, by default ‘nearest’

  • tolerance (float, optional) – Tolerance (in units of the dataset’s CRS) for point selection, by default None

  • crs_wkt (str, optional) – WKT string for the dataset’s CRS. If None, attempts to read from ds.crs.attrs[‘crs_wkt’].

Returns:

Single point: time dimension only Multiple points: adds ‘point’ dimension BBox: retains y, x subset

Return type:

xarray.Dataset

Testing utilities#

Snapshot testing extensions for xarray and GeoPandas.

class ocr.testing.XarraySnapshotExtension[source]#

Bases: SingleFileSnapshotExtension

Snapshot extension for xarray DataArrays and Datasets stored as zarr.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path)

Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples

# Use default S3 storage (no env var needed) pytest tests/test_snapshot.py –snapshot-update

# Override with local storage SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/

file_extension = 'zarr'#
classmethod dirname(*, test_location)[source]#

Return the directory for storing snapshots.

classmethod get_snapshot_name(*, test_location, index=0)[source]#

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

classmethod get_location(*, test_location, index=0)[source]#

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

serialize(data, **kwargs)[source]#

Convert DataArray to Dataset for consistent zarr storage. Returns the data unchanged.

matches(*, serialized_data, snapshot_data)[source]#

Check if serialized data matches snapshot using approximate comparison.

Uses assert_allclose instead of assert_equal to handle platform-specific numerical differences from OpenCV and scipy operations between macOS and Linux.

read_snapshot_data_from_location(*, snapshot_location, snapshot_name, session_id)[source]#

Read zarr snapshot from disk.

classmethod write_snapshot_collection(*, snapshot_collection)[source]#

Write snapshot collection to zarr format (local or remote).

diff_lines(serialized_data, snapshot_data)[source]#

Generate diff lines for test output.

class ocr.testing.GeoDataFrameSnapshotExtension[source]#

Bases: SingleFileSnapshotExtension

Snapshot extension for GeoPandas GeoDataFrames stored as parquet.

Supports both local and remote (S3) storage via environment variable configuration: - SNAPSHOT_STORAGE_PATH: Base path for snapshots (local or s3://bucket/path)

Default: s3://carbonplan-scratch/snapshots (configured in tests/conftest.py)

Examples

# Use default S3 storage (no env var needed) pytest tests/test_snapshot.py –snapshot-update

# Override with local storage SNAPSHOT_STORAGE_PATH=tests/__snapshots__ pytest tests/

# Override with different S3 bucket SNAPSHOT_STORAGE_PATH=s3://my-bucket/snapshots pytest tests/

file_extension = 'parquet'#
classmethod dirname(*, test_location)[source]#

Return the directory for storing snapshots.

classmethod get_snapshot_name(*, test_location, index=0)[source]#

Generate snapshot name based on test name.

Sanitizes the test name to replace problematic characters (e.g., brackets from parametrized tests) with underscores for valid file paths.

classmethod get_location(*, test_location, index=0)[source]#

Get the full snapshot location path.

Override to properly handle S3 paths using upath instead of os.path.join.

serialize(data, **kwargs)[source]#

Validate that data is a GeoDataFrame. Returns the data unchanged.

matches(*, serialized_data, snapshot_data)[source]#

Check if serialized data matches snapshot using GeoDataFrame comparison.

read_snapshot_data_from_location(*, snapshot_location, snapshot_name, session_id)[source]#

Read parquet snapshot from disk.

classmethod write_snapshot_collection(*, snapshot_collection)[source]#

Write snapshot collection to parquet format (local or remote).

diff_lines(serialized_data, snapshot_data)[source]#

Generate diff lines for test output.


Risk analysis#

Fire risk#

Core fire/wind risk utilities used by the pipeline (kernels, wind classification, risk composition).

ocr.risks.fire.haversine(lon1, lat1, lon2, lat2)[source]#

Calculate the great circle distance in meters between two points on the earth (specified in decimal degrees).

Uses the haversine formula from: https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points

ocr.risks.fire.get_grid_spacing_info(da)[source]#

Extract grid spacing information from a DataArray.

Returns the center coordinates and the spacing (in degrees) between pixels for latitude and longitude dimensions.

Parameters:

da (xr.DataArray) – DataArray with latitude and longitude coordinates.

Returns:

  • latitude (float) – Latitude at the center of the grid.

  • longitude (float) – Longitude at the center of the grid.

  • latitude_increment (float) – Spacing between latitude pixels in degrees.

  • longitude_increment (float) – Spacing between longitude pixels in degrees.

ocr.risks.fire.generate_weights(method='skewed', kernel_size=81.0, circle_diameter=35.0, direction='W', lat_pixel_size_meters=34, lon_pixel_size_meters=25)[source]#

Generate a 2D array of weights for a kernel.

Parameters:
  • method (str, optional) – The method to use for generating weights. Options are ‘skewed’ or ‘circular_focal_mean’. ‘skewed’ generates an elliptical kernel to simulate wind directionality. ‘circular_focal_mean’ generates a circular kernel, by default ‘skewed’

  • kernel_size (float, optional) – The size of the kernel, by default 81.0

  • circle_diameter (float, optional) – The diameter of the circle, by default 35.0

  • direction (str, optional) – Wind direction (‘N’, ‘NE’, ‘E’, ‘SE’, ‘S’, ‘SW’, ‘W’, ‘NW’), by default ‘W’

  • lat_pixel_size_meters (float, optional) – Physical size of one pixel in the latitude direction in meters, by default 34

  • lon_pixel_size_meters (float, optional) – Physical size of one pixel in the longitude direction in meters, by default 25

Returns:

weights – A 2D array of weights for the circular kernel.

Return type:

np.ndarray

ocr.risks.fire.generate_wind_directional_kernels(kernel_size=81.0, circle_diameter=35.0, latitude=38.0, longitude=-100, longitude_increment=0.0003, latitude_increment=0.0003)[source]#

Generate a dictionary of 2D arrays of weights for elliptical kernels oriented in different directions.

Parameters:
  • kernel_size (float, optional) – The size of the kernel, by default 81.0

  • circle_diameter (float, optional) – The diameter of the circle, by default 35.0

Returns:

kernels – A dictionary of 2D arrays of weights for elliptical kernels oriented in different directions.

Return type:

dict[str, np.ndarray]

ocr.risks.fire.apply_wind_directional_convolution(da, iterations=3, kernel_size=81.0, circle_diameter=35.0, latitude=34.0, longitude=100.0, latitude_increment=0.0003, longitude_increment=0.0003)[source]#

Apply a directional convolution to a DataArray.

Parameters:
  • da (xr.DataArray) – The DataArray to apply the convolution to.

  • iterations (int, optional) – The number of iterations to apply the convolution, by default 3

  • kernel_size (float, optional) – The size of the kernel, by default 81.0

  • circle_diameter (float, optional) – The diameter of the circle, by default 35.0

Returns:

ds – The Dataset with the directional convolution applied

Return type:

xr.Dataset

ocr.risks.fire.classify_wind_directions(wind_direction_ds)[source]#

Classify wind directions into 8 cardinal directions (0-7). The classification is:

0: North (337.5-22.5) 1: Northeast (22.5-67.5) 2: East (67.5-112.5) 3: Southeast (112.5-157.5) 4: South (157.5-202.5) 5: Southwest (202.5-247.5) 6: West (247.5-292.5) 7: Northwest (292.5-337.5)

Parameters:

wind_direction_ds (xarray.DataArray) – DataArray containing wind direction in degrees (0-360)

Returns:

result – DataArray with wind directions classified as integers 0-7

Return type:

xarray.DataArray

ocr.risks.fire.create_weighted_composite_bp_map(bp, wind_direction_distribution, *, distribution_direction_dim='wind_direction', weight_sum_tolerance=1e-05)[source]#

Create a weighted composite burn probability map using wind direction distribution.

Parameters:
  • bp (xr.Dataset) – Dataset containing 9 directional burn probability layers with variables named [‘N’,’NE’,’E’,’SE’,’S’,’SW’,’W’,’NW’,’circular’] produced by apply_wind_directional_convolution.

  • wind_direction_distribution (xr.DataArray) – Probability distribution over 8 cardinal directions with dimension ‘wind_direction’ and length 8, matching direction labels: [‘N’,’NE’,’E’,’SE’,’S’,’SW’,’W’,’NW’] (order must align). Values should sum to 1 where fire-weather hours exist; may be all 0 where none exist.

  • distribution_direction_dim (str, optional) – Name of the dimension in wind_direction_distribution that holds the direction labels, by default ‘wind_direction’.

  • weight_sum_tolerance (float, optional) – Tolerance for deviation from 1.0 in the sum of weights, by default

Returns:

weighted – Weighted composite burn probability with same spatial dims as inputs. Name: ‘wind_weighted_bp’. Missing (all-zero) distributions yield NaN.

Return type:

xr.DataArray

ocr.risks.fire.create_wind_informed_burn_probability(wind_direction_distribution_30m_4326, riley_270m_5070)[source]#

Create wind-informed burn probability dataset by applying directional convolution and creating a weighted composite burn probability map.

Parameters:
  • wind_direction_distribution_30m_4326 (xr.DataArray) – Wind direction distribution data at 30m resolution in EPSG:4326 projection.

  • riley_270m_5070 (xr.DataArray) – Riley et al. (2011) burn probability data at 270m resolution in EPSG:5070 projection.

Returns:

smoothed_final_bp – Smoothed wind-informed burn probability data at 30m resolution in EPSG:4326 projection.

Return type:

xr.DataArray

ocr.risks.fire.calculate_wind_adjusted_risk(*, x_slice, y_slice, buffer=0.15)[source]#

Calculate wind-adjusted fire risk using climate run and wildfire risk datasets.

Parameters:
  • x_slice (slice) – Slice object for selecting longitude range.

  • y_slice (slice) – Slice object for selecting latitude range.

  • buffer (float, optional) – Buffer size in degrees to add around the region for edge effect handling (default 0.15). For 30m EPSG:4326 data, 0.15 degrees ≈ 16.7 km ≈ 540 pixels. This buffer ensures neighborhood operations (convolution, Gaussian smoothing) have adequate context at boundaries.

Returns:

fire_risk – Dataset containing wind-adjusted fire risk variables.

Return type:

xr.Dataset

ocr.risks.fire.direction_histogram(data_array)[source]#

Compute direction histogram on xarray DataArray with dask chunks.

Parameters:

data_array (xarray.DataArray) – Input data array containing direction indices (expected to be integers 0-7)

Returns:

Normalized histogram counts as a probability distribution

Return type:

xarray.DataArray

ocr.risks.fire.fosberg_fire_weather_index(hurs, T2, sfcWind)[source]#

Calculate the Fosberg Fire Weather Index (FFWI) based on relative humidity, temperature, and wind speed. taken from https://wikifire.wsl.ch/tiki-indexb1d5.html?page=Fosberg+fire+weather+index&structure=Fire hurs, T2, sfcWind are arrays

Parameters:
  • hurs (xr.DataArray) – Relative humidity in percentage (0-100).

  • T2 (xr.DataArray) – Temperature

  • sfcWind (xr.DataArray) – Wind speed in meters per second.

Returns:

Fosberg Fire Weather Index (FFWI).

Return type:

xr.DataArray

ocr.risks.fire.compute_wind_direction_distribution(direction, fire_weather_mask)[source]#

Compute the wind direction distribution during fire weather conditions.

Parameters:
  • direction (xr.DataArray) – Wind direction in degrees (0-360).

  • fire_weather_mask (xr.DataArray) – Boolean mask indicating fire weather conditions.

Returns:

wind_direction_hist – Wind direction histogram during fire weather conditions.

Return type:

xr.Dataset

ocr.risks.fire.compute_modal_wind_direction(distribution)[source]#

Compute the modal wind direction from the wind direction distribution.

Parameters:

distribution (xr.DataArray) – Wind direction distribution.

Returns:

mode – Modal wind direction.

Return type:

xr.Dataset

ocr.risks.fire.rps_to_score(rps)[source]#

Convert RPS (Risk Percent to Structures) value(s) to a categorical fire risk score.

The scoring system uses 11 categories (0–10) with bin boundaries designed so that higher scores are increasingly rare: each higher score encompasses a progressively smaller share of the building population.

Parameters:

rps (float or array-like) – RPS value(s) in percent. Must be in the range [0, 100].

Returns:

Risk score(s) in the range [0, 10].

Return type:

int or numpy.ndarray

Examples

>>> rps_to_score(0.0)
0
>>> rps_to_score(0.005)
1
>>> rps_to_score(100.0)
10
>>> import numpy as np
>>> rps_to_score(np.array([0.0, 0.015, 3.5]))
array([ 0,  2, 10])

Internal pipeline modules#

Internal API

These modules are used internally by the pipeline and are not intended for direct public consumption. They are documented here for completeness and advanced use cases.

Batch managers#

Orchestration backends for local and Coiled execution.

class ocr.deploy.managers.AbstractBatchManager(*, debug=False)[source]#

Bases: BaseModel

Abstract base class for batch managers.

debug: bool#
submit_job(command, name, kwargs)[source]#

Wait for the batch job to complete.

wait_for_completion(exit_on_failure=False)[source]#

Get the batch ID.

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ocr.deploy.managers.CoiledBatchManager(*, debug=False, status_check_int=10, job_limit=1000, job_ids=<factory>)[source]#

Bases: AbstractBatchManager

Coiled batch manager for managing batch jobs.

status_check_int: int#
job_limit: int#
job_ids: list[str]#
submit_job(command, name, kwargs)[source]#

Submit a job to Coiled batch.

Parameters:
  • command (str) – The command to run.

  • name (str) – The name of the job.

  • kwargs (dict) – Additional keyword arguments to pass to coiled.batch.run.

Returns:

job_id – The ID of the submitted job.

Return type:

str

wait_for_completion(exit_on_failure=False)[source]#

Wait for all tracked jobs to complete.

Parameters:

exit_on_failure (bool, default False) – If True, raise an Exception immediately when a job failure is detected.

Returns:

completed, failed – A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

Return type:

tuple[set[str], set[str]]

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class ocr.deploy.managers.LocalBatchManager(*, debug=False, status_check_int=1, max_workers=4, jobs=<factory>)[source]#

Bases: AbstractBatchManager

Local batch manager for running jobs locally using subprocess.

status_check_int: int#
max_workers: int#
jobs: dict[str, dict]#
model_post_init(_LocalBatchManager__context)[source]#

Initialize the thread pool executor after model creation.

submit_job(command, name, kwargs)[source]#

Submit a job to run locally.

Parameters:
  • command (str) – The command to run.

  • name (str) – The name of the job.

  • kwargs (dict) – Additional keyword arguments to pass to subprocess.run.

Returns:

job_id – The ID of the submitted job.

Return type:

str

wait_for_completion(exit_on_failure=False)[source]#

Wait for all tracked jobs to complete.

Parameters:

exit_on_failure (bool, default False) – If True, raise an Exception immediately when a job failure is detected.

Returns:

completed, failed – A tuple of (completed_job_ids, failed_job_ids). If exit_on_failure is True and a failure is encountered the method will raise before returning.

Return type:

tuple[set[str], set[str]]

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

CLI application#

Command-line interface exposed as the ocr command. For detailed usage and options, see the Data Pipeline guide.

ocr#

Run OCR deployment pipeline on Coiled

Usage

ocr [OPTIONS] COMMAND [ARGS]...

Options

--install-completion#

Install completion for the current shell.

--show-completion#

Show completion for the current shell, to copy it or customize the installation.

aggregate-region-risk-summary-stats#

Generate time-horizon based statistical summaries for county and tract level PMTiles creation

Usage

ocr aggregate-region-risk-summary-stats [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'm8g.16xlarge'

create-building-centroid-pmtiles#

Create building centroid PMTiles from the consolidated geoparquet file.

Usage

ocr create-building-centroid-pmtiles [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'c8g.8xlarge'

--disk-size <disk_size>#

Disk size in GB (Coiled only).

Default:

250

create-building-pmtiles#

Create PMTiles from the consolidated geoparquet file.

Usage

ocr create-building-pmtiles [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'c8g.8xlarge'

--disk-size <disk_size>#

Disk size in GB (Coiled only).

Default:

250

create-pyramid#

Create Pyramid

Usage

ocr create-pyramid [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'm8g.16xlarge'

create-regional-pmtiles#

Create PMTiles for regional risk statistics (counties and tracts).

Usage

ocr create-regional-pmtiles [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'c8g.8xlarge'

--disk-size <disk_size>#

Disk size in GB (Coiled only).

Default:

250

ingest-data#

Ingest and process input datasets

Usage

ocr ingest-data [OPTIONS] COMMAND [ARGS]...
download#

Download raw source data for a dataset.

Usage

ocr ingest-data download [OPTIONS] DATASET

Options

--dry-run#

Preview operations without executing

Default:

False

--debug#

Enable debug logging

Default:

False

Arguments

DATASET#

Required argument

Name of the dataset to download

list-datasets#

List all available datasets that can be ingested.

Usage

ocr ingest-data list-datasets [OPTIONS]
process#

Process downloaded data and upload to S3/Icechunk.

Usage

ocr ingest-data process [OPTIONS] DATASET

Options

--dry-run#

Preview operations without executing

Default:

False

--use-coiled#

Use Coiled for distributed processing

Default:

False

--software <coiled_software>#

Software environment to use (required if –use-coiled is set)

--debug#

Enable debug logging

Default:

False

--overture-data-type <overture_data_type>#

For overture-maps: which data to process (buildings, addresses, or both)

Default:

'both'

--census-geography-type <census_geography_type>#

For census-tiger: which geography to process (blocks, tracts, counties, or all)

Default:

'all'

--census-subset-states <census_subset_states>#

For census-tiger: subset of states to process (e.g., California Oregon)

Arguments

DATASET#

Required argument

Name of the dataset to process

run-all#

Run the complete pipeline: download, process, and cleanup.

Usage

ocr ingest-data run-all [OPTIONS] DATASET

Options

--dry-run#

Preview operations without executing

Default:

False

--use-coiled#

Use Coiled for distributed processing

Default:

False

--debug#

Enable debug logging

Default:

False

--overture-data-type <overture_data_type>#

For overture-maps: which data to process (buildings, addresses, or both)

Default:

'both'

--census-geography-type <census_geography_type>#

For census-tiger: which geography to process (blocks, tracts, counties, or all)

Default:

'all'

--census-subset-states <census_subset_states>#

For census-tiger: subset of states to process (e.g., California Oregon)

Arguments

DATASET#

Required argument

Name of the dataset to process

partition-buildings#

Partition buildings geoparquet by state and county FIPS codes.

Usage

ocr partition-buildings [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'c8g.12xlarge'

process-region#

Calculate and write risk for a given region to Icechunk CONUS template.

Usage

ocr process-region [OPTIONS] REGION_ID

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-t, --risk-type <risk_type>#

Type of risk to calculate

Default:

<RiskType.FIRE: 'fire'>

Options:

fire

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

--init-repo#

Initialize Icechunk repository (if not already initialized).

Default:

False

Arguments

REGION_ID#

Required argument

Region ID to process, e.g., y10_x2

run#

Run the OCR deployment pipeline. This will process regions, aggregate geoparquet files, and create PMTiles layers for the specified risk type.

Usage

ocr run [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-r, --region-id <region_id>#

Region IDs to process, e.g., y10_x2

--all-region-ids#

Process all valid region IDs

Default:

False

-t, --risk-type <risk_type>#

Type of risk to calculate

Default:

<RiskType.FIRE: 'fire'>

Options:

fire

--write-regional-stats#

Write aggregated statistical summaries for each region (one file per region type with stats like averages, medians, percentiles, and histograms)

Default:

False

--create-pyramid#

Create ndpyramid / multiscale zarr for web-visualization

Default:

False

-p, --platform <platform>#

Platform to run the pipeline on

Default:

<Platform.LOCAL: 'local'>

Options:

coiled | local

--wipe#

Wipe the icechunk and vector data storages before running the pipeline

Default:

False

--dispatch-platform <dispatch_platform>#

If set, schedule this run command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

VM type override for dispatch-platform (Coiled only).

--process-retries <process_retries>#

Number of times to retry failed process-region tasks (Coiled only). 0 disables retries.

Default:

2

write-aggregated-region-analysis-files#

Write aggregated statistical summaries for each region (CONUS, state, county, tract and block).

Creates one file per region type containing aggregated statistics for ALL regions, including building counts, average/median risk values, percentiles (p90, p95, p99), and histograms. Outputs in geoparquet, geojson, and csv formats.

Usage

ocr write-aggregated-region-analysis-files [OPTIONS]

Options

-e, --env-file <env_file>#

Path to the environment variables file. These will be used to set up the OCRConfiguration

-p, --platform <platform>#

If set, schedule this command on the specified platform instead of running inline.

Options:

coiled | local

--vm-type <vm_type>#

Coiled VM type override (Coiled only).

Default:

'r8g.4xlarge'