Work with data

Work with data#

This guide shows how to access Open Climate Risk output data using Icechunk, a versioned data format for cloud-native geospatial data.

Prerequisites#

Python environment with xarray, icechunk, duckdb, dask and lonboard installed.

What you’ll learn#

How to open and inspect the fire risk raster dataset
How to work with specific variables and spatial subsets
How to open and subset the raster sampled building vector dataset

Raster / Xarray#

Import required libraries#

import icechunk
import xarray as xr

Connect to the Icechunk repository#

Production fire risk data is stored in an Icechunk repository on S3. We’ll connect to version v1.1.0 of the wind-adjusted fire risk output. For valid versions, check out the GitHub releases page.

# Configure S3 storage for the Icechunk repository
version = 'v1.1.0'
storage = icechunk.s3_storage(
    bucket='us-west-2.opendata.source.coop',
    prefix=f'carbonplan/carbonplan-ocr/output/fire-risk/tensor/production/{version}/ocr.icechunk',
    region='us-west-2',
    anonymous=True,
)

# Open the repository
repo = icechunk.Repository.open(storage)

# Create a read-only session on the main branch
session = repo.readonly_session('main')

Open the dataset with xarray#

# Open the dataset
ds = xr.open_dataset(session.store, engine='zarr', chunks={})
ds

Explore the data variables#

The dataset contains wind-adjusted fire risk metrics. Let’s examine the available variables:

# List all data variables
print('Data variables:')
for var in ds.data_vars:
    print(f'  - {var}: {ds[var].attrs.get("long_name", "No description")}')

Select a spatial subset#

Extract data for a specific geographic region using coordinate slicing:

# Example: Select data for California region
california_subset = ds.sel(
    latitude=slice(42, 32),  # Southern to Northern California
    longitude=slice(-125, -114),  # Western to Eastern California
)

california_subset

Vector building dataset#

Next we will open and query the raster sampled building dataset stored in the geoparquet format.

Import required libraries#

Here we’ll import duckdb and load the spatial extension. The combination of duckdb spatial and geoparquet allows us to perform GIS queries that are beyond desktop GIS capabilities.

import duckdb

duckdb.sql("""INSTALL SPATIAL; LOAD SPATIAL; INSTALL HTTPFS; LOAD HTTPFS""")
dataset_uri = f's3://us-west-2.opendata.source.coop/carbonplan/carbonplan-ocr/output/fire-risk/vector/production/{version}/geoparquet/buildings.parquet/**'

Examine the dataset#

The SQL describe command shows us that the dataset contains raster sampled variables as well as geometry columns and regional identifiers.

duckdb.sql(f"""
DESCRIBE
SELECT
   *
FROM
   read_parquet('{dataset_uri}', hive_partitioning = TRUE)
""")

Load the first few rows#

Using the SQL LIMIT command, we can get just the first few rows of the dataset.

duckdb.sql(f"""
SELECT
   *
FROM
   read_parquet('{dataset_uri}', hive_partitioning = TRUE) LIMIT 5""")

Subset by state and county#

This geoparquet is partitioned by state and county using FIPS codes.

Let’s select data in LA County in California.

CA_FIPS_CODE = '06'
LA_FIPS_CODE = '037'

Get a count of the number of records in LA County#

Using the COUNT SQL syntax, we can get the total number of entires in our query.

duckdb.sql(f"""
SELECT
   COUNT(*) AS LA_building_count
FROM
   read_parquet('{dataset_uri}', hive_partitioning = TRUE)
WHERE
   state_fips = '{CA_FIPS_CODE}'
   AND county_fips = '{LA_FIPS_CODE}'""")

Subset by bounding box#

Looks like there are about 3 million building polygons in our query. Let’s subset that further.

We can get a bounding box (bbox) for an area around the Palisades fire.

palisades_bbox = (-118.761864, 34.026381, -118.466263, 34.152972)

palisades_query = duckdb.sql(f"""
SELECT
   rps_2011,
   geometry
FROM
   read_parquet('{dataset_uri}', hive_partitioning = TRUE)
WHERE
   state_fips = '{CA_FIPS_CODE}'
   AND county_fips = '{LA_FIPS_CODE}'
   AND bbox.xmin BETWEEN {palisades_bbox[0]} AND {palisades_bbox[2]}
   AND bbox.ymin BETWEEN {palisades_bbox[1]} AND {palisades_bbox[3]}""")

Visualize our query#

Next we’ll use the python plotting library, lonboard to visualize the wind informed risk in an interactive map.

from lonboard import Map, PolygonLayer
from lonboard.colormap import apply_continuous_cmap
from matplotlib import colormaps

layer = PolygonLayer.from_duckdb(
    palisades_query,
    get_fill_color=apply_continuous_cmap(
        palisades_query.fetchdf()['rps_2011'], colormaps['YlOrRd']
    ),
    pickable=True,
)

m = Map(layer, show_tooltip=True)
m

Work with data

Contents

Work with data#

Prerequisites#

What you’ll learn#

Raster / Xarray#

Import required libraries#

Connect to the Icechunk repository#

Open the dataset with xarray#

Explore the data variables#

Select a spatial subset#

Vector building dataset#

Import required libraries#

Examine the dataset#

Load the first few rows#

Subset by state and county#

Get a count of the number of records in LA County#

Subset by bounding box#

Visualize our query#