Hybrid Icechunk stores for serverless web mapping
This past winter, we saw one of the worst snowpacks in recent memory for the Western U.S., and we wanted to be able to explore how this season compared to last. NOAA provides seasonal snowfall data as NetCDF, a format commonly used for scientific workflows and analysis. Web maps shine for exploring a dataset quickly, but NetCDF files have not traditionally been compatible with web mapping, and transforming them into a web map-compatible format is a lot of work.
We’ve been experimenting with a suite of tools and workflows that make visualizing archival formats like NetCDF in web maps significantly easier. Both the static map above and the interactive demo below render this NOAA data without the aid of a tile server, directly rendering bytes from the NetCDF analysis data.
Web mapping primer
The standard path to visualizing data in a web map relies on map tiles, organized into a gridded pyramid starting with full-resolution data at its base, and reduced-resolution multiscale levels above, typically in the Web Mercator projection. As you pan and zoom, the web map requests the tiles relevant to its current extent and zoom level from a server. These tiles can be generated on-the-fly from the source data, but this comes with performance, compute, and maintenance costs. Alternatively, you can generate tiles ahead of time and store them, but this leads to data duplication and storage costs. Ideally, we would just visualize the source data directly!
Web maps without a tile server
Recently, CarbonPlan
released zarr-layer, an open
source library for rendering gridded data directly in the browser. Given a Zarr
store, zarr-layer reprojects the data and renders it in a web map without
requiring an intermediary tile server.
This approach works great if you have full control over your data and can store it as Zarr, but there are petabytes of gridded climate and weather data that are stored in archival file formats such as NetCDF/HDF5, GRIB, TIFF, and others. Fortunately, tools like VirtualiZarr let you read these archival data formats as if they were Zarr. These virtual Zarr references are stored in the Icechunk format and don’t duplicate the data, but instead point to byte ranges inside the original files.
To access these stores in a browser, we built
icechunk-js, a lightweight
Icechunk reader. Tying together a few tools —
VirtualiZarr +
icechunk-js +
zarr-layer — we can render
archival gridded datasets directly in the browser without a server or duplicate
data.
For many lower resolution datasets, this combination works as-is. Viewing broad swaths of higher resolution data, on the other hand, still requires multiscale generation for acceptable web performance. Fortunately, these multiscales are lightweight and easy to add to our virtual store.
Adding multiscales to virtual stores
Direct access isn’t necessarily fast access. High resolution data isn’t relevant for zoomed-out views, but without multiscales, we’d be forced to load all of the data. Multiscales are pre-generated coarsened versions of data at multiple zoom levels. The Zarr community has recently converged on a standardized way of representing these coarsened versions of high resolution spatial data: the multiscale spec.
To streamline the multiscale creation process we released topozarr, a library
for adding multiscales and associated metadata to Zarr and Icechunk stores. We
can use this to create a “hybrid” store where the full resolution data is stored
as references to the original analysis data, with relatively lightweight
multiscales added on. This one-time operation is cheap compared to full tileset
generation because no reprojection, regridding, or duplication of the data is
required.
Direct access to data
Traditional web maps work with visualization-specific representations of the data, but the model we describe here streams the raw data to the browser. Since the data in the map is the actual analysis data, not an image representation, we can perform computations on it directly, revisualize it dynamically to hone in on interesting trends, or query regions of interest for exact values or averages.
As a quick example, the map below shows the difference in each grid cell’s snow depth between this past winter and the previous winter. It also indicates whether the current zoom is rendering NetCDF data directly or the multiscale overviews.
Caveats: Chunking and bucket access
Since we’re rendering the underlying analysis data directly, we’re subject to the choices its creators made about chunking and codecs. Unfortunately, the chunking scheme chosen for analysis is not always best suited for web visualization. In our snowpack example, the underlying chunk scheme of the NetCDF data is in latitudinal strips from one end of the continent to the other, which is far from ideal for regional maps. To see the snow depth of your planned ski line in the Bitterroot Mountains, full strips of data stretching from Seattle to New Hampshire are loaded. For some purposes, the performance is acceptable, but other chunking schemes might make this approach untenable.
Bucket access to the underlying dataset can be a problem too. If the bucket hosting the analysis data has access restrictions, or if Cross-origin Resource Sharing (CORS) is not enabled, this approach will not work. NOAA’s hosting setup for seasonal snowpack data, for instance, doesn’t have CORS enabled, preventing cross-origin access via browsers. As a workaround, we’ve mirrored the original NetCDF on S3 for this demo.
Try it out
This work is exploratory, and there are many underlying data formats we haven’t tested out yet. All of the pieces are open source and we’d love to see what you build. Please open issues on GitHub, reach out to us at hello@carbonplan.org, or find us in the CNG Slack. Thanks to the Earthmover and Development Seed teams for much of the tooling and conceptual work that underlies this workflow.
Our blog is open source. You can suggest edits on GitHub.