Skip to content

Getting Started🔗

Installation🔗

For development with all testing and documentation tools:

git clone https://github.com/PermafrostDiscoveryGateway/water-timeseries-v2
cd water-timeseries-v2
pip install -e ".[dev]"

Or with uv:

git clone https://github.com/PermafrostDiscoveryGateway/water-timeseries-v2
cd water-timeseries-v2
uv sync

For installing just the runtime dependencies:

pip install .

Quick Example🔗

from water_timeseries.dataset import DWDataset
import xarray as xr

# Load your data
ds = xr.open_dataset("your_data.nc")

# Create a DWDataset instance
dataset = DWDataset(ds)

# Access normalized data
normalized_data = dataset.ds_normalized

# Access the preprocessed dataset
preprocessed_ds = dataset.ds

Downloading from Google Earth Engine🔗

The EarthEngineDownloader class allows you to download Dynamic World land cover data directly from Google Earth Engine.

Initialization🔗

import os
from loguru import logger
from water_timeseries.downloader import EarthEngineDownloader

# Set your EE project via environment variable
os.environ["EE_PROJECT"] = "your-project"

# Or pass directly as parameter
dl = EarthEngineDownloader(ee_project="your-project", ee_auth=True, logger=logger)

Download Parameters🔗

The download_dw_monthly() method supports the following parameters:

Parameter Type Description Default
vector_dataset str/Path Path to input vector dataset (Parquet format) Required
name_attribute str Column name in vector dataset for grouping Required
years List[int] Years to process [2017-2025]
months List[int] Months to process (default: June-September) [6,7,8,9]
bbox_west/east/north/south float Bounding box for spatial filtering Global (-180 to 180, -90 to 90)
id_list List[str] Filter by specific IDs (from name_attribute column) None (all)
scale float Pixel scale in meters 10
max_total_requests int Max requests per chunk (controls chunking) 500
n_parallel int Number of parallel workers (1 = sequential) 1
no_download bool If True, only log parameters without downloading False
save_to_file str Path to save dataset (.zarr or .nc). Relative paths go to output dir None

Usage Examples🔗

Basic Download🔗

Download all features from the test dataset:

dl = EarthEngineDownloader(ee_project="your-project", ee_auth=True, logger=logger)

ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    years=[2024],
    months=[7, 8],
)

Filter by Specific IDs🔗

Download only specific lakes using their geohash IDs:

ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    id_list=["b7g6g1ny1mf7", "b7g4yc12k4yj", "b7g6c8gye56e"],
    years=[2024],
    months=[7, 8],
)

Parallel Download🔗

Speed up large downloads by using multiple parallel workers:

ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    n_parallel=4,  # Use 4 parallel workers
    max_total_requests=500,  # Control chunk size
    years=[2024, 2025],
    months=[6, 7, 8, 9],
)

Spatial Bounding Box Filter🔗

Filter by geographic region:

ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    bbox_west=-165,
    bbox_east=-164,
    bbox_south=66.2,
    bbox_north=66.6,
    years=[2024],
    months=[7, 8],
)

Preview Mode (No Download)🔗

Test your parameters without actually downloading data:

ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    years=[2024],
    months=[7, 8],
    no_download=True,  # Only logs parameters, skips actual download
)

Save to File🔗

Automatically save the downloaded dataset to file. The format is determined by the file extension:

# Save to Zarr format (relative path goes to output directory)
ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    years=[2024],
    months=[7, 8],
    save_to_file="data.zarr",  # Saves to downloads/data.zarr
)

# Save to NetCDF format (absolute path)
ds = dl.download_dw_monthly(
    vector_dataset="tests/data/lake_polygons.parquet",
    name_attribute="id_geohash",
    years=[2024],
    months=[7, 8],
    save_to_file="/path/to/output/data.nc",
)

Test Dataset🔗

The package includes a test dataset at tests/data/lake_polygons.parquet with 118 lake polygons in Alaska.

Saving and Loading Datasets🔗

The package provides utility functions for saving and loading xarray datasets:

from water_timeseries.utils import save_xarray_dataset, load_xarray_dataset

# Save to Zarr format
save_xarray_dataset(ds, "output.zarr", output_dir="./data")

# Save to NetCDF format
save_xarray_dataset(ds, "/full/path/output.nc")

# Load from Zarr
ds = load_xarray_dataset("output.zarr")

# Load from NetCDF
ds = load_xarray_dataset("output.nc", format="netcdf")

Command Line Interface🔗

The package includes a hierarchical CLI tool water-timeseries for running breakpoint detection from the command line.

Installation🔗

The CLI is installed automatically with the package:

uv sync

Basic Usage🔗

# Show all options
uv run water-timeseries --help

# Show breakpoint-analysis subcommand help
uv run water-timeseries breakpoint-analysis --help

# Show plot-timeseries subcommand help
uv run water-timeseries plot-timeseries --help

# Show dashboard subcommand help
uv run water-timeseries dashboard --help

# Run breakpoint analysis
uv run water-timeseries breakpoint-analysis data.zarr output.parquet

# Run with optional parameters
uv run water-timeseries breakpoint-analysis \
    data.zarr \
    output.parquet \
    --chunksize 100 \
    --n-jobs 4

# Run with a config file
uv run water-timeseries breakpoint-analysis --config-file configs/config.yaml

# Run with ray backend (default) using all available CPUs
uv run water-timeseries breakpoint-analysis \
    data.zarr \
    output.parquet \
    --chunksize 50 \
    --n-jobs -1

# Run with joblib backend and simple break method
uv run water-timeseries breakpoint-analysis \
    data.zarr \
    output.parquet \
    --parallel-backend joblib \
    --break-method simple \
    --chunksize 50 \
    --n-jobs 20

# Run without geometry in output
uv run water-timeseries breakpoint-analysis \
    data.zarr \
    output.parquet \
    --no-output-geometry

# Plot lake timeseries
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc

# Save figure to file
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc --output-figure plot.png

# Save only (no popup window)
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc --output-figure plot.png --no-show

# Launch the Streamlit dashboard (default port 8501)
uv run water-timeseries dashboard

# Launch dashboard on a custom port
uv run water-timeseries dashboard --port 8502

# Launch dashboard with custom data files
uv run water-timeseries dashboard --vector-file /path/to/lakes.parquet --dw-dataset-file /path/to/data.zarr

Using a Config File🔗

Create a YAML configuration file:

# config.yaml
water_dataset_file: /path/to/your/data.zarr
output_file: /path/to/output.parquet

# Optional: vector dataset for bbox filtering
vector_dataset_file: /path/to/lakes.parquet

# Bounding box (optional)
bbox_west: -160
bbox_east: -155
bbox_north: 68
bbox_south: 66

# Processing options
chunksize: 100
n_jobs: 20
min_chunksize: 10
parallel_backend: ray  # or "joblib"
break_method: beast  # or "simple"

# Output options
output_geometry: true  # include geometry in output
output_geometry_all: false  # include geometry for all IDs

CLI arguments take priority over config file values.

Break Methods🔗

The pipeline supports two breakpoint detection methods:

  • beast (default): Uses RBEAST (Bayesian change-point detection) for more accurate but slower analysis
  • simple: Uses rolling window statistical method (mean/median/max) for faster analysis
break_method: beast  # or "simple"

Auto-saved Configuration🔗

After running the pipeline, a YAML file with the used parameters is automatically saved next to the output file. This includes: - All CLI arguments with their final values (after config merging and any automatic adjustments like n_jobs reduction) - The actual n_jobs value (may be less than requested if it exceeded number of chunks)

For example, if output is output.parquet, the config is saved as output.yaml.

Key Classes🔗

LakeDataset🔗

Base class for lake dataset handling. Provides preprocessing, normalization, and masking functionality.

DWDataset🔗

Handles Dynamic World land cover data with classes for water, bare, snow, trees, grass, and more.

JRCDataset🔗

Handles Joint Research Centre (JRC) water data with permanent and seasonal water classifications.

Testing🔗

The package includes comprehensive tests for all functionality. Tests are organized by module and cover:

  • Dataset processing: Normalization, masking, and preprocessing
  • Breakpoint detection: Simple and RBEAST-based methods
  • Integration tests: End-to-end functionality with real and synthetic data

Interactive Dashboard🔗

The package includes an interactive Streamlit dashboard for visualizing lake polygons and time series data.

Running the Dashboard🔗

# Launch via CLI (recommended)
uv run water-timeseries dashboard

# Or with a custom port
uv run water-timeseries dashboard --port 8502

# Alternative: Run directly with streamlit
streamlit run src/water_timeseries/dashboard/app.py

Dashboard Features🔗

The dashboard provides a graphical interface for:

  1. Map Visualization: Interactive map showing lake polygons from a parquet file
  2. Hover over polygons to see attributes (id_geohash, area, net change)
  3. Click on a polygon to select it

  4. Time Series Plotting: Automatically plots water extent over time for the selected lake

  5. Shows a preview below the map
  6. Click "Open Time Series in Popup" for a larger view

  7. Automatic Download: If the selected lake's data is not in the cached dataset:

  8. Shows "Downloading..." status
  9. Automatically fetches data from Google Earth Engine
  10. Displays the time series plot after download completes

  11. Google Earth Engine Configuration:

  12. Enter your EE project in the sidebar
  13. Click "Set EE Project" to save it

  14. Satellite Timelapse Animation: Generate animated GIFs showing satellite imagery over time

  15. Sentinel-2 (2016-2025): High-resolution optical imagery (10m resolution)
  16. Landsat (2000-2025): Longer historical record (30m resolution)
  17. Uses summer months (July-August) to maximize cloud-free observations
  18. Creates a buffer around the lake for context

Generating Timelapse GIFs🔗

In the dashboard sidebar, you'll find the "Satellite Timelapse" section with:

  • Sentinel-2 (2016-2025): Checkbox to generate Sentinel-2 timelapse (default enabled)
  • Landsat (2000-2025): Checkbox to generate Landsat timelapse (default disabled)
  • Create Timelapse: Button to start the generation process

The timelapse GIFs are saved to the gifs/ directory with naming convention: - {geohash}_S2.gif for Sentinel-2 - {geohash}_LS.gif for Landsat

Timelapse Options🔗

The timelapse generation can be customized via the create_timelapse() function:

Parameter Type Description Default
input_lake_gdf GeoDataFrame Lake geometries with id_geohash column Required
id_geohash str Specific lake to visualize Required
timelapse_source str Image source ("sentinel2" or "landsat") "sentinel2"
gif_outdir str/Path Output directory for GIF files "gifs"
buffer float Buffer around lake in meters 100
start_year int Start year for timelapse 2016 (Sentinel-2) / 2000 (Landsat)
end_year int End year for timelapse 2025
start_date str Start date within year (MM-DD) "07-01"
end_date str End date within year (MM-DD) "08-31"
frames_per_second int Animation speed 1
dimensions int GIF pixel dimensions 512
overwrite_exists bool Re-download if file exists False
Programmatic Usage🔗

You can also generate timelapses programmatically:

import geopandas as gpd
from water_timeseries.utils.earthengine import create_timelapse

# Load your lake data
lakes_gdf = gpd.read_file("lakes.parquet")

# Generate Sentinel-2 timelapse (default)
gif_path = create_timelapse(
    input_lake_gdf=lakes_gdf,
    id_geohash="b7uefy0bvcrc",
    timelapse_source="sentinel2",
    gif_outdir="gifs",
)

# Generate Landsat timelapse (longer historical record)
gif_path = create_timelapse(
    input_lake_gdf=lakes_gdf,
    id_geohash="b7uefy0bvcrc",
    timelapse_source="landsat",
    start_year=2000,
    end_year=2025,
    gif_outdir="gifs",
)

# Overwrite existing files
gif_path = create_timelapse(
    input_lake_gdf=lakes_gdf,
    id_geohash="b7uefy0bvcrc",
    timelapse_source="sentinel2",
    overwrite_exists=True,  # Re-download even if file exists
)

Dashboard Arguments🔗

The create_app() function accepts these parameters:

Parameter Type Description Default
data_path str/Path Path to parquet file with lake polygons tests/data/lake_polygons.parquet
zarr_path str/Path Path to zarr file with cached time series tests/data/lakes_dw_test.zarr

Using with Custom Data🔗

The dashboard can be launched with custom data files via the CLI:

# Using default test data
uv run water-timeseries dashboard

# Using custom data files
uv run water-timeseries dashboard \
    --vector-file /path/to/lakes.parquet \
    --dw-dataset-file /path/to/data.zarr

Or programmatically with Python:

from water_timeseries.dashboard.map_viewer import create_app

# Create dashboard with custom paths
create_app(
    data_path="/path/to/your/lakes.parquet",
    zarr_path="/path/to/your/data.zarr"
)

Running Tests🔗

To run the test suite:

# Install with development dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run specific test modules
pytest tests/test_breakpoints.py
pytest tests/test_normalization.py

# Run with coverage
pytest --cov=water_timeseries

Test Data🔗

Tests use both real and synthetic datasets: - Real data: Located in tests/data/ (DW and JRC test datasets) - Synthetic data: Generated programmatically for predictable breakpoint testing

Next Steps🔗