Getting Started🔗
Installation🔗
For development with all testing and documentation tools:
git clone https://github.com/PermafrostDiscoveryGateway/water-timeseries-v2
cd water-timeseries-v2
pip install -e ".[dev]"
Or with uv:
git clone https://github.com/PermafrostDiscoveryGateway/water-timeseries-v2
cd water-timeseries-v2
uv sync
For installing just the runtime dependencies:
pip install .
Quick Example🔗
from water_timeseries.dataset import DWDataset
import xarray as xr
# Load your data
ds = xr.open_dataset("your_data.nc")
# Create a DWDataset instance
dataset = DWDataset(ds)
# Access normalized data
normalized_data = dataset.ds_normalized
# Access the preprocessed dataset
preprocessed_ds = dataset.ds
Downloading from Google Earth Engine🔗
The EarthEngineDownloader class allows you to download Dynamic World land cover data directly from Google Earth Engine.
Initialization🔗
import os
from loguru import logger
from water_timeseries.downloader import EarthEngineDownloader
# Set your EE project via environment variable
os.environ["EE_PROJECT"] = "your-project"
# Or pass directly as parameter
dl = EarthEngineDownloader(ee_project="your-project", ee_auth=True, logger=logger)
Download Parameters🔗
The download_dw_monthly() method supports the following parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
vector_dataset |
str/Path | Path to input vector dataset (Parquet format) | Required |
name_attribute |
str | Column name in vector dataset for grouping | Required |
years |
List[int] | Years to process | [2017-2025] |
months |
List[int] | Months to process (default: June-September) | [6,7,8,9] |
bbox_west/east/north/south |
float | Bounding box for spatial filtering | Global (-180 to 180, -90 to 90) |
id_list |
List[str] | Filter by specific IDs (from name_attribute column) | None (all) |
scale |
float | Pixel scale in meters | 10 |
max_total_requests |
int | Max requests per chunk (controls chunking) | 500 |
n_parallel |
int | Number of parallel workers (1 = sequential) | 1 |
no_download |
bool | If True, only log parameters without downloading | False |
save_to_file |
str | Path to save dataset (.zarr or .nc). Relative paths go to output dir | None |
Usage Examples🔗
Basic Download🔗
Download all features from the test dataset:
dl = EarthEngineDownloader(ee_project="your-project", ee_auth=True, logger=logger)
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
years=[2024],
months=[7, 8],
)
Filter by Specific IDs🔗
Download only specific lakes using their geohash IDs:
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
id_list=["b7g6g1ny1mf7", "b7g4yc12k4yj", "b7g6c8gye56e"],
years=[2024],
months=[7, 8],
)
Parallel Download🔗
Speed up large downloads by using multiple parallel workers:
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
n_parallel=4, # Use 4 parallel workers
max_total_requests=500, # Control chunk size
years=[2024, 2025],
months=[6, 7, 8, 9],
)
Spatial Bounding Box Filter🔗
Filter by geographic region:
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
bbox_west=-165,
bbox_east=-164,
bbox_south=66.2,
bbox_north=66.6,
years=[2024],
months=[7, 8],
)
Preview Mode (No Download)🔗
Test your parameters without actually downloading data:
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
years=[2024],
months=[7, 8],
no_download=True, # Only logs parameters, skips actual download
)
Save to File🔗
Automatically save the downloaded dataset to file. The format is determined by the file extension:
# Save to Zarr format (relative path goes to output directory)
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
years=[2024],
months=[7, 8],
save_to_file="data.zarr", # Saves to downloads/data.zarr
)
# Save to NetCDF format (absolute path)
ds = dl.download_dw_monthly(
vector_dataset="tests/data/lake_polygons.parquet",
name_attribute="id_geohash",
years=[2024],
months=[7, 8],
save_to_file="/path/to/output/data.nc",
)
Test Dataset🔗
The package includes a test dataset at tests/data/lake_polygons.parquet with 118 lake polygons in Alaska.
Saving and Loading Datasets🔗
The package provides utility functions for saving and loading xarray datasets:
from water_timeseries.utils import save_xarray_dataset, load_xarray_dataset
# Save to Zarr format
save_xarray_dataset(ds, "output.zarr", output_dir="./data")
# Save to NetCDF format
save_xarray_dataset(ds, "/full/path/output.nc")
# Load from Zarr
ds = load_xarray_dataset("output.zarr")
# Load from NetCDF
ds = load_xarray_dataset("output.nc", format="netcdf")
Command Line Interface🔗
The package includes a hierarchical CLI tool water-timeseries for running breakpoint detection from the command line.
Installation🔗
The CLI is installed automatically with the package:
uv sync
Basic Usage🔗
# Show all options
uv run water-timeseries --help
# Show breakpoint-analysis subcommand help
uv run water-timeseries breakpoint-analysis --help
# Show plot-timeseries subcommand help
uv run water-timeseries plot-timeseries --help
# Show dashboard subcommand help
uv run water-timeseries dashboard --help
# Run breakpoint analysis
uv run water-timeseries breakpoint-analysis data.zarr output.parquet
# Run with optional parameters
uv run water-timeseries breakpoint-analysis \
data.zarr \
output.parquet \
--chunksize 100 \
--n-jobs 4
# Run with a config file
uv run water-timeseries breakpoint-analysis --config-file configs/config.yaml
# Run with ray backend (default) using all available CPUs
uv run water-timeseries breakpoint-analysis \
data.zarr \
output.parquet \
--chunksize 50 \
--n-jobs -1
# Run with joblib backend and simple break method
uv run water-timeseries breakpoint-analysis \
data.zarr \
output.parquet \
--parallel-backend joblib \
--break-method simple \
--chunksize 50 \
--n-jobs 20
# Run without geometry in output
uv run water-timeseries breakpoint-analysis \
data.zarr \
output.parquet \
--no-output-geometry
# Plot lake timeseries
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc
# Save figure to file
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc --output-figure plot.png
# Save only (no popup window)
uv run water-timeseries plot-timeseries data.zarr --lake-id b7uefy0bvcrc --output-figure plot.png --no-show
# Launch the Streamlit dashboard (default port 8501)
uv run water-timeseries dashboard
# Launch dashboard on a custom port
uv run water-timeseries dashboard --port 8502
# Launch dashboard with custom data files
uv run water-timeseries dashboard --vector-file /path/to/lakes.parquet --dw-dataset-file /path/to/data.zarr
Using a Config File🔗
Create a YAML configuration file:
# config.yaml
water_dataset_file: /path/to/your/data.zarr
output_file: /path/to/output.parquet
# Optional: vector dataset for bbox filtering
vector_dataset_file: /path/to/lakes.parquet
# Bounding box (optional)
bbox_west: -160
bbox_east: -155
bbox_north: 68
bbox_south: 66
# Processing options
chunksize: 100
n_jobs: 20
min_chunksize: 10
parallel_backend: ray # or "joblib"
break_method: beast # or "simple"
# Output options
output_geometry: true # include geometry in output
output_geometry_all: false # include geometry for all IDs
CLI arguments take priority over config file values.
Break Methods🔗
The pipeline supports two breakpoint detection methods:
- beast (default): Uses RBEAST (Bayesian change-point detection) for more accurate but slower analysis
- simple: Uses rolling window statistical method (mean/median/max) for faster analysis
break_method: beast # or "simple"
Auto-saved Configuration🔗
After running the pipeline, a YAML file with the used parameters is automatically saved next to the output file. This includes: - All CLI arguments with their final values (after config merging and any automatic adjustments like n_jobs reduction) - The actual n_jobs value (may be less than requested if it exceeded number of chunks)
For example, if output is output.parquet, the config is saved as output.yaml.
Key Classes🔗
LakeDataset🔗
Base class for lake dataset handling. Provides preprocessing, normalization, and masking functionality.
DWDataset🔗
Handles Dynamic World land cover data with classes for water, bare, snow, trees, grass, and more.
JRCDataset🔗
Handles Joint Research Centre (JRC) water data with permanent and seasonal water classifications.
Testing🔗
The package includes comprehensive tests for all functionality. Tests are organized by module and cover:
- Dataset processing: Normalization, masking, and preprocessing
- Breakpoint detection: Simple and RBEAST-based methods
- Integration tests: End-to-end functionality with real and synthetic data
Interactive Dashboard🔗
The package includes an interactive Streamlit dashboard for visualizing lake polygons and time series data.
Running the Dashboard🔗
# Launch via CLI (recommended)
uv run water-timeseries dashboard
# Or with a custom port
uv run water-timeseries dashboard --port 8502
# Alternative: Run directly with streamlit
streamlit run src/water_timeseries/dashboard/app.py
Dashboard Features🔗
The dashboard provides a graphical interface for:
- Map Visualization: Interactive map showing lake polygons from a parquet file
- Hover over polygons to see attributes (id_geohash, area, net change)
-
Click on a polygon to select it
-
Time Series Plotting: Automatically plots water extent over time for the selected lake
- Shows a preview below the map
-
Click "Open Time Series in Popup" for a larger view
-
Automatic Download: If the selected lake's data is not in the cached dataset:
- Shows "Downloading..." status
- Automatically fetches data from Google Earth Engine
-
Displays the time series plot after download completes
-
Google Earth Engine Configuration:
- Enter your EE project in the sidebar
-
Click "Set EE Project" to save it
-
Satellite Timelapse Animation: Generate animated GIFs showing satellite imagery over time
- Sentinel-2 (2016-2025): High-resolution optical imagery (10m resolution)
- Landsat (2000-2025): Longer historical record (30m resolution)
- Uses summer months (July-August) to maximize cloud-free observations
- Creates a buffer around the lake for context
Generating Timelapse GIFs🔗
In the dashboard sidebar, you'll find the "Satellite Timelapse" section with:
- Sentinel-2 (2016-2025): Checkbox to generate Sentinel-2 timelapse (default enabled)
- Landsat (2000-2025): Checkbox to generate Landsat timelapse (default disabled)
- Create Timelapse: Button to start the generation process
The timelapse GIFs are saved to the gifs/ directory with naming convention:
- {geohash}_S2.gif for Sentinel-2
- {geohash}_LS.gif for Landsat
Timelapse Options🔗
The timelapse generation can be customized via the create_timelapse() function:
| Parameter | Type | Description | Default |
|---|---|---|---|
input_lake_gdf |
GeoDataFrame | Lake geometries with id_geohash column | Required |
id_geohash |
str | Specific lake to visualize | Required |
timelapse_source |
str | Image source ("sentinel2" or "landsat") | "sentinel2" |
gif_outdir |
str/Path | Output directory for GIF files | "gifs" |
buffer |
float | Buffer around lake in meters | 100 |
start_year |
int | Start year for timelapse | 2016 (Sentinel-2) / 2000 (Landsat) |
end_year |
int | End year for timelapse | 2025 |
start_date |
str | Start date within year (MM-DD) | "07-01" |
end_date |
str | End date within year (MM-DD) | "08-31" |
frames_per_second |
int | Animation speed | 1 |
dimensions |
int | GIF pixel dimensions | 512 |
overwrite_exists |
bool | Re-download if file exists | False |
Programmatic Usage🔗
You can also generate timelapses programmatically:
import geopandas as gpd
from water_timeseries.utils.earthengine import create_timelapse
# Load your lake data
lakes_gdf = gpd.read_file("lakes.parquet")
# Generate Sentinel-2 timelapse (default)
gif_path = create_timelapse(
input_lake_gdf=lakes_gdf,
id_geohash="b7uefy0bvcrc",
timelapse_source="sentinel2",
gif_outdir="gifs",
)
# Generate Landsat timelapse (longer historical record)
gif_path = create_timelapse(
input_lake_gdf=lakes_gdf,
id_geohash="b7uefy0bvcrc",
timelapse_source="landsat",
start_year=2000,
end_year=2025,
gif_outdir="gifs",
)
# Overwrite existing files
gif_path = create_timelapse(
input_lake_gdf=lakes_gdf,
id_geohash="b7uefy0bvcrc",
timelapse_source="sentinel2",
overwrite_exists=True, # Re-download even if file exists
)
Dashboard Arguments🔗
The create_app() function accepts these parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
data_path |
str/Path | Path to parquet file with lake polygons | tests/data/lake_polygons.parquet |
zarr_path |
str/Path | Path to zarr file with cached time series | tests/data/lakes_dw_test.zarr |
Using with Custom Data🔗
The dashboard can be launched with custom data files via the CLI:
# Using default test data
uv run water-timeseries dashboard
# Using custom data files
uv run water-timeseries dashboard \
--vector-file /path/to/lakes.parquet \
--dw-dataset-file /path/to/data.zarr
Or programmatically with Python:
from water_timeseries.dashboard.map_viewer import create_app
# Create dashboard with custom paths
create_app(
data_path="/path/to/your/lakes.parquet",
zarr_path="/path/to/your/data.zarr"
)
Running Tests🔗
To run the test suite:
# Install with development dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Run specific test modules
pytest tests/test_breakpoints.py
pytest tests/test_normalization.py
# Run with coverage
pytest --cov=water_timeseries
Test Data🔗
Tests use both real and synthetic datasets:
- Real data: Located in tests/data/ (DW and JRC test datasets)
- Synthetic data: Generated programmatically for predictable breakpoint testing
Next Steps🔗
- See API Reference for detailed class documentation
- Check out the Examples for more use cases
- Visit the GitHub repository