Dataset Module🔗
Dataset processing classes for water timeseries analysis.
This module provides classes for processing and normalizing satellite-derived land cover and water classification data. It includes specialized handlers for different data sources and processing pipelines.
DWDataset
🔗
Bases: LakeDataset
Handler for Dynamic World land cover classification data.
Processes Dynamic World land cover classes including water, bare soil, snow/ice, trees, grass, flooded vegetation, crops, shrub/scrub, and built areas.
Attributes:
| Name | Type | Description |
|---|---|---|
water_column |
str
|
Fixed as "water" for DW data. |
data_columns |
list
|
All 9 DW land cover classes. |
Example
dw_data = DWDataset(xr.open_dataset("dynamic_world.nc")) water_time_series = dw_data.ds_normalized["water"] print(dw_data.data_columns) ['water', 'bare', 'snow_and_ice', 'trees', 'grass', 'flooded_vegetation', 'crops', 'shrub_and_scrub', 'built']
Source code in src/water_timeseries/dataset.py
288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
__init__(ds)
🔗
Initialize DWDataset with Dynamic World data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
Dataset
|
Input xarray Dataset with at least the 9 DW class variables. |
required |
Source code in src/water_timeseries/dataset.py
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | |
plot_timeseries(id_geohash, breakpoints=None, save_path=None)
🔗
Plot the time series for a specific geohash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_geohash
|
str
|
The geohash identifier for the location. |
required |
breakpoints
|
BreakpointMethod
|
Breakpoint detection method to use. |
None
|
save_path
|
str | Path
|
Path to save the plot as an image file. |
None
|
Returns:
| Type | Description |
|---|---|
Figure
|
plt.Figure: The matplotlib figure object. |
Source code in src/water_timeseries/dataset.py
373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 | |
plot_timeseries_interactive(id_geohash, breakpoints=None, save_path=None)
🔗
Plot the interactive time series for a specific geohash using Plotly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_geohash
|
str
|
The geohash identifier for the location. |
required |
breakpoints
|
BreakpointMethod
|
Breakpoint detection method to use. |
None
|
save_path
|
str | Path
|
Path to save the plot as HTML file. |
None
|
Returns:
| Type | Description |
|---|---|
|
plotly.graph_objects.Figure: Interactive Plotly figure. |
Source code in src/water_timeseries/dataset.py
409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
JRCDataset
🔗
Bases: LakeDataset
Handler for JRC (Joint Research Centre) water classification data.
Processes JRC water occurrence data with separate classes for permanent water, seasonal water, and land.
Attributes:
| Name | Type | Description |
|---|---|---|
water_column |
str
|
Fixed as "area_water_permanent" for JRC data. |
data_columns |
list
|
['area_water_permanent', 'area_water_seasonal', 'area_land']. |
Example
jrc_data = JRCDataset(xr.open_dataset("jrc_water.nc")) permanent_water = jrc_data.ds_normalized["area_water_permanent"] seasonal_water = jrc_data.ds_normalized["area_water_seasonal"]
Source code in src/water_timeseries/dataset.py
450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 | |
__init__(ds)
🔗
Initialize JRCDataset with JRC water classification data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
Dataset
|
Input xarray Dataset with JRC water classification variables. |
required |
Source code in src/water_timeseries/dataset.py
466 467 468 469 470 471 472 473 474 | |
create_timelapse(lake_gdf, id_geohash, timelapse_source='landsat', gif_outdir='gifs', buffer=100, start_year=2000, end_year=2025, start_date='07-01', end_date='08-31', frames_per_second=1, dimensions=512, overwrite_exists=False)
🔗
Create a timelapse GIF for a specific lake.
This method generates an animated GIF showing satellite imagery over a date range for a lake identified by its geohash. The timelapse captures the summer period (July-August) each year to maximize cloud-free observations.
Default timelapse_source is 'landsat' for JRC data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lake_gdf
|
GeoDataFrame
|
GeoDataFrame containing lake geometries with an 'id_geohash' column. |
required |
id_geohash
|
str
|
The geohash identifier for the specific lake to visualize. |
required |
timelapse_source
|
str
|
Image source for timelapse imagery ('sentinel2' or 'landsat'). |
'landsat'
|
gif_outdir
|
str | Path
|
Output directory for the GIF file (default: 'gifs'). |
'gifs'
|
buffer
|
float
|
Buffer distance in meters to expand the lake bounding box (default: 100). |
100
|
start_year
|
int
|
Start year for the timelapse (default: 2000). |
2000
|
end_year
|
int
|
End year for the timelapse (default: 2025). |
2025
|
start_date
|
str
|
Start date within each year (MM-DD format, default: '07-01'). |
'07-01'
|
end_date
|
str
|
End date within each year (MM-DD format, default: '08-31'). |
'08-31'
|
frames_per_second
|
int
|
Animation speed (default: 1). |
1
|
dimensions
|
int
|
Pixel dimensions for the output GIF (default: 512). |
512
|
overwrite_exists
|
bool
|
If False (default), skip download if output file already exists. If True, always re-download and overwrite existing file. |
False
|
Returns:
| Type | Description |
|---|---|
Path | None
|
Path | None: Path to the generated GIF file, or None if skipped due to existing file. |
Source code in src/water_timeseries/dataset.py
503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 | |
plot_timeseries(id_geohash, breakpoints=None, save_path=None)
🔗
Plot the time series for a specific geohash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_geohash
|
str
|
The geohash identifier for the location. |
required |
breakpoints
|
BreakpointMethod
|
Breakpoint detection method to use. |
None
|
save_path
|
str | Path
|
Path to save the plot as an image file. |
None
|
Returns:
| Type | Description |
|---|---|
Figure
|
plt.Figure: The matplotlib figure object. |
Source code in src/water_timeseries/dataset.py
560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 | |
plot_timeseries_interactive(id_geohash, breakpoints=None, save_path=None)
🔗
Plot the interactive time series for a specific geohash using Plotly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_geohash
|
str
|
The geohash identifier for the location. |
required |
breakpoints
|
BreakpointMethod
|
Breakpoint detection method to use (not used currently). |
None
|
save_path
|
str | Path
|
Path to save the plot as HTML file. |
None
|
Returns:
| Type | Description |
|---|---|
|
plotly.graph_objects.Figure: Interactive Plotly figure. |
Source code in src/water_timeseries/dataset.py
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 | |
LakeDataset
🔗
Base class for processing lake and water body datasets.
Handles common operations for dataset preprocessing, normalization, and masking. Provides a framework that can be extended for different data sources.
Attributes:
| Name | Type | Description |
|---|---|---|
ds |
Dataset
|
The input xarray Dataset containing raw data. |
ds_normalized |
Dataset
|
Normalized version of the dataset (0-1 scale). |
preprocessed_ |
bool
|
Whether preprocessing has been completed. |
normalized_available_ |
bool
|
Whether normalized data is available. |
water_column |
str
|
Name of the water/water extent column. |
data_columns |
list
|
Names of all data columns in the dataset. |
ds_ismasked_ |
bool
|
Whether the original dataset has been masked. |
ds_normalized_ismasked_ |
bool
|
Whether the normalized dataset has been masked. |
Example
lake_data = LakeDataset(xr.Dataset(...)) normalized = lake_data.ds_normalized
Source code in src/water_timeseries/dataset.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
dates_
property
🔗
Get all valid dates from the dataset.
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
List of all dates from the 'date' coordinate. |
object_ids_
property
🔗
Get all valid object IDs from the dataset.
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
List of all object IDs from the id_field coordinate. |
__init__(ds, id_field='id_geohash')
🔗
Initialize the LakeDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
Dataset
|
Input xarray Dataset with land cover or water classification data. |
required |
id_field
|
str
|
Name of the coordinate field that identifies individual time series (default: "id_geohash"). |
'id_geohash'
|
Source code in src/water_timeseries/dataset.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
create_timelapse(lake_gdf, id_geohash, timelapse_source='sentinel2', gif_outdir='gifs', buffer=100, start_year=2016, end_year=2025, start_date='07-01', end_date='08-31', frames_per_second=1, dimensions=512, overwrite_exists=False)
🔗
Create a timelapse GIF for a specific lake.
This method generates an animated GIF showing satellite imagery over a date range for a lake identified by its geohash. The timelapse captures the summer period (July-August) each year to maximize cloud-free observations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lake_gdf
|
GeoDataFrame
|
GeoDataFrame containing lake geometries with an 'id_geohash' column. |
required |
id_geohash
|
str
|
The geohash identifier for the specific lake to visualize. |
required |
timelapse_source
|
str
|
Image source for timelapse imagery ('sentinel2' or 'landsat'). |
'sentinel2'
|
gif_outdir
|
str | Path
|
Output directory for the GIF file (default: 'gifs'). |
'gifs'
|
buffer
|
float
|
Buffer distance in meters to expand the lake bounding box (default: 100). |
100
|
start_year
|
int
|
Start year for the timelapse (default: 2016). |
2016
|
end_year
|
int
|
End year for the timelapse (default: 2025). |
2025
|
start_date
|
str
|
Start date within each year (MM-DD format, default: '07-01'). |
'07-01'
|
end_date
|
str
|
End date within each year (MM-DD format, default: '08-31'). |
'08-31'
|
frames_per_second
|
int
|
Animation speed (default: 1). |
1
|
dimensions
|
int
|
Pixel dimensions for the output GIF (default: 512). |
512
|
overwrite_exists
|
bool
|
If False (default), skip download if output file already exists. If True, always re-download and overwrite existing file. |
False
|
Returns:
| Type | Description |
|---|---|
Path | None
|
Path | None: Path to the generated GIF file, or None if skipped due to existing file. |
Source code in src/water_timeseries/dataset.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
merge(other, how='both')
🔗
Merge this LakeDataset with another LakeDataset.
Combines the .ds attributes of both datasets. Both datasets must have the same
variables. The merge strategy is determined by the how parameter.
Both datasets must be of the same type (e.g., both DWDataset or both JRCDataset).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
LakeDataset
|
Another LakeDataset instance to merge with. |
required |
how
|
str
|
Merge strategy. Options: - "both": Merge along both dimensions (date and id_geohash). Combines all data from both datasets, keeping all unique dates and id_geohashes. - "date": Merge along the "date" dimension only. Both datasets must have the same id_geohash values, but can have different dates. New dates are appended to the existing time series. - "id_geohash": Merge along the "id_geohash" dimension only. Both datasets must have the same dates, but can have different id_geohashes. New id_geohashes (lakes) are added with their time series. |
'both'
|
Returns:
| Name | Type | Description |
|---|---|---|
LakeDataset |
LakeDataset
|
A new LakeDataset with merged .ds data. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the datasets are of different types. |
ValueError
|
If the merge strategy is invalid or datasets are incompatible. |
Example
merged = dataset1.merge(dataset2, how="both") merged = dataset1.merge(dataset2, how="date") # Add new dates merged = dataset1.merge(dataset2, how="id_geohash") # Add new lakes
Source code in src/water_timeseries/dataset.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | |
plot_timeseries(id_geohash, breakpoints)
🔗
Plot the time series for a specific geohash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_geohash
|
str
|
The geohash identifier for the location. |
required |
breakpoints
|
BreakpointMethod
|
Breakpoint detection method to use. |
required |
Source code in src/water_timeseries/dataset.py
168 169 170 171 172 173 174 175 | |
Merge Functionality🔗
The LakeDataset class and its subclasses (DWDataset, JRCDataset) provide a merge() method to combine two datasets. This is useful for:
- Combining datasets from different time periods
- Adding new lakes to an existing dataset
- Combining partial datasets into a complete one
Merge Strategies🔗
The merge() method accepts a how parameter with three options:
| Strategy | Description | Requirements |
|---|---|---|
"both" |
Merge along both dimensions (date and id_geohash). Combines all unique data from both datasets. | Same variables |
"date" |
Merge along the date dimension only. Adds new dates for the same lakes. | Same id_geohash values, same variables |
"id_geohash" |
Merge along the id_geohash dimension only. Adds new lakes with the same dates. | Same dates, same variables |
Examples🔗
from water_timeseries.dataset import DWDataset
import xarray as xr
# Load two datasets
ds1 = xr.open_dataset("data_2020_2022.zarr")
dataset1 = DWDataset(ds1)
ds2 = xr.open_dataset("data_2023_2024.zarr")
dataset2 = DWDataset(ds2)
# Merge along both dimensions
merged = dataset1.merge(dataset2, how="both")
# Add new dates to existing time series (same lakes)
# Both datasets must have the same id_geohash values
merged = dataset1.merge(dataset2, how="date")
# Add new lakes with the same temporal coverage
# Both datasets must have the same dates
merged = dataset1.merge(dataset2, how="id_geohash")
Warnings🔗
When there are overlapping values, a warning is issued:
how="date": Warns if there are duplicate dates between datasetshow="id_geohash": Warns if there are duplicate id_geohash values
In both cases, data from the second dataset will overwrite the first for overlapping values.
Requirements🔗
- Both datasets must be of the same type (both
DWDatasetor bothJRCDataset) - Both datasets must have the same variables
- The specific merge strategy may have additional requirements (see table above)
Return Value🔗
The merge() method returns a new LakeDataset instance (of the same type as the first dataset) with the combined data. The returned dataset is fully preprocessed and normalized.
Plot Time Series🔗
Both DWDataset and JRCDataset provide a plot_timeseries() method to visualize water extent over time for a specific lake.
DWDataset.plot_timeseries()🔗
from water_timeseries.dataset import DWDataset
import xarray as xr
# Load data
ds = xr.open_zarr("lakes_dw.zarr")
dataset = DWDataset(ds)
# Plot time series for a specific lake
fig = dataset.plot_timeseries(
id_geohash="b7uefy0bvcrc",
breakpoints=None # Optional: pass BreakpointMethod to overlay detected breaks
)
# Show the plot
fig.show()
JRCDataset.plot_timeseries()🔗
from water_timeseries.dataset import JRCDataset
import xarray as xr
# Load data
ds = xr.open_zarr("lakes_jrc.zarr")
dataset = JRCDataset(ds)
# Plot time series
fig = dataset.plot_timeseries(
id_geohash="b7uefy0bvcrc",
breakpoints=None # Optional: BreakpointMethod to overlay detected breaks
)
fig.show()
Parameters🔗
| Parameter | Type | Description |
|---|---|---|
id_geohash |
str | The geohash identifier for the lake to plot |
breakpoints |
BreakpointMethod, optional | Breakpoint detection result to overlay on the plot (e.g., from SimpleBreakpoint or BeastBreakpoint) |
Return Value🔗
Returns a matplotlib.figure.Figure object that can be displayed or saved.
With Breakpoint Overlay🔗
from water_timeseries.dataset import DWDataset
from water_timeseries.breakpoint import SimpleBreakpoint
# Initialize dataset
dataset = DWDataset(xr.open_zarr("lakes_dw.zarr"))
# Detect breakpoints
bp = SimpleBreakpoint()
breaks = bp.calculate_break(dataset, object_id="b7uefy0bvcrc")
# Plot with breakpoint overlay
fig = dataset.plot_timeseries(
id_geohash="b7uefy0bvcrc",
breakpoints=breaks
)
fig.show()
Visual Output🔗
DWDataset Time Series

The DWDataset plot shows land cover class proportions as a stacked area chart: - Water (blue): Primary water extent indicator - Vegetation classes (trees, grass, crops, shrub): Grouped in green tones - Other classes (built, bare, snow): Shown in distinct colors - Values are normalized to total area (0-1 scale)
JRCDataset Time Series

The JRCDataset plot shows permanent vs seasonal water as a line chart: - Permanent water (blue): Water present year-round - Seasonal water (light blue): Water present seasonally - Land: Dry land area (shown in brown/green) - Values are percentages (0-100)
With Breakpoint Overlay
When a breakpoint is detected, a vertical dashed line marks when significant water extent changes occurred.