Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SurtGIS

High-performance geospatial analysis in Rust. A single CLI binary and a set of Rust crates that compute 127 terrain, hydrology, imagery, morphology, interpolation, and classification algorithms — with native GeoTIFF I/O, streaming for rasters larger than RAM, native CRS reprojection, and an end-to-end STAC composite pipeline for Planetary Computer and Earth Search. 56 of the 127 algorithms also compile to WebAssembly and run in any modern browser without a server, including the complete Florinsky 14-curvature morphometric system.

This book is the user guide. For the auto-generated API reference (for embedding SurtGIS as a library in Rust code), see docs.rs/surtgis-core.

What this book is for

If you’re looking for:

  • “Does this fit my workflow?” → start with Installation then the First terrain analysis tutorial. 30 minutes and you’ll know.
  • “How do I do X?”How-to guides are task-focused and short. Pick the one that matches your problem.
  • “What flags does surtgis terrain slope take?”CLI reference.
  • “Why does SurtGIS do it this way?”Explanation covers architecture, memory model, and side-by-side comparisons with GDAL, GRASS, and WhiteboxTools.

The four sections don’t overlap by design. Tutorials teach. How-to guides solve. Reference is authoritative. Explanation reasons. This is the Diátaxis framework.

Design principles

  1. Native I/O, no GDAL required. GeoTIFF read/write including predictor 1/2/3, DEFLATE, GDAL_NODATA, and Cloud Optimized (COG) over HTTP range requests. GDAL is available as an optional feature.

  2. Streaming over in-memory when data grows. Algorithms that fit the strip-processing pattern (Horn slope, Gaussian smoothing, flow direction, fill-sinks, and about ten more) run with bounded RAM regardless of raster size. Detection is automatic; a --streaming flag forces the path when you want to be sure.

  3. One binary per platform. The CLI is statically-linked where possible. The default feature set for the precompiled binary covers STAC + COG + climate data (Zarr) + UTM reprojection. System-library dependencies (libnetcdf, HDF5) are opt-in via --all-features when building from source.

  4. Validated against the canonical alternatives. Slope, aspect, hillshade, flow accumulation, and watershed outputs agree with GDAL 3.11, GRASS 8.3, and WhiteboxTools within documented tolerances. The cross-validation numbers live in the paper (Environmental Modelling & Software, under review).

  5. Honest about tradeoffs. SurtGIS is not faster than everything at everything — see When to use SurtGIS vs alternatives for where the Rust reimplementation wins (I/O pipelines, parallel terrain factors, streaming) and where you’re better off reaching for GDAL or GRASS.

Current status

As of v0.8.1, SurtGIS has been used end-to-end on a 15-watershed Chilean landslide-susceptibility pipeline (Sentinel-2 composites, 41 M-cell areas, 60 labelled wells). The accompanying paper is in major-revision at Environmental Modelling & Software. Recent releases focused on: v0.7.5 fixed a critical decoder bug for Planetary Computer COGs with BitsPerSample=15 packed encoding; v0.7.4 added the native reproject command, removing the last dependency on gdalwarp; v0.8.x roughly doubled the WebAssembly surface, from 33 to 56 algorithms.

No external users are known yet — feedback from people outside the original developer/postdoc loop is exactly what would take this from “works for one expert” to “works for a community”. If that’s you, please open an issue or try the live demo — it runs the 56 WebAssembly algorithms client-side, so your DEM never leaves your machine.

Installation

Three paths, in order of increasing effort.

Download the archive for your platform from the GitHub releases page:

PlatformArchive
Linux x86_64surtgis-v<version>-x86_64-unknown-linux-gnu.tar.gz
macOS arm64 (Apple Silicon)surtgis-v<version>-aarch64-apple-darwin.tar.gz
Windows x86_64surtgis-v<version>-x86_64-pc-windows-msvc.zip

Unpack and put surtgis (or surtgis.exe) somewhere on your PATH:

# Linux / macOS
tar xzf surtgis-*.tar.gz
sudo mv surtgis /usr/local/bin/
surtgis --version

The precompiled binary has the feature set cloud,zarr,projections: STAC, COG, climate-data Zarr readers, and UTM reprojection. This covers every tutorial and every how-to in this book.

2. cargo install (for the full feature set)

If you have the Rust toolchain and want netcdf / grib support on top, install from crates.io:

cargo install surtgis --all-features

System libraries required for --all-features:

FeatureLinuxmacOSWindows
netcdflibnetcdf-devbrew install netcdfnot supported
griblibgribapi or eccodesbrew install eccodesnot supported

If you only need the same feature set as the precompiled binary:

cargo install surtgis

3. From source (for contributors)

git clone https://github.com/franciscoparrao/surtgis
cd surtgis
cargo build --release -p surtgis --bin surtgis
./target/release/surtgis --version

The workspace contains ten crates. You only need the surtgis package for the CLI; the library crates (surtgis-core, surtgis-algorithms, surtgis-cloud, etc.) are separately publishable for embedding.

Verifying the install

Any of these should return sensible output:

surtgis --version                    # surtgis 0.7.0
surtgis --help                       # top-level command list
surtgis terrain --help               # terrain subcommand list

If the binary runs but a specific command fails with “feature not enabled”, you’re on the precompiled binary and the command needs netcdf / grib / gdal. Rebuild from source with cargo install surtgis --all-features.

Next step

Head to the first tutorial for an end-to-end walk-through using real data from the Copernicus Digital Elevation Model.

First terrain analysis from a Copernicus DEM

This tutorial walks you end-to-end through a realistic terrain analysis: you will download a real digital elevation model from a public STAC catalog, reproject it to a metric coordinate system, compute slope, aspect, and hillshade, clip the result to a smaller area, and understand what the outputs actually mean. It takes about 30 minutes.

By the end you will have:

  • A 30-metre DEM of a ~11 km × 5 km patch of the Chilean Andes on disk.
  • A set of terrain-factor rasters derived from it.
  • A clear mental model of SurtGIS’s command flow: source → transform → derive → consume.

Prerequisites: SurtGIS installed (see Installation). Network access for the STAC download step (about 20 MB transferred).

1. Set up a working directory

mkdir -p ~/surtgis-tutorial
cd ~/surtgis-tutorial

All subsequent commands expect you to be in this directory.

2. Download the DEM from STAC

We’ll pull a patch of the Copernicus GLO-30 DEM (30-metre global coverage) from Microsoft’s Planetary Computer:

surtgis stac fetch-mosaic \
    --catalog pc \
    --collection cop-dem-glo-30 \
    --bbox=-71.0,-35.1,-70.9,-35.05 \
    dem_wgs84.tif

What each flag does:

  • --catalog pc — use Planetary Computer. The other supported catalogs are es (Earth Search) and any full STAC API URL. For cop-dem-glo-30, PC is the cleaner source (Earth Search returns multiple overlapping tiles that slow the download).
  • --collection cop-dem-glo-30 — the STAC collection ID. PC’s browser at planetarycomputer.microsoft.com/catalog lists every available.
  • --bbox=-71.0,-35.1,-70.9,-35.05 — west, south, east, north in degrees. Note the = syntax: with a negative number as the first value, the shell would otherwise interpret -71.0 as a flag.
  • dem_wgs84.tif — output filename. The _wgs84 suffix is a note to future-us: this DEM is in EPSG:4326 (latitude/longitude). We’ll reproject in a moment.

The command should take ~10 seconds. Verify:

surtgis info dem_wgs84.tif

You should see something close to:

Dimensions: 360 x 182 (65520 cells)
Cell size: 0.0002777777777777778
Bounds: (-71.000000, -35.100278) - (-70.900000, -35.049722)
CRS: EPSG:4326

Statistics:
  Min: 400.0
  Max: 1380.0
  Mean: ~800
  Valid cells: 65520 (100.0%)

Elevations between 400 m and 1380 m tell us this patch is at a foothill level of the Andes, not in the high peaks. That’s realistic for the latitude (35°S, near Curicó).

3. Why we have to reproject before computing slope

Here is a trap almost every newcomer falls into, so we’re going to walk straight into it on purpose.

Try computing slope on this DEM directly:

surtgis terrain slope dem_wgs84.tif slope_wgs84.tif
surtgis info slope_wgs84.tif

Look at the statistics:

Statistics:
  Min: 89.8
  Max: 90.0
  Mean: ~89.99

Every pixel is saying “89.99 degrees of slope” — essentially vertical, everywhere. This is wrong; the Andes foothills are not a sheer cliff.

Why this happens. Slope is atan(rise / run). The run in our DEM is measured in degrees of longitude (about 0.000278° per pixel), but the rise is measured in metres. When you divide metres by degrees, the ratio is on the order of hundreds or thousands, so the atan saturates to 90°.

The fix is to reproject the DEM into a coordinate system where both axes are metres. For Chile at 35°S, that’s UTM zone 19 South (EPSG:32719). SurtGIS doesn’t have a dedicated reproject command yet — but stac fetch-mosaic accepts --align-to which can reproject-on-read, and the resample command is the general-purpose tool for grid alignment. For this tutorial we’ll take a simpler route: ask STAC for the DEM in the right projection in one shot.

4. Reproject via resample to a reference grid

If you have a reference raster in the target CRS, resample is the cleanest path:

# (skip this if you don't have an existing UTM raster — see the next
# section for the alternate path)
surtgis resample dem_wgs84.tif dem_utm.tif --reference reference_utm.tif

Most first-time users don’t have a reference. For a standalone reprojection you’ll want the more direct approach below.

5. Alternative: process a native-metric DEM from a different source

For this tutorial we’ll sidestep the reprojection question entirely by working with a small pre-projected DEM we can compute slope on directly. The sentinel-2-l2a collection on Planetary Computer serves imagery already in UTM, and its companion DEM products do the same.

If you’re following along without a reference raster, skip to tutorial 2 Sentinel-2 composite and NDVI which starts with UTM data and avoids this rabbit hole.

For the remainder of this tutorial, if you want to compute terrain factors on the dem_wgs84.tif we just downloaded, you will need to reproject it first. The simplest way today is a one-liner with GDAL (yes, the tool SurtGIS doesn’t depend on — we’re not dogmatic about it):

gdalwarp -t_srs EPSG:32719 -tr 30 30 -r bilinear dem_wgs84.tif dem_utm.tif
surtgis info dem_utm.tif

Output:

Cell size: 30
CRS: EPSG:32719

Now the cells are 30 m on a side, and we can compute slope meaningfully.

This is a documented limitation of SurtGIS 0.7.0. A native reproject command is on the roadmap. Until it lands, GDAL’s gdalwarp is the recommended path. See the Reproject between UTM zones how-to for more.

6. Compute the core terrain factors

With a UTM DEM, we can compute the standard factors:

surtgis terrain slope    dem_utm.tif slope.tif
surtgis terrain aspect   dem_utm.tif aspect.tif
surtgis terrain hillshade dem_utm.tif hillshade.tif

Each of these takes a fraction of a second on a patch this size. Verify one:

surtgis info slope.tif

You should see a reasonable range now, roughly:

Min: 0.0
Max: ~45
Mean: ~12

Twelve degrees mean slope is consistent with Andean foothills. Peak slopes of 45° correspond to the steeper gullies.

7. Or: compute every standard factor in one pass

The terrain all subcommand computes the full set (slope, aspect, hillshade, curvature, TPI, TRI, roughness) in one shot, reading the DEM once and writing each factor as a separate file:

surtgis terrain all dem_utm.tif --outdir factors/ --compress
ls factors/
slope.tif  aspect.tif  hillshade.tif  curvature.tif  tpi.tif  tri.tif  roughness.tif

This is the ergonomic path for “give me the standard factor set to feed into a model”. For ad-hoc experimentation, the individual commands are fine.

8. Clip to a smaller area of interest

If your model only cares about the eastern half of this patch:

surtgis clip factors/slope.tif --bbox=300000,6110000,315000,6120000 slope_east.tif

--bbox is in the raster’s CRS (so UTM metres here, not degrees). Alternatively, --polygon my_aoi.gpkg clips by a vector geometry — see Clip by polygon how-to for that workflow.

9. Understand what you just produced

Let’s make sure the outputs actually make sense.

Slope is degrees above horizontal, 0 to 90. Flat plains are 0–3°, gentle hills 5–15°, steep terrain 25–45°, cliffs >45°. If your DEM yields slopes above 70° widely, something is wrong with units (as we saw in step 3) or with artifacts in the DEM.

Aspect is compass direction the slope faces, 0–360° (or −1 where the slope is too flat to have a direction). 0° = north, 90° = east, 180° = south, 270° = west.

Hillshade is a synthetic illumination value, 0–255, simulating a light source at a default azimuth (315°) and altitude (45°). It’s purely for visualisation; it has no physical unit. Darker values = in shadow from the light source; brighter = facing it.

Open factors/hillshade.tif in QGIS (or any raster viewer) and you should immediately see the terrain structure of the patch. That visual check is the single best way to confirm your pipeline is correct before trusting the downstream derivatives.

What’s next

  • For cloud masking and Sentinel-2 imagery, continue with Tutorial 2: Sentinel-2 composite and NDVI.
  • For scaling this to a much larger DEM (>5 GB), read Memory model and use --streaming.
  • For writing this pipeline as a script, every command here is composable via shell scripts; most users treat the CLI as the primary interface.

Common gotchas

“My slope is everywhere 89°.” You’re computing slope on a lat/lon-projected DEM. Reproject to a metric CRS first (step 5).

“All my outputs are black or blank.” Open in QGIS with “Singleband pseudocolor” and let it auto-stretch the min–max. The TIFF viewer on your OS is probably clamping to the raw byte range.

“Some pixels are NaN / nodata.” Edge pixels within one cell of the DEM boundary are undefined for 3×3 window algorithms (slope, aspect, TPI, TRI). This is intentional; the alternatives either extrapolate off the edge or silently invent values.

“The STAC download was slow.” For cop-dem-glo-30, prefer --catalog pc. Earth Search serves the same collection but with multiple overlapping tile granules — transfer volume can be 20× higher for the same bbox.

Sentinel-2 composite and NDVI from STAC

This tutorial builds a cloud-free Sentinel-2 composite from Microsoft’s Planetary Computer and derives the NDVI vegetation index from it — the classic “hello world” of satellite remote sensing. It takes about 15 minutes of wall clock (most of it network transfer).

By the end you will have:

  • Red and NIR band GeoTIFFs, mosaicked across multiple cloud-free scenes from one month, already in UTM at 10-metre resolution.
  • An NDVI raster ready to visualise or feed into a model.
  • An understanding of the STAC composite pipeline: search → mask → median → reproject.

Prerequisites: SurtGIS installed. Network access for STAC reads (~100 MB transferred for this tutorial).

1. Working directory

mkdir -p ~/surtgis-s2-tutorial
cd ~/surtgis-s2-tutorial

2. Pick a study area

We’ll use a 5 km × 5 km patch in central Chile (near Talca), Southern Hemisphere summer, so vegetation is green and cloud cover is manageable:

  • Bbox: -71.0, -35.1, -70.95, -35.05 (west, south, east, north in degrees)
  • Date range: 2024-01-01 to 2024-01-31

Small bboxes are kind to your network and to Planetary Computer’s rate limits. Once the pipeline works at 5 km, scaling to a watershed is just changing one flag.

3. Fetch a median composite of red + NIR

surtgis stac composite \
    --catalog pc \
    --collection sentinel-2-l2a \
    --asset "red,nir" \
    --bbox=-71.0,-35.1,-70.95,-35.05 \
    --datetime 2024-01-01/2024-01-31 \
    --max-scenes 3 \
    composite.tif

What’s happening, step by step:

  1. STAC search. The client queries Planetary Computer’s STAC API for sentinel-2-l2a items intersecting the bbox within the datetime range. Typically you get 5–10 candidate scenes for one month on a small bbox.
  2. Scene ranking. Up to --max-scenes (3 here) are picked by coverage of the bbox and reasonable cloud cover metadata.
  3. Per-scene fetch. For each selected scene, the COG reader opens the red and NIR assets via HTTP range requests, downloading only the tiles that intersect the bbox.
  4. Cloud masking. Sentinel-2 L2A ships with an SCL (Scene Classification Layer) raster that labels every pixel as cloud, cloud shadow, water, vegetation, etc. SurtGIS applies the mask to remove cloudy pixels.
  5. Median composite. Across the 3 cloud-masked scenes, each pixel takes the median value per band. The result is one pixel-stack with one value per band.
  6. Output. Written as UTM-projected GeoTIFFs, one per band: composite_red.tif and composite_nir.tif. The _red and _nir suffixes come from the asset names.

On a warm connection this should take about 65 seconds. Verify:

ls *.tif
surtgis info composite_red.tif
composite_nir.tif  composite_red.tif
...
Dimensions: 467 x 564 (263388 cells)
Cell size: 10
Bounds: (317595.06, 6114033.69) - (322265.06, 6119673.69)
CRS: EPSG:32719

10-metre cells, EPSG:32719 (UTM 19S), about 4.7 km × 5.6 km. Note that the output extent is slightly larger than the bbox you requested — the composite rounds to the Sentinel-2 native 10-metre grid, so you get a couple of extra pixels of padding.

4. Compute NDVI

NDVI = (NIR − Red) / (NIR + Red). It’s a vegetation index: roughly, higher values mean denser, greener vegetation. Bare soil is around 0.1, sparse grass 0.2–0.3, active crops 0.4–0.7, dense forest 0.7–0.9.

surtgis imagery ndvi \
    --red composite_red.tif \
    --nir composite_nir.tif \
    ndvi.tif

surtgis info ndvi.tif

You should see:

Statistics:
  Min: ~0.03
  Max: ~0.69
  Mean: ~0.40

A mean of 0.40 is consistent with a mix of crop fields and some bare land in mid-January. Open the TIFF in QGIS with the “pseudocolor” style and apply a red → yellow → green ramp; you’ll immediately see agricultural patterns.

5. Interpret the output

A few sanity checks to confirm your pipeline is working:

Histogram shape. NDVI over a real landscape has a bimodal or trimodal distribution: bare soil / pavement (~0.1), mixed vegetation (~0.3–0.5), dense vegetation (~0.7+). A single peak at 0 suggests the red/NIR bands got swapped. A single peak at 0.9 suggests the composite is hallucinating vegetation, possibly because the SCL mask is removing all non-vegetation pixels (unlikely but worth checking).

No NaN border. Unlike terrain derivatives (which have a 1-pixel NaN border from the 3×3 window), NDVI is a per-pixel formula and should have valid values everywhere the red and NIR bands both have values. If you see a large NaN region, that’s likely cloud cover that the SCL mask filtered out across every selected scene — increase --max-scenes to get more dates.

Distribution of valid pixels. surtgis info ndvi.tif reports “Valid cells: N/M”. For a cloud-free January in this region you should see 100%. In cloudier regions or seasons, expect 85–95%.

6. Scale to more bands and more dates

The real workflow for a machine-learning pipeline wants more than two bands. Here’s a 10-band composite over three months:

surtgis stac composite \
    --catalog pc \
    --collection sentinel-2-l2a \
    --asset "blue,green,red,rededge1,rededge2,rededge3,nir,nir08,swir16,swir22" \
    --bbox=-71.0,-35.1,-70.95,-35.05 \
    --datetime 2024-01-01/2024-03-31 \
    --max-scenes 8 \
    --cache \
    composite_all.tif

What’s new:

  • 10 bands at once. Writes composite_all_blue.tif, composite_all_green.tif, etc.
  • 3 months of data, 8 scenes median-composited.
  • --cache persists downloaded COG tiles at ~/.cache/surtgis/cog/ so re-runs reuse them instead of re-fetching.

Expected wall time: ~5 minutes. RAM peak: see the RAM budget how-to — SurtGIS will print a budget line at startup estimating peak usage.

7. Use NDVI as input to a model

Now you have a feature raster. Two common next steps:

Combine with terrain. Run the first tutorial on a DEM covering the same bbox, then the derived terrain factors (slope, aspect, TPI) plus this NDVI form a multi-feature stack that most landslide-susceptibility or vegetation-classification models want.

Extract training samples at labelled points. If you have a vector file of labelled points (e.g. wells.gpkg with a vegetation_class column):

surtgis extract \
    --features-dir ./composite_all_bands/ \
    --points wells.gpkg \
    --target vegetation_class \
    samples.csv

The result is a CSV with one row per point, columns = bands + target. Feed into XGBoost, scikit-learn, or any tabular model.

For CNN-based spatial classification, use extract-patches instead to generate image chips (see the extract-patches how-to).

Common gotchas

“The composite has huge black regions.” Your bbox straddles a Sentinel-2 MGRS tile boundary, and the scenes selected don’t all cover both sides. Increase --max-scenes to get scenes from both orbits.

“The download is slow or times out.” Planetary Computer occasionally rate-limits (HTTP 429). SurtGIS 0.6.25+ has exponential backoff with jitter built into the retry logic; transient failures should recover automatically. Persistent rate limits might warrant switching to Earth Search (--catalog es) or spreading the work across time.

“RAM climbs during long runs.” See Debug a stac composite using too much RAM. Short version: the [ram] log lines SurtGIS prints at strip/band transitions pinpoint where growth starts. Raise --band-chunk-size if you have headroom, or lower SURTGIS_RAM_BUDGET_GB if you don’t.

“I want the full scene, not just my bbox.” For a full MGRS tile (~110 km on a side at 10 m resolution), expect 10–100× the runtime and transfer volume. Use --strip-rows 256 and enable --cache. For tile-size extents the streaming pipeline really earns its keep.

Fluvial-tectonic morphometry from a DEM

This tutorial walks you from a raw digital elevation model to the five canonical metrics of tectonic geomorphology: χ (chi), normalised channel steepness (ksn), knickpoints, per-basin concavity (θ), and divide-migration asymmetry. It takes about 10 minutes once you have a DEM in hand and produces inputs that match the convention TopoToolbox (Schwanghart & Scherler 2014) established.

By the end you will have:

  • A reproducible CLI sequence that turns any projected DEM into the full v0.10 fluvial product family.
  • Three GeoJSON files (ksn segments, knickpoints, divides) ready to drape on a hillshade in QGIS or load into geopandas.
  • A CSV of per-basin θ with bootstrap-derived 95 % CIs.
  • Working intuition for which knob to turn when an output looks off.

Prerequisites: SurtGIS ≥ 0.10.1 installed (cargo install surtgis or pip install surtgis). A DEM in a projected coordinate system in metres — UTM is the standard choice. If yours is in EPSG:4326 (latitude/longitude), reproject first: surtgis reproject dem_wgs84.tif --to EPSG:32719 dem.tif.

If you don’t have one, the First terrain analysis tutorial walks through downloading a real 30 m Andes DEM in two steps.

1. Set up a working directory

mkdir -p ~/surtgis-fluvial-tutorial
cd ~/surtgis-fluvial-tutorial
cp /path/to/your/dem.tif dem.tif

All subsequent commands expect dem.tif to be present in the working directory and to be in a projected CRS with metre units.

2. Build the hydrology stack

The fluvial algorithms operate on three derived rasters, not on the DEM directly. Producing them is the standard SurtGIS hydrology pipeline:

surtgis hydrology fill-sinks       dem.tif       filled.tif
surtgis hydrology flow-direction   filled.tif    fdir.tif
surtgis hydrology flow-accumulation fdir.tif      facc.tif

Each command takes seconds on a 1000 × 1000 raster. The three outputs together represent the full hydrologic topology: every cell knows where its water comes from (facc) and where it goes next (fdir).

Now extract the stream network. This is where the new --from-facc flag matters: without it, stream-network would recompute the whole pipeline internally and the resulting stream cells would NOT be topologically consistent with the fdir.tif you just produced. With it, the command thresholds your existing facc.tif and the topology is coherent across all four files.

surtgis hydrology stream-network --from-facc --threshold 1000 \
                                  facc.tif    streams.tif

The --threshold 1000 is in cell counts. With a 30 m DEM this is ~0.9 km² of contributing area — a reasonable threshold for mountainous terrain that captures bedrock channels but suppresses noise in zero-order hollows. For a 10 m DEM, scale to --threshold 9000 to keep the area threshold constant.

3. Compute χ (chi)

χ is the path integral of (A₀/A(x))^θref along each river profile, measured from the network’s outlets upstream. In a steady-state landscape it linearises the elevation profile against a single slope — that slope is ksn.

surtgis fluvial chi streams.tif fdir.tif facc.tif chi.tif

Defaults: --theta-ref 0.45 (the canonical bedrock-channel reference, Perron & Royden 2013) and --a-0-m2 1e6 (1 km² normalisation). The output is a Float32 GeoTIFF where every stream cell has a χ value in metres; non-stream cells are NaN.

Quick visual check in QGIS: load chi.tif, style with a continuous ramp — χ should climb monotonically from blue (low, near outlets) to red (high, near divides). Any spatial discontinuity along a single channel flags a topology problem upstream (usually flow-direction ambiguity at a confluence).

4. Compute ksn (channel steepness)

surtgis fluvial ksn streams.tif fdir.tif facc.tif filled.tif ksn.tif \
                    --segments ksn_segments.geojson

Two outputs:

  • ksn.tif — per-cell raster, smoothed over a 500 m window along the channel (Wobus et al. 2006 standard).
  • ksn_segments.geojson — one LineString per stream segment between graph terminals (confluence ↔ confluence, confluence ↔ outlet), attributes ksn_mean and n_cells. Coordinates are reprojected to WGS84 by default (RFC 7946); pass --keep-crs to preserve source CRS metres with a legacy crs declaration.

High ksn cells correspond to either fast uplift or resistant lithology. Sub-basins with anomalously high ksn relative to their neighbours are the first thing to investigate for tectonic signal.

5. Detect knickpoints

surtgis fluvial knickpoints streams.tif fdir.tif facc.tif filled.tif \
                            knickpoints.geojson \
                            --raster knickpoint_raster.tif

The output GeoJSON Points carry four properties:

  • elevation_m — denoised z at the knickpoint cell
  • magnitude_m — elevation drop across a ±2-cell window
  • chi — χ value at the cell
  • polarityconcave (slope decreases downstream, often lithology contrast) or convex (slope increases downstream, often transient response to a tectonic pulse)

The categorical raster (--raster) writes 0 for non-knickpoint, 1 for concave, 2 for convex — useful when you want to overlay knickpoint density on a hillshade.

Tuning: if you see false positives in headwaters, raise --min-magnitude-m from 10 (default) to 20-30. If you suspect real knickpoints are being smoothed away, lower --tvd-lambda from 0.5 to 0.3.

6. Per-basin concavity θ

You’ll need a basins raster first. Identify pour-points (rows/cols of the outlets of the sub-basins you care about) — usually the cells with the highest facc values along the AOI boundary. If you have them as coordinates from a GIS, convert to (row, col) using the facc.tif transform.

surtgis hydrology watershed --pour-points "1429,778;1613,466" \
                            fdir.tif basins.tif

surtgis fluvial concavity streams.tif fdir.tif facc.tif filled.tif basins.tif \
                          concavity.csv \
                          --bootstrap-n 200

The CSV has one row per basin with theta_opt, theta_ci_low, theta_ci_high, n_cells, rmse. Typical bedrock channels in steady state cluster around θ = 0.45; values < 0.3 or > 0.6 are interesting and warrant inspection — they suggest transient response, lithologic heterogeneity, or a non-uniform uplift signal.

7. Divide migration

surtgis fluvial divide-migration basins.tif filled.tif facc.tif \
                                 divides.geojson \
                                 --chi chi.tif

One LineString feature per adjacent-basin pair, with median Δχ (Willett 2014), median Gilbert Δelev, median Δrelief, and n_pairs (cell-pair count along the divide).

Interpretation: a divide where median_chi_diff > 0 and basin_a (the smaller numeric ID) is on the higher-elevation side suggests basin_a is “losing area” to basin_b. The threshold for significance is empirical — in well-studied cases like Willett’s southern Sierra Madre, |Δχ| of a few hundred metres is diagnostic; for your AOI calibrate against the obvious cases first.

8. Verify in QGIS

Load the rasters and vectors in QGIS:

LayerStyle
filled.tifhillshade (Tools → Raster → Analysis → Hillshade)
ksn.tifgraduated, viridis 10–500 range, transparent for NaN
knickpoints.geojsoncategorized on polarity (concave = blue, convex = red), graduated size by magnitude_m
ksn_segments.geojsongraduated line on ksn_mean, viridis
divides.geojsongraduated line on median_chi_diff, diverging colormap centred at 0

If you used the default WGS84 output, everything aligns directly on a basemap layer (OpenStreetMap, satellite). If you used --keep-crs, make sure the project CRS matches your source raster CRS.

Where to next

  • “What do these numbers actually mean?”Tectonic geomorphology with the fluvial module walks through the interpretation framework.
  • “My output looks weird.” — the pitfalls section of the explanation chapter covers the eight failure modes the spec calls out, including the cell-size-vs-resolution sensitivity that bites on coarse DEMs.
  • “I want to publish this.” — every output above carries provenance (parameters, source CRS, etc.) in its filename or properties; for a paper-grade pipeline log, capture the full command sequence and the SurtGIS version (surtgis --version) alongside the figures.

Reproject a raster between CRSes

SurtGIS 0.7.4 added the surtgis reproject command, which uses proj4rs for the coordinate transform and a Rayon-parallelised inverse mapping for the per-pixel sampling. There is no system GDAL dependency — this works from a single static surtgis binary.

The 30-second version

# UTM 19S → WGS84 lat/lon
surtgis reproject dem_utm.tif dem_wgs84.tif --to EPSG:4326

# WGS84 → UTM 19S, preserving roughly the same resolution
surtgis reproject dem_wgs84.tif dem_utm.tif --to EPSG:32719

# UTM 19S → Web Mercator at 30 m, nearest-neighbour (for categorical
# rasters like a classification result)
surtgis reproject classes.tif classes_mercator.tif \
    --to EPSG:3857 --pixel-size 30 --method nearest

The output pixel size is auto-inferred when both CRSes are the same kind (metric ↔ metric, or geographic ↔ geographic). For unit-changing reprojections you can pass --pixel-size <X> in target-CRS units.

Required flags

  • --to EPSG:XXXX (or just --to XXXX): the target CRS. Any EPSG code supported by proj4rs works (UTM zones, Web Mercator 3857, Lambert, national grids, etc.).

Optional flags

  • --from EPSG:XXXX: override the source CRS when the input GeoTIFF doesn’t embed one. Most modern GeoTIFFs do, in which case you can omit this.
  • --method nearest|bilinear: resampling kernel. Default is bilinear which is right for continuous data (elevation, NDVI, reflectance). Use nearest for categorical data like geomorphons or land-cover classifications, where interpolating values across class boundaries would produce meaningless “tween” classes.
  • --pixel-size <X>: output pixel size in target-CRS units (metres for projected CRSes, degrees for geographic). When omitted, SurtGIS picks a sensible default that preserves the source resolution.

What it produces

$ surtgis reproject benchmarks/results/dems/fbm_1000_raw.tif out.tif --to EPSG:4326
Reprojecting EPSG:32719 → EPSG:4326 (1000, bilinear method)
✓ wrote out.tif (649×1001) in 0.13 s

On the i7-1270P benchmark machine: a 1 000 × 1 000 UTM 19S DEM reprojects to WGS84 in 0.13 s, and to Web Mercator in 0.59 s (the output grid is bigger because Web Mercator stretches near the poles). The reproject is parallelised across output rows, so multi-core machines scale roughly linearly until memory bandwidth saturates.

Same-CRS shortcut

If the source and target CRSes are the same, SurtGIS writes the input verbatim and tells you what it did:

$ surtgis reproject foo.tif foo_copy.tif --to EPSG:32719
source and target CRS are the same (EPSG:32719); copying input to output
✓ wrote foo_copy.tif in 0.04 s

This makes reproject safe to put in scripts where the source CRS varies; the no-op case is cheap.

Alternative paths

The reproject command is the right tool for a standalone reprojection. For two adjacent use cases, prefer the dedicated shortcut:

1. Aligning to a template raster

If you already have a raster in the target CRS and want the output on the same grid (same origin, pixel size, dimensions), use resample:

surtgis resample source.tif aligned.tif --reference template.tif

resample handles CRS mismatch internally; you don’t need to call reproject first.

2. Pulling cloud data already in the target CRS

If the source is a STAC asset, the cloud commands accept --align-to and reproject-on-read so only the tiles intersecting the target grid are fetched:

surtgis stac composite \
    --catalog pc \
    --collection cop-dem-glo-30 \
    --bbox=... \
    --align-to reference_utm.tif \
    dem_aligned.tif

This is usually faster than download → reproject because only the tiles within the target extent are pulled.

Choosing a target CRS

For terrain analysis at mid-latitudes, the local UTM zone is almost always the right choice. The formula:

zone = floor((longitude + 180) / 6) + 1
hemisphere = "S" if latitude < 0 else "N"
EPSG = 32600 + zone   (Northern hemisphere)
EPSG = 32700 + zone   (Southern hemisphere)

Examples:

  • Chile around Curicó (−71°, −35°) → zone 19 South → EPSG:32719.
  • Spain around Madrid (−3°, 40°) → zone 30 North → EPSG:32630.
  • Western US around Los Angeles (−118°, 34°) → zone 11 North → EPSG:32611.

For continental-scale areas that span multiple UTM zones, use an equal-area projection appropriate to your region (Lambert Azimuthal for continental Europe, Albers Equal Area Conic for the continental US, etc.). Terrain slope is less sensitive to projection distortion than area, so UTM within a single zone is almost always good enough for analyses smaller than ~500 km across.

Caveats

  • Only nearest and bilinear are implemented today. Cubic resampling is on the roadmap. For elevation and most reflectance data, bilinear is the right default; cubic’s main advantage is smoother contour rendering, which reproject is not the right tool for anyway.
  • The output extent is computed by transforming the source raster’s four corners plus three edge midpoints, which covers most projections. Highly curved projections (large-area polar or cylindrical projections crossing the antimeridian) may need --pixel-size set explicitly.
  • For geographic-to-metric reprojections without --pixel-size, SurtGIS infers a default that roughly preserves the output column count. If you need an exact resolution (say 30 m to match Copernicus DEM), pass it explicitly.

Mosaic rasters in different CRSes

surtgis mosaic assumes all inputs share a CRS and grid. If they don’t, you’ll hit an error. Two recommended approaches.

Pre-reproject everything to a target CRS

Pick a reference raster (or create an empty one) in your target CRS, then use resample on each input to align them:

for f in tile_*.tif; do
    surtgis resample "$f" "aligned_$f" --reference reference_utm.tif
done
surtgis mosaic aligned_*.tif out.tif

resample handles both the CRS change and the grid alignment in one step, so aligned_*.tif are all on the same grid by construction.

Use stac composite if the sources are on a STAC catalog

The multi-UTM case that drove SurtGIS 0.6.1’s critical fix is exactly this: a bbox spanning two UTM zones, with different scenes in each zone. stac composite handles it automatically by reprojecting each tile’s bbox into the tile’s native CRS before reading, then unifying at mosaic time. You don’t think about zone boundaries:

surtgis stac composite \
    --bbox=-71.5,-35.5,-70.5,-34.5 \   # spans UTM 19S and 18S
    --asset red,nir \
    ...

If you really want to mosaic mixed-CRS files yourself

Reproject them all to a common CRS first (either via resample --reference or gdalwarp), then surtgis mosaic. There is no “mosaic with automatic reprojection” command; deliberately so, because implicit reprojection during mosaic usually produces surprising seams when the target grid isn’t chosen carefully.

Overlap handling

surtgis mosaic uses NaN-aware last-write-wins: a source pixel only overwrites the output if it is finite. This means irregular tile shapes (Sentinel-2 scenes with masked borders, Landsat fill pixels) do not overwrite valid data from adjacent tiles. Order of the input arguments matters only for genuinely overlapping valid regions — there, the last file wins.

Cache COG tiles for fast re-runs

Any STAC or COG command that supports --cache will persist downloaded tiles to disk and serve them from cache on subsequent runs over the same bbox and asset.

Enable

surtgis stac composite --cache --catalog pc --asset red,nir ...
surtgis stac fetch-mosaic --cache --catalog pc --collection cop-dem-glo-30 ...

The first run downloads from the origin; subsequent runs over the same bbox read from local disk. No code changes, no extra flags beyond --cache being present.

Where the cache lives

Default: $XDG_CACHE_HOME/surtgis/cog/ (typically ~/.cache/surtgis/cog/ on Linux, ~/Library/Caches/surtgis/cog/ on macOS). Override with:

export XDG_CACHE_HOME=/path/to/big/disk

Cache keys are derived from the COG base URL (without SAS query parameters) plus the exact bbox. This means re-signed Azure Blob URLs still hit cache, but shifting the bbox by even one pixel produces a cache miss.

When to use it

Always, if you’re iterating on the same area of interest. The cost is disk space (each tile is typically 1–10 MB, a medium study area might cache 0.5–2 GB); the benefit is that subsequent runs go from minutes to seconds.

When not to use it

  • One-shot production runs where you don’t expect to re-read the tiles.
  • Highly variable bboxes that don’t share tile coverage (each run caches fresh, adds disk pressure, returns nothing usable).
  • Sentinel-2 scenes with very short SAS token lifetimes where you might cache items whose signed URLs go stale. This is rare in practice; SurtGIS re-signs tokens automatically within the same run.

Cache hygiene

No automatic eviction yet. Manually clear when the cache gets too big:

rm -rf ~/.cache/surtgis/cog/

A TTL-based eviction policy is on the roadmap. Until then, a cron job removing files older than N days is a reasonable workaround.

Disk layout

Tiles are sharded into 256 subdirectories by hash prefix so no single directory grows unbounded:

~/.cache/surtgis/cog/
├── 00/
│   ├── 42/
│   │   └── <rest-of-hash>.tif
│   └── ...
├── 01/
└── ...

Safe to browse with any file manager. The .tif files are regular GeoTIFFs and can be opened directly.

Tune the RAM budget

For workflows that approach or exceed your machine’s RAM, SurtGIS gives you two dials.

Override the budget target (stac composite)

stac composite auto-sizes strip_rows to fit within a 16 GB budget by default. Lower or raise it with an environment variable:

# Conservative — fit the pipeline in 8 GB
SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ...

# Permissive — machine has 64 GB, let strips be larger for fewer HTTP calls
SURTGIS_RAM_BUDGET_GB=32 surtgis stac composite ...

SurtGIS will print the budget and the derived strip_rows on startup:

RAM budget (16.0 GB target, band_chunk_size=1) — output: 3.3 GB | mask cache: 5.0 GB | scene strips: 1.3 GB | band working: 0.4 GB | decode: 0.1 GB (strip_rows=512) → ~10.1 GB peak

That ~10.1 GB peak is the estimate. Real peak is typically within ±30%. If it comes in much higher, you’ve hit an edge case — see Debug a stac composite using too much RAM.

Override the memory ceiling (general commands)

For commands that operate on in-memory rasters:

surtgis --max-memory 4G terrain slope huge.tif slope.tif

Any raster larger than --max-memory (default 500 MB) triggers automatic streaming mode on commands that support it. Force it explicitly with --streaming:

surtgis --streaming terrain slope huge.tif slope.tif

If you pass --streaming on a command that hasn’t been adapted to the streaming path, you get a hard error. See Memory model for the list of streamable algorithms.

Rule of thumb for sizing

On a host with X GB of RAM, leave about 0.4 × X GB free for the OS and other processes. So:

Host RAMSafe SURTGIS_RAM_BUDGET_GB
8 GB4
16 GB9
32 GB18
38 GB (postdoc machine)22
64 GB36

These aren’t hard caps — SurtGIS won’t strictly respect the budget, it sizes strip_rows to aim for it. Overshoot is possible. When the overshoot matters (you’re on a shared machine, or the OOM killer is watching), pair the budget override with an external watchdog:

# Kill if RSS exceeds 14 GB
( surtgis stac composite ... & PID=$!
  while kill -0 $PID 2>/dev/null; do
    RSS=$(awk '/VmRSS:/ {print $2}' /proc/$PID/status 2>/dev/null || echo 0)
    if [ "$RSS" -gt 14000000 ]; then kill -9 $PID; break; fi
    sleep 30
  done
)

Raising band_chunk_size for speed at RAM cost

The --band-chunk-size K flag on stac composite controls how many bands are downloaded and processed together per scene. K=1 (default) is minimum RAM; higher K reduces HTTP request count at proportional RAM cost:

KScene stripsHTTP requests per scene
1 (default)1× baselinen_bands ×
33× baselinen_bands/3 ×
55× baselinen_bands/5 ×

On a 38 GB host with a 10-band composite, K=5 is comfortable. The budget printed at startup scales with K explicitly, so you can see the predicted peak before committing.

Debug a stac composite using too much RAM

If surtgis stac composite is consuming more RAM than expected, work through these in order.

1. Read the budget line

Every stac composite run prints at startup something like:

RAM budget (16.0 GB target, band_chunk_size=1) — output: 3.3 GB | mask cache: 5.0 GB | scene strips: 1.3 GB | band working: 0.4 GB | decode: 0.1 GB (strip_rows=512) → ~10.1 GB peak

Five components:

ComponentWhat it isScales with
outputPre-allocated output buffers, one per bandn_bands × total_cells
mask cacheMosaicked cloud masks, one per scenen_scenes × strip_rows
scene stripsPer-scene data for the active band chunkband_chunk_size × n_scenes × strip_rows
band workingThe active band’s mosaicked tile cacheband_chunk_size × strip_rows
decodeIn-flight tile decode bufferstile_concurrency

If ~X GB peak is comfortably within --max-memory (or your host’s free RAM), proceed; if not, see step 2.

2. Watch the [ram] transition log lines

Since v0.6.26, SurtGIS prints RSS and cumulative-tile counters at strip/phase/band-chunk transitions:

[ram] baseline before strip loop: RSS=120 MB
[ram] strip 1/2 start: RSS=3300 MB, tiles_cumulative=0
[ram] strip 1/2 after Phase A (masks): RSS=5800 MB, phase_a_tiles=1890, cumulative=1890
[ram] strip 1/2 chunk bands [0..1] start: RSS=6200 MB, cumulative=1890
[ram] strip 1/2 chunk bands [0..1] end: RSS=6400 MB, cumulative=3780

Rule of thumb: if RSS grows between a chunk start and the matching chunk end by more than the budget’s scene strips + band working, the model is wrong for this workload — file an issue with the full log. That’s exactly how the fragmentation bug in v0.6.24 was found.

3. Lower the budget, then the bbox

In order of impact:

Cut SURTGIS_RAM_BUDGET_GB. Halves strip_rows, proportionally halves mask cache and scene strips:

SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ...

Cut --band-chunk-size. K=1 is the minimum; you can’t go lower. If you’re already at 1 and still over budget, the workload is structurally too big for the host.

Cut --max-scenes. Each scene you drop trims mask cache and scene strips proportionally. Going from 42 scenes to 20 halves both. Median composite quality degrades, but on a cloud-free month that can be acceptable.

Cut the bbox. Halving the bbox area quarters output and proportionally shrinks every other component. If the bbox is a watershed, process by sub-watershed and mosaic the outputs afterwards.

4. If RSS grows linearly with no plateau

That’s the fragmentation signature from the v0.6.19–v0.6.24 debugging arc. Should not recur on v0.6.26+ (mimalloc fixed it). If it does on a later version:

  • Confirm you’re on v0.6.26 or later: surtgis --version.
  • Share the [ram] log plus surtgis --version plus OS + allocator in an issue.

5. Checklist for a stuck run

Before giving up:

  • Current version ≥ v0.6.26?
  • --cache enabled if re-running the same bbox?
  • SURTGIS_RAM_BUDGET_GB set to ~0.6× host RAM?
  • --band-chunk-size 1?
  • bbox reasonable (< ~200 km × 200 km for S2 L2A at 10 m)?
  • Host not competing with another RAM-heavy process?

If all yes and it still OOMs, that’s a genuine bug, not a tuning problem.

Extract patches: points vs polygons

surtgis extract-patches writes CNN-ready NPY tensors from feature rasters. The two modes answer different questions.

Points mode

Use when you have labels at specific point locations and want spatial context (a patch) around each point for the model to learn from.

Example: 60 water wells with measured water-table depth. Each well gets a 256 × 256 patch of the surrounding terrain/imagery features; the model learns to predict water depth from that spatial context.

surtgis extract-patches \
    --features-dir factors/ \
    --points wells.gpkg \
    --label-col water_depth \
    --size 256 \
    patches/
  • One patch per point.
  • --label-col can be numeric (i64 for classification, f32 for regression, auto-detected) or a string that parses to a number.
  • Points whose patch would extend outside the raster are silently skipped.

Polygons mode

Use when you have labelled regions and want many patches per region for training a CNN to classify or regress pixels.

Example: a shapefile with land-cover polygons (crop, pasture, bare soil, forest). Each polygon produces many patches; the model learns per-polygon class from spatial context.

surtgis extract-patches \
    --features-dir factors/ \
    --polygons land_use.gpkg \
    --label-col class \
    --size 256 \
    --stride 128 \
    patches/
  • Grid sampling: patches are placed every --stride pixels (default = --size, giving non-overlapping tiles). Setting --stride to half of --size gives 50 % overlap, which common for data augmentation.
  • Each candidate centre is tested for point-in-polygon; patches whose centre falls inside the polygon inherit its --label-col value.
  • MultiPolygons are flattened. Interior rings (holes) are not honoured in v1 — patch centres in holes still get the outer polygon’s label. This is a documented limitation.

Which to pick

  • Sparse labels on a grid you care about → points.
  • Dense labels over connected regions → polygons.
  • Both → run twice to two different output directories and merge.

Output format

Both modes write the same four files:

patches_dir/
├── patches.npy    # [N, bands, H, W] f32
├── labels.npy     # [N] i64 or f32
├── manifest.csv   # idx, label, center_row/col, center_x/y, source_idx
└── meta.json      # bands order, CRS, pixel size, skipped counts, seed

Load in Python:

import numpy as np
X = np.load('patches_dir/patches.npy')  # ready for torch.from_numpy(X)
y = np.load('patches_dir/labels.npy')

Memory ceiling

The entire patches.npy tensor is held in RAM before writing in v0.7.0. For N = 10,000 patches of 256 × 256 × 10 bands × 4 bytes, that’s 26 GB. Two mitigations:

  • --max-patches N deterministically subsamples down to N (seeded by --seed, default 42). Useful for balancing very dense polygon grids.
  • Shrink --size or --stride to reduce patch count.

Streaming writes (no full-tensor-in-RAM) are on the roadmap; they require knowing N upfront without extracting data, which is doable but not yet done.

NaN filtering

--skip-nan-threshold 0.1 (default) skips any patch where more than 10 % of pixels are NaN across all bands. Raise it to 0.5 if you’re tolerant of partial coverage; lower it to 0.0 to require fully valid patches. The per-patch NaN count lands in manifest.csv for downstream filtering.

Prepare training data for a Geospatial Foundation Model

This guide walks through the full pipeline from a STAC bbox to a tensor ready to fine-tune Prithvi-EO-2.0 (NASA / IBM). The same flow works for Clay v1.5 by swapping one flag.

The motivation: in October 2025 the InstaGeo paper identified that no published Geospatial Foundation Model ships its preprocessing pipeline — users get only model checkpoints. SurtGIS fills that gap with a single CLI command.

Step 1 — fetch a multi-temporal HLS composite

Prithvi was pre-trained on Harmonized Landsat-Sentinel-2 (HLS). For each timestamp you want in the temporal stack, run one stac composite call into its own subdirectory. The subdirectory name becomes the timestamp label in the output tensor.

# 2024 Maule, Chile — three months of HLS coverage
for month in 01 02 03; do
    surtgis stac composite \
        --catalog pc \
        --collection hls2-s30 \
        --asset "B02,B03,B04,B05,B06,B07" \
        --bbox=-72.0,-35.5,-71.5,-35.0 \
        --datetime "2024-${month}-01/2024-${month}-31" \
        --max-scenes 10 \
        features/t_2024_${month}/composite.tif
done

Each composite uses the HLS Fmask asset for cloud masking automatically — the asset name “Fmask” triggers SurtGIS’s HlsFmask strategy (cloud, adjacent-to-cloud, and cloud-shadow bits dropped; cirrus and snow kept). The output is one .tif per band per timestamp.

After the loop your tree should look like:

features/
├── t_2024_01/
│   ├── composite_B02.tif
│   ├── composite_B03.tif
│   └── ... (one per asset)
├── t_2024_02/
│   └── ...
└── t_2024_03/
    └── ...

Step 2 — extract Prithvi-ready chips

One command turns the per-timestamp directories into a tensor matched to Prithvi-EO-2.0’s input convention:

surtgis extract-patches \
    --features-dir features/ \
    --points labels.geojson \
    --label-col landslide_class \
    --profile prithvi-v2 \
    --size 224 \
    --output-format zarr \
    --emit-stac \
    chips/

What the flags do, in order:

  • --features-dir features/: SurtGIS auto-detects multi-timestamp mode because the directory contains subdirectories (each holding the same band set). Subdirs are sorted lexicographically — that’s why we used t_2024_01, t_2024_02, t_2024_03.
  • --profile prithvi-v2: validates that 6 bands are present in the order Prithvi expects (B02, B03, B04, B05, B06, B07), then applies the per-band z-score normalization from the official Prithvi-EO-2.0-300M config. Records full provenance in meta.json.
  • --size 224: matches Prithvi’s pre-training tile size. The profile warns if you pass a different size, since the model would then need to resize at training time.
  • --output-format zarr: emit chunked Zarr v2 instead of a single .npy. One chunk per chip, parallel I/O. Optional but useful when N is large.
  • --emit-stac: write a STAC ML-AOI Collection and one Item per chip, embedding the STAC MLM extension with mlm:model_target = ibm-nasa-geospatial/Prithvi-EO-2.0-300M and the full mlm:input descriptor. This is what makes the dataset publishable on its own.

The output tree:

chips/
├── patches.zarr/            # [N, 6, 3, 224, 224] f32, chunk = 1 chip
│   ├── .zarray
│   ├── .zattrs
│   └── 0.0.0.0.0, 1.0.0.0.0, ...
├── labels.npy               # [N] i64 or f32
├── manifest.csv
├── meta.json                # bands, timestamps, gfm_profile{...}
└── stac/
    ├── collection.json      # MLM + ML-AOI
    └── items/
        ├── chip_000000.json
        └── chip_000001.json

The tensor shape [N, 6, 3, 224, 224] is the standard [batch, channels, time, height, width] that Prithvi accepts directly.

Step 3 — load and fine-tune

With TerraTorch (the IBM fine-tuning toolkit):

import zarr
import numpy as np
import torch

X = zarr.open('chips/patches.zarr', mode='r')   # [N, 6, 3, 224, 224]
y = np.load('chips/labels.npy')

# Materialise into a Torch dataset — the Zarr open is lazy, [:] forces it.
X_tensor = torch.from_numpy(X[:])
y_tensor = torch.from_numpy(y)

# Prithvi-EO-2.0 expects (B, C, T, H, W) which is exactly our layout.
# No transpose, no extra normalization — SurtGIS already z-scored using
# the official Prithvi statistics.

The profile metadata in meta.json lets a TerraTorch config file pick up the right band order, normalization params, and tile size without hand-editing:

import json
with open('chips/meta.json') as f: m = json.load(f)
print(m['gfm_profile']['model_target'])
# → 'ibm-nasa-geospatial/Prithvi-EO-2.0-300M'
print(m['tensor_layout'])
# → '[N, C, T, H, W]'

Variation — Clay v1.5 instead

Clay expects 10 Sentinel-2 bands at tile 256 with reflectance values in [0, 1]. Swap the profile and band list:

surtgis stac composite \
    --catalog pc --collection sentinel-2-l2a \
    --asset "B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12" \
    --bbox=-72.0,-35.5,-71.5,-35.0 --datetime 2024-02-01/2024-02-28 \
    features/t0/composite.tif

surtgis extract-patches \
    --features-dir features/ --points labels.geojson \
    --label-col landslide_class --profile clay-v1.5 --size 256 \
    --output-format zarr --emit-stac chips/

Cloud masking switches to Sentinel-2 SCL automatically because the collection is L2A, not HLS. No flag needed.

Variation — NPY instead of Zarr

If your downstream loader is np.load()-based, drop --output-format zarr:

surtgis extract-patches \
    --features-dir features/ --points labels.geojson \
    --label-col landslide_class --profile prithvi-v2 --size 224 \
    --emit-stac chips/

You get chips/patches.npy with the same shape and dtype. Pick Zarr when N is large enough that np.load would page-fault your machine.

Validation

After the run, verify the tensor matches what Prithvi expects:

python3 -c "
import zarr, json
X = zarr.open('chips/patches.zarr', mode='r')
print('shape:', X.shape, 'dtype:', X.dtype)
m = json.load(open('chips/meta.json'))
spec = m['gfm_profile']
print('bands:', spec['bands_order'])
print('mean:', spec['band_norm_mean'])
print('std:',  spec['band_norm_std'])
print('source:', spec['source_url'])
"

You should see dtype: float32, the canonical Prithvi band order, and the means around [1087, 1342, 1433, 2734, 1958, 1363].

When things break

  • “Profile expects N bands, but M feature rasters were loaded”: curate features/<timestamp>/ to contain exactly the bands the profile expects, in the expected order. The error message lists them.
  • “Band-name mismatch at timestamp ‘t_2024_02’”: all timestamp subdirs must declare the same bands. Re-run the composite step for the offending timestamp.
  • “Mixed mode: top-level .tif AND subdirs with .tifs”: move the top-level files into a subdirectory. SurtGIS refuses to guess.
  • STAC items have bbox in source CRS instead of WGS84: compile with --features projections (default in precompiled binaries). The warning message tells you when this fallback triggered.

CLI reference

Every top-level surtgis subcommand, with flags and a minimal example. Pages are generated from --help output and regenerated on every CLI-surface change.

Top-level commands

Global flags

Available on every subcommand:

  -v, --verbose                  Verbose output
      --compress                 Compress output GeoTIFFs (deflate)
      --streaming                Force streaming mode for large rasters (auto-detected if >500MB)
      --max-memory <MAX_MEMORY>  Maximum memory to use (e.g., 4G, 1024MB, 500MiB). If raster would exceed this when decompressed, force streaming
  -h, --help                     Print help
  -V, --version                  Print version

Supported formats

Raster formats

FormatReadWriteFeature flagNotes
GeoTIFF(always)Predictor 1/2/3, DEFLATE, LZW, GDAL_NODATA
Cloud Optimized GeoTIFF (COG)cloudHTTP range reads; write not supported yet
Zarr (via zarrs)zarrAzure Blob + Planetary Computer auth
NetCDFnetcdfRequires libnetcdf system library
GRIB2gribRequires libgribapi or eccodes
JP2000, HDF4, ECW, MrSID, ERDAS .imgUse GDAL

Vector formats

FormatReadNotes
GeoJSON (.geojson, .json)
Shapefile (.shp + sidecar files)Via shapefile crate
GeoPackage (.gpkg)Via rusqlite

Vector writing is not supported in v0.7.0. Clip / rasterize accept vectors as input; produce rasters as output.

Floating-point value handling

For float element types (f32, f64), the GDAL_NODATA tag value is replaced with NaN on read. Algorithms treat NaN as “invalid” and propagate it as such — no sentinel values buried inside floats.

On write, the same convention: NaN pixels are written out as GDAL_NODATA values if the output format supports the tag.

For integer types (i16, u16, u8), the original nodata sentinel is preserved.

Endianness and TIFF predictors

The native GeoTIFF reader handles both big-endian and little-endian files. Predictor 1 (no prediction), 2 (horizontal differencing), and 3 (floating-point prediction) are all supported. Predictor 3 was added in v0.6.21 specifically for Copernicus DEM GLO-30 which uses it.

Compression support

Read: DEFLATE, LZW, packbits, none. Write: DEFLATE (via --compress flag) or none.

Write support for LZW is not planned — DEFLATE compresses better on geospatial rasters for comparable CPU cost.

Size limits

The tiff crate enforces a default size limit to prevent malicious inputs. For large DEMs (e.g. 20,000 × 20,000 or bigger), SurtGIS sets Limits::unlimited() internally so uncompressed reads aren’t rejected.

Files on disk can be arbitrarily large; streaming algorithms don’t care about total size.

Environment variables

SurtGIS reads the following environment variables at runtime. Most configuration is via CLI flags; env vars are for settings that are awkward to thread through every command.

SURTGIS_RAM_BUDGET_GB

Consumed by: stac composite

Default: 16

Sets the target peak-RAM budget for the STAC composite pipeline. The command auto-sizes its strip_rows to fit within this budget, accounting for output buffers, per-scene mask cache, active-band scene strips, band working set, and decode overhead.

SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ...   # conservative
SURTGIS_RAM_BUDGET_GB=24 surtgis stac composite ...  # permissive

Real peak is typically within ±30% of the target. See Tune the RAM budget for sizing guidance.

XDG_CACHE_HOME

Consumed by: stac, cog (when --cache is enabled)

Default: $HOME/.cache (Linux), $HOME/Library/Caches (macOS)

COG tile cache is stored under $XDG_CACHE_HOME/surtgis/cog/. Override to put the cache on a different disk:

export XDG_CACHE_HOME=/mnt/fast-ssd/cache

HOME

Used as a fallback for XDG_CACHE_HOME when the latter is unset. Also consulted for conventional per-user config paths.

CARGO_TERM_COLOR, NO_COLOR, TERM

Standard CLI-colour env vars. SurtGIS honours NO_COLOR=1 to disable ANSI escapes; useful for CI logs.

RUST_LOG

Controls the tracing subscriber’s verbosity when you pass --verbose.

RUST_LOG=debug surtgis --verbose terrain slope dem.tif slope.tif
RUST_LOG=surtgis_cloud=trace surtgis --verbose stac composite ...

Module-scoped levels (surtgis_cloud=trace) are useful for debugging cloud read issues without drowning in unrelated debug output.

Variables SurtGIS does not consume

Common ones we explicitly don’t use:

  • GDAL_DATA, PROJ_LIB — SurtGIS’s reprojection uses proj4rs which embeds its own grid data, not the system GDAL/PROJ installation.
  • AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY — not used for cop-dem-glo-30 via s3:// (the bucket is public). For private S3 buckets, authenticated access is not yet supported.
  • PC_SDK_SUBSCRIPTION_KEY — Planetary Computer’s private-API key. Not consumed; SurtGIS uses the public anonymous endpoint.

Architecture

This page explains why SurtGIS is structured the way it is. If you want to learn what flags to pass, you’re in the wrong section — try CLI reference. If you want to know why --streaming exists at all, keep reading.

Workspace layout

SurtGIS is a Cargo workspace of ten crates:

surtgis/
├── crates/core           # Raster<T>, GeoTransform, CRS, GeoTIFF I/O, streaming, mosaic, vector
├── crates/algorithms     # 250+ public functions across 10 modules
├── crates/parallel       # Rayon-based strategies, maybe_rayon compile-time switch
├── crates/cli            # 19 top-level subcommands, binary `surtgis`
├── crates/cloud          # COG reader, STAC client, bbox reprojection, s3:// normalisation
├── crates/wasm           # WebAssembly bindings (~17% of algorithms currently)
├── crates/python         # PyO3 bindings: extract_at_points, predict_raster
├── crates/gui            # egui desktop app (experimental)
└── crates/colormap       # Color schemes and raster → PNG rendering

The split exists because the dependency graphs differ wildly. surtgis-core has zero non-trivial deps beyond tiff and ndarray; surtgis-cloud pulls in reqwest + tokio for HTTP; surtgis-python needs PyO3 at build time; surtgis-gui is egui and eframe. Jamming them into one crate would force every user — including someone who just wants Raster<f32> — to compile a GUI framework they don’t need.

Concretely: embedding SurtGIS in a Rust program that only needs the algorithm library pulls two crates (surtgis-core, surtgis-algorithms). Adding STAC support adds a third. Nothing more.

The Raster<T> core

Everything starts from one type:

pub struct Raster<T> {
    data: ndarray::Array2<T>,
    transform: GeoTransform,
    crs: Option<CRS>,
    nodata: Option<f64>,
}

Generic over the element type (f32, f64, i16, u8, u16), with an affine GeoTransform (origin + pixel size + optional rotation) and an optional CRS. Nodata is stored separately from the data array, and the convention across the library is: for float element types, nodata values are replaced by NaN on read, and NaN is the only signal algorithms check for invalid data. No sentinel values buried inside floats.

This is deliberately more restrictive than GDAL’s model, which supports arbitrary block organisations, band counts, and offsets per band. SurtGIS treats a multi-band raster as a Vec<Raster<T>> — one per band — and forces each band through its own file on disk. The cost: you pay one HTTP round-trip per band when reading from COGs. The benefit: the entire library’s worth of algorithms doesn’t have to special-case multi-band layouts.

Native I/O via the tiff crate

A deliberate non-decision: SurtGIS does not depend on GDAL by default. It reads and writes GeoTIFFs using the pure-Rust tiff crate plus a thin layer in surtgis-core::io::native that handles the geospatial tags (ModelTiepointTag, ModelPixelScaleTag, GeoKeyDirectory, GDAL_NODATA).

Trade-offs of this choice:

Wins:

  • Single binary. Ship as precompiled. No libgdal.so version mismatch hell.
  • Works on WASM. 56 of the 127 algorithms compile to wasm32-unknown-unknown and run in any modern browser without a server. This includes the full Florinsky 14-curvature system, D8/MFD/D-infinity hydrology, 14 spectral indices, the focal-stats suite, and basic morphology. The browser path is unique to SurtGIS among comprehensive terrain libraries.
  • Cloud-optimised reads. The COG reader in surtgis-cloud issues HTTP range requests directly against the TIFF byte layout; no GDAL VSI layer in between.

Losses:

  • Exotic formats: HDF4, JP2, ERDAS .img, etc. GDAL reads ~150 formats; SurtGIS natively reads GeoTIFF + (optionally) Zarr + NetCDF + GRIB2. If you need Pléiades JP2000s in, GDAL is the answer.
  • TIFF edge cases that GDAL’s robustness has absorbed over two decades: odd compression combos, non-standard predictor tags, malformed GeoKey directories. SurtGIS handles the common cases (predictor 1/2/3, DEFLATE, LZW, big/little endian, tiled and stripped), but not every ancient file your grandfather’s theodolite produced.

A gdal feature flag exists as an escape hatch — cargo install surtgis --features gdal — which falls back to the GDAL crate for I/O when the native path can’t handle something. Most users never need it.

Streaming: strip processing for arbitrarily large rasters

The StripProcessor trait in surtgis-core::streaming is where SurtGIS handles DEMs larger than RAM. About ten window-based algorithms (Horn slope, aspect, hillshade, curvature, Gaussian smoothing, Laplacian, and a few more) are written against this trait rather than the in-memory Raster<T>.

A strip is a horizontal band of rows read into memory together with enough overlap above and below (the halo) to compute window-based operations without cross-strip coordination. For a 3×3 Horn slope, halo = 1; for Gaussian sigma=5, halo = ceil(3σ) ≈ 15.

+--- DEM on disk ---+
|                   |
|  strip 1 [halo]   |   ← read N rows + halo below
|  strip 1 [data]   |   ← compute, write
|  strip 1 [halo]   |
|                   |
|  strip 2 [halo]   |   ← reuse bottom halo of strip 1
|  strip 2 [data]   |
|  strip 2 [halo]   |
|  ...              |
+-------------------+

Peak RAM = 2 × halo × cols × sizeof(T) plus one strip’s worth of output. For a 100,000 × 100,000 DEM at f32 with Horn slope, that’s roughly 0.8 MB working set regardless of total file size.

Automatic detection: if a raster on disk would exceed --max-memory (or 500 MB by default) when decompressed, algorithms that support streaming switch automatically. --streaming forces the path; --max-memory 16G raises the threshold; passing --streaming for an algorithm that doesn’t support it is a hard error rather than silent fallback.

The list of streamable algorithms is small on purpose — adding one requires writing the algorithm against the halo-strip iterator pattern, which is harder than writing against a fully-resident array. We only absorb that cost for algorithms people actually want to run on huge DEMs.

Parallelism: maybe_rayon

The surtgis-parallel crate provides a compile-time switch:

#[cfg(feature = "parallel")]
use rayon::prelude::*;
#[cfg(not(feature = "parallel"))]
use std::iter as rayon_fallback;

With the parallel default feature, algorithm loops use par_iter / par_bridge. Without it (useful for WASM, where threading is a separate setup), the same code compiles to sequential iterators. One source, two targets.

The WASM build is currently serial because browser thread pools require SharedArrayBuffer and CORS headers that most hosts don’t provide by default. Adding WASM threading is tracked but not a priority — the per-tile work in a typical browser use case (small extents) doesn’t benefit enough to justify the deployment complexity.

STAC + COG: the cloud read path

surtgis-cloud handles the cloud-native read path, which is structurally different from “download the whole file then process it”:

  1. STAC search returns a list of items, each with asset hrefs (HTTPS or s3://) pointing at COG files on blob storage.
  2. CogReader::open(href) fetches the first ~64 KiB — enough to parse the TIFF header, IFD chain, and GeoKey directory — without reading tile data.
  3. reader.read_bbox(bbox) computes which internal tiles of the COG intersect the requested bbox, fetches exactly those tiles via HTTP range requests, decompresses them in parallel, and returns a Raster<T> of the intersection.

The key insight: for a 100 MB COG where your bbox only covers 1% of the image, you transfer ~1 MB and decompress ~1 MB. The stac composite pipeline composes this with strip processing so even very wide STAC extents have bounded peak memory.

See STAC integration philosophy for why the design chose this path over the alternatives (download-then-process, or GDAL-VSI-based access).

Why workspaces pin internal versions loosely

The ten crates share a version via [workspace.package] and reference each other with version = "0.7" (caret range) plus path = "../core". Local builds always use the path; crates.io resolution uses the version. This means:

  • Bumping any one crate breaks the build unless they all bump together.
  • Cutting a release = one version bump in the root Cargo.toml, one cargo publish per crate in dependency order.
  • cargo update picks up patch releases automatically; a minor bump (v0.7 → v0.8) requires consumers to update their dependency declaration.

This is heavier than publishing one monolithic crate, but it lets someone embed surtgis-core in a low-footprint context without dragging in the CLI or the cloud stack.

Memory model

SurtGIS has two distinct modes for handling raster data, and understanding when each applies is the difference between a pipeline that finishes and one that OOMs.

In-memory mode

The default. A raster is read entirely into an ndarray::Array2<T>, algorithms operate on the array in place, and the result is written back to disk. This is what every command does unless a streaming condition is met.

Peak RAM ≈ input size + output size, both fully decompressed. For a 20,000 × 20,000 f32 DEM that’s 20k² × 4 bytes = 1.6 GB twice, so about 3.2 GB working set. Most commands fit comfortably in modern host RAM.

Streaming mode (strip processing)

For about 10 window-based algorithms (Horn slope, aspect, hillshade, curvature, Gaussian smoothing, Laplacian, fill-sinks, flow direction, and a few more), SurtGIS can process the DEM in horizontal strips with configurable overlap, reading and writing one strip at a time.

Peak RAM is independent of total raster size: roughly 2 × halo × cols × sizeof(T) plus one strip of output. For a 100,000 × 100,000 f32 DEM with Horn slope (halo = 1), that’s about 800 KB of data actively in memory at any moment, not 40 GB.

Triggered automatically when the raster on disk would exceed --max-memory (default 500 MB). Force explicitly with --streaming.

STAC composite mode

The stac composite pipeline is neither: it operates on a stream of scenes fetched from STAC, mosaics them into per-strip outputs, and writes results per band. Peak RAM is bounded by a 5-component model documented in the command’s startup log line and discussed in the debug-stac-ram how-to.

Practical decision tree

Am I running a window-based algorithm on a DEM?
├── Yes, raster fits in RAM → in-memory mode, no action needed
├── Yes, raster exceeds --max-memory → streaming kicks in automatically
└── No, I'm running STAC composite → 5-component budget model applies,
                                      see --band-chunk-size and
                                      SURTGIS_RAM_BUDGET_GB

Which algorithms support streaming?

Roughly the ones where the operation is definable in a bounded window: terrain slope, terrain aspect, terrain hillshade, terrain curvature, terrain tpi, terrain tri, terrain gaussian-smoothing, hydrology fill, hydrology flow-direction, morphology dilation, morphology erosion.

Algorithms that need global information (flow accumulation across an entire drainage network, watershed delineation from any-point pour, mosaic of unrelated rasters) don’t fit the strip pattern and run in-memory. For huge DEMs these would need a different approach (Dask-style tile graphs, or out-of-core sort/scan algorithms). Not currently implemented.

When to use SurtGIS vs alternatives

SurtGIS is a newer, narrower tool than the incumbents in this space. This page is the honest “should I bother?” guide. It will not tell you SurtGIS is better at everything — it isn’t.

The incumbents

ToolWhat it’s great at
GDALI/O breadth (~150 formats), raster transforms, the reference implementation of nearly every GIS operation
GRASS GISHydrology, mature topological algorithms, serious spatial analysis workflows with long history
WhiteboxToolsHydrology specifically, very clean algorithm implementations, good defaults
SAGA GISTerrain analysis, geomorphometry, wide algorithm selection with a visual UI
QGISUser interface, visualisation, integrating all of the above

All four are freely available, battle-tested, and vastly more mature than SurtGIS. If one of them already solves your problem, use it.

Where SurtGIS is a genuine win

You want a single-binary CLI with no system dependencies

No libgdal.so version matching. No Python environment. Works in Docker FROM scratch. Works in CI runners without privileged installs. This matters for:

  • Reproducible benchmarks.
  • Embedding geospatial processing into systems that can’t afford a package manager (industrial control, embedded, browser via WASM).
  • Distributing analysis scripts that a non-GIS user can actually run. The friction from “install Anaconda then conda-forge GDAL then pip pyproj” to “download this tarball” is substantial.

You’re processing rasters larger than your RAM

The streaming strip processor (see Memory model) runs slope, aspect, hillshade, curvature, fill-sinks, flow direction, Gaussian smoothing, and ~10 others with bounded RAM regardless of input size. GDAL and WhiteboxTools can also do this, but typically require explicit block processing setup. SurtGIS does it automatically when --max-memory would be exceeded.

You want end-to-end STAC workflows with no glue code

surtgis stac composite --catalog pc --asset red,nir,swir16,swir22 ... searches Planetary Computer, downloads only the tiles intersecting your bbox via HTTP range requests, cloud-masks with SCL, and median-composites across dates. One command. Peer tools (rasterio + pystac-client + rioxarray + dask) can match this but typically require 30+ lines of Python glue and an understanding of Dask.

You need terrain + hydrology on the same binary

Many projects combine DEM preprocessing with spectral analysis. Mixing WhiteboxTools for hydrology, GDAL for I/O, and a Python notebook for NDVI is normal but operationally painful. SurtGIS covers all three in one tool, which matters most for scripts that run daily and need to be debuggable years later.

You want honest, reproducible benchmarks against GDAL/GRASS

The paper cross-validates SurtGIS outputs against GDAL 3.11, GRASS 8.3, and WhiteboxTools on identical DEMs, and publishes the tolerances as tables. RMSE vs GDAL for slope is 1.6×10⁻⁴ degrees. That level of validation is a design choice and is part of the CI today.

Where the incumbents beat SurtGIS

You need a format SurtGIS doesn’t read natively

Netcdf, GRIB2, and Zarr are supported but gated behind feature flags that require system libraries on some platforms. HDF4, JP2, ECW, MrSID, ERDAS .img: not supported at all. Reach for GDAL.

You need mature, edge-case-hardened algorithms

GRASS’s hydrology has been tuned for 25 years against flat terrain, coastal areas, pit-fill strategies, and every weird DEM edge case you’ve ever seen. SurtGIS implements the standard algorithms (D8, D∞, priority-flood fill, watershed-from-pour-points) and validates them against GRASS on common cases, but any production pipeline that hits edge cases will hit them first on SurtGIS, because GRASS has already been debugged on them.

You need a GUI

surtgis-gui exists but is experimental, has no real user base, and is not the project’s priority. QGIS is a dramatically better interactive environment. SurtGIS integrates cleanly as a QGIS algorithm provider if you want the CLI-backed pipeline with the QGIS UX.

You need Python interoperability beyond a handful of functions

surtgis-python exposes extract_at_points and predict_raster for ML pipelines. That’s it. If you’re building on numpy/xarray/rasterio and want the full toolbox, rasterio + rioxarray + pyproj + scikit-image is the mature stack. SurtGIS is for Rust-first users who occasionally want to bridge to Python.

You need a wider algorithm surface than 250 functions

That’s a lot, but GRASS has ~500 modules and GDAL has ~100 high-level operations on top of its I/O. If your workflow uses obscure morphological filters, multi-criteria decision analysis, or mature geostatistical tools beyond IDW/kriging/natural-neighbour, the incumbents have more.

A decision rule

A blunt three-question filter:

  1. Is your data in GeoTIFF (+ maybe Zarr/NetCDF/GRIB2 with feature flags)? If not, stop here and use GDAL.
  2. Do you want your pipeline to be one binary that installs in 10 seconds and runs anywhere? If not, GRASS + WhiteboxTools give you more at the cost of install complexity.
  3. Are you writing a Rust program that needs geospatial primitives embedded? Here SurtGIS has no real competition — gdal-rust bindings exist but the FFI surface and build requirements are hostile.

If you answer yes–yes–maybe, SurtGIS is a strong fit. Otherwise you’re probably better served by the incumbents, and we’d rather you use them and come back to SurtGIS when a concrete gap bites.

Where we’re honestly worse right now

  • Docs. This book is the first attempt at a real user guide; the incumbents have decades of accumulated tutorials, stackoverflow answers, and textbook chapters.
  • Community. No Slack / mailing list / forum. One postdoc in production; no GitHub stars to speak of.
  • Mature error messages. GDAL has tuned its error text over two decades; SurtGIS still emits some Rust-ish errors that won’t mean much to a GIS user.
  • Edge-case robustness. Every weird TIFF file in the world has been thrown at GDAL. SurtGIS has been thrown at a few thousand.

These are not bugs; they are the cost of being young. If you hit one, please report it — that’s the only way the list gets shorter.

STAC integration philosophy

The surtgis stac command group is designed around one principle: fetch the minimum bytes you need, compose them into the output the user actually asked for, in one command.

This rules out the two obvious alternatives:

  1. Download the whole file, then process. A full Sentinel-2 MGRS tile is ~700 MB. Over a year that’s hundreds of GB of transfer for a workflow that might only care about a 5 km × 5 km bbox. The COG format exists specifically so we don’t have to do this.

  2. Use GDAL’s VSI layer. GDAL can read COGs over HTTP (/vsicurl/https://...). This works, but puts GDAL between SurtGIS and the byte layout of the file, which means any bug or performance quirk in GDAL’s VSI cache shows up as mysterious SurtGIS behaviour. Keeping the COG reader in surtgis-cloud means when something is wrong we can debug it, not just report it to GDAL upstream.

What the pipeline does

surtgis stac composite walks this chain:

  1. Search. Hit the STAC API for items matching bbox + datetime + collection. Page through results, honouring per-catalog page-size limits (PC=1000, ES=250).
  2. Rank and select. From the candidates, pick --max-scenes by coverage-aware ranking.
  3. Per scene: resolve asset URLs. Each band and the cloud mask become a signed or unsigned HTTPS URL (or s3:// for Earth Search — which SurtGIS rewrites to https://bucket.s3.amazonaws.com/key internally).
  4. Per strip of output: fetch the tiles that intersect. COG tiles are typically 512² or 1024² pixels. For a strip of output, the reader computes which internal tiles to pull, issues parallel HTTP range requests, decompresses.
  5. Mosaic within scene. If the bbox crosses UTM zones, the scene’s tiles may be in different CRSes; surtgis-cloud::reproject aligns them before surtgis_core::mosaic stitches.
  6. Mask clouds. Apply the scene’s SCL (Sentinel-2) or QA_PIXEL (Landsat) to nullify cloudy pixels.
  7. Median across scenes. For each output pixel, the median of the non-null values across all selected scenes.
  8. Temporal + spatial fill. If a pixel is still null after the median (every scene was cloudy there), try temporal fallback from the least-cloudy scene, then 3×3 spatial mean.

Why median and not mean

Median is robust to cloud-mask leakage. A single bright cloud pixel that slipped past SCL distorts a mean significantly but affects a median of 3+ values almost not at all. Cost: slightly noisier output on very sparse scene counts (N=3 gives a noisier median than N=20).

Why per-strip processing

The composite for a 100 km × 100 km extent at 10 m resolution is 10,000 × 10,000 × n_bands × 8 bytes = 8 GB for 10 bands. Holding that plus all the per-scene intermediate arrays in RAM blows any host. Strip processing bounds the peak regardless of extent size.

The trade-off was learned expensively over v0.6.19 through v0.6.26 — see the post-mortem for the full story.

What’s intentionally not supported

  • Arbitrary temporal aggregations. Median only. Mean, max, percentile are easy to add if someone asks; nobody has yet.
  • Sub-pixel reprojection. Bilinear resampling is the only option. If you want cubic or Lanczos for a composite, reach for GDAL’s gdalwarp after the fact.
  • Streaming output. The output band rasters are written whole at the end, not incrementally. For extents where the output itself exceeds RAM (very rare in practice, requires >> 100 GB bbox coverage), you need a different tool.

Tectonic geomorphology with the fluvial module

The five algorithms in surtgis fluvial answer different questions about a river network. This chapter explains which algorithm to reach for when, gives the briefest necessary geomorphological context for a developer who doesn’t have it, and documents the parameter tuning that the spec calls out as easy to get wrong.

For the runnable end-to-end sequence, see Fluvial-tectonic morphometry from a DEM. For the formal references, see the spec at docs/SPEC_morfometria_fluvial_tectonica.md in the repo.

The framework in one paragraph

The stream-power law (Howard 1994, Whipple & Tucker 1999) postulates that a river’s incision rate scales as E = K · A^m · S^n with A the drainage area, S the local slope, and K an erodibility constant determined by lithology and climate. In topographic steady state — incision balanced by tectonic uplift — this inverts to a slope-area scaling S = ks · A^(-θ) with θ = m/n. Fixing θ at a reference value (0.45 is the convention for bedrock channels) and solving for ks gives the normalised channel steepness ksn, which is comparable across basins. Departures from this clean picture — knickpoints, anomalous concavity, divide asymmetry — flag transient landscape response or non-uniform erodibility.

That’s the framework. The five algorithms each measure a different facet of it.

When to use what

QuestionAlgorithmOutput
“Is this basin in steady state? Where’s the base level?”chiPer-cell χ raster
“Which channels are climbing fastest? Where’s the high U/K?”ksnPer-cell raster + per-segment vector
“Are there transient pulses propagating upstream?”knickpointsVector points with polarity
“Is θ for this basin actually 0.45?”concavityPer-basin θ with bootstrap CI
“Are basin boundaries migrating?”divide-migrationVector lines with Δχ

In the order they’re typically computed: chi → ksn → knickpoints in parallel; concavity once basins are defined; divide-migration last because it consumes the χ raster from the first step.

chi (χ)

The path integral of (A₀/A(x))^θref along the channel from a base level upward. Geometric: it’s a base-level-anchored reference distance that climbs faster where drainage area is small (i.e. headwaters “count more”). Plotted as elevation vs χ, a steady-state profile is a straight line whose slope equals ksn.

When χ alone is enough: discriminating divide asymmetry, comparing sub-basins, checking whether the network has reached topographic steady state (look for the elevation~χ curve being roughly linear).

Key parameter: --theta-ref (default 0.45). Almost never worth changing for inter-basin comparison; the literature standard exists because comparing across studies requires it. If you want a basin-specific θ, run concavity instead.

ksn (channel steepness)

ksn = S · A^θref per cell, smoothed along the channel over a window (default 500 m). High ksn = either high uplift or resistant bedrock or both.

When ksn is the right primary metric: you have a basin or a region and want to map relative U/K spatially. ksn is the workhorse — every tectonic-geomorphology paper has a ksn map.

Tuning that bites:

  • --segment-length-m (default 500 m): the smoothing window. With a 10–30 m DEM, 500 m is the literature standard. With a 90 m DEM, bump to 1000 m or accept noise.
  • --min-drainage-area-m2 (default 1 km²): cells below this give unstable ksn (small numerator, noisy slope). Don’t lower without a reason.
  • --theta-ref: must match what you’d compare against in the literature. Don’t tune this per dataset unless you’re explicit.

ksn requires channel-following slope, not the 2-D terrain slope output of surtgis terrain slope. The fluvial module computes the correct version internally; if you find yourself reaching for terrain slope to feed an analysis, stop — you’ve gone off the geomorphology rails.

knickpoints

Sharp breaks in the slope of a river’s long profile. A knickpoint can be a transient wave-form response to a tectonic perturbation propagating upstream, or a stationary feature anchored to a lithologic contrast. SurtGIS classifies each detection by polarity:

  • Convex (slope increases downstream) — most often transient, flag for tectonic interpretation.
  • Concave (slope decreases downstream) — most often lithology-pinned, expect a corresponding contact on the geology map.

The method (Neely et al. 2017) total-variation-denoises the χ-z profile, computes the discrete second derivative d²z/dχ², and flags cells where the magnitude exceeds a threshold AND the elevation drop across a small window exceeds min_magnitude_m. Cells within 5 of a segment end are excluded (default) because confluences and outlets induce spurious curvature.

Tuning that bites:

  • --tvd-lambda (default 0.5): smoothing strength. Too low → every noise spike is a knickpoint. Too high → real knickpoints smoothed away. Default tuned for 10–30 m DEMs.
  • --min-magnitude-m (default 10): the elevation-drop guard. Raise to 30+ if you’re seeing dozens of false positives in pristine headwaters; the real knickpoints are usually >> 30 m.
  • --curvature-threshold (default 1.0): controls how sharp a break has to be. Most users never touch this; tune tvd_lambda and min_magnitude_m first.

concavity (θ per basin)

Grid-searches θ over [0.1, 0.9] for each basin, picking the value that linearises the elevation~χ regression best (minimum RMSE), then bootstraps for a 95 % CI.

When: you want to test whether assuming θ = 0.45 is reasonable for your basins, OR you want θ per basin as its own measurement (some geomorphologists treat it as a tectonic signal independent of ksn).

What the values mean:

  • θ ∈ [0.4, 0.6]: textbook bedrock channel in steady state. Boring, confirmatory.
  • θ < 0.3 or > 0.7: interesting. Could indicate transient response (Mudd et al. 2018), non-uniform erodibility, or a non-uniform uplift pattern. Investigate.
  • Bootstrap CI that spans 0.2 or more: the basin has too few channels or too much scatter to pin down θ. Don’t over-interpret.

Tuning:

  • --bootstrap-n (default 200): more iterations = tighter CI, more compute. 200 is enough for most basins; 1000 if you need publication-grade error bars.
  • --min-basin-cells (default 30): below this the estimator is unreliable.

divide-migration

For every pair of adjacent basins, compute the median asymmetry across their shared divide: Δχ (Willett 2014), Δelev (Gilbert metric), Δrelief.

When: you suspect basin geometries are not stable. Active tectonic uplift, base-level fall, or differential erosion all push divides toward the “losing” basin. A sign in Δχ across a long divide is the canonical detection.

Interpretation:

  • Δχ ≈ 0 along the divide: basins in (chi) equilibrium, no migration signal.
  • Consistent Δχ > 0 with basin_a (smaller numeric ID by convention) on the higher-elevation side: basin_a is being “captured” by basin_b. The basin with lower χ at the divide is winning.
  • The Gilbert Δelev / Δrelief metric is the older indicator — useful as cross-check but Δχ is the modern primary.

Tuning:

  • --min-divide-length-m (default 500): suppress divides too short to be statistically meaningful. Raise to 2000+ for regional studies.
  • --chi (optional but recommended): provide the χ raster from surtgis fluvial chi. Without it, you only get the Gilbert metrics — usable but weaker.

Pitfalls the spec calls out

These bit the spec author often enough that they’re in the spec’s §8. They will bite you too.

  1. Cell size > 30 m: ksn and knickpoints are sensitive to DEM resolution. With FabDEM at 25 m you’re fine; with SRTM at 90 m you’ll get noisier results that bias toward false positives. --cell-size-m N overrides the auto-detection; the algorithms warn when cell > 30 m.

  2. Stream threshold matters: too low and convergent noise from zero-order hollows contaminates the analysis; too high and you miss real tributaries. 1000 cells at 30 m (≈ 0.9 km²) is a sensible default; for arid regions go higher, for humid regions lower.

  3. Short tributaries are unreliable: by default the algorithms filter cells with drainage area below 1 km² or segments with fewer than ~20 cells. Don’t relax these without a reason.

  4. NoData propagates upstream: a single missing cell in the DEM knocks out χ for everything upstream. Fill or interpolate carefully.

  5. Boundary outlets: outlets at the raster edge get partial trajectories. The algorithms flag and exclude them from medians; if your AOI is dominated by boundary outlets, expand the bbox.

  6. θ may not be unique: the RMSE landscape can be flat. The bootstrap CI is honest about this — read it.

  7. TVD λ at the wrong scale: see the knickpoints tuning section above.

  8. CRS must be projected: every fluvial algorithm assumes metres. EPSG:4326 inputs are rejected with a clear error; reproject to UTM (or any local projected CRS) first.

  9. Confluence noise in knickpoints: confluences cause sudden area discontinuities that look like knickpoints to the curvature detector. The default 5-cell buffer at segment ends filters this; widen it if you still see clustering at junctions.

When SurtGIS isn’t the right tool

For interactive stream-network pruning (drawing on a map, removing specific channels), use TopoToolbox 2 (MATLAB) or pyTopoToolbox. SurtGIS is batch-only; the interactive use case isn’t its sweet spot.

For landscape evolution modelling (TTLEM / Landlab), likewise — those are a different category of tool and out of scope for the spec.

For prospectivity mapping or other geological-AI workflows, the SurtGIS GFM preprocessing pipeline is the right entry point; fluvial outputs can be inputs to it but shouldn’t be its terminus.

References

  • Howard (1994) WRR 30(7) — detachment-limited stream-power law
  • Whipple & Tucker (1999) JGR — stream-power model dynamics
  • Wobus et al. (2006) GSA SP 398 — ksn operational paper
  • Perron & Royden (2013) ESPL 38, 570 — χ-transform fundacional
  • Willett et al. (2014) Science 343, 1248765 — divide reorganisation
  • Neely et al. (2017) EPSL 469 — knickpoint detection method
  • Whipple, Forte et al. (2017) JGR-Earth Surface 122 — divide migration
  • Mudd et al. (2018) ESurf 6 — variable θ as signal
  • Schwanghart & Scherler (2014) ESurf 2 — TopoToolbox 2 (the reference implementation in MATLAB; SurtGIS targets parity with its main outputs)