SurtGIS
High-performance geospatial analysis in Rust. A single CLI binary and a set of Rust crates that compute 127 terrain, hydrology, imagery, morphology, interpolation, and classification algorithms — with native GeoTIFF I/O, streaming for rasters larger than RAM, native CRS reprojection, and an end-to-end STAC composite pipeline for Planetary Computer and Earth Search. 56 of the 127 algorithms also compile to WebAssembly and run in any modern browser without a server, including the complete Florinsky 14-curvature morphometric system.
This book is the user guide. For the auto-generated API reference (for embedding SurtGIS as a library in Rust code), see docs.rs/surtgis-core.
What this book is for
If you’re looking for:
- “Does this fit my workflow?” → start with Installation then the First terrain analysis tutorial. 30 minutes and you’ll know.
- “How do I do X?” → How-to guides are task-focused and short. Pick the one that matches your problem.
- “What flags does
surtgis terrain slopetake?” → CLI reference. - “Why does SurtGIS do it this way?” → Explanation covers architecture, memory model, and side-by-side comparisons with GDAL, GRASS, and WhiteboxTools.
The four sections don’t overlap by design. Tutorials teach. How-to guides solve. Reference is authoritative. Explanation reasons. This is the Diátaxis framework.
Design principles
-
Native I/O, no GDAL required. GeoTIFF read/write including predictor 1/2/3, DEFLATE, GDAL_NODATA, and Cloud Optimized (COG) over HTTP range requests. GDAL is available as an optional feature.
-
Streaming over in-memory when data grows. Algorithms that fit the strip-processing pattern (Horn slope, Gaussian smoothing, flow direction, fill-sinks, and about ten more) run with bounded RAM regardless of raster size. Detection is automatic; a
--streamingflag forces the path when you want to be sure. -
One binary per platform. The CLI is statically-linked where possible. The default feature set for the precompiled binary covers STAC + COG + climate data (Zarr) + UTM reprojection. System-library dependencies (libnetcdf, HDF5) are opt-in via
--all-featureswhen building from source. -
Validated against the canonical alternatives. Slope, aspect, hillshade, flow accumulation, and watershed outputs agree with GDAL 3.11, GRASS 8.3, and WhiteboxTools within documented tolerances. The cross-validation numbers live in the paper (Environmental Modelling & Software, under review).
-
Honest about tradeoffs. SurtGIS is not faster than everything at everything — see When to use SurtGIS vs alternatives for where the Rust reimplementation wins (I/O pipelines, parallel terrain factors, streaming) and where you’re better off reaching for GDAL or GRASS.
Current status
As of v0.8.1, SurtGIS has been used end-to-end on a 15-watershed Chilean
landslide-susceptibility pipeline (Sentinel-2 composites, 41 M-cell areas,
60 labelled wells). The accompanying paper is in major-revision at
Environmental Modelling & Software. Recent releases focused on:
v0.7.5 fixed a critical decoder bug for Planetary Computer COGs with
BitsPerSample=15 packed encoding; v0.7.4 added the native reproject
command, removing the last dependency on gdalwarp; v0.8.x roughly
doubled the WebAssembly surface, from 33 to 56 algorithms.
No external users are known yet — feedback from people outside the original developer/postdoc loop is exactly what would take this from “works for one expert” to “works for a community”. If that’s you, please open an issue or try the live demo — it runs the 56 WebAssembly algorithms client-side, so your DEM never leaves your machine.
Installation
Three paths, in order of increasing effort.
1. Precompiled binaries (recommended for evaluation)
Download the archive for your platform from the GitHub releases page:
| Platform | Archive |
|---|---|
| Linux x86_64 | surtgis-v<version>-x86_64-unknown-linux-gnu.tar.gz |
| macOS arm64 (Apple Silicon) | surtgis-v<version>-aarch64-apple-darwin.tar.gz |
| Windows x86_64 | surtgis-v<version>-x86_64-pc-windows-msvc.zip |
Unpack and put surtgis (or surtgis.exe) somewhere on your PATH:
# Linux / macOS
tar xzf surtgis-*.tar.gz
sudo mv surtgis /usr/local/bin/
surtgis --version
The precompiled binary has the feature set cloud,zarr,projections:
STAC, COG, climate-data Zarr readers, and UTM reprojection. This covers
every tutorial and every how-to in this book.
2. cargo install (for the full feature set)
If you have the Rust toolchain and want netcdf / grib support on top,
install from crates.io:
cargo install surtgis --all-features
System libraries required for --all-features:
| Feature | Linux | macOS | Windows |
|---|---|---|---|
netcdf | libnetcdf-dev | brew install netcdf | not supported |
grib | libgribapi or eccodes | brew install eccodes | not supported |
If you only need the same feature set as the precompiled binary:
cargo install surtgis
3. From source (for contributors)
git clone https://github.com/franciscoparrao/surtgis
cd surtgis
cargo build --release -p surtgis --bin surtgis
./target/release/surtgis --version
The workspace contains ten crates. You only need the surtgis package for
the CLI; the library crates (surtgis-core, surtgis-algorithms,
surtgis-cloud, etc.) are separately publishable for embedding.
Verifying the install
Any of these should return sensible output:
surtgis --version # surtgis 0.7.0
surtgis --help # top-level command list
surtgis terrain --help # terrain subcommand list
If the binary runs but a specific command fails with “feature not enabled”,
you’re on the precompiled binary and the command needs netcdf / grib /
gdal. Rebuild from source with cargo install surtgis --all-features.
Next step
Head to the first tutorial for an end-to-end walk-through using real data from the Copernicus Digital Elevation Model.
First terrain analysis from a Copernicus DEM
This tutorial walks you end-to-end through a realistic terrain analysis: you will download a real digital elevation model from a public STAC catalog, reproject it to a metric coordinate system, compute slope, aspect, and hillshade, clip the result to a smaller area, and understand what the outputs actually mean. It takes about 30 minutes.
By the end you will have:
- A 30-metre DEM of a ~11 km × 5 km patch of the Chilean Andes on disk.
- A set of terrain-factor rasters derived from it.
- A clear mental model of SurtGIS’s command flow: source → transform → derive → consume.
Prerequisites: SurtGIS installed (see Installation). Network access for the STAC download step (about 20 MB transferred).
1. Set up a working directory
mkdir -p ~/surtgis-tutorial
cd ~/surtgis-tutorial
All subsequent commands expect you to be in this directory.
2. Download the DEM from STAC
We’ll pull a patch of the Copernicus GLO-30 DEM (30-metre global coverage) from Microsoft’s Planetary Computer:
surtgis stac fetch-mosaic \
--catalog pc \
--collection cop-dem-glo-30 \
--bbox=-71.0,-35.1,-70.9,-35.05 \
dem_wgs84.tif
What each flag does:
--catalog pc— use Planetary Computer. The other supported catalogs arees(Earth Search) and any full STAC API URL. Forcop-dem-glo-30, PC is the cleaner source (Earth Search returns multiple overlapping tiles that slow the download).--collection cop-dem-glo-30— the STAC collection ID. PC’s browser atplanetarycomputer.microsoft.com/cataloglists every available.--bbox=-71.0,-35.1,-70.9,-35.05— west, south, east, north in degrees. Note the=syntax: with a negative number as the first value, the shell would otherwise interpret-71.0as a flag.dem_wgs84.tif— output filename. The_wgs84suffix is a note to future-us: this DEM is in EPSG:4326 (latitude/longitude). We’ll reproject in a moment.
The command should take ~10 seconds. Verify:
surtgis info dem_wgs84.tif
You should see something close to:
Dimensions: 360 x 182 (65520 cells)
Cell size: 0.0002777777777777778
Bounds: (-71.000000, -35.100278) - (-70.900000, -35.049722)
CRS: EPSG:4326
Statistics:
Min: 400.0
Max: 1380.0
Mean: ~800
Valid cells: 65520 (100.0%)
Elevations between 400 m and 1380 m tell us this patch is at a foothill level of the Andes, not in the high peaks. That’s realistic for the latitude (35°S, near Curicó).
3. Why we have to reproject before computing slope
Here is a trap almost every newcomer falls into, so we’re going to walk straight into it on purpose.
Try computing slope on this DEM directly:
surtgis terrain slope dem_wgs84.tif slope_wgs84.tif
surtgis info slope_wgs84.tif
Look at the statistics:
Statistics:
Min: 89.8
Max: 90.0
Mean: ~89.99
Every pixel is saying “89.99 degrees of slope” — essentially vertical, everywhere. This is wrong; the Andes foothills are not a sheer cliff.
Why this happens. Slope is atan(rise / run). The run in our DEM is
measured in degrees of longitude (about 0.000278° per pixel), but the
rise is measured in metres. When you divide metres by degrees, the ratio
is on the order of hundreds or thousands, so the atan saturates to 90°.
The fix is to reproject the DEM into a coordinate system where both axes
are metres. For Chile at 35°S, that’s UTM zone 19 South (EPSG:32719).
SurtGIS doesn’t have a dedicated reproject command yet — but stac fetch-mosaic accepts --align-to which can reproject-on-read, and the
resample command is the general-purpose tool for grid alignment. For
this tutorial we’ll take a simpler route: ask STAC for the DEM in the right
projection in one shot.
4. Reproject via resample to a reference grid
If you have a reference raster in the target CRS, resample is the
cleanest path:
# (skip this if you don't have an existing UTM raster — see the next
# section for the alternate path)
surtgis resample dem_wgs84.tif dem_utm.tif --reference reference_utm.tif
Most first-time users don’t have a reference. For a standalone reprojection you’ll want the more direct approach below.
5. Alternative: process a native-metric DEM from a different source
For this tutorial we’ll sidestep the reprojection question entirely by
working with a small pre-projected DEM we can compute slope on directly.
The sentinel-2-l2a collection on Planetary Computer serves imagery
already in UTM, and its companion DEM products do the same.
If you’re following along without a reference raster, skip to tutorial 2 Sentinel-2 composite and NDVI which starts with UTM data and avoids this rabbit hole.
For the remainder of this tutorial, if you want to compute terrain factors
on the dem_wgs84.tif we just downloaded, you will need to reproject it
first. The simplest way today is a one-liner with GDAL (yes, the tool
SurtGIS doesn’t depend on — we’re not dogmatic about it):
gdalwarp -t_srs EPSG:32719 -tr 30 30 -r bilinear dem_wgs84.tif dem_utm.tif
surtgis info dem_utm.tif
Output:
Cell size: 30
CRS: EPSG:32719
Now the cells are 30 m on a side, and we can compute slope meaningfully.
This is a documented limitation of SurtGIS 0.7.0. A native reproject command is on the roadmap. Until it lands, GDAL’s
gdalwarpis the recommended path. See the Reproject between UTM zones how-to for more.
6. Compute the core terrain factors
With a UTM DEM, we can compute the standard factors:
surtgis terrain slope dem_utm.tif slope.tif
surtgis terrain aspect dem_utm.tif aspect.tif
surtgis terrain hillshade dem_utm.tif hillshade.tif
Each of these takes a fraction of a second on a patch this size. Verify one:
surtgis info slope.tif
You should see a reasonable range now, roughly:
Min: 0.0
Max: ~45
Mean: ~12
Twelve degrees mean slope is consistent with Andean foothills. Peak slopes of 45° correspond to the steeper gullies.
7. Or: compute every standard factor in one pass
The terrain all subcommand computes the full set (slope, aspect,
hillshade, curvature, TPI, TRI, roughness) in one shot, reading the DEM
once and writing each factor as a separate file:
surtgis terrain all dem_utm.tif --outdir factors/ --compress
ls factors/
slope.tif aspect.tif hillshade.tif curvature.tif tpi.tif tri.tif roughness.tif
This is the ergonomic path for “give me the standard factor set to feed into a model”. For ad-hoc experimentation, the individual commands are fine.
8. Clip to a smaller area of interest
If your model only cares about the eastern half of this patch:
surtgis clip factors/slope.tif --bbox=300000,6110000,315000,6120000 slope_east.tif
--bbox is in the raster’s CRS (so UTM metres here, not degrees).
Alternatively, --polygon my_aoi.gpkg clips by a vector geometry — see
Clip by polygon how-to for that workflow.
9. Understand what you just produced
Let’s make sure the outputs actually make sense.
Slope is degrees above horizontal, 0 to 90. Flat plains are 0–3°, gentle hills 5–15°, steep terrain 25–45°, cliffs >45°. If your DEM yields slopes above 70° widely, something is wrong with units (as we saw in step 3) or with artifacts in the DEM.
Aspect is compass direction the slope faces, 0–360° (or −1 where the slope is too flat to have a direction). 0° = north, 90° = east, 180° = south, 270° = west.
Hillshade is a synthetic illumination value, 0–255, simulating a light source at a default azimuth (315°) and altitude (45°). It’s purely for visualisation; it has no physical unit. Darker values = in shadow from the light source; brighter = facing it.
Open factors/hillshade.tif in QGIS (or any raster viewer) and you should
immediately see the terrain structure of the patch. That visual check is
the single best way to confirm your pipeline is correct before trusting
the downstream derivatives.
What’s next
- For cloud masking and Sentinel-2 imagery, continue with Tutorial 2: Sentinel-2 composite and NDVI.
- For scaling this to a much larger DEM (>5 GB), read
Memory model and use
--streaming. - For writing this pipeline as a script, every command here is composable via shell scripts; most users treat the CLI as the primary interface.
Common gotchas
“My slope is everywhere 89°.” You’re computing slope on a lat/lon-projected DEM. Reproject to a metric CRS first (step 5).
“All my outputs are black or blank.” Open in QGIS with “Singleband pseudocolor” and let it auto-stretch the min–max. The TIFF viewer on your OS is probably clamping to the raw byte range.
“Some pixels are NaN / nodata.” Edge pixels within one cell of the DEM boundary are undefined for 3×3 window algorithms (slope, aspect, TPI, TRI). This is intentional; the alternatives either extrapolate off the edge or silently invent values.
“The STAC download was slow.” For cop-dem-glo-30, prefer
--catalog pc. Earth Search serves the same collection but with multiple
overlapping tile granules — transfer volume can be 20× higher for the
same bbox.
Sentinel-2 composite and NDVI from STAC
This tutorial builds a cloud-free Sentinel-2 composite from Microsoft’s Planetary Computer and derives the NDVI vegetation index from it — the classic “hello world” of satellite remote sensing. It takes about 15 minutes of wall clock (most of it network transfer).
By the end you will have:
- Red and NIR band GeoTIFFs, mosaicked across multiple cloud-free scenes from one month, already in UTM at 10-metre resolution.
- An NDVI raster ready to visualise or feed into a model.
- An understanding of the STAC composite pipeline: search → mask → median → reproject.
Prerequisites: SurtGIS installed. Network access for STAC reads (~100 MB transferred for this tutorial).
1. Working directory
mkdir -p ~/surtgis-s2-tutorial
cd ~/surtgis-s2-tutorial
2. Pick a study area
We’ll use a 5 km × 5 km patch in central Chile (near Talca), Southern Hemisphere summer, so vegetation is green and cloud cover is manageable:
- Bbox:
-71.0, -35.1, -70.95, -35.05(west, south, east, north in degrees) - Date range:
2024-01-01to2024-01-31
Small bboxes are kind to your network and to Planetary Computer’s rate limits. Once the pipeline works at 5 km, scaling to a watershed is just changing one flag.
3. Fetch a median composite of red + NIR
surtgis stac composite \
--catalog pc \
--collection sentinel-2-l2a \
--asset "red,nir" \
--bbox=-71.0,-35.1,-70.95,-35.05 \
--datetime 2024-01-01/2024-01-31 \
--max-scenes 3 \
composite.tif
What’s happening, step by step:
- STAC search. The client queries Planetary Computer’s STAC API for
sentinel-2-l2aitems intersecting the bbox within the datetime range. Typically you get 5–10 candidate scenes for one month on a small bbox. - Scene ranking. Up to
--max-scenes(3 here) are picked by coverage of the bbox and reasonable cloud cover metadata. - Per-scene fetch. For each selected scene, the COG reader opens the red and NIR assets via HTTP range requests, downloading only the tiles that intersect the bbox.
- Cloud masking. Sentinel-2 L2A ships with an SCL (Scene Classification Layer) raster that labels every pixel as cloud, cloud shadow, water, vegetation, etc. SurtGIS applies the mask to remove cloudy pixels.
- Median composite. Across the 3 cloud-masked scenes, each pixel takes the median value per band. The result is one pixel-stack with one value per band.
- Output. Written as UTM-projected GeoTIFFs, one per band:
composite_red.tifandcomposite_nir.tif. The_redand_nirsuffixes come from the asset names.
On a warm connection this should take about 65 seconds. Verify:
ls *.tif
surtgis info composite_red.tif
composite_nir.tif composite_red.tif
...
Dimensions: 467 x 564 (263388 cells)
Cell size: 10
Bounds: (317595.06, 6114033.69) - (322265.06, 6119673.69)
CRS: EPSG:32719
10-metre cells, EPSG:32719 (UTM 19S), about 4.7 km × 5.6 km. Note that the output extent is slightly larger than the bbox you requested — the composite rounds to the Sentinel-2 native 10-metre grid, so you get a couple of extra pixels of padding.
4. Compute NDVI
NDVI = (NIR − Red) / (NIR + Red). It’s a vegetation index: roughly, higher values mean denser, greener vegetation. Bare soil is around 0.1, sparse grass 0.2–0.3, active crops 0.4–0.7, dense forest 0.7–0.9.
surtgis imagery ndvi \
--red composite_red.tif \
--nir composite_nir.tif \
ndvi.tif
surtgis info ndvi.tif
You should see:
Statistics:
Min: ~0.03
Max: ~0.69
Mean: ~0.40
A mean of 0.40 is consistent with a mix of crop fields and some bare land in mid-January. Open the TIFF in QGIS with the “pseudocolor” style and apply a red → yellow → green ramp; you’ll immediately see agricultural patterns.
5. Interpret the output
A few sanity checks to confirm your pipeline is working:
Histogram shape. NDVI over a real landscape has a bimodal or trimodal distribution: bare soil / pavement (~0.1), mixed vegetation (~0.3–0.5), dense vegetation (~0.7+). A single peak at 0 suggests the red/NIR bands got swapped. A single peak at 0.9 suggests the composite is hallucinating vegetation, possibly because the SCL mask is removing all non-vegetation pixels (unlikely but worth checking).
No NaN border. Unlike terrain derivatives (which have a 1-pixel NaN
border from the 3×3 window), NDVI is a per-pixel formula and should have
valid values everywhere the red and NIR bands both have values. If you see
a large NaN region, that’s likely cloud cover that the SCL mask filtered
out across every selected scene — increase --max-scenes to get more
dates.
Distribution of valid pixels. surtgis info ndvi.tif reports “Valid
cells: N/M”. For a cloud-free January in this region you should see 100%.
In cloudier regions or seasons, expect 85–95%.
6. Scale to more bands and more dates
The real workflow for a machine-learning pipeline wants more than two bands. Here’s a 10-band composite over three months:
surtgis stac composite \
--catalog pc \
--collection sentinel-2-l2a \
--asset "blue,green,red,rededge1,rededge2,rededge3,nir,nir08,swir16,swir22" \
--bbox=-71.0,-35.1,-70.95,-35.05 \
--datetime 2024-01-01/2024-03-31 \
--max-scenes 8 \
--cache \
composite_all.tif
What’s new:
- 10 bands at once. Writes
composite_all_blue.tif,composite_all_green.tif, etc. - 3 months of data, 8 scenes median-composited.
--cachepersists downloaded COG tiles at~/.cache/surtgis/cog/so re-runs reuse them instead of re-fetching.
Expected wall time: ~5 minutes. RAM peak: see the RAM budget how-to — SurtGIS will print a budget line at startup estimating peak usage.
7. Use NDVI as input to a model
Now you have a feature raster. Two common next steps:
Combine with terrain. Run the first tutorial on a DEM covering the same bbox, then the derived terrain factors (slope, aspect, TPI) plus this NDVI form a multi-feature stack that most landslide-susceptibility or vegetation-classification models want.
Extract training samples at labelled points. If you have a vector file
of labelled points (e.g. wells.gpkg with a vegetation_class column):
surtgis extract \
--features-dir ./composite_all_bands/ \
--points wells.gpkg \
--target vegetation_class \
samples.csv
The result is a CSV with one row per point, columns = bands + target. Feed into XGBoost, scikit-learn, or any tabular model.
For CNN-based spatial classification, use extract-patches instead to
generate image chips (see the extract-patches how-to).
Common gotchas
“The composite has huge black regions.” Your bbox straddles a
Sentinel-2 MGRS tile boundary, and the scenes selected don’t all cover
both sides. Increase --max-scenes to get scenes from both orbits.
“The download is slow or times out.” Planetary Computer occasionally
rate-limits (HTTP 429). SurtGIS 0.6.25+ has exponential backoff with
jitter built into the retry logic; transient failures should recover
automatically. Persistent rate limits might warrant switching to Earth
Search (--catalog es) or spreading the work across time.
“RAM climbs during long runs.” See
Debug a stac composite using too much RAM.
Short version: the [ram] log lines SurtGIS prints at strip/band
transitions pinpoint where growth starts. Raise --band-chunk-size if
you have headroom, or lower SURTGIS_RAM_BUDGET_GB if you don’t.
“I want the full scene, not just my bbox.” For a full MGRS tile (~110
km on a side at 10 m resolution), expect 10–100× the runtime and transfer
volume. Use --strip-rows 256 and enable --cache. For tile-size extents
the streaming pipeline really earns its keep.
Fluvial-tectonic morphometry from a DEM
This tutorial walks you from a raw digital elevation model to the five canonical metrics of tectonic geomorphology: χ (chi), normalised channel steepness (ksn), knickpoints, per-basin concavity (θ), and divide-migration asymmetry. It takes about 10 minutes once you have a DEM in hand and produces inputs that match the convention TopoToolbox (Schwanghart & Scherler 2014) established.
By the end you will have:
- A reproducible CLI sequence that turns any projected DEM into the full v0.10 fluvial product family.
- Three GeoJSON files (ksn segments, knickpoints, divides) ready to drape on a hillshade in QGIS or load into geopandas.
- A CSV of per-basin θ with bootstrap-derived 95 % CIs.
- Working intuition for which knob to turn when an output looks off.
Prerequisites: SurtGIS ≥ 0.10.1 installed
(cargo install surtgis or pip install surtgis). A DEM in a
projected coordinate system in metres — UTM is the standard
choice. If yours is in EPSG:4326 (latitude/longitude), reproject first:
surtgis reproject dem_wgs84.tif --to EPSG:32719 dem.tif.
If you don’t have one, the First terrain analysis tutorial walks through downloading a real 30 m Andes DEM in two steps.
1. Set up a working directory
mkdir -p ~/surtgis-fluvial-tutorial
cd ~/surtgis-fluvial-tutorial
cp /path/to/your/dem.tif dem.tif
All subsequent commands expect dem.tif to be present in the working
directory and to be in a projected CRS with metre units.
2. Build the hydrology stack
The fluvial algorithms operate on three derived rasters, not on the DEM directly. Producing them is the standard SurtGIS hydrology pipeline:
surtgis hydrology fill-sinks dem.tif filled.tif
surtgis hydrology flow-direction filled.tif fdir.tif
surtgis hydrology flow-accumulation fdir.tif facc.tif
Each command takes seconds on a 1000 × 1000 raster. The three outputs
together represent the full hydrologic topology: every cell knows where
its water comes from (facc) and where it goes next (fdir).
Now extract the stream network. This is where the new
--from-facc flag matters: without it, stream-network would
recompute the whole pipeline internally and the resulting stream cells
would NOT be topologically consistent with the fdir.tif you just
produced. With it, the command thresholds your existing facc.tif
and the topology is coherent across all four files.
surtgis hydrology stream-network --from-facc --threshold 1000 \
facc.tif streams.tif
The --threshold 1000 is in cell counts. With a 30 m DEM this is
~0.9 km² of contributing area — a reasonable threshold for mountainous
terrain that captures bedrock channels but suppresses noise in
zero-order hollows. For a 10 m DEM, scale to --threshold 9000 to
keep the area threshold constant.
3. Compute χ (chi)
χ is the path integral of (A₀/A(x))^θref along each river profile,
measured from the network’s outlets upstream. In a steady-state
landscape it linearises the elevation profile against a single
slope — that slope is ksn.
surtgis fluvial chi streams.tif fdir.tif facc.tif chi.tif
Defaults: --theta-ref 0.45 (the canonical bedrock-channel reference,
Perron & Royden 2013) and --a-0-m2 1e6 (1 km² normalisation). The
output is a Float32 GeoTIFF where every stream cell has a χ value in
metres; non-stream cells are NaN.
Quick visual check in QGIS: load chi.tif, style with a continuous
ramp — χ should climb monotonically from blue (low, near outlets) to
red (high, near divides). Any spatial discontinuity along a single
channel flags a topology problem upstream (usually flow-direction
ambiguity at a confluence).
4. Compute ksn (channel steepness)
surtgis fluvial ksn streams.tif fdir.tif facc.tif filled.tif ksn.tif \
--segments ksn_segments.geojson
Two outputs:
ksn.tif— per-cell raster, smoothed over a 500 m window along the channel (Wobus et al. 2006 standard).ksn_segments.geojson— one LineString per stream segment between graph terminals (confluence ↔ confluence, confluence ↔ outlet), attributesksn_meanandn_cells. Coordinates are reprojected to WGS84 by default (RFC 7946); pass--keep-crsto preserve source CRS metres with a legacycrsdeclaration.
High ksn cells correspond to either fast uplift or resistant lithology. Sub-basins with anomalously high ksn relative to their neighbours are the first thing to investigate for tectonic signal.
5. Detect knickpoints
surtgis fluvial knickpoints streams.tif fdir.tif facc.tif filled.tif \
knickpoints.geojson \
--raster knickpoint_raster.tif
The output GeoJSON Points carry four properties:
elevation_m— denoised z at the knickpoint cellmagnitude_m— elevation drop across a ±2-cell windowchi— χ value at the cellpolarity—concave(slope decreases downstream, often lithology contrast) orconvex(slope increases downstream, often transient response to a tectonic pulse)
The categorical raster (--raster) writes 0 for non-knickpoint,
1 for concave, 2 for convex — useful when you want to overlay knickpoint
density on a hillshade.
Tuning: if you see false positives in headwaters, raise
--min-magnitude-m from 10 (default) to 20-30. If you suspect real
knickpoints are being smoothed away, lower --tvd-lambda from 0.5 to
0.3.
6. Per-basin concavity θ
You’ll need a basins raster first. Identify pour-points (rows/cols of
the outlets of the sub-basins you care about) — usually the cells
with the highest facc values along the AOI boundary. If you have
them as coordinates from a GIS, convert to (row, col) using the
facc.tif transform.
surtgis hydrology watershed --pour-points "1429,778;1613,466" \
fdir.tif basins.tif
surtgis fluvial concavity streams.tif fdir.tif facc.tif filled.tif basins.tif \
concavity.csv \
--bootstrap-n 200
The CSV has one row per basin with theta_opt, theta_ci_low,
theta_ci_high, n_cells, rmse. Typical bedrock channels in steady
state cluster around θ = 0.45; values < 0.3 or > 0.6 are interesting
and warrant inspection — they suggest transient response, lithologic
heterogeneity, or a non-uniform uplift signal.
7. Divide migration
surtgis fluvial divide-migration basins.tif filled.tif facc.tif \
divides.geojson \
--chi chi.tif
One LineString feature per adjacent-basin pair, with median Δχ
(Willett 2014), median Gilbert Δelev, median Δrelief, and n_pairs
(cell-pair count along the divide).
Interpretation: a divide where median_chi_diff > 0 and basin_a
(the smaller numeric ID) is on the higher-elevation side suggests
basin_a is “losing area” to basin_b. The threshold for significance
is empirical — in well-studied cases like Willett’s southern Sierra
Madre, |Δχ| of a few hundred metres is diagnostic; for your AOI
calibrate against the obvious cases first.
8. Verify in QGIS
Load the rasters and vectors in QGIS:
| Layer | Style |
|---|---|
filled.tif | hillshade (Tools → Raster → Analysis → Hillshade) |
ksn.tif | graduated, viridis 10–500 range, transparent for NaN |
knickpoints.geojson | categorized on polarity (concave = blue, convex = red), graduated size by magnitude_m |
ksn_segments.geojson | graduated line on ksn_mean, viridis |
divides.geojson | graduated line on median_chi_diff, diverging colormap centred at 0 |
If you used the default WGS84 output, everything aligns directly on
a basemap layer (OpenStreetMap, satellite). If you used --keep-crs,
make sure the project CRS matches your source raster CRS.
Where to next
- “What do these numbers actually mean?” — Tectonic geomorphology with the fluvial module walks through the interpretation framework.
- “My output looks weird.” — the pitfalls section of the explanation chapter covers the eight failure modes the spec calls out, including the cell-size-vs-resolution sensitivity that bites on coarse DEMs.
- “I want to publish this.” — every output above carries provenance
(parameters, source CRS, etc.) in its filename or properties; for a
paper-grade pipeline log, capture the full command sequence and the
SurtGIS version (
surtgis --version) alongside the figures.
Reproject a raster between CRSes
SurtGIS 0.7.4 added the surtgis reproject command, which uses
proj4rs for the coordinate transform and a Rayon-parallelised inverse
mapping for the per-pixel sampling. There is no system GDAL dependency
— this works from a single static surtgis binary.
The 30-second version
# UTM 19S → WGS84 lat/lon
surtgis reproject dem_utm.tif dem_wgs84.tif --to EPSG:4326
# WGS84 → UTM 19S, preserving roughly the same resolution
surtgis reproject dem_wgs84.tif dem_utm.tif --to EPSG:32719
# UTM 19S → Web Mercator at 30 m, nearest-neighbour (for categorical
# rasters like a classification result)
surtgis reproject classes.tif classes_mercator.tif \
--to EPSG:3857 --pixel-size 30 --method nearest
The output pixel size is auto-inferred when both CRSes are the same kind
(metric ↔ metric, or geographic ↔ geographic). For unit-changing
reprojections you can pass --pixel-size <X> in target-CRS units.
Required flags
--to EPSG:XXXX(or just--to XXXX): the target CRS. Any EPSG code supported by proj4rs works (UTM zones, Web Mercator 3857, Lambert, national grids, etc.).
Optional flags
--from EPSG:XXXX: override the source CRS when the input GeoTIFF doesn’t embed one. Most modern GeoTIFFs do, in which case you can omit this.--method nearest|bilinear: resampling kernel. Default isbilinearwhich is right for continuous data (elevation, NDVI, reflectance). Usenearestfor categorical data like geomorphons or land-cover classifications, where interpolating values across class boundaries would produce meaningless “tween” classes.--pixel-size <X>: output pixel size in target-CRS units (metres for projected CRSes, degrees for geographic). When omitted, SurtGIS picks a sensible default that preserves the source resolution.
What it produces
$ surtgis reproject benchmarks/results/dems/fbm_1000_raw.tif out.tif --to EPSG:4326
Reprojecting EPSG:32719 → EPSG:4326 (1000, bilinear method)
✓ wrote out.tif (649×1001) in 0.13 s
On the i7-1270P benchmark machine: a 1 000 × 1 000 UTM 19S DEM reprojects to WGS84 in 0.13 s, and to Web Mercator in 0.59 s (the output grid is bigger because Web Mercator stretches near the poles). The reproject is parallelised across output rows, so multi-core machines scale roughly linearly until memory bandwidth saturates.
Same-CRS shortcut
If the source and target CRSes are the same, SurtGIS writes the input verbatim and tells you what it did:
$ surtgis reproject foo.tif foo_copy.tif --to EPSG:32719
source and target CRS are the same (EPSG:32719); copying input to output
✓ wrote foo_copy.tif in 0.04 s
This makes reproject safe to put in scripts where the source CRS
varies; the no-op case is cheap.
Alternative paths
The reproject command is the right tool for a standalone
reprojection. For two adjacent use cases, prefer the dedicated
shortcut:
1. Aligning to a template raster
If you already have a raster in the target CRS and want the output
on the same grid (same origin, pixel size, dimensions), use
resample:
surtgis resample source.tif aligned.tif --reference template.tif
resample handles CRS mismatch internally; you don’t need to call
reproject first.
2. Pulling cloud data already in the target CRS
If the source is a STAC asset, the cloud commands accept --align-to
and reproject-on-read so only the tiles intersecting the target grid
are fetched:
surtgis stac composite \
--catalog pc \
--collection cop-dem-glo-30 \
--bbox=... \
--align-to reference_utm.tif \
dem_aligned.tif
This is usually faster than download → reproject because only the
tiles within the target extent are pulled.
Choosing a target CRS
For terrain analysis at mid-latitudes, the local UTM zone is almost always the right choice. The formula:
zone = floor((longitude + 180) / 6) + 1
hemisphere = "S" if latitude < 0 else "N"
EPSG = 32600 + zone (Northern hemisphere)
EPSG = 32700 + zone (Southern hemisphere)
Examples:
- Chile around Curicó (−71°, −35°) → zone 19 South → EPSG:32719.
- Spain around Madrid (−3°, 40°) → zone 30 North → EPSG:32630.
- Western US around Los Angeles (−118°, 34°) → zone 11 North → EPSG:32611.
For continental-scale areas that span multiple UTM zones, use an equal-area projection appropriate to your region (Lambert Azimuthal for continental Europe, Albers Equal Area Conic for the continental US, etc.). Terrain slope is less sensitive to projection distortion than area, so UTM within a single zone is almost always good enough for analyses smaller than ~500 km across.
Caveats
- Only
nearestandbilinearare implemented today. Cubic resampling is on the roadmap. For elevation and most reflectance data, bilinear is the right default; cubic’s main advantage is smoother contour rendering, whichreprojectis not the right tool for anyway. - The output extent is computed by transforming the source raster’s
four corners plus three edge midpoints, which covers most
projections. Highly curved projections (large-area polar or
cylindrical projections crossing the antimeridian) may need
--pixel-sizeset explicitly. - For geographic-to-metric reprojections without
--pixel-size, SurtGIS infers a default that roughly preserves the output column count. If you need an exact resolution (say 30 m to match Copernicus DEM), pass it explicitly.
Mosaic rasters in different CRSes
surtgis mosaic assumes all inputs share a CRS and grid. If they don’t,
you’ll hit an error. Two recommended approaches.
Pre-reproject everything to a target CRS
Pick a reference raster (or create an empty one) in your target CRS, then
use resample on each input to align them:
for f in tile_*.tif; do
surtgis resample "$f" "aligned_$f" --reference reference_utm.tif
done
surtgis mosaic aligned_*.tif out.tif
resample handles both the CRS change and the grid alignment in one step,
so aligned_*.tif are all on the same grid by construction.
Use stac composite if the sources are on a STAC catalog
The multi-UTM case that drove SurtGIS 0.6.1’s critical fix is exactly
this: a bbox spanning two UTM zones, with different scenes in each zone.
stac composite handles it automatically by reprojecting each tile’s
bbox into the tile’s native CRS before reading, then unifying at mosaic
time. You don’t think about zone boundaries:
surtgis stac composite \
--bbox=-71.5,-35.5,-70.5,-34.5 \ # spans UTM 19S and 18S
--asset red,nir \
...
If you really want to mosaic mixed-CRS files yourself
Reproject them all to a common CRS first (either via resample --reference
or gdalwarp), then surtgis mosaic. There is no “mosaic with automatic
reprojection” command; deliberately so, because implicit reprojection
during mosaic usually produces surprising seams when the target grid isn’t
chosen carefully.
Overlap handling
surtgis mosaic uses NaN-aware last-write-wins: a source pixel only
overwrites the output if it is finite. This means irregular tile shapes
(Sentinel-2 scenes with masked borders, Landsat fill pixels) do not
overwrite valid data from adjacent tiles. Order of the input arguments
matters only for genuinely overlapping valid regions — there, the last
file wins.
Cache COG tiles for fast re-runs
Any STAC or COG command that supports --cache will persist downloaded
tiles to disk and serve them from cache on subsequent runs over the same
bbox and asset.
Enable
surtgis stac composite --cache --catalog pc --asset red,nir ...
surtgis stac fetch-mosaic --cache --catalog pc --collection cop-dem-glo-30 ...
The first run downloads from the origin; subsequent runs over the same
bbox read from local disk. No code changes, no extra flags beyond
--cache being present.
Where the cache lives
Default: $XDG_CACHE_HOME/surtgis/cog/ (typically ~/.cache/surtgis/cog/
on Linux, ~/Library/Caches/surtgis/cog/ on macOS). Override with:
export XDG_CACHE_HOME=/path/to/big/disk
Cache keys are derived from the COG base URL (without SAS query parameters) plus the exact bbox. This means re-signed Azure Blob URLs still hit cache, but shifting the bbox by even one pixel produces a cache miss.
When to use it
Always, if you’re iterating on the same area of interest. The cost is disk space (each tile is typically 1–10 MB, a medium study area might cache 0.5–2 GB); the benefit is that subsequent runs go from minutes to seconds.
When not to use it
- One-shot production runs where you don’t expect to re-read the tiles.
- Highly variable bboxes that don’t share tile coverage (each run caches fresh, adds disk pressure, returns nothing usable).
- Sentinel-2 scenes with very short SAS token lifetimes where you might cache items whose signed URLs go stale. This is rare in practice; SurtGIS re-signs tokens automatically within the same run.
Cache hygiene
No automatic eviction yet. Manually clear when the cache gets too big:
rm -rf ~/.cache/surtgis/cog/
A TTL-based eviction policy is on the roadmap. Until then, a cron job removing files older than N days is a reasonable workaround.
Disk layout
Tiles are sharded into 256 subdirectories by hash prefix so no single directory grows unbounded:
~/.cache/surtgis/cog/
├── 00/
│ ├── 42/
│ │ └── <rest-of-hash>.tif
│ └── ...
├── 01/
└── ...
Safe to browse with any file manager. The .tif files are regular
GeoTIFFs and can be opened directly.
Tune the RAM budget
For workflows that approach or exceed your machine’s RAM, SurtGIS gives you two dials.
Override the budget target (stac composite)
stac composite auto-sizes strip_rows to fit within a 16 GB budget by
default. Lower or raise it with an environment variable:
# Conservative — fit the pipeline in 8 GB
SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ...
# Permissive — machine has 64 GB, let strips be larger for fewer HTTP calls
SURTGIS_RAM_BUDGET_GB=32 surtgis stac composite ...
SurtGIS will print the budget and the derived strip_rows on startup:
RAM budget (16.0 GB target, band_chunk_size=1) — output: 3.3 GB | mask cache: 5.0 GB | scene strips: 1.3 GB | band working: 0.4 GB | decode: 0.1 GB (strip_rows=512) → ~10.1 GB peak
That ~10.1 GB peak is the estimate. Real peak is typically within ±30%.
If it comes in much higher, you’ve hit an edge case — see
Debug a stac composite using too much RAM.
Override the memory ceiling (general commands)
For commands that operate on in-memory rasters:
surtgis --max-memory 4G terrain slope huge.tif slope.tif
Any raster larger than --max-memory (default 500 MB) triggers automatic
streaming mode on commands that support it. Force it explicitly with
--streaming:
surtgis --streaming terrain slope huge.tif slope.tif
If you pass --streaming on a command that hasn’t been adapted to the
streaming path, you get a hard error. See Memory model
for the list of streamable algorithms.
Rule of thumb for sizing
On a host with X GB of RAM, leave about 0.4 × X GB free for the OS
and other processes. So:
| Host RAM | Safe SURTGIS_RAM_BUDGET_GB |
|---|---|
| 8 GB | 4 |
| 16 GB | 9 |
| 32 GB | 18 |
| 38 GB (postdoc machine) | 22 |
| 64 GB | 36 |
These aren’t hard caps — SurtGIS won’t strictly respect the budget, it
sizes strip_rows to aim for it. Overshoot is possible. When the
overshoot matters (you’re on a shared machine, or the OOM killer is
watching), pair the budget override with an external watchdog:
# Kill if RSS exceeds 14 GB
( surtgis stac composite ... & PID=$!
while kill -0 $PID 2>/dev/null; do
RSS=$(awk '/VmRSS:/ {print $2}' /proc/$PID/status 2>/dev/null || echo 0)
if [ "$RSS" -gt 14000000 ]; then kill -9 $PID; break; fi
sleep 30
done
)
Raising band_chunk_size for speed at RAM cost
The --band-chunk-size K flag on stac composite controls how many bands
are downloaded and processed together per scene. K=1 (default) is
minimum RAM; higher K reduces HTTP request count at proportional RAM
cost:
| K | Scene strips | HTTP requests per scene |
|---|---|---|
| 1 (default) | 1× baseline | n_bands × |
| 3 | 3× baseline | n_bands/3 × |
| 5 | 5× baseline | n_bands/5 × |
On a 38 GB host with a 10-band composite, K=5 is comfortable. The budget printed at startup scales with K explicitly, so you can see the predicted peak before committing.
Debug a stac composite using too much RAM
If surtgis stac composite is consuming more RAM than expected, work
through these in order.
1. Read the budget line
Every stac composite run prints at startup something like:
RAM budget (16.0 GB target, band_chunk_size=1) — output: 3.3 GB | mask cache: 5.0 GB | scene strips: 1.3 GB | band working: 0.4 GB | decode: 0.1 GB (strip_rows=512) → ~10.1 GB peak
Five components:
| Component | What it is | Scales with |
|---|---|---|
| output | Pre-allocated output buffers, one per band | n_bands × total_cells |
| mask cache | Mosaicked cloud masks, one per scene | n_scenes × strip_rows |
| scene strips | Per-scene data for the active band chunk | band_chunk_size × n_scenes × strip_rows |
| band working | The active band’s mosaicked tile cache | band_chunk_size × strip_rows |
| decode | In-flight tile decode buffers | tile_concurrency |
If ~X GB peak is comfortably within --max-memory (or your host’s free
RAM), proceed; if not, see step 2.
2. Watch the [ram] transition log lines
Since v0.6.26, SurtGIS prints RSS and cumulative-tile counters at strip/phase/band-chunk transitions:
[ram] baseline before strip loop: RSS=120 MB
[ram] strip 1/2 start: RSS=3300 MB, tiles_cumulative=0
[ram] strip 1/2 after Phase A (masks): RSS=5800 MB, phase_a_tiles=1890, cumulative=1890
[ram] strip 1/2 chunk bands [0..1] start: RSS=6200 MB, cumulative=1890
[ram] strip 1/2 chunk bands [0..1] end: RSS=6400 MB, cumulative=3780
Rule of thumb: if RSS grows between a chunk start and the matching
chunk end by more than the budget’s scene strips + band working, the
model is wrong for this workload — file an issue
with the full log. That’s exactly how the fragmentation bug in v0.6.24
was found.
3. Lower the budget, then the bbox
In order of impact:
Cut SURTGIS_RAM_BUDGET_GB. Halves strip_rows, proportionally
halves mask cache and scene strips:
SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ...
Cut --band-chunk-size. K=1 is the minimum; you can’t go lower.
If you’re already at 1 and still over budget, the workload is structurally
too big for the host.
Cut --max-scenes. Each scene you drop trims mask cache and
scene strips proportionally. Going from 42 scenes to 20 halves both.
Median composite quality degrades, but on a cloud-free month that can be
acceptable.
Cut the bbox. Halving the bbox area quarters output and
proportionally shrinks every other component. If the bbox is a watershed,
process by sub-watershed and mosaic the outputs afterwards.
4. If RSS grows linearly with no plateau
That’s the fragmentation signature from the v0.6.19–v0.6.24 debugging arc. Should not recur on v0.6.26+ (mimalloc fixed it). If it does on a later version:
- Confirm you’re on v0.6.26 or later:
surtgis --version. - Share the
[ram]log plussurtgis --versionplus OS + allocator in an issue.
5. Checklist for a stuck run
Before giving up:
- Current version ≥ v0.6.26?
-
--cacheenabled if re-running the same bbox? -
SURTGIS_RAM_BUDGET_GBset to ~0.6× host RAM? -
--band-chunk-size 1? - bbox reasonable (< ~200 km × 200 km for S2 L2A at 10 m)?
- Host not competing with another RAM-heavy process?
If all yes and it still OOMs, that’s a genuine bug, not a tuning problem.
Extract patches: points vs polygons
surtgis extract-patches writes CNN-ready NPY tensors from feature
rasters. The two modes answer different questions.
Points mode
Use when you have labels at specific point locations and want spatial context (a patch) around each point for the model to learn from.
Example: 60 water wells with measured water-table depth. Each well gets a 256 × 256 patch of the surrounding terrain/imagery features; the model learns to predict water depth from that spatial context.
surtgis extract-patches \
--features-dir factors/ \
--points wells.gpkg \
--label-col water_depth \
--size 256 \
patches/
- One patch per point.
--label-colcan be numeric (i64 for classification, f32 for regression, auto-detected) or a string that parses to a number.- Points whose patch would extend outside the raster are silently skipped.
Polygons mode
Use when you have labelled regions and want many patches per region for training a CNN to classify or regress pixels.
Example: a shapefile with land-cover polygons (crop, pasture, bare soil, forest). Each polygon produces many patches; the model learns per-polygon class from spatial context.
surtgis extract-patches \
--features-dir factors/ \
--polygons land_use.gpkg \
--label-col class \
--size 256 \
--stride 128 \
patches/
- Grid sampling: patches are placed every
--stridepixels (default =--size, giving non-overlapping tiles). Setting--strideto half of--sizegives 50 % overlap, which common for data augmentation. - Each candidate centre is tested for point-in-polygon; patches whose
centre falls inside the polygon inherit its
--label-colvalue. - MultiPolygons are flattened. Interior rings (holes) are not honoured in v1 — patch centres in holes still get the outer polygon’s label. This is a documented limitation.
Which to pick
- Sparse labels on a grid you care about → points.
- Dense labels over connected regions → polygons.
- Both → run twice to two different output directories and merge.
Output format
Both modes write the same four files:
patches_dir/
├── patches.npy # [N, bands, H, W] f32
├── labels.npy # [N] i64 or f32
├── manifest.csv # idx, label, center_row/col, center_x/y, source_idx
└── meta.json # bands order, CRS, pixel size, skipped counts, seed
Load in Python:
import numpy as np
X = np.load('patches_dir/patches.npy') # ready for torch.from_numpy(X)
y = np.load('patches_dir/labels.npy')
Memory ceiling
The entire patches.npy tensor is held in RAM before writing in v0.7.0.
For N = 10,000 patches of 256 × 256 × 10 bands × 4 bytes, that’s
26 GB. Two mitigations:
--max-patches Ndeterministically subsamples down toN(seeded by--seed, default 42). Useful for balancing very dense polygon grids.- Shrink
--sizeor--strideto reduce patch count.
Streaming writes (no full-tensor-in-RAM) are on the roadmap; they require
knowing N upfront without extracting data, which is doable but not yet
done.
NaN filtering
--skip-nan-threshold 0.1 (default) skips any patch where more than 10 %
of pixels are NaN across all bands. Raise it to 0.5 if you’re tolerant
of partial coverage; lower it to 0.0 to require fully valid patches.
The per-patch NaN count lands in manifest.csv for downstream filtering.
Prepare training data for a Geospatial Foundation Model
This guide walks through the full pipeline from a STAC bbox to a tensor ready to fine-tune Prithvi-EO-2.0 (NASA / IBM). The same flow works for Clay v1.5 by swapping one flag.
The motivation: in October 2025 the InstaGeo paper identified that no published Geospatial Foundation Model ships its preprocessing pipeline — users get only model checkpoints. SurtGIS fills that gap with a single CLI command.
Step 1 — fetch a multi-temporal HLS composite
Prithvi was pre-trained on Harmonized Landsat-Sentinel-2 (HLS). For
each timestamp you want in the temporal stack, run one stac composite
call into its own subdirectory. The subdirectory name becomes the
timestamp label in the output tensor.
# 2024 Maule, Chile — three months of HLS coverage
for month in 01 02 03; do
surtgis stac composite \
--catalog pc \
--collection hls2-s30 \
--asset "B02,B03,B04,B05,B06,B07" \
--bbox=-72.0,-35.5,-71.5,-35.0 \
--datetime "2024-${month}-01/2024-${month}-31" \
--max-scenes 10 \
features/t_2024_${month}/composite.tif
done
Each composite uses the HLS Fmask asset for cloud masking
automatically — the asset name “Fmask” triggers SurtGIS’s
HlsFmask strategy (cloud, adjacent-to-cloud,
and cloud-shadow bits dropped; cirrus and snow kept). The output is
one .tif per band per timestamp.
After the loop your tree should look like:
features/
├── t_2024_01/
│ ├── composite_B02.tif
│ ├── composite_B03.tif
│ └── ... (one per asset)
├── t_2024_02/
│ └── ...
└── t_2024_03/
└── ...
Step 2 — extract Prithvi-ready chips
One command turns the per-timestamp directories into a tensor matched to Prithvi-EO-2.0’s input convention:
surtgis extract-patches \
--features-dir features/ \
--points labels.geojson \
--label-col landslide_class \
--profile prithvi-v2 \
--size 224 \
--output-format zarr \
--emit-stac \
chips/
What the flags do, in order:
--features-dir features/: SurtGIS auto-detects multi-timestamp mode because the directory contains subdirectories (each holding the same band set). Subdirs are sorted lexicographically — that’s why we usedt_2024_01,t_2024_02,t_2024_03.--profile prithvi-v2: validates that 6 bands are present in the order Prithvi expects (B02, B03, B04, B05, B06, B07), then applies the per-band z-score normalization from the official Prithvi-EO-2.0-300M config. Records full provenance inmeta.json.--size 224: matches Prithvi’s pre-training tile size. The profile warns if you pass a different size, since the model would then need to resize at training time.--output-format zarr: emit chunked Zarr v2 instead of a single.npy. One chunk per chip, parallel I/O. Optional but useful when N is large.--emit-stac: write a STAC ML-AOI Collection and one Item per chip, embedding the STAC MLM extension withmlm:model_target = ibm-nasa-geospatial/Prithvi-EO-2.0-300Mand the fullmlm:inputdescriptor. This is what makes the dataset publishable on its own.
The output tree:
chips/
├── patches.zarr/ # [N, 6, 3, 224, 224] f32, chunk = 1 chip
│ ├── .zarray
│ ├── .zattrs
│ └── 0.0.0.0.0, 1.0.0.0.0, ...
├── labels.npy # [N] i64 or f32
├── manifest.csv
├── meta.json # bands, timestamps, gfm_profile{...}
└── stac/
├── collection.json # MLM + ML-AOI
└── items/
├── chip_000000.json
└── chip_000001.json
The tensor shape [N, 6, 3, 224, 224] is the standard
[batch, channels, time, height, width] that Prithvi accepts directly.
Step 3 — load and fine-tune
With TerraTorch (the IBM fine-tuning toolkit):
import zarr
import numpy as np
import torch
X = zarr.open('chips/patches.zarr', mode='r') # [N, 6, 3, 224, 224]
y = np.load('chips/labels.npy')
# Materialise into a Torch dataset — the Zarr open is lazy, [:] forces it.
X_tensor = torch.from_numpy(X[:])
y_tensor = torch.from_numpy(y)
# Prithvi-EO-2.0 expects (B, C, T, H, W) which is exactly our layout.
# No transpose, no extra normalization — SurtGIS already z-scored using
# the official Prithvi statistics.
The profile metadata in meta.json lets a TerraTorch config file pick
up the right band order, normalization params, and tile size without
hand-editing:
import json
with open('chips/meta.json') as f: m = json.load(f)
print(m['gfm_profile']['model_target'])
# → 'ibm-nasa-geospatial/Prithvi-EO-2.0-300M'
print(m['tensor_layout'])
# → '[N, C, T, H, W]'
Variation — Clay v1.5 instead
Clay expects 10 Sentinel-2 bands at tile 256 with reflectance values in
[0, 1]. Swap the profile and band list:
surtgis stac composite \
--catalog pc --collection sentinel-2-l2a \
--asset "B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12" \
--bbox=-72.0,-35.5,-71.5,-35.0 --datetime 2024-02-01/2024-02-28 \
features/t0/composite.tif
surtgis extract-patches \
--features-dir features/ --points labels.geojson \
--label-col landslide_class --profile clay-v1.5 --size 256 \
--output-format zarr --emit-stac chips/
Cloud masking switches to Sentinel-2 SCL automatically because the collection is L2A, not HLS. No flag needed.
Variation — NPY instead of Zarr
If your downstream loader is np.load()-based, drop --output-format zarr:
surtgis extract-patches \
--features-dir features/ --points labels.geojson \
--label-col landslide_class --profile prithvi-v2 --size 224 \
--emit-stac chips/
You get chips/patches.npy with the same shape and dtype. Pick Zarr
when N is large enough that np.load would page-fault your machine.
Validation
After the run, verify the tensor matches what Prithvi expects:
python3 -c "
import zarr, json
X = zarr.open('chips/patches.zarr', mode='r')
print('shape:', X.shape, 'dtype:', X.dtype)
m = json.load(open('chips/meta.json'))
spec = m['gfm_profile']
print('bands:', spec['bands_order'])
print('mean:', spec['band_norm_mean'])
print('std:', spec['band_norm_std'])
print('source:', spec['source_url'])
"
You should see dtype: float32, the canonical Prithvi band order, and
the means around [1087, 1342, 1433, 2734, 1958, 1363].
When things break
- “Profile expects N bands, but M feature rasters were loaded”:
curate
features/<timestamp>/to contain exactly the bands the profile expects, in the expected order. The error message lists them. - “Band-name mismatch at timestamp ‘t_2024_02’”: all timestamp subdirs must declare the same bands. Re-run the composite step for the offending timestamp.
- “Mixed mode: top-level .tif AND subdirs with .tifs”: move the top-level files into a subdirectory. SurtGIS refuses to guess.
- STAC items have bbox in source CRS instead of WGS84: compile with
--features projections(default in precompiled binaries). The warning message tells you when this fallback triggered.
CLI reference
Every top-level surtgis subcommand, with flags and a minimal example.
Pages are generated from --help output and regenerated on every
CLI-surface change.
Top-level commands
surtgis info— Show information about a raster filesurtgis terrain— Terrain analysis algorithmssurtgis hydrology— Hydrology algorithmssurtgis imagery— Imagery / spectral index algorithmssurtgis morphology— Mathematical morphology algorithmssurtgis landscape— Landscape ecology metrics (global patch/class/landscape level)surtgis extract— Extract raster values at point locations to CSVsurtgis extract-patches— Extract raster patches centered on points or sampled from polygons for CNN trainingsurtgis clip— Clip a raster by polygon or bounding boxsurtgis reproject— Reproject a raster from one CRS to another (native, no GDAL dependency)surtgis rasterize— Rasterize a vector file to a raster grid (.geojson, .shp, .gpkg)surtgis resample— Resample a raster to match the grid of a reference rastersurtgis mosaic— Mosaic multiple rasters into one covering the union extentsurtgis cog— Read and process Cloud Optimized GeoTIFFs (COGs) via HTTPsurtgis stac— Search and fetch data from STAC catalogs (Planetary Computer, Earth Search)surtgis pipeline— Pipeline: integrated workflows for specific use casessurtgis vector— Vector geoprocessing: intersection, union, difference, dissolve, buffersurtgis interpolation— Geostatistical interpolation: variogram, kriging, universal kriging, regression krigingsurtgis temporal— Temporal analysis: trend, anomaly, phenology, statisticssurtgis classification— Classification: unsupervised/supervised raster classification (k-means, PCA, etc.)surtgis texture— Texture analysis: edge detection and GLCM texture measuressurtgis statistics— Statistics: focal, zonal, and spatial autocorrelation
Global flags
Available on every subcommand:
-v, --verbose Verbose output
--compress Compress output GeoTIFFs (deflate)
--streaming Force streaming mode for large rasters (auto-detected if >500MB)
--max-memory <MAX_MEMORY> Maximum memory to use (e.g., 4G, 1024MB, 500MiB). If raster would exceed this when decompressed, force streaming
-h, --help Print help
-V, --version Print version
Supported formats
Raster formats
| Format | Read | Write | Feature flag | Notes |
|---|---|---|---|---|
| GeoTIFF | ✓ | ✓ | (always) | Predictor 1/2/3, DEFLATE, LZW, GDAL_NODATA |
| Cloud Optimized GeoTIFF (COG) | ✓ | — | cloud | HTTP range reads; write not supported yet |
Zarr (via zarrs) | ✓ | — | zarr | Azure Blob + Planetary Computer auth |
| NetCDF | ✓ | — | netcdf | Requires libnetcdf system library |
| GRIB2 | ✓ | — | grib | Requires libgribapi or eccodes |
| JP2000, HDF4, ECW, MrSID, ERDAS .img | — | — | — | Use GDAL |
Vector formats
| Format | Read | Notes |
|---|---|---|
GeoJSON (.geojson, .json) | ✓ | |
Shapefile (.shp + sidecar files) | ✓ | Via shapefile crate |
GeoPackage (.gpkg) | ✓ | Via rusqlite |
Vector writing is not supported in v0.7.0. Clip / rasterize accept vectors as input; produce rasters as output.
Floating-point value handling
For float element types (f32, f64), the GDAL_NODATA tag value is
replaced with NaN on read. Algorithms treat NaN as “invalid” and
propagate it as such — no sentinel values buried inside floats.
On write, the same convention: NaN pixels are written out as
GDAL_NODATA values if the output format supports the tag.
For integer types (i16, u16, u8), the original nodata sentinel is
preserved.
Endianness and TIFF predictors
The native GeoTIFF reader handles both big-endian and little-endian files. Predictor 1 (no prediction), 2 (horizontal differencing), and 3 (floating-point prediction) are all supported. Predictor 3 was added in v0.6.21 specifically for Copernicus DEM GLO-30 which uses it.
Compression support
Read: DEFLATE, LZW, packbits, none.
Write: DEFLATE (via --compress flag) or none.
Write support for LZW is not planned — DEFLATE compresses better on geospatial rasters for comparable CPU cost.
Size limits
The tiff crate enforces a default size limit to prevent malicious
inputs. For large DEMs (e.g. 20,000 × 20,000 or bigger), SurtGIS sets
Limits::unlimited() internally so uncompressed reads aren’t rejected.
Files on disk can be arbitrarily large; streaming algorithms don’t care about total size.
Environment variables
SurtGIS reads the following environment variables at runtime. Most configuration is via CLI flags; env vars are for settings that are awkward to thread through every command.
SURTGIS_RAM_BUDGET_GB
Consumed by: stac composite
Default: 16
Sets the target peak-RAM budget for the STAC composite pipeline. The
command auto-sizes its strip_rows to fit within this budget, accounting
for output buffers, per-scene mask cache, active-band scene strips, band
working set, and decode overhead.
SURTGIS_RAM_BUDGET_GB=8 surtgis stac composite ... # conservative
SURTGIS_RAM_BUDGET_GB=24 surtgis stac composite ... # permissive
Real peak is typically within ±30% of the target. See Tune the RAM budget for sizing guidance.
XDG_CACHE_HOME
Consumed by: stac, cog (when --cache is enabled)
Default: $HOME/.cache (Linux), $HOME/Library/Caches (macOS)
COG tile cache is stored under $XDG_CACHE_HOME/surtgis/cog/. Override
to put the cache on a different disk:
export XDG_CACHE_HOME=/mnt/fast-ssd/cache
HOME
Used as a fallback for XDG_CACHE_HOME when the latter is unset. Also
consulted for conventional per-user config paths.
CARGO_TERM_COLOR, NO_COLOR, TERM
Standard CLI-colour env vars. SurtGIS honours NO_COLOR=1 to disable
ANSI escapes; useful for CI logs.
RUST_LOG
Controls the tracing subscriber’s verbosity when you pass --verbose.
RUST_LOG=debug surtgis --verbose terrain slope dem.tif slope.tif
RUST_LOG=surtgis_cloud=trace surtgis --verbose stac composite ...
Module-scoped levels (surtgis_cloud=trace) are useful for debugging
cloud read issues without drowning in unrelated debug output.
Variables SurtGIS does not consume
Common ones we explicitly don’t use:
GDAL_DATA,PROJ_LIB— SurtGIS’s reprojection usesproj4rswhich embeds its own grid data, not the system GDAL/PROJ installation.AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY— not used for cop-dem-glo-30 vias3://(the bucket is public). For private S3 buckets, authenticated access is not yet supported.PC_SDK_SUBSCRIPTION_KEY— Planetary Computer’s private-API key. Not consumed; SurtGIS uses the public anonymous endpoint.
Architecture
This page explains why SurtGIS is structured the way it is. If you want to
learn what flags to pass, you’re in the wrong section — try
CLI reference. If you want to know
why --streaming exists at all, keep reading.
Workspace layout
SurtGIS is a Cargo workspace of ten crates:
surtgis/
├── crates/core # Raster<T>, GeoTransform, CRS, GeoTIFF I/O, streaming, mosaic, vector
├── crates/algorithms # 250+ public functions across 10 modules
├── crates/parallel # Rayon-based strategies, maybe_rayon compile-time switch
├── crates/cli # 19 top-level subcommands, binary `surtgis`
├── crates/cloud # COG reader, STAC client, bbox reprojection, s3:// normalisation
├── crates/wasm # WebAssembly bindings (~17% of algorithms currently)
├── crates/python # PyO3 bindings: extract_at_points, predict_raster
├── crates/gui # egui desktop app (experimental)
└── crates/colormap # Color schemes and raster → PNG rendering
The split exists because the dependency graphs differ wildly. surtgis-core
has zero non-trivial deps beyond tiff and ndarray; surtgis-cloud pulls
in reqwest + tokio for HTTP; surtgis-python needs PyO3 at build time;
surtgis-gui is egui and eframe. Jamming them into one crate would force
every user — including someone who just wants Raster<f32> — to compile a
GUI framework they don’t need.
Concretely: embedding SurtGIS in a Rust program that only needs the
algorithm library pulls two crates (surtgis-core, surtgis-algorithms).
Adding STAC support adds a third. Nothing more.
The Raster<T> core
Everything starts from one type:
pub struct Raster<T> {
data: ndarray::Array2<T>,
transform: GeoTransform,
crs: Option<CRS>,
nodata: Option<f64>,
}
Generic over the element type (f32, f64, i16, u8, u16), with an
affine GeoTransform (origin + pixel size + optional rotation) and an
optional CRS. Nodata is stored separately from the data array, and the
convention across the library is: for float element types, nodata values are
replaced by NaN on read, and NaN is the only signal algorithms check for
invalid data. No sentinel values buried inside floats.
This is deliberately more restrictive than GDAL’s model, which supports
arbitrary block organisations, band counts, and offsets per band. SurtGIS
treats a multi-band raster as a Vec<Raster<T>> — one per band — and forces
each band through its own file on disk. The cost: you pay one HTTP round-trip
per band when reading from COGs. The benefit: the entire library’s worth of
algorithms doesn’t have to special-case multi-band layouts.
Native I/O via the tiff crate
A deliberate non-decision: SurtGIS does not depend on GDAL by default. It
reads and writes GeoTIFFs using the pure-Rust tiff
crate plus a thin layer in surtgis-core::io::native that handles the
geospatial tags (ModelTiepointTag, ModelPixelScaleTag, GeoKeyDirectory,
GDAL_NODATA).
Trade-offs of this choice:
Wins:
- Single binary. Ship as precompiled. No
libgdal.soversion mismatch hell. - Works on WASM. 56 of the 127 algorithms compile to
wasm32-unknown-unknownand run in any modern browser without a server. This includes the full Florinsky 14-curvature system, D8/MFD/D-infinity hydrology, 14 spectral indices, the focal-stats suite, and basic morphology. The browser path is unique to SurtGIS among comprehensive terrain libraries. - Cloud-optimised reads. The COG reader in
surtgis-cloudissues HTTP range requests directly against the TIFF byte layout; no GDAL VSI layer in between.
Losses:
- Exotic formats: HDF4, JP2, ERDAS .img, etc. GDAL reads ~150 formats; SurtGIS natively reads GeoTIFF + (optionally) Zarr + NetCDF + GRIB2. If you need Pléiades JP2000s in, GDAL is the answer.
- TIFF edge cases that GDAL’s robustness has absorbed over two decades: odd compression combos, non-standard predictor tags, malformed GeoKey directories. SurtGIS handles the common cases (predictor 1/2/3, DEFLATE, LZW, big/little endian, tiled and stripped), but not every ancient file your grandfather’s theodolite produced.
A gdal feature flag exists as an escape hatch — cargo install surtgis --features gdal
— which falls back to the GDAL crate for I/O when the native path can’t
handle something. Most users never need it.
Streaming: strip processing for arbitrarily large rasters
The StripProcessor trait in surtgis-core::streaming is where SurtGIS
handles DEMs larger than RAM. About ten window-based algorithms (Horn slope,
aspect, hillshade, curvature, Gaussian smoothing, Laplacian, and a few more)
are written against this trait rather than the in-memory Raster<T>.
A strip is a horizontal band of rows read into memory together with enough
overlap above and below (the halo) to compute window-based operations
without cross-strip coordination. For a 3×3 Horn slope, halo = 1; for
Gaussian sigma=5, halo = ceil(3σ) ≈ 15.
+--- DEM on disk ---+
| |
| strip 1 [halo] | ← read N rows + halo below
| strip 1 [data] | ← compute, write
| strip 1 [halo] |
| |
| strip 2 [halo] | ← reuse bottom halo of strip 1
| strip 2 [data] |
| strip 2 [halo] |
| ... |
+-------------------+
Peak RAM = 2 × halo × cols × sizeof(T) plus one strip’s worth of output.
For a 100,000 × 100,000 DEM at f32 with Horn slope, that’s roughly 0.8 MB
working set regardless of total file size.
Automatic detection: if a raster on disk would exceed --max-memory (or
500 MB by default) when decompressed, algorithms that support streaming
switch automatically. --streaming forces the path; --max-memory 16G
raises the threshold; passing --streaming for an algorithm that doesn’t
support it is a hard error rather than silent fallback.
The list of streamable algorithms is small on purpose — adding one requires writing the algorithm against the halo-strip iterator pattern, which is harder than writing against a fully-resident array. We only absorb that cost for algorithms people actually want to run on huge DEMs.
Parallelism: maybe_rayon
The surtgis-parallel crate provides a compile-time switch:
#[cfg(feature = "parallel")]
use rayon::prelude::*;
#[cfg(not(feature = "parallel"))]
use std::iter as rayon_fallback;
With the parallel default feature, algorithm loops use par_iter /
par_bridge. Without it (useful for WASM, where threading is a separate
setup), the same code compiles to sequential iterators. One source, two
targets.
The WASM build is currently serial because browser thread pools require SharedArrayBuffer and CORS headers that most hosts don’t provide by default. Adding WASM threading is tracked but not a priority — the per-tile work in a typical browser use case (small extents) doesn’t benefit enough to justify the deployment complexity.
STAC + COG: the cloud read path
surtgis-cloud handles the cloud-native read path, which is structurally
different from “download the whole file then process it”:
- STAC search returns a list of items, each with asset hrefs (HTTPS or
s3://) pointing at COG files on blob storage. CogReader::open(href)fetches the first ~64 KiB — enough to parse the TIFF header, IFD chain, and GeoKey directory — without reading tile data.reader.read_bbox(bbox)computes which internal tiles of the COG intersect the requested bbox, fetches exactly those tiles via HTTP range requests, decompresses them in parallel, and returns aRaster<T>of the intersection.
The key insight: for a 100 MB COG where your bbox only covers 1% of the
image, you transfer ~1 MB and decompress ~1 MB. The stac composite
pipeline composes this with strip processing so even very wide STAC extents
have bounded peak memory.
See STAC integration philosophy for why the design chose this path over the alternatives (download-then-process, or GDAL-VSI-based access).
Why workspaces pin internal versions loosely
The ten crates share a version via [workspace.package] and reference each
other with version = "0.7" (caret range) plus path = "../core". Local
builds always use the path; crates.io resolution uses the version. This
means:
- Bumping any one crate breaks the build unless they all bump together.
- Cutting a release = one version bump in the root
Cargo.toml, onecargo publishper crate in dependency order. cargo updatepicks up patch releases automatically; a minor bump (v0.7 → v0.8) requires consumers to update their dependency declaration.
This is heavier than publishing one monolithic crate, but it lets someone
embed surtgis-core in a low-footprint context without dragging in the CLI
or the cloud stack.
Memory model
SurtGIS has two distinct modes for handling raster data, and understanding when each applies is the difference between a pipeline that finishes and one that OOMs.
In-memory mode
The default. A raster is read entirely into an ndarray::Array2<T>,
algorithms operate on the array in place, and the result is written back
to disk. This is what every command does unless a streaming condition is
met.
Peak RAM ≈ input size + output size, both fully decompressed. For a
20,000 × 20,000 f32 DEM that’s 20k² × 4 bytes = 1.6 GB twice, so
about 3.2 GB working set. Most commands fit comfortably in modern host
RAM.
Streaming mode (strip processing)
For about 10 window-based algorithms (Horn slope, aspect, hillshade, curvature, Gaussian smoothing, Laplacian, fill-sinks, flow direction, and a few more), SurtGIS can process the DEM in horizontal strips with configurable overlap, reading and writing one strip at a time.
Peak RAM is independent of total raster size: roughly
2 × halo × cols × sizeof(T) plus one strip of output. For a
100,000 × 100,000 f32 DEM with Horn slope (halo = 1), that’s about
800 KB of data actively in memory at any moment, not 40 GB.
Triggered automatically when the raster on disk would exceed
--max-memory (default 500 MB). Force explicitly with --streaming.
STAC composite mode
The stac composite pipeline is neither: it operates on a stream of
scenes fetched from STAC, mosaics them into per-strip outputs, and writes
results per band. Peak RAM is bounded by a 5-component model documented
in the command’s startup log line and discussed in the
debug-stac-ram how-to.
Practical decision tree
Am I running a window-based algorithm on a DEM?
├── Yes, raster fits in RAM → in-memory mode, no action needed
├── Yes, raster exceeds --max-memory → streaming kicks in automatically
└── No, I'm running STAC composite → 5-component budget model applies,
see --band-chunk-size and
SURTGIS_RAM_BUDGET_GB
Which algorithms support streaming?
Roughly the ones where the operation is definable in a bounded window:
terrain slope, terrain aspect, terrain hillshade, terrain curvature, terrain tpi, terrain tri, terrain gaussian-smoothing,
hydrology fill, hydrology flow-direction, morphology dilation,
morphology erosion.
Algorithms that need global information (flow accumulation across an entire drainage network, watershed delineation from any-point pour, mosaic of unrelated rasters) don’t fit the strip pattern and run in-memory. For huge DEMs these would need a different approach (Dask-style tile graphs, or out-of-core sort/scan algorithms). Not currently implemented.
When to use SurtGIS vs alternatives
SurtGIS is a newer, narrower tool than the incumbents in this space. This page is the honest “should I bother?” guide. It will not tell you SurtGIS is better at everything — it isn’t.
The incumbents
| Tool | What it’s great at |
|---|---|
| GDAL | I/O breadth (~150 formats), raster transforms, the reference implementation of nearly every GIS operation |
| GRASS GIS | Hydrology, mature topological algorithms, serious spatial analysis workflows with long history |
| WhiteboxTools | Hydrology specifically, very clean algorithm implementations, good defaults |
| SAGA GIS | Terrain analysis, geomorphometry, wide algorithm selection with a visual UI |
| QGIS | User interface, visualisation, integrating all of the above |
All four are freely available, battle-tested, and vastly more mature than SurtGIS. If one of them already solves your problem, use it.
Where SurtGIS is a genuine win
You want a single-binary CLI with no system dependencies
No libgdal.so version matching. No Python environment. Works in Docker
FROM scratch. Works in CI runners without privileged installs. This
matters for:
- Reproducible benchmarks.
- Embedding geospatial processing into systems that can’t afford a package manager (industrial control, embedded, browser via WASM).
- Distributing analysis scripts that a non-GIS user can actually run. The friction from “install Anaconda then conda-forge GDAL then pip pyproj” to “download this tarball” is substantial.
You’re processing rasters larger than your RAM
The streaming strip processor (see Memory model) runs
slope, aspect, hillshade, curvature, fill-sinks, flow direction, Gaussian
smoothing, and ~10 others with bounded RAM regardless of input size. GDAL
and WhiteboxTools can also do this, but typically require explicit block
processing setup. SurtGIS does it automatically when --max-memory would
be exceeded.
You want end-to-end STAC workflows with no glue code
surtgis stac composite --catalog pc --asset red,nir,swir16,swir22 ...
searches Planetary Computer, downloads only the tiles intersecting your
bbox via HTTP range requests, cloud-masks with SCL, and median-composites
across dates. One command. Peer tools (rasterio + pystac-client +
rioxarray + dask) can match this but typically require 30+ lines of Python
glue and an understanding of Dask.
You need terrain + hydrology on the same binary
Many projects combine DEM preprocessing with spectral analysis. Mixing WhiteboxTools for hydrology, GDAL for I/O, and a Python notebook for NDVI is normal but operationally painful. SurtGIS covers all three in one tool, which matters most for scripts that run daily and need to be debuggable years later.
You want honest, reproducible benchmarks against GDAL/GRASS
The paper cross-validates SurtGIS outputs against GDAL 3.11, GRASS 8.3, and WhiteboxTools on identical DEMs, and publishes the tolerances as tables. RMSE vs GDAL for slope is 1.6×10⁻⁴ degrees. That level of validation is a design choice and is part of the CI today.
Where the incumbents beat SurtGIS
You need a format SurtGIS doesn’t read natively
Netcdf, GRIB2, and Zarr are supported but gated behind feature flags that require system libraries on some platforms. HDF4, JP2, ECW, MrSID, ERDAS .img: not supported at all. Reach for GDAL.
You need mature, edge-case-hardened algorithms
GRASS’s hydrology has been tuned for 25 years against flat terrain, coastal areas, pit-fill strategies, and every weird DEM edge case you’ve ever seen. SurtGIS implements the standard algorithms (D8, D∞, priority-flood fill, watershed-from-pour-points) and validates them against GRASS on common cases, but any production pipeline that hits edge cases will hit them first on SurtGIS, because GRASS has already been debugged on them.
You need a GUI
surtgis-gui exists but is experimental, has no real user base, and is
not the project’s priority. QGIS is a dramatically better interactive
environment. SurtGIS integrates cleanly as a QGIS algorithm provider if
you want the CLI-backed pipeline with the QGIS UX.
You need Python interoperability beyond a handful of functions
surtgis-python exposes extract_at_points and predict_raster for ML
pipelines. That’s it. If you’re building on numpy/xarray/rasterio and
want the full toolbox, rasterio + rioxarray + pyproj + scikit-image is the
mature stack. SurtGIS is for Rust-first users who occasionally want to
bridge to Python.
You need a wider algorithm surface than 250 functions
That’s a lot, but GRASS has ~500 modules and GDAL has ~100 high-level operations on top of its I/O. If your workflow uses obscure morphological filters, multi-criteria decision analysis, or mature geostatistical tools beyond IDW/kriging/natural-neighbour, the incumbents have more.
A decision rule
A blunt three-question filter:
- Is your data in GeoTIFF (+ maybe Zarr/NetCDF/GRIB2 with feature flags)? If not, stop here and use GDAL.
- Do you want your pipeline to be one binary that installs in 10 seconds and runs anywhere? If not, GRASS + WhiteboxTools give you more at the cost of install complexity.
- Are you writing a Rust program that needs geospatial primitives
embedded? Here SurtGIS has no real competition —
gdal-rustbindings exist but the FFI surface and build requirements are hostile.
If you answer yes–yes–maybe, SurtGIS is a strong fit. Otherwise you’re probably better served by the incumbents, and we’d rather you use them and come back to SurtGIS when a concrete gap bites.
Where we’re honestly worse right now
- Docs. This book is the first attempt at a real user guide; the incumbents have decades of accumulated tutorials, stackoverflow answers, and textbook chapters.
- Community. No Slack / mailing list / forum. One postdoc in production; no GitHub stars to speak of.
- Mature error messages. GDAL has tuned its error text over two decades; SurtGIS still emits some Rust-ish errors that won’t mean much to a GIS user.
- Edge-case robustness. Every weird TIFF file in the world has been thrown at GDAL. SurtGIS has been thrown at a few thousand.
These are not bugs; they are the cost of being young. If you hit one, please report it — that’s the only way the list gets shorter.
STAC integration philosophy
The surtgis stac command group is designed around one principle:
fetch the minimum bytes you need, compose them into the output the user
actually asked for, in one command.
This rules out the two obvious alternatives:
-
Download the whole file, then process. A full Sentinel-2 MGRS tile is ~700 MB. Over a year that’s hundreds of GB of transfer for a workflow that might only care about a 5 km × 5 km bbox. The COG format exists specifically so we don’t have to do this.
-
Use GDAL’s VSI layer. GDAL can read COGs over HTTP (
/vsicurl/https://...). This works, but puts GDAL between SurtGIS and the byte layout of the file, which means any bug or performance quirk in GDAL’s VSI cache shows up as mysterious SurtGIS behaviour. Keeping the COG reader insurtgis-cloudmeans when something is wrong we can debug it, not just report it to GDAL upstream.
What the pipeline does
surtgis stac composite walks this chain:
- Search. Hit the STAC API for items matching bbox + datetime + collection. Page through results, honouring per-catalog page-size limits (PC=1000, ES=250).
- Rank and select. From the candidates, pick
--max-scenesby coverage-aware ranking. - Per scene: resolve asset URLs. Each band and the cloud mask become
a signed or unsigned HTTPS URL (or
s3://for Earth Search — which SurtGIS rewrites tohttps://bucket.s3.amazonaws.com/keyinternally). - Per strip of output: fetch the tiles that intersect. COG tiles are typically 512² or 1024² pixels. For a strip of output, the reader computes which internal tiles to pull, issues parallel HTTP range requests, decompresses.
- Mosaic within scene. If the bbox crosses UTM zones, the scene’s
tiles may be in different CRSes;
surtgis-cloud::reprojectaligns them beforesurtgis_core::mosaicstitches. - Mask clouds. Apply the scene’s SCL (Sentinel-2) or QA_PIXEL (Landsat) to nullify cloudy pixels.
- Median across scenes. For each output pixel, the median of the non-null values across all selected scenes.
- Temporal + spatial fill. If a pixel is still null after the median (every scene was cloudy there), try temporal fallback from the least-cloudy scene, then 3×3 spatial mean.
Why median and not mean
Median is robust to cloud-mask leakage. A single bright cloud pixel that slipped past SCL distorts a mean significantly but affects a median of 3+ values almost not at all. Cost: slightly noisier output on very sparse scene counts (N=3 gives a noisier median than N=20).
Why per-strip processing
The composite for a 100 km × 100 km extent at 10 m resolution is
10,000 × 10,000 × n_bands × 8 bytes = 8 GB for 10 bands. Holding that
plus all the per-scene intermediate arrays in RAM blows any host. Strip
processing bounds the peak regardless of extent size.
The trade-off was learned expensively over v0.6.19 through v0.6.26 — see the post-mortem for the full story.
What’s intentionally not supported
- Arbitrary temporal aggregations. Median only. Mean, max, percentile are easy to add if someone asks; nobody has yet.
- Sub-pixel reprojection. Bilinear resampling is the only option. If
you want cubic or Lanczos for a composite, reach for GDAL’s
gdalwarpafter the fact. - Streaming output. The output band rasters are written whole at the
end, not incrementally. For extents where the output itself exceeds RAM
(very rare in practice, requires
>> 100GB bbox coverage), you need a different tool.
Tectonic geomorphology with the fluvial module
The five algorithms in surtgis fluvial answer different questions
about a river network. This chapter explains which algorithm to
reach for when, gives the briefest necessary geomorphological
context for a developer who doesn’t have it, and documents the
parameter tuning that the spec calls out as easy to get wrong.
For the runnable end-to-end sequence, see
Fluvial-tectonic morphometry from a DEM.
For the formal references, see the spec at
docs/SPEC_morfometria_fluvial_tectonica.md in the repo.
The framework in one paragraph
The stream-power law (Howard 1994, Whipple & Tucker 1999) postulates
that a river’s incision rate scales as E = K · A^m · S^n with
A the drainage area, S the local slope, and K an erodibility
constant determined by lithology and climate. In topographic steady
state — incision balanced by tectonic uplift — this inverts to a
slope-area scaling S = ks · A^(-θ) with θ = m/n. Fixing θ at a
reference value (0.45 is the convention for bedrock channels) and
solving for ks gives the normalised channel steepness ksn, which
is comparable across basins. Departures from this clean picture
— knickpoints, anomalous concavity, divide asymmetry — flag transient
landscape response or non-uniform erodibility.
That’s the framework. The five algorithms each measure a different facet of it.
When to use what
| Question | Algorithm | Output |
|---|---|---|
| “Is this basin in steady state? Where’s the base level?” | chi | Per-cell χ raster |
| “Which channels are climbing fastest? Where’s the high U/K?” | ksn | Per-cell raster + per-segment vector |
| “Are there transient pulses propagating upstream?” | knickpoints | Vector points with polarity |
| “Is θ for this basin actually 0.45?” | concavity | Per-basin θ with bootstrap CI |
| “Are basin boundaries migrating?” | divide-migration | Vector lines with Δχ |
In the order they’re typically computed: chi → ksn → knickpoints in parallel; concavity once basins are defined; divide-migration last because it consumes the χ raster from the first step.
chi (χ)
The path integral of (A₀/A(x))^θref along the channel from a base
level upward. Geometric: it’s a base-level-anchored reference distance
that climbs faster where drainage area is small (i.e. headwaters
“count more”). Plotted as elevation vs χ, a steady-state profile is a
straight line whose slope equals ksn.
When χ alone is enough: discriminating divide asymmetry, comparing sub-basins, checking whether the network has reached topographic steady state (look for the elevation~χ curve being roughly linear).
Key parameter: --theta-ref (default 0.45). Almost never worth
changing for inter-basin comparison; the literature standard exists
because comparing across studies requires it. If you want a
basin-specific θ, run concavity instead.
ksn (channel steepness)
ksn = S · A^θref per cell, smoothed along the channel over a window
(default 500 m). High ksn = either high uplift or resistant bedrock or
both.
When ksn is the right primary metric: you have a basin or a region and want to map relative U/K spatially. ksn is the workhorse — every tectonic-geomorphology paper has a ksn map.
Tuning that bites:
--segment-length-m(default 500 m): the smoothing window. With a 10–30 m DEM, 500 m is the literature standard. With a 90 m DEM, bump to 1000 m or accept noise.--min-drainage-area-m2(default 1 km²): cells below this give unstable ksn (small numerator, noisy slope). Don’t lower without a reason.--theta-ref: must match what you’d compare against in the literature. Don’t tune this per dataset unless you’re explicit.
ksn requires channel-following slope, not the 2-D terrain slope
output of surtgis terrain slope. The fluvial module computes the
correct version internally; if you find yourself reaching for
terrain slope to feed an analysis, stop — you’ve gone off the
geomorphology rails.
knickpoints
Sharp breaks in the slope of a river’s long profile. A knickpoint can be a transient wave-form response to a tectonic perturbation propagating upstream, or a stationary feature anchored to a lithologic contrast. SurtGIS classifies each detection by polarity:
- Convex (slope increases downstream) — most often transient, flag for tectonic interpretation.
- Concave (slope decreases downstream) — most often lithology-pinned, expect a corresponding contact on the geology map.
The method (Neely et al. 2017) total-variation-denoises the χ-z
profile, computes the discrete second derivative d²z/dχ², and flags
cells where the magnitude exceeds a threshold AND the elevation drop
across a small window exceeds min_magnitude_m. Cells within 5 of a
segment end are excluded (default) because confluences and outlets
induce spurious curvature.
Tuning that bites:
--tvd-lambda(default 0.5): smoothing strength. Too low → every noise spike is a knickpoint. Too high → real knickpoints smoothed away. Default tuned for 10–30 m DEMs.--min-magnitude-m(default 10): the elevation-drop guard. Raise to 30+ if you’re seeing dozens of false positives in pristine headwaters; the real knickpoints are usually >> 30 m.--curvature-threshold(default 1.0): controls how sharp a break has to be. Most users never touch this; tunetvd_lambdaandmin_magnitude_mfirst.
concavity (θ per basin)
Grid-searches θ over [0.1, 0.9] for each basin, picking the value that linearises the elevation~χ regression best (minimum RMSE), then bootstraps for a 95 % CI.
When: you want to test whether assuming θ = 0.45 is reasonable for your basins, OR you want θ per basin as its own measurement (some geomorphologists treat it as a tectonic signal independent of ksn).
What the values mean:
- θ ∈ [0.4, 0.6]: textbook bedrock channel in steady state. Boring, confirmatory.
- θ < 0.3 or > 0.7: interesting. Could indicate transient response (Mudd et al. 2018), non-uniform erodibility, or a non-uniform uplift pattern. Investigate.
- Bootstrap CI that spans 0.2 or more: the basin has too few channels or too much scatter to pin down θ. Don’t over-interpret.
Tuning:
--bootstrap-n(default 200): more iterations = tighter CI, more compute. 200 is enough for most basins; 1000 if you need publication-grade error bars.--min-basin-cells(default 30): below this the estimator is unreliable.
divide-migration
For every pair of adjacent basins, compute the median asymmetry across their shared divide: Δχ (Willett 2014), Δelev (Gilbert metric), Δrelief.
When: you suspect basin geometries are not stable. Active tectonic uplift, base-level fall, or differential erosion all push divides toward the “losing” basin. A sign in Δχ across a long divide is the canonical detection.
Interpretation:
- Δχ ≈ 0 along the divide: basins in (chi) equilibrium, no migration signal.
- Consistent Δχ > 0 with basin_a (smaller numeric ID by convention) on the higher-elevation side: basin_a is being “captured” by basin_b. The basin with lower χ at the divide is winning.
- The Gilbert Δelev / Δrelief metric is the older indicator — useful as cross-check but Δχ is the modern primary.
Tuning:
--min-divide-length-m(default 500): suppress divides too short to be statistically meaningful. Raise to 2000+ for regional studies.--chi(optional but recommended): provide the χ raster fromsurtgis fluvial chi. Without it, you only get the Gilbert metrics — usable but weaker.
Pitfalls the spec calls out
These bit the spec author often enough that they’re in the spec’s §8. They will bite you too.
-
Cell size > 30 m: ksn and knickpoints are sensitive to DEM resolution. With FabDEM at 25 m you’re fine; with SRTM at 90 m you’ll get noisier results that bias toward false positives.
--cell-size-m Noverrides the auto-detection; the algorithms warn when cell > 30 m. -
Stream threshold matters: too low and convergent noise from zero-order hollows contaminates the analysis; too high and you miss real tributaries. 1000 cells at 30 m (≈ 0.9 km²) is a sensible default; for arid regions go higher, for humid regions lower.
-
Short tributaries are unreliable: by default the algorithms filter cells with drainage area below 1 km² or segments with fewer than ~20 cells. Don’t relax these without a reason.
-
NoData propagates upstream: a single missing cell in the DEM knocks out χ for everything upstream. Fill or interpolate carefully.
-
Boundary outlets: outlets at the raster edge get partial trajectories. The algorithms flag and exclude them from medians; if your AOI is dominated by boundary outlets, expand the bbox.
-
θ may not be unique: the RMSE landscape can be flat. The bootstrap CI is honest about this — read it.
-
TVD λ at the wrong scale: see the knickpoints tuning section above.
-
CRS must be projected: every fluvial algorithm assumes metres. EPSG:4326 inputs are rejected with a clear error; reproject to UTM (or any local projected CRS) first.
-
Confluence noise in knickpoints: confluences cause sudden area discontinuities that look like knickpoints to the curvature detector. The default 5-cell buffer at segment ends filters this; widen it if you still see clustering at junctions.
When SurtGIS isn’t the right tool
For interactive stream-network pruning (drawing on a map, removing specific channels), use TopoToolbox 2 (MATLAB) or pyTopoToolbox. SurtGIS is batch-only; the interactive use case isn’t its sweet spot.
For landscape evolution modelling (TTLEM / Landlab), likewise — those are a different category of tool and out of scope for the spec.
For prospectivity mapping or other geological-AI workflows, the SurtGIS GFM preprocessing pipeline is the right entry point; fluvial outputs can be inputs to it but shouldn’t be its terminus.
References
- Howard (1994) WRR 30(7) — detachment-limited stream-power law
- Whipple & Tucker (1999) JGR — stream-power model dynamics
- Wobus et al. (2006) GSA SP 398 —
ksnoperational paper - Perron & Royden (2013) ESPL 38, 570 — χ-transform fundacional
- Willett et al. (2014) Science 343, 1248765 — divide reorganisation
- Neely et al. (2017) EPSL 469 — knickpoint detection method
- Whipple, Forte et al. (2017) JGR-Earth Surface 122 — divide migration
- Mudd et al. (2018) ESurf 6 — variable θ as signal
- Schwanghart & Scherler (2014) ESurf 2 — TopoToolbox 2 (the reference implementation in MATLAB; SurtGIS targets parity with its main outputs)