Perspective

We ran Mask3D on a terrestrial scan it was never trained on

An honest field report on running an aerial-trained 3D instance segmentation model against a ground-based LiDAR capture of a small civic building. What worked, what broke, what it means for clients evaluating off-the-shelf 3D AI.

Jim Coleman

This is a research note from the lab, not a sales pitch. We ran Mask3D — a transformer-based 3D instance segmentation model from RWTH Aachen — against a colorized terrestrial LiDAR capture of an Idaho administrative building. The model was never trained on terrestrial data. We used the STPLS3D checkpoint, which was trained on synthetic aerial photogrammetry. We wanted to know what would happen.

You can scrub the output yourself — the second viewer on the page is the classified result. Below is what it took to get there, what the model actually saw, and what I’d tell a client who asked “can we just point an off-the-shelf model at our scans?”

The setup

The capture is a Trimble X9 scan I did for fun last summer. About 24M points, colorized RGB, registered as a single coordinate system, exported as a .las file. The original is multi-gigabyte; the public-facing viewer shows the whole thing as a Cloud-Optimized Point Cloud streamed from S3.

For this experiment I cropped a 50 m × 50 m square around the building — roughly 600,000 points after a 10 cm voxel downsample. Single block, single GPU pass.

The hardware on my desk is a single RTX 4090 in an Arch Linux box. The whole pipeline runs in a Podman container with NVIDIA’s CDI device passthrough. No cloud GPU rental, no SaaS, no manual labeling. Total wall-clock time from “this is the input file” to “here is a colored point cloud” was about four hours of which forty minutes was actual compute and the rest was fighting Mask3D’s environment.

The model

Mask3D treats 3D instance segmentation the way DETR treats 2D object detection: a transformer decoder attends to features extracted by a sparse convolutional backbone (MinkowskiEngine), produces a set of object queries, and each query emits a class label and a binary mask over the input points. It’s the current state-of-the-art on indoor benchmarks (ScanNet, S3DIS) and competitive on the outdoor STPLS3D benchmark.

We picked the STPLS3D-trained checkpoint specifically because:

  1. It’s the closest published checkpoint to the geometric scale and content of a building exterior.
  2. STPLS3D’s class taxonomy includes the things we’d expect to see in this scan — buildings, vegetation, fences, light poles, vehicles.
  3. The community treats it as a reasonable proxy for outdoor capture even though it’s trained on synthetic aerial photogrammetry.

That third point is the interesting one. STPLS3D is aerial and synthetic. Our capture is terrestrial and real. The geometry of how points distribute around objects is fundamentally different — an aerial scanner sees rooftops; a terrestrial scanner sees facades. The point density falls off with distance from the scanner head, not uniformly.

This is exactly the kind of distribution shift that breaks production ML.

What worked

The model successfully classified 433,987 of 602,835 points (72%). The remaining 28% landed in the “unclassified” bucket, which on inspection is almost entirely ground — Mask3D’s STPLS3D loader is configured to filter out the ground class because in the training set it dominates everything. So “unclassified” here means “the model wasn’t asked to label this.” That’s correct behavior.

Of the 750 instance proposals the model emitted, 73 cleared a 0.5 confidence threshold. The class distribution:

  • 309,355 points classified as building (51% of the cloud)
  • 69,175 as high vegetation (11%)
  • 54,457 as fence (9%)
  • 645 as low vegetation
  • 349 as light pole
  • 6 as street sign

Standing in front of the viewer with the classified COPC loaded, the building mask is genuinely impressive. The model identified the brick walls, the roof, the chimneys, and most of the architectural detail as a single coherent “building” class. The trees behind the structure came out as high vegetation. The picket-style fence around the perimeter came out as fence. With no terrestrial training data, no labels, no fine-tuning.

This matters because most of the value of a 3D classifier in a buildings context is exactly this: separating “the building” from “everything else around it.” Once you have a clean building mask, you can do change detection, energy modeling, facade analysis, or just clean up the cloud for downstream BIM work. The model gets that 80% right, out of the box, on a sensor it never saw during training.

What broke

Several things, in roughly increasing order of how much they’d matter to a client.

The light poles weren’t really light poles. STPLS3D’s “light pole” class trained on tall, slender, isolated vertical structures viewed from above. Our terrestrial scan sees the same shape — the chimney on the building. The model labeled the chimneys as light poles with high confidence. This is a textbook distribution shift failure: the geometry matches what the model learned, but the semantic context is wrong.

The fences are a mess. STPLS3D fences are typically tall, dense, perimeter structures. Our scan has a low picket fence that the scanner caught from one side. Some of it correctly came out as fence. Most of it got swept into “building” because it’s spatially close to the wall.

A bench got labeled as a street sign. Six points, low confidence, but it’s a tell. The model is reaching for the closest training class for any compact upright object.

The “ground” is not really ground. It’s grass, gravel, sidewalk, and a parking apron, all conflated into “no class.” A terrestrial-trained model would distinguish these. Mask3D-on-STPLS3D can’t, because the training set never asked it to.

Inference is slow. 3:25 of GPU time on a 4090 for 600k points. That’s fine for an experiment but it’s roughly half a wall-clock minute per million points, which means a full multi-gigabyte building scan would take an hour-plus. There’s a lot of headroom — we ran with num_workers=0 to avoid out-of-memory issues, voxelized to 33cm, and ran a single block — but it’s a real number to plan around.

The setup cost is brutal. Mask3D ships with a snapshot of dependencies from late 2022 (PyTorch 1.13, MinkowskiEngine 0.5.4, an older Detectron2, a custom-built pointnet2 CUDA op). Most of those won’t compile against current CUDA toolchains without intervention. Getting a clean container takes a serious afternoon. Most of the four hours was here, not in inference. None of this is the model’s fault, but it’s a real-world cost.

If you’re a buyer, treat 2022-vintage research code as carrying about as much operational debt as a system from 2018. The model is good. The packaging is not.

The honest read for a client

If you’re sitting on a pile of point clouds and wondering whether a model like this can save you the cost of manual labeling, the answer in 2026 is: partially, and with caveats you have to plan for.

What you get for free, today:

  • A solid building/non-building separation.
  • Reasonable vegetation segmentation.
  • Bulk classification of the easy 70-80% of any outdoor scene.
  • Per-instance object proposals with confidence scores you can threshold.

What you don’t get for free:

  • Domain-correct labels. An aerial-trained model will hallucinate labels that match its training prior, not your sensor’s reality.
  • Speed. Plan for hour-scale inference per scan on a single GPU.
  • A “drop-in” model. You will spend non-trivial engineering time getting the inference environment to run.
  • Anything resembling a quality guarantee. You need a human review step.

What’s worth the investment, today:

  • Fine-tuning on your data. A few hundred labeled instances of your sensor, your scenes, would fix most of the failure modes above. The base model has clearly learned generalizable 3D feature extraction; the issue is the classifier head trained on the wrong distribution.
  • Treating off-the-shelf inference as a pre-labeling step. Run Mask3D, threshold at 0.5, hand the result to a human reviewer who fixes the obvious mistakes. The reviewer goes 5-10x faster than starting from raw.
  • Investing in a sensible MLOps wrapper if you’re going to do this at any scale. The model needs to be containerized once and run many times.

The actual answer to “can we use AI on our point clouds?” is the same answer as “can we use AI for X?” everywhere else right now. You can use it, you should use it, and you should be honest about the gap between the demo and the deployment.

A second pass: cheap geometry beats more model

Looking at the v1 output, the most obvious failure modes weren’t subtle. Half a dozen “high vegetation” predictions sitting six inches off the ground. Tall trees labeled “fence.” A solid 28% of the cloud left unclassified because Mask3D’s STPLS3D loader explicitly drops the ground class during training.

The instinct in machine learning is to reach for the model: better checkpoint, fine-tune, more data. We did the opposite first. We ran two well-known classical algorithms on the raw point cloud and used their outputs to sanity-check the model.

Step 1. Run PDAL’s Cloth Simulation Filter on the source LAS to identify ground points. CSF drapes a virtual cloth over the inverted point cloud and tags whatever the cloth comes to rest on. It’s been around since 2016, has no learned parameters, and does one thing well.

Step 2. Run hag_nn — Height Above Ground via Nearest Neighbour — using CSF’s ground points as the reference plane. Every non-ground point gets a HeightAboveGround value: how far up it sits relative to the nearest ground point.

Step 3. Apply a five-line decision table on top of Mask3D’s predictions:

if csf_says_ground:
    refined = "Ground"                                         # trust CSF
elif mask3d == "HighVeg" and hag < 0.3:
    refined = "Ground"                                         # tree on the ground = grass
elif mask3d == "HighVeg" and hag < 1.5:
    refined = "LowVeg"                                         # tree at knee height = bush
elif mask3d == "Fence" and hag > 4.0:
    refined = "HighVeg"                                        # 4m fence = tree
elif mask3d == "Unclassified" and hag > 5.0:
    refined = "HighVeg"                                        # tall mystery = canopy
else:
    refined = mask3d                                           # leave it alone

That’s the entire post-processing pass. The results, on the same 50m chunk:

MetricMask3D only+ CSF/HAG rules
Points classified433,987 (72%)555,874 (92.5%)
Ground points recovered0102,954
”High veg” demoted to grass/bushn/a1,660
Tree-shaped “fence” predictions reclassifiedn/a17,773
Missed canopy points labeled HighVegn/a46,329

The 28% unclassified hole essentially closed. The most embarrassing visual failure (trees-as-fence) went away. And the cost was about a minute of CPU time and forty lines of Python.

Why this matters as a pattern, not just a fix. The interesting bit isn’t “we cleaned up a viewer.” It’s that for production point-cloud workflows, the right architecture for the next 12-18 months is almost certainly modern 3D model + classical geometric features as a sanity layer, not modern 3D model on its own.

The model is good at the perceptual judgements humans struggle to encode as rules — what’s a building shape, what’s a tree shape, where instance boundaries fall. Classical filters like CSF and HAG are good at the geometric facts that are tedious to label and easy to compute — what’s the ground, how far up is each point, what’s the local roughness. The two together give you something stronger than either alone, and the engineering effort is small.

The same reflex applies in non-point-cloud AI work. The model gives you a hypothesis. The classical layer gives you the audit trail.

For the consulting brief: if a client comes to you wanting to “use AI on their scans,” the answer almost always involves at least a CSF or SMRF pass before or after the model. Anyone selling pure-AI as the whole pipeline is leaving 20+ points of accuracy on the floor for no good reason.

The one rule we didn’t apply in v2 is the chimney-as-light-pole bug. Catching that requires real spatial reasoning — “is this slim vertical thing inside the convex hull of a building cluster?” — which is more than a one-line numpy filter. It’s an obvious next step but not a five-line one.

v3: the roof problem

The biggest visual failure left over from v2 wasn’t the chimneys. It was the roof.

From the terrestrial scanner’s perspective, large patches of the building’s slate roof read as “high vegetation.” The texture is busy, the colour is dark, and the model — trained from above on aerial photogrammetry where roofs and canopies are surprisingly easy to confuse — defaults to “tree.” About 150,000 points of clearly-building rooftop sat there in the v2 viewer painted bright green.

This is also a spatial reasoning problem, but a much simpler one than chimney containment. The fix:

  1. Take every point Mask3D did label as “Building” (the walls, gables, and most of the roof — not the misclassified parts). Rasterise them into a 1m × 1m occupancy grid in the XY plane. For each cell, remember the maximum Z value of any building point in it. That’s the building’s roofline at that location.
  2. For every non-Building, non-Ground point, look up its (x, y) cell in the grid. If the cell has a meaningful number of building points in it (so we know we’re inside the footprint), and the point’s Z is at or above the cell’s roofline (within a few-meter window), reclassify the point as Building.

That’s the entire rule. Forty-ish lines of numpy, runs in a couple of seconds.

It catches three failure modes in one pass:

Failure modeSource classPoints reclassified
Roof read as “high vegetation”HighVeg21,975
Roof read as “fence”Fence1,376
Small roof holes left unclassifiedUnclassified3,377
Chimneys read as “light pole”LightPole13
Misc miscellaneous on the roofLowVeg, StreetSign54

26,795 reclassifications at 10cm. Densified to the 3cm viewer cloud, that’s roughly 150,000 visible roof pixels that snapped from green to building-orange.

The chimney-as-light-pole bug got fixed almost incidentally. Chimneys sit inside the building’s 2D footprint and tower a small distance above the roof; the rule’s Z window covers them. The 13 lightpole-to-building reclassifications at 10cm are exactly that.

Cumulative scoreboard, end-to-end:

StageClassification rateTime costNotes
Mask3D alone (v1)72.0%~3:25 GPUTree canopies wrong, no ground, tree-as-fence
+ CSF / HAG rules (v2)92.5%+~1 min CPUGround recovered, fence/canopy fixes
+ Building footprint rule (v3)93.3%+~5 sec CPURoof reclassified, chimneys fixed

The 0.8-point bump from v2 to v3 understates the change because most of the misclassified roof pixels were already labeled — they were just labeled wrongly. The accuracy improvement isn’t visible in the ”% classified” number; it’s visible the moment you toggle the viewer to “Classification” mode and the building stops looking like it has a forest growing out of its roof.

There’s a meta-point here that’s worth pulling out for the consulting brief: classical geometry filters and spatial reasoning rules are radically cheaper to write, debug, and maintain than equivalent ML capability. Each of these post-processing layers is on the order of fifty lines of code and runs in seconds. Replacing any one of them with “fine-tune Mask3D until it gets this right” would be a multi-day project. For most production-grade point-cloud workflows in 2026, the right architecture is modern model + thin classical sanity layer, and the engineering ratio is something like 80% model, 20% rules — but the rules deliver the last mile of perceived quality.

Where this stops working

We need to be honest about what we just did. Every constant in the v2 and v3 rules was hand-tuned against this one scan of this one building. Read them again with that in mind:

  • CSF parameters (resolution=0.4, rigidness=2, step=0.65) — picked because the site is “lawn and parking lot.” A forested site or a steep grade needs different values; the wrong rigidness will eat tree trunks or miss the ground entirely.
  • HAG bands at 0.3m / 1.5m / 4.0m / 5.0m — calibrated by eye against this building’s geometry. A two-story warehouse with a real 6m perimeter fence breaks the “fence taller than 4m must be a tree” rule on the first scan.
  • Footprint cell at 1.0m, density floor of 4 points/cell, Z window of −0.3m to +4m — tuned to a building with a clean rectangular-ish footprint and a single coherent roofline. An industrial site with no clean outline has nothing to anchor on. A multi-story mid-construction site with intermediate floor decks would have the rule reclassify legitimate interior structure as roof.

Run this exact pipeline against:

  • A two-story brick warehouse with a tall perimeter fence → the 4m fence rule reclassifies the actual fence as canopy.
  • An industrial yard with no contiguous building → the footprint rasterisation has no anchor; v3 contributes nothing and may add noise.
  • A forested park → CSF parameters undershoot, half the ground gets missed, the HAG values become unreliable, every downstream rule degrades.
  • A mid-construction site → “building” instances are fragmentary, the footprint grid is swiss cheese, and the rule misfires on intermediate floor structure.

So when we report 93.3% classified, what we’re really reporting is “93.3% on this site after a half day of by-eye tuning.” That number does not transfer.

There’s also a diminishing-returns signal in the experiment itself. v1→v2 added 20 points of classification rate. v2→v3 added 0.8 points and most of the visible improvement was relabeling already-labelled pixels. A v4 would chase smaller and smaller cohorts on this single scan. We’d be optimizing for the screenshot.

The honest deliverable from the experiment is not the percentage. It’s the architecture. Specifically:

  1. The pattern modern 3D model + classical geometry filters + a thin spatial-reasoning layer is the right shape for the next twelve months of production point-cloud work.
  2. The constants in any one implementation of that pattern are site-specific. They are a calibration step, not an algorithm.
  3. A real production deployment forks the rules per site (consultancy / per-engagement work), exposes them as a config (commodity tooling), or learns them (back to ML, defeating the point of the rule layer).

That’s the framing we’d take to a client. Not “here is our 93% pipeline.” Rather: “here is the shape of a pipeline that gets you to the high 80s or low 90s on most sites, here is the half-day of tuning per new site, and here is when that stops being good enough and you should invest in fine-tuning instead.” Anyone selling a single number across all sites is selling a screenshot, not a system.

This is also why we stopped at v3 instead of writing v4. The next rule would have been more prescriptive than the last and would have generalised even less. Knowing when to stop tuning is itself a finding.

What this looks like as consulting work

For most of our clients, the relevant question isn’t “can the model classify a building?” It’s “what do we do with this capability now that it exists?” The teams worth working with on this are:

  • Engineering and architecture firms with archives of scans and no fast way to extract structure from them. The 80% pre-labeling story is real, and the cost-per-scan trends down fast once the pipeline is in place.
  • Asset operators — utilities, municipalities, industrial sites — who scan periodically and need change detection. Mask3D-style segmentation gives you semantic anchors for differencing scans across time.
  • Reality capture vendors wondering whether to build vs. buy AI capability. Mostly: build a thin layer on top of existing open-source. Don’t reinvent the model. Do invest in your own label data.

If any of that resonates, that’s a conversation worth having. The full reproduction recipe — Dockerfile, preprocessing scripts, Hydra overrides, back-mapping code — lives in our infrastructure repo. We’re happy to walk through it.

The pipeline, abbreviated

For the engineers reading this: the steps were

  1. PDAL crop a 50m square around the building, voxel downsample to 10cm.
  2. Convert to STPLS3D’s expected .npy format. Generate fake ground-truth instance labels — Mask3D’s data loader has an obscure recursion if a scene has fewer than two unique instance ids, which cost an embarrassing amount of debugging time.
  3. Build a Mask3D container on top of nvidia/cuda:11.7.1-devel-ubuntu22.04. Pin PyTorch to 1.13.1+cu117. Set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6 (Ada-class 8.9 isn’t supported by 1.13 and triggers a build failure in Detectron2). Compile MinkowskiEngine and the custom pointnet2 op. Install with pip install . not setup.py install (the latter creates a zip egg that Python 3.10 doesn’t import correctly).
  4. Run inference with general.train_mode=false, general.export=true, general.use_dbscan=true, general.dbscan_eps=14.0. Disable workers (data.num_workers=0) and caching (data.cache_data=false) to fit in memory on a single block.
  5. Parse the per-instance binary masks back to per-point class labels. Resolve overlap with a winner-takes-all rule weighted by confidence.
  6. Stamp classification and user_data on a clone of the input LAS. Convert to COPC with PDAL. Upload to S3.
  7. (v2) Run PDAL’s CSF + hag_nn on the raw chunk. Apply the five-rule decision table above. Re-emit COPC.
  8. (v3) Rasterise the v2 Building points into a 1m XY grid. Reclassify any non-Building point that sits inside the footprint and within ~4m above the cell’s roofline. Re-emit COPC. Done.

Most of those steps are obvious in hindsight. None of them were obvious going in. We deliberately did not write a step 9; the constants in steps 7 and 8 are calibration choices for this site, and pretending otherwise would be dishonest.

That gap, between “obvious in hindsight” and “obvious going in,” and the discipline to stop iterating once the pattern is clear instead of chasing the last few points on a single dataset, is roughly the value an outside team brings to a client doing this for the first time.


This whole experiment lives on the model branch of reality capture — a labelled point cloud is a substrate for downstream BIM, change-detection, and analysis tools. There’s a parallel branch — the render branch — that takes the same source data toward Gaussian Splats and other view-synthesis artifacts for human audiences. The split between the two, and why conflating them is the most expensive mistake in the field, is its own writeup: Your scan is actually two scans.

Newsletter

More like this in your inbox

Subscribe for new essays on AI product strategy and the prototype-to-production gap. Roughly monthly. No filler.

One-click unsubscribe. I never share your email.

Contact

Working through this in your team?

If this resonates with where your team is, that's usually a good time to talk.

Goes straight to my inbox. Or email coleman.jamese@pm.me.