Show HN: Pipeline and datasets for data-centric AI on real-world floor plans

Posted by standfest 1 day ago

Show HN: Pipeline and datasets for data-centric AI on real-world floor plans(archilyse.standfest.science)

11 points | 4 comments

jaynamburi 1 day ago|

Interesting work floor plans are a great real world testbed for data centric AI because the bottleneck is almost always annotation quality, not model architecture.

We’ve seen similar patterns in document layout and indoor mapping projects: cleaning mislabeled walls/doors, fixing class imbalance (e.g., tiny symbols vs large rooms), and enforcing geometric consistency often gives bigger gains than switching models. For example, simply normalizing scale, snapping lines, and correcting room boundary labels can outperform moving from a basic U-Net to a heavier transformer.

A reproducible pipeline + curated datasets here feels especially valuable for downstream tasks like indoor navigation, energy modeling, or digital twins where noisy labels quickly compound into bad geometry.

Would be curious how you handle symbol ambiguity (stairs vs ramps, doors vs windows) and cross-domain generalization between architectural styles.

Nice focus on data quality over model churn.

standfest 1 day ago|

Totally agree that for floor plans the bottleneck is usually label/geometry quality, not model architecture. We looked at CV early on, but real plan archives are a pretty adversarial input: ~100-year-old drawings mixed with modern exports, lots of drafting styles/implicit ontologies, low-res scans + distortion, and sometimes multiple conflicting “truths” for the same plan (revisions, partial updates, different sources). Even with decent models, you still pay heavily in expert cleanup.

So we optimized against the real baseline: manual CAD-style annotation. The “data-centric” work for us was making manual annotation cheap and auditable: limited ontology, a web editor that enforces structure (scale normalization, closed rooms, openings must attach to walls, etc.), plus hard QA gates against external numeric truth (client index / measured areas, room counts). Typical QA tolerance is ~3%; in Swiss Dwellings we report median area deviation <1.2% with a hard max of 5%. Once we could hit those bounds at <~1/10th the prevailing manual cost, CV stopped being a clear value add for this stage.

On ambiguity (doors vs windows, stairs vs ramps): we try not to “guess” — we push it into constraints + consistency checks (attachment to walls, adjacency, unit connectivity, cross-floor consistency) and flag conflicts for review. On generalization: I don’t think this is zero-shot across styles; the goal is bounded adaptation (stable primitives + QA gates, small mapping/rules layer changes). Trade-off is less expressiveness, but for geometry-sensitive downstream tasks small errors compound fast.

V_Shukla 1 day ago|

Data realism tends to be the quiet differentiator in generation systems. Once base model capability becomes commoditized, the biggest performance gap often comes from dataset curation, labeling quality, and how closely the training data reflects real deployment conditions. In visual generation workflows, even small improvements in dataset realism can significantly reduce the “synthetic look” that usually breaks usability in production contexts.

standfest 22 hours ago|

Agree that “data realism” is the quiet differentiator in mature visual generation domains.

Floor plans / technical drawings feel a lot less mature though — we don’t really have generators that are “good” in the sense that they preserve the constraints that matter (scale, closure, topology, entrances, unit stats, cross-floor consistency, etc.). A lot of outputs can look plausible but fall apart the moment you treat them as geometry for downstream tasks.

That’s why I’ve been pushing the idea that simplistic generators are kind of doomed without a context graph (spatial topology + semantics + building/unit/site constraints, ideally with environmental context). Otherwise you’re generating pretty pictures, not usable plans.

Also: I’m a bit surprised how few researchers have used these datasets for basic EDA. Even before training anything, there’s a ton of value in just mapping distributions, correlations, biases, and failure modes. Feels like we’re skipping the “understand the data” step far too often.