[OSM-dev] Server-side data validation

Fri Jul 13 18:58:23 BST 2012

On Fri, Jul 13, 2012 at 07:27:25PM +0200, Paweł Paprota wrote:
> Today I have encountered a lot of bad data in my area - duplicated
> nodes/ways. These probably stem from an inexperienced user or faulty
> editor software when drawing building. I corrected a lot of this stuff,
> see changesets:
> 
> http://www.openstreetmap.org/browse/changeset/12208202
> http://www.openstreetmap.org/browse/changeset/12208389
> http://www.openstreetmap.org/browse/changeset/12208467
> http://www.openstreetmap.org/browse/changeset/12208498
> 
> As you can see, these changesets remove thousands of nodes/ways. I have
> done this using JOSM validators and "Fix it" option which automatically
> merges/deletes nodes that are duplicated.
> 
> That is all fine of course but this sparked a thought... why is this
> garbage data like this allowed into the database in the first place? Of
> course it can always be fixed client-side (JOSM, even some autobots) but
> why allow an unconnected untagged nodes or duplicated nodes, duplicated
> ways etc.?
> 
> I understand (though don't wholly agree...) the concept of having a very
> generic data model where anyone can push anything into the database but
> it would be trivial to implement some server-side validations for these
> cases (so that API throws errors and does not accept such data) and thus
> reduce client-side work by a very significant margin - i.e. I could have
> been working on something more useful in that time than removing garbage
> data.
> 
> Server-side validation could be of course taken even further - OSM
> server could reject meaningless tag combinations etc. - basically JOSM
> validators on the "error" level should be implemented as server-side
> validators, some "warning" level validators possibly as well.
> 
> This would ensure data consistency and integrity at least a little
> bit... (of course first bad data would have to be pruned from existing
> database so that it is consistent with validation logic but that's for
> another discussion).
> 
> What is the current consensus within OSM dev community on this aspect of
> OSM architecture?

This is a difficult subject because it goes to the core of the way we do
things in OSM. Traditionally there have been very few checks in the server.
(Tag key/value length, max number of nodes in a way, ...) This allows us a
lot of flexibility, changes in the way we model the world are easier.

On the other hand we see more and more garbage data in the database as you
describe. There are some things that are "merely annoying" like duplicated
nodes or ways with only a single node in them. Fixing these issues takes
a lot of work, but even if not fixed these problems normally do not have
grave results.

But there are much bigger problems. At any time a lot of the existing
multipolygon or route relations are broken. It is nearly impossible to get a
complete set of coastlines or boundaries, because something is always broken
somewhere. With every fix new problems are introduced. Those problems are much
larger because a single error in one part of the world can break everything on
the same country or continent. I think we seriously need to think about how we
can improve this situation.

And one thing we always have to keep in mind: The central server is OSM's
bottleneck. We can't do anything that would substantially highten the
load on it.

Jochen
-- 
Jochen Topf  jochen at remote.org  http://www.remote.org/jochen/  +49-721-388298