[OSM-dev] Server-side data validation

Fri Jul 13 19:54:31 BST 2012

Hi Peter,

Thanks for the response. 

> 1) The OSM API is a restful api that allows "live" editing: The editor 
> software a) opens a changeset, b) creates a node, c) adds a tag - same 
> for ways.
> Between b and c there's an untagged osm element in the database (even if 
> it's in most cases a very short time).

I think that is a rather orthogonal issue to validation, meaning that
some validation should probably be launched when a changeset closes for
example - true - but more important is the fact that even with the API
calls that you described it is not possible to _end up_ with broken
data. So for now I'm trying to discuss this at a more abstract level -
that the contract would be "we can't have X in the database" but how it
is implemented (at changeset close maybe?) - I cannot say (yet) as I am
no expert in OSM. For now more important is whether this kind of
thinking even makes sense for you.

> 2) Ways without tags may be part of relations nevertheless: an outer way 
> of a multipolygon does not necessarily have tags, as the tags applied to 
> the multipolygon should go to the relation.

Yes, that was just a quick example and saying "don't allow unconnected
untagged nodes" may be too simplistic but still there is a lot of
business logic that could be placed on the server that would help
increase the quality of OSM data across the board.

> 3) the free tagging scheme would allow similar stuff for nodes, too 
> (while I don't know any issue where that's used currently). A 
> theoretical example would be a set of nodes, which are defined points 
> inside a fuzzy area/region and others which are defined points outside 
> (where there's no concrete, hard boundary defined, e.g. for "the alpes".
> 

I understand the benefits of "free tagging" approach. On the other hand
it is kind of strange that even for "core" keys (e.g. "highway" or
"surface") there is no validation/schema/whatever one calls it.

In this case, what is more efficient:

1. Adding one more possible value for "highway" when it is needed and
deploying such a change to production.
2. Constantly cleaning up the database when there are inconsistent
entries (typos etc). In fact I think there is no such process as global
cleanup - there are couple of bots that do so here and there but overall
the data can be inconsistent.

Paweł

> Pushing this validation to the server side has several drawbacks:
> - usually server load is the bottleneck in osm, not client load.

I understand infrastructure constraints but I think (very-)long-term
pushing stuff to the client-side will cause much more trouble than
dealing with load issues but having consistent database and business
logic (validation) in place.

> - a check on server side would fix the corresponding tagging and makes 
> other tagging schemes invalid probably, a contradiction to the free 
> tagging scheme we have.
> - the api would have to change to use transaction like semantics, wich 
> is again higher server load, but the only way to make sure not to create 
> these invalid stuff.
> 

For now it is just a thought exercise and discussion but if I could
propose some changes and perhaps implement some proof of concept, would
it be taken seriously? You can say that "open source is about working
not talking" and I should rather do something instead of discussing but
as you can see these are pretty high level things that go against status
quo - that's why I want to make sure my time is well spent...

Paweł

> regards
> Peter
> 
> Am 13.07.2012 19:27, schrieb Paweł Paprota:
> > Hi all,
> >
> > Today I have encountered a lot of bad data in my area - duplicated
> > nodes/ways. These probably stem from an inexperienced user or faulty
> > editor software when drawing building. I corrected a lot of this stuff,
> > see changesets:
> >
> > http://www.openstreetmap.org/browse/changeset/12208202
> > http://www.openstreetmap.org/browse/changeset/12208389
> > http://www.openstreetmap.org/browse/changeset/12208467
> > http://www.openstreetmap.org/browse/changeset/12208498
> >
> > As you can see, these changesets remove thousands of nodes/ways. I have
> > done this using JOSM validators and "Fix it" option which automatically
> > merges/deletes nodes that are duplicated.
> >
> > That is all fine of course but this sparked a thought... why is this
> > garbage data like this allowed into the database in the first place? Of
> > course it can always be fixed client-side (JOSM, even some autobots) but
> > why allow an unconnected untagged nodes or duplicated nodes, duplicated
> > ways etc.?
> >
> > I understand (though don't wholly agree...) the concept of having a very
> > generic data model where anyone can push anything into the database but
> > it would be trivial to implement some server-side validations for these
> > cases (so that API throws errors and does not accept such data) and thus
> > reduce client-side work by a very significant margin - i.e. I could have
> > been working on something more useful in that time than removing garbage
> > data.
> >
> > Server-side validation could be of course taken even further - OSM
> > server could reject meaningless tag combinations etc. - basically JOSM
> > validators on the "error" level should be implemented as server-side
> > validators, some "warning" level validators possibly as well.
> >
> > This would ensure data consistency and integrity at least a little
> > bit... (of course first bad data would have to be pruned from existing
> > database so that it is consistent with validation logic but that's for
> > another discussion).
> >
> > What is the current consensus within OSM dev community on this aspect of
> > OSM architecture?
> >
> > Paweł
> >
> > _______________________________________________
> > dev mailing list
> > dev at openstreetmap.org
> > http://lists.openstreetmap.org/listinfo/dev
> 
> 
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev