[OSM-dev] Server-side data validation

Fri Jul 13 21:38:09 BST 2012

Am 13.07.2012 20:54, schrieb Paweł Paprota:
> Hi Peter,
>
> Thanks for the response.
>
>> 1) The OSM API is a restful api that allows "live" editing: The editor
>> software a) opens a changeset, b) creates a node, c) adds a tag - same
>> for ways.
>> Between b and c there's an untagged osm element in the database (even if
>> it's in most cases a very short time).
> I think that is a rather orthogonal issue to validation, meaning that
> some validation should probably be launched when a changeset closes for
> example - true - but more important is the fact that even with the API
> calls that you described it is not possible to _end up_ with broken
> data.
Of course it is.
Imagine the user's internet connection to be broken after (b) and the 
changeset get's closed due to a serverside timeout later.
The database has that empty node and probably even other users already 
use it (by downloading and editing).
It's not possible to invalidate that node completely afterwards, because 
there may be conflicts if you try that.
>   So for now I'm trying to discuss this at a more abstract level -
> that the contract would be "we can't have X in the database" but how it
> is implemented (at changeset close maybe?) - I cannot say (yet) as I am
> no expert in OSM. For now more important is whether this kind of
> thinking even makes sense for you.
The idea makes sense IMHO, but I don't have an idea how to intelligently 
handle these checks without big changes to the API style.
> [...]
>> 3) the free tagging scheme would allow similar stuff for nodes, too
>> (while I don't know any issue where that's used currently). A
>> theoretical example would be a set of nodes, which are defined points
>> inside a fuzzy area/region and others which are defined points outside
>> (where there's no concrete, hard boundary defined, e.g. for "the alpes".
>>
> I understand the benefits of "free tagging" approach. On the other hand
> it is kind of strange that even for "core" keys (e.g. "highway" or
> "surface") there is no validation/schema/whatever one calls it.
It's the big question what "core keys" are - and what they are allowed 
to contain or not.
What do you want to check for the highway key, for example?
The most prominent values are easy - but there are tons of other values, 
too, that are less easy to "validate", and as long as 
highway=emergency_access_point, highway=give_way and similar are 
"allowed", how to reject invalid highway-tags?
> In this case, what is more efficient:
>
> 1. Adding one more possible value for "highway" when it is needed and
> deploying such a change to production.
> 2. Constantly cleaning up the database when there are inconsistent
> entries (typos etc). In fact I think there is no such process as global
> cleanup - there are couple of bots that do so here and there but overall
> the data can be inconsistent.
Sure.
But if you decide for the first variant, Mary Mapper cannot decide to 
add a useful highway=electrocycleway for the increasing amount of 
e-bikes, because the server would reject that tag.
The rails coders have to decide to change that, before anyone can add 
that tag.
Sure: We can assume that seldom a new value for highway will appear - 
but is that helpful?

I think, again Multipolygons are some kind of a special case, but from 
another point of view: Multipolygon basically is in fact the current 
fourth basic datatype of OSM.
>> Pushing this validation to the server side has several drawbacks:
>> - usually server load is the bottleneck in osm, not client load.
> I understand infrastructure constraints but I think (very-)long-term
> pushing stuff to the client-side will cause much more trouble than
> dealing with load issues but having consistent database and business
> logic (validation) in place.
True, as long as you really can restrict the tagging. As mentioned: that 
brings other problems as it kills the free tagging even in corner cases, 
where that's the big benefit of osm.
>> - a check on server side would fix the corresponding tagging and makes
>> other tagging schemes invalid probably, a contradiction to the free
>> tagging scheme we have.
>> - the api would have to change to use transaction like semantics, wich
>> is again higher server load, but the only way to make sure not to create
>> these invalid stuff.
> For now it is just a thought exercise and discussion but if I could
> propose some changes and perhaps implement some proof of concept, would
> it be taken seriously? You can say that "open source is about working
> not talking" and I should rather do something instead of discussing but
> as you can see these are pretty high level things that go against status
> quo - that's why I want to make sure my time is well spent...
well...
I fear, I'm not someone really big in coding (for osm) yet, I looked 
into the rails code, but don't know rails ;)
So probably I'm the wrong one to ask.

But usually as far as I can see solutions are taken seriously usually - 
if they follow the right path; and that right pass might not be what 
someone imagines - especially if it's not discussed before.

regards
Peter