[OSM-talk] The temporal dimension (was: Re: Overpass API: new version 0.6.93 ...)
Roland Olbricht
roland.olbricht at gmx.de
Mon Sep 5 21:58:34 BST 2011
> > [@newer=2011-08-01]
> > restricts the data to only those data last edited after the given
> > date. This is only possible in combination with another conditional.
>
> Why wasn't something like [@timestamp>2011-08-01] used?
Well, at first, because there is no [@older=..] or [@timestamp<..] . By the
way, to correct myself, it should be something like
[@newer=2011-08-01T09:15:00Z], because the simplistic parser needs a full
date. The [@older=..] in turn doesn't exist, because it doesn't make sense in
the current state of affairs. See below.
[@newer=..] shall have a humble look and feel, because it is a very humble
solution. To spark the discussion, imagine the following data items:
- Element A hasn't changed since January.
- Element B has been re-uploaded without changes yesterday.
- Element C has been deleted yesterday.
- Element D has substantially changed its meaning yesterday such that it is
now out of scope.
(To make this clearer: think e.g. you are searching for bridges and D is a way
that has been split such that one half is no longer a bridge.)
- Element E has been created yesterday morning but then disappeared yesterday
evening, due to a bad edit.
Now what would you expect from the [@newer=24-hours-ago] operator to find? A
good implementation should produce surely C and D, maybe E and/or B. Actually,
Overpass API yields B and only B, which is unsatisfactory.
The reason for this: At the moment, Overpass API represents at any moment the
data as it would be visible in a fictional Planet.osm, patched from the last
Planet.osm by the diff files applied so far. And a Planet.osm would also only
show B. This is closely related to discussions of the type "How do I find
deleted elements?".
What are other options? (Please mind: All of them are thoughts, not even
vapour ware)
Overpass API could become a full blown history server. This would allow to
give sane answers to [@timestamp<..], [@timestamp>..], and a create a diff
option, produce search results from a certain time in the past and a lot more
of amazing thing. But this has at least four downsides:
1. This partly breaks with the OSM data model. For example, a way can change
its geometry without getting literally changed itself: just move the
underlying nodes. The OSM element doesn't recognize this, the database must
recognize this for consistent data delivery. The way might for example have
entered or left the bounding box you have searched for. The issue popped up at
some time in a discussion about the undo features of Potlatch, but I don't
have a link to that. Another thing is that such a server would mix CC-BY-SA
and ODBL data at a certain time in the future, which is a unnecessary legal
hassle. Most likely, I would be slow enough on development to roll out the
software after the license change :) In any case, this may produce a flame war
on details, which I exactly don't want to get the project into, and I'm not
diplomat enough to avoid this.
2. Hourly, daily, weekly diffs are incompatible, and even the Planet.osm and
minute updates need diligent analysis. Note that in all the diffs, multiple
changes will be collapsed into a single diff. Thus, an element like E above
might never appear on the server. I'm not sure whether the minute updates are
guaranteed to contain all changes, but changes reverted in less than a minute
might be acceptable to lose. The full history Planet.osm could replace an
ordinary Planet.osm, but mind that it is an order of magnitude bigger.
3. This all could multiply the hardware requirements. I'm simply not sure what
the current server can handle. For this reason, I started with the Planet.osm
meta data, which already doubled the data amount from roughly 35 GB to roughly
65 GB. With history data, I expect rather 100 GB to 150 GB. The impact on the
query times can probably be kept under control (if we keep the historic data
apart from the current data), but the data updates in that case will become
much slower.
4. And it will need a lot of programming effort. Just alone the necessary
documentation to make clear all of the decisions necessary in point 1. and 2.
takes weeks. Implementation and testing will be the same or even more effort,
depending on how much tricks are necessary to keep the system responsive.
While it is challenging, I don't see the massive demand in comparison to other
features that would be postponed in that case.
A second option would be to produce some kind of feed, where you need to
subscribe to get changes. This can be realized quite easily, because the way
and relation updater already receive some kind of internal feed to update
their geometries from their members. So you subscribe with an arbitrary query,
e.g. a bounding box or a certain tag or a combination of both, and get all
changes concerning that query roughly every few hours, without the assertion
of being complete. But I don't expect that much users would be interested in
such a service [Responses by e-Mail may me convince of the contrary :) ]. It
will at some point in the future been implemented to improve area update
speed, but this is rather at the end of this year on the road map.
The third option would be to regularly freeze the data. This is technically
easily possible (just freeze the updates at a certain point and copy the data
base), but out of scope with regard to the hard disk sizes on the overpass-
api.de server.
Other ideas how to give [@newer=..] proper semantics, comments on the above
ideas and personal options on the usefulness are welcome.
Cheers,
Roland
More information about the talk
mailing list