[OSM-talk] The temporal dimension (was: Re: Overpass API: new version 0.6.93 ...)

Mon Sep 5 21:58:34 BST 2011

> > [@newer=2011-08-01]
> >    restricts the data to only those data last edited after the given
> >    date. This is only possible in combination with another conditional.
> 
> Why wasn't something like [@timestamp>2011-08-01] used?

Well, at first, because there is no [@older=..] or [@timestamp<..] . By the 
way, to correct myself, it should be something like 
[@newer=2011-08-01T09:15:00Z], because the simplistic parser needs a full 
date. The [@older=..] in turn doesn't exist, because it doesn't make sense in 
the current state of affairs. See below.

[@newer=..] shall have a humble look and feel, because it is a very humble
solution. To spark the discussion, imagine the following data items:

- Element A hasn't changed since January.
- Element B has been re-uploaded without changes yesterday.
- Element C has been deleted yesterday.
- Element D has substantially changed its meaning yesterday such that it is 
now out of scope.
(To make this clearer: think e.g. you are searching for bridges and D is a way 
that has been split such that one half is no longer a bridge.)
- Element E has been created yesterday morning but then disappeared yesterday 
evening, due to a bad edit.

Now what would you expect from the [@newer=24-hours-ago] operator to find? A 
good implementation should produce surely C and D, maybe E and/or B. Actually, 
Overpass API yields B and only B, which is unsatisfactory.

The reason for this: At the moment, Overpass API represents at any moment the 
data as it would be visible in a fictional Planet.osm, patched from the last 
Planet.osm by the diff files applied so far. And a Planet.osm would also only 
show B. This is closely related to discussions of the type "How do I find 
deleted elements?".

What are other options? (Please mind: All of them are thoughts, not even 
vapour ware)

Overpass API could become a full blown history server. This would allow to 
give sane answers to [@timestamp<..], [@timestamp>..], and a create a diff 
option, produce search results from a certain time in the past and a lot more 
of amazing thing. But this has at least four downsides:

1. This partly breaks with the OSM data model. For example, a way can change 
its geometry without getting literally changed itself: just move the 
underlying nodes. The OSM element doesn't recognize this, the database must 
recognize this for consistent data delivery. The way might for example have 
entered or left the bounding box you have searched for. The issue popped up at 
some time in a discussion about the undo features of Potlatch, but I don't 
have a link to that. Another thing is that such a server would mix CC-BY-SA 
and ODBL data at a certain time in the future, which is a unnecessary legal 
hassle. Most likely, I would be slow enough on development to roll out the 
software after the license change :) In any case, this may produce a flame war 
on details, which I exactly don't want to get the project into, and I'm not 
diplomat enough to avoid this.

2. Hourly, daily, weekly diffs are incompatible, and even the Planet.osm and 
minute updates need diligent analysis. Note that in all the diffs, multiple 
changes will be collapsed into a single diff. Thus, an element like E above 
might never appear on the server. I'm not sure whether the minute updates are 
guaranteed to contain all changes, but changes reverted in less than a minute 
might be acceptable to lose. The full history Planet.osm could replace an 
ordinary Planet.osm, but mind that it is an order of magnitude bigger.

3. This all could multiply the hardware requirements. I'm simply not sure what 
the current server can handle. For this reason, I started with the Planet.osm 
meta data, which already doubled the data amount from roughly 35 GB to roughly 
65 GB. With history data, I expect rather 100 GB to 150 GB. The impact on the 
query times can probably be kept under control (if we keep the historic data 
apart from the current data), but the data updates in that case will become 
much slower.

4. And it will need a lot of programming effort. Just alone the necessary 
documentation to make clear all of the decisions necessary in point 1. and 2. 
takes weeks. Implementation and testing will be the same or even more effort, 
depending on how much tricks are necessary to keep the system responsive. 
While it is challenging, I don't see the massive demand in comparison to other 
features that would be postponed in that case.

A second option would be to produce some kind of feed, where you need to 
subscribe to get changes. This can be realized quite easily, because the way 
and relation updater already receive some kind of internal feed to update 
their geometries from their members. So you subscribe with an arbitrary query, 
e.g. a bounding box or a certain tag or a combination of both, and get all 
changes concerning that query roughly every few hours, without the assertion 
of being complete. But I don't expect that much users would be interested in 
such a service [Responses by e-Mail may me convince of the contrary :) ]. It 
will at some point in the future been implemented to improve area update 
speed, but this is rather at the end of this year on the road map.

The third option would be to regularly freeze the data. This is technically 
easily possible (just freeze the updates at a certain point and copy the data 
base), but out of scope with regard to the hard disk sizes on the overpass-
api.de server.

Other ideas how to give [@newer=..] proper semantics, comments on the above 
ideas and personal options on the usefulness are welcome.

Cheers,

Roland