[OSM-dev] Minute Diffs Broken

Brett Henderson brett at bretth.com
Tue May 5 22:53:59 BST 2009


Greg Troxel wrote:
>   My aim all along has been to provide people with up to date data.  The 
>   nice thing about the minute changesets is that they let you have an 
>   offline database that exactly matches the API as of 6 minutes ago.  I'd 
>   completely agree with you if the API only released data once the 
>   changeset was closed but that's not the case.
>
> I am a bit confused by some of the terms being used here.  The basic
> issue for me is that we have API operations, which correspond to
> database transacations.  Ignoring SERIALIZABLE vs READ COMMITTED, these
> operations are quite safe.  These operations are not changesets.
>   
I've been struggling with terminology too.  I've always called the 
osmosis files "changesets" because to me that's what they were.  However 
now API 0.6 has introduced its own concept of a changeset which has a 
lot of similarities but isn't the same thing.  I sometimes call the 
osmosis files "diffs" which is perhaps closer to the mark.  Neither type 
of changeset has anything to do with a transaction though, they both 
exist independently of transactions.
> Given the way the world is, it seems like the minute diffs really should
> be looking for new transactions, not new changesets.  I can see
> Frederik's point of only exporting closed changesets, but for that to
> really make sense I think the main database has to isolate changesets
> From each other until they are fully committed (meaning either
> long-running transactions or an API change to have an API operation be
> open/upload/close) -- trying to add transaction properties on a copy
> when they aren't there in the original seems like it just won't work.
>   
Just to be clear, osmosis isn't looking for new changesets or 
transactions, it is just looking for entities that have been modified 
within a specific time period.  It doesn't know what an API changeset or 
database transaction is.  Perhaps it should be looking for transactions 
(although I don't see how that will solve anything yet) but that is not 
currently the case.
> This is also confusing in wording because in svn changeset is a
> transaction, and it's not just SERIALIZABLE but actually SERIALIZED, so
> the word changeset can have a wrong connotation.
>
> I think we have
>
>   uploads == db transactions (perhaps "microchangesets" of "changeset fragments"??)
>
>   changesets == (some group of uploads, with a common id and comment)
>
>   minute diffs == (some collection of uploads)
>
> or maybe we will have
>
>   minute diffs == (some collection of changesets)
>
> but in that case the db created by the minute diff may refer to objects
> which are not present, breaking the integrity guarantees that 0.6 got
> us.
>   
I'm still against the idea of minute diffs being a "collection of 
changesets".  The "collection of uploads" is closer to the mark, 
although uploads are just an API convenience, they have no 
representation in the database and have no meaning to osmosis.  minute 
diffs are really a minimal diff to get from one point in time to another.

To complicate things slightly further, the full history files
http://planet.openstreetmap.org/history/
are similar but complete a full delta from one point in time to another 
and may contain several versions of a single entity.

So perhaps the term "diffs" is the right one for the existing files and 
"deltas" is the right one for full history files.

The reason I've tended to avoid the word "diffs" is because the planet 
directory also contains diffs between planet files.  These diffs are yet 
another way of describing changes/differences and are truly a difference 
between two planet files.
> I don't have a clue about how the uploads are numbered and how easy it
> is to extract all of them, but given that the main DB can have committed
> transactions with uploads that are not part of a closed changeset, I
> think the minute diff and replicated dbs should have that too.
>   
If you're not familiar with it already, please check out the API 
schema.  If information isn't stored there, we can't query it.  For 
example, there is no concept of an upload in the database, the only 
grouping feature it has is changesets.
http://gweb.bretth.com/apidb06-pgsql-latest.sql

Brett





More information about the dev mailing list