[OSM-dev] Why are so many changeset so large?

Wed Oct 17 13:53:23 BST 2012

On Wed, 2012-10-17 at 00:28 +0100, Tom Hughes wrote:
> On 17/10/12 00:04, Alex Barth wrote: 
> > - Are there technical reasons why changesets should tend to be 
> > large? Are they expensive on some level?
> 
> I believe it's entirely because we've got so many people doing 
> mechanical or semi-mechanical edits.
> 
> That includes bots but also things like people using xapi or overpass to 
> download all objects matching some set of tags, then change those tags 
> and reupload.

the historical answer to this is that when changesets were added to the
OSM API there were two different intentions for their use which got
conflated: first, that changesets were structures for grouping edits
sharing common attributes. and second, that changesets were VCS-style
'commits' which would be uploaded in a single request and applied
atomically.

effectively, the first use-case was for users, and tried to make
changesets as open-ended as possible. from this, we get tags on
changesets for comments, editor, bot-ness, etc... and the ability to
keep uploading into an open changeset.

the second use-case was a technical thing - the sheer number of API
calls to individual elements, even from normal-sized editing sessions,
could cause problems. and, for small calls, HTTP headers and round-trip
latencies would dominate the cost of an upload. further, editors had to
cope with the situation where an upload failed half-way through and to
re-try the failed calls. from this, we get a single changeset/#id/upload
call which applies atomically.

at the time, this seemed like a good way to satisfy both use-cases. and,
while it does what it set out to, i think we should consider splitting
these in the next API version; explicitly reifying uploads at which
bboxes / coverage sets and change counts can be stored. changesets can
then simply be collections of uploads.

getting to the point: this might to some extent mitigate the "large
changesets" issue, as it would allow bboxes to be collected at a smaller
granularity. however, it wouldn't be a full solution and we'd probably
still need something like OWL to break down the geographic footprint of
changesets further.

cheers,

matt