[OSM-talk] Some measure to prevent duplicate uploads of same data ...

Fri Mar 5 23:57:27 GMT 2010

Hi,

On 6 March 2010 00:16, MP <singularita at gmail.com> wrote:
> While API 0.6 have implemented object versioning, preventing
> accidentally overwriting someone else's changes, with introduction of
> atomic uploads now I see many problems with duplicate data.
>
> These come often with imports of data or generally if someone uploads
> any new data without modifying any existing data (like if someone just
> traces hundreds of buildings from ortophoto, or alike ....)
>
> Since in JOSM (and possibly in other tools) the atomic upload is the
> default method, that user presses some "upload" button and in few
> seconds all the changes are uploaded to the server, which then starts
> processing it (this could take some time for larger changes) and once
> it is finished, it will send new node ID's back to the editor.
>
> Unfortunately, sometimes while waiting for server to process the
> uploaded data, the connection will timeout, so the user sees some
> error message  - thinking the upload failed, he presses "upload"
> again, starting to push new copy of all the objects to the server.
> Later, the server want to return ID's from first upload, but nobody is
> listening on the orher end anymore.
>
> Ultimate result is sometimes having 2 to 4 identical copies of some
> data, sometimes it is thousands of duplicate nodes and ways.
>
> Suggestion for one possible countermeasure:
>  after server receives complete succesful atomic upload from user,
> compute SHA1, MD5, or some other checksum of the uploaded XML. Store
> it and if user tries uploading exactly the same thing again (because
> he thinks the upload have failed, which is not true), send him just
> some error message instead, like: "You have already uploaded this
> data".

This sounds like a good idea to me.  Perhaps it should only be
employed for diff uploads with only <create>'s, for all other cases a
re-upload will fail with a conflict.  An identical measure can be
implemented in the client such as JOSM.  Only the uploads with solely
new objects need to be extra cautious, but even for other uploads JOSM
could admittedly be better at treating network errors, for example by
looking at the last open changeset and retrieving the new IDs and
versions of objects which should have been in the server response.

I have a very experimental script that generates the server response
based on the content uploaded and the corresponding changeset as
downloaded from the api, which I use for bulk uploads, at
http://svn.openstreetmap.org/applications/utils/import/bulkupload/change2diff2.py
It only works if the changeset contains only the single diff and it
makes other significant assumptions.

Generally if you're not uploading through a proxy and the diff is not
in conflict with existing data (for example because it only creates
new objects) I notice that it will always hit the database if 100% of
the xml is uploaded, i.e. once the last byte has been sent out the api
never cancels the commit, if on the contrary not all bytes were sent
out, the api will not be able to parse it as xml, so it's
deterministic.

Cheers