[OSM-talk] Some measure to prevent duplicate uploads of same data ...
Sebastian Klein
bastikln at googlemail.com
Sat Mar 6 00:11:51 GMT 2010
MP wrote:
> While API 0.6 have implemented object versioning, preventing
> accidentally overwriting someone else's changes, with introduction of
> atomic uploads now I see many problems with duplicate data.
>
> These come often with imports of data or generally if someone uploads
> any new data without modifying any existing data (like if someone just
> traces hundreds of buildings from ortophoto, or alike ....)
>
> Since in JOSM (and possibly in other tools) the atomic upload is the
> default method, that user presses some "upload" button and in few
> seconds all the changes are uploaded to the server, which then starts
> processing it (this could take some time for larger changes) and once
> it is finished, it will send new node ID's back to the editor.
>
> Unfortunately, sometimes while waiting for server to process the
> uploaded data, the connection will timeout, so the user sees some
> error message - thinking the upload failed, he presses "upload"
> again, starting to push new copy of all the objects to the server.
> Later, the server want to return ID's from first upload, but nobody is
> listening on the orher end anymore.
>
> Ultimate result is sometimes having 2 to 4 identical copies of some
> data, sometimes it is thousands of duplicate nodes and ways.
>
> Suggestion for one possible countermeasure:
> after server receives complete succesful atomic upload from user,
> compute SHA1, MD5, or some other checksum of the uploaded XML. Store
> it and if user tries uploading exactly the same thing again (because
> he thinks the upload have failed, which is not true), send him just
> some error message instead, like: "You have already uploaded this
> data".
>
> Or alternatively, send the user whatever result was there from the
> last upload (either new set of ID's, or some error message in case
> that previous upload failed because of some error)
>
> I think perhaps last 2 or 3 checksums could be stored in case someone
> have multiple parallel uploads in multiple editors.
>
> Martin
This is really a problem, especially for large data imports. The
solution might not be so easy:
JOSM offers to upload the data in chunks of different sizes, or even
each object separately. If the upload fails (due to timeout), the user
might vary these paramenters, so the checksums become useless.
There was a discussion on that topic in josm trac:
http://josm.openstreetmap.de/ticket/4401
It was suggested, that a final handshake should be required after the
diff is sent from the server. If the client does not respond, the upload
is discarded.
It would be nice to have a solution for this in API 0.7, but in the
meantime, the editors should learn to handle this in a better way.
The user should be informed, that the dataset is in a dirty state and
offer downloading the changeset. The new objects of the current dataset
should then be matched heuristically (by their coordinates and tags)
with the objects in the changeset.
__
Sebastian
More information about the talk
mailing list