[OSM-dev] 0.6 bulk uploader

Thu Jan 22 10:25:09 GMT 2009

2009/1/22 Frederik Ramm <frederik at remote.org>:
> Hi,
>
> Shaun McDonald wrote:
>> It would be best if the bulk_import.py script was updated for 0.6. As
>> everything needs to be wrapped into a changeset, it makes the bulk
>> upload more complex than before.
>
> Yes and no... if you're talking uploads that are small enough to fit
> into one diff upload (i.e. not something like a TIGER county ;-) then
> bulk uploading should become trivial because you don't even have to keep
> track of the object IDs, you just throw your diff at the server and
> that's it. Such a bulk upload could basically be handled by a shell
> script that has three lwp-request calls.
>
> Hm, I see that each object in the diff must explicitly reference the
> changeset ID... so that would probably add one "sed" call to the shell
> script ;-)
>
> BTW: It seems that we're not currently imposing an upper limit for the
> number of changes in a diff upload, is that true? If so, we should
> perhaps add such a limit because the transacionality of diff uploads
> would otherwise make it too easy for the thoughtless script writer to
> mess up or data base... only thing I'm unsure about is whether we should
> simply abort after "n" cycles in the DiffReader.commit method (easy to
> implement, but by the time we abort the database has already been
> unnecessarily loaded), or whether there is perhaps a way to make this
> depend on the size (in bytes) of the upload and it could easily be
> checked before even starting to process it?

Don't forget changeset size limitations.
As I remember it we decided on something like a 50,000 edit limit to
keep changesets from becoming land mines taking out poor innocent
passers by as they suddenly find themselves trying to view a 1GB city
upload.

Last I saw that limit was being enforced by the API, so any diff
upload that's bigger than 50,000 changes will fail automatically --
just not till rails runs the validation, probably after the whole diff
has been processed. So the bulk uploader needs to split the data into
useful changesets, not just multiple uploads.

Dave