[OSM-dev] Ideas for speeding up the TIGER import

Sat Sep 1 17:45:16 BST 2007

On Sat, 2007-09-01 at 18:09 +0200, Martijn van Oosterhout wrote:
> On 9/1/07, Jon Burgess <jburgess777 at googlemail.com> wrote:
> > I think it may be wise to step back a minute and take a look at just how
> > to handle the TIGER data import.
> 
> <snip>
> 
> > Currently the OSM DB has around 36M objects so we are talking about
> > growing the DB by a factor of 20 times. I think we can probably do
> > better then simply importing it all via the API on the production
> > system.
> 
> Given the current setup, I don't tihnk running the script on the main
> server will be any faster than currently. The bottleneck is the
> database, nothing will change that.

I agree the current database setup is the bottleneck. On the contrary
though, I think we can and do think we need to fix that. 

Also we need to do something otherwise we'll fill up all the available
disk space on DB in the next few weeks even at the present import rate. 

I don't want us to have to stop everything while wait to get new
hardware. It might be better to pause the tiger import and re-consider
our approach.

> > One idea I had to improve the import speed is as follows:
> 
> It is absolutly true that we could import it faster. But I don't think
> we should do something drastic.
> 
> - With the current method we have gradual growth of the DB, which
> means if there's an issue we have more time to deal with it.

Or if we had a secondary DB with just the tiger data we can trivially
erase it and start again with no worry about leaving garbage in the
history table (as we have from the previous tiger import).

> - The current method is weeding out bugs in the system. TIGER isn't
> going to be the only import and the experience we get now will help
> with later imports.
> - In the long term the speed will come from something like osmosis
> writing to the DB directly. It's not quite ready yet but one day it
> will be.

Sure I totally agree. Using the production API for this is not the
solution. 

I agree that the long term answer as something like Osmosis which can
translate OSM->SQL and then we can load the SQL directly into the DB. 

What I'm proposing here is a method of converting the OSM to SQL using
an external rails app on a setup specifically tuned to do fast
OSM->DB(SQL) conversion and then we can take the resulting SQL data to
load into the main DB. 

> And finally:
> - In terms of bang for buck the biggest benefits are going to come
> from a bulk-upload interface in the server.

I think our present Tiger import process is fine for medium sized data
sets, just not for Tiger sized data sets. If we take the AND data as a
more typical example, it has ~10M objects. The tiger import is
progressing at ~30 objects/s which means the AND import would complete
in around 4 days. This seems reasonable to me. 

I don't see how many more TIGER sized imports we are likely to get. If
we do get some more then it is worth spending some time hand-holding the
import process. 

Part of the reason why I wrote the original email was also to try to
alert everyone to the sheer size of the data we are attempting to pull
in. Once complete, the existing OSM data will be just 5% of the combined
data set.

We should not under estimate how many things are going to get broken by
importing all the tiger data, e.g. 

- the current disk space in DB.

- the weekly planet export script which currently takes a few hours to
run will start taking several days. 

- The planet.osm.bz2 file will probably be ~7GB in size (and be 80GB
uncompressed).

Osmosis producing weekly or nightly diffs will hopefully be the solution
to the last two.

	Jon