[OSM-dev] Ideas for speeding up the TIGER import

Sun Sep 2 01:54:14 BST 2007

Tom Hughes wrote:
> In message <46D9A74E.9010900 at arjam.net>
>           "Robert (Jamie) Munro" <rjmunro at arjam.net> wrote:
>
>   
>> Jon Burgess wrote:
>>
>>     
>>> Part of the reason why I wrote the original email was also to try to
>>> alert everyone to the sheer size of the data we are attempting to pull
>>> in. Once complete, the existing OSM data will be just 5% of the combined
>>> data set.
>>>
>>> We should not under estimate how many things are going to get broken by
>>> importing all the tiger data, e.g.
>>>
>>> - the current disk space in DB.
>>>       
>> Disks are cheap. I think the foundation could easily raise a few
>> thousand pounds and buy a new high-specification DB server. In fact,
>> they may have the money already.
>>     
>
> We do have some money, and a DB server is among the things I am
> speccing at the moment. It's just that I may have underspecced it
> a bit for Tiger ;-)
>
> We're going to hit some sort of database/schema wall long before
> Tiger is finished anyway - we're pretty close now I suspect. That
> isn't something simple hardware upgrades is going to solve either.
>   
I had no idea TIGER was so big, or at least I hadn't thought it through 
properly.  While I think continuing with gradual uploads is a good thing 
to iron out issues with the import, there is almost no way the current 
OSM design can support the full dataset.

To second Tom's comments above, while the upcoming elimination of 
segments is an excellent start we need to start thinking about more 
drastic changes to the database to support this amount of data.  
Everything from use of non-transactional MyISAM tables (and even MySQL 
itself), to current planet creation processes, need to be revisited.  
The editing API itself may not need to change but everything behind it 
might.

The comments below are just my random thoughts, but are some of the 
first things to come to mind.

It may be necessary to partition data in the database in some way.  Note 
that I'm against the idea of splitting the database by region because I 
think it will create more problems than it solves but there may be other 
ways of partitioning it.  Splitting history tables chronologically is a 
possibility.  One problem with the current design is that data in the 
history tables is effectively read-only but new data is being added 
online.  This leads to compromises being made such as the current lack 
of transaction support when adding history items.  If there was some way 
of having recent history in a modifiable table but old data regularly 
moved out to read-only "log" tables it may allow more efficient updates 
to be made to these tables.  It may greatly reduce the risk of database 
corruption and reduce downtime if corruption occurs.

Another thing to consider is which database we use.  While I generally 
like MySQL, it appears to have its limitations when handling large 
datasets.  The biggest impact I've noticed is that an outage is required 
whenever a relatively minor change is made such as the addition of a new 
index.  "Enterprise" databases such as Oracle don't have this issue.

Planet dumps may also be contenders for partitioning.  Perhaps a region 
based split is appropriate here although I'm still uncomfortable with 
region splits.  Chronological splits using changesets as produced by 
osmosis have the potential to assist here.

As far as osmosis is concerned, much of it may need to be revisited as 
well.  The current polygon task I'm looking at will be useless on a 
complete dataset, it would be necessary to query by bounding box on the 
database itself first.  Hopefully the database synchronisation tasks 
will scale appropriately.