[OSM-dev] Ideas for speeding up the TIGER import
Brett Henderson
brett at bretth.com
Sun Sep 2 01:54:14 BST 2007
Tom Hughes wrote:
> In message <46D9A74E.9010900 at arjam.net>
> "Robert (Jamie) Munro" <rjmunro at arjam.net> wrote:
>
>
>> Jon Burgess wrote:
>>
>>
>>> Part of the reason why I wrote the original email was also to try to
>>> alert everyone to the sheer size of the data we are attempting to pull
>>> in. Once complete, the existing OSM data will be just 5% of the combined
>>> data set.
>>>
>>> We should not under estimate how many things are going to get broken by
>>> importing all the tiger data, e.g.
>>>
>>> - the current disk space in DB.
>>>
>> Disks are cheap. I think the foundation could easily raise a few
>> thousand pounds and buy a new high-specification DB server. In fact,
>> they may have the money already.
>>
>
> We do have some money, and a DB server is among the things I am
> speccing at the moment. It's just that I may have underspecced it
> a bit for Tiger ;-)
>
> We're going to hit some sort of database/schema wall long before
> Tiger is finished anyway - we're pretty close now I suspect. That
> isn't something simple hardware upgrades is going to solve either.
>
I had no idea TIGER was so big, or at least I hadn't thought it through
properly. While I think continuing with gradual uploads is a good thing
to iron out issues with the import, there is almost no way the current
OSM design can support the full dataset.
To second Tom's comments above, while the upcoming elimination of
segments is an excellent start we need to start thinking about more
drastic changes to the database to support this amount of data.
Everything from use of non-transactional MyISAM tables (and even MySQL
itself), to current planet creation processes, need to be revisited.
The editing API itself may not need to change but everything behind it
might.
The comments below are just my random thoughts, but are some of the
first things to come to mind.
It may be necessary to partition data in the database in some way. Note
that I'm against the idea of splitting the database by region because I
think it will create more problems than it solves but there may be other
ways of partitioning it. Splitting history tables chronologically is a
possibility. One problem with the current design is that data in the
history tables is effectively read-only but new data is being added
online. This leads to compromises being made such as the current lack
of transaction support when adding history items. If there was some way
of having recent history in a modifiable table but old data regularly
moved out to read-only "log" tables it may allow more efficient updates
to be made to these tables. It may greatly reduce the risk of database
corruption and reduce downtime if corruption occurs.
Another thing to consider is which database we use. While I generally
like MySQL, it appears to have its limitations when handling large
datasets. The biggest impact I've noticed is that an outage is required
whenever a relatively minor change is made such as the addition of a new
index. "Enterprise" databases such as Oracle don't have this issue.
Planet dumps may also be contenders for partitioning. Perhaps a region
based split is appropriate here although I'm still uncomfortable with
region splits. Chronological splits using changesets as produced by
osmosis have the potential to assist here.
As far as osmosis is concerned, much of it may need to be revisited as
well. The current polygon task I'm looking at will be useless on a
complete dataset, it would be necessary to query by bounding box on the
database itself first. Hopefully the database synchronisation tasks
will scale appropriately.
More information about the dev
mailing list