[OSM-dev] improving TIGER-to-OSM memory usage/performance

Sat Sep 15 20:09:19 BST 2007

Hi everyone,

This is my first post to dev.

Recently I've been working on speeding up USA TIGER data conversion to
OSM format.  I was able to take Dave Hansen's code
(http://wiki.openstreetmap.org/index.php/TIGER) and run it on my home
Mac, with minimal issues.  nice!

But, I'm a bit late to the game: He already has generated the data,
however he's suspects the OSM data will need to be regenerated, so
hopefully this is still useful.

The main issue I noticed with the ruby code is that it eats a lot of
memory.  it runs pretty fast on small data sets (under a minute).  On
large datasets, it quickly runs of memory, swaps, and goes painfully
slow.  So I focused on improving memory usage.  The new code is
attached (hopefully attachments work):

NOTES:

1) The current code reads in entire files, processes them, then
deletes the data.  While this is very clear to read, it requires a lot
of memory up front.  The new code reads in the data line at a time and
processes it.  It's more of a streaming model.

2) Ruby uses a garbage collection memory management system, which
means it might have "trouble" when a large number of long-lived
objects are in memory.  I wrote up a little something on this here:

http://blog.modp.com/2007/09/scripting-languages-and-memory.html

it could be that perl/python/etc (which use a reference counting
system) might suck more, they might be much better.  I don't know.
You'll see the problem with GC'ed systems in the next few points.
(BTW, I'm not picking on Ruby. I like ruby.  java would have the same
problem.).

3) The current code takes each line, and chops it up into the fields
right away.  Normally fine, but here given the number of records and
objects, the object/string overhead is actually _larger_ than the data
it self!  To fix this, we'll use a "lazy lookup", where we store the
raw line, and then do a string slice when you want a field.  This
reduces (in RT1) 47 objects to 2 objects per record.  I wrote up the
details on this here:

http://blog.modp.com/2007/09/processing-fixed-length-records.html

In addition, I used specialized classes for each of the RT files.
They could be made a lot better -- right now there is a good bit of
"cut-n-paste"   I'm not a ruby expert, really!  I just hacked up ruby
for this project.

--------
Results:

output should be indentical
performance is as good if not better
memory is hundreds of megabytes less on large files.

With this patch, I was finally able to process some of the larger
california counties on my 1G machine.

It looks like the output generation code could also be made to be
"streaming" and save hundreds more megabyes.  This however might
change the order of output, so verification might be a bit harder.

---------
NOTES TO DAVE:

Hi dave..

the main change is in tiger.rb  the 'diff' willl look horrible, but if
you look at the code it should be mostly familiar looking.  Lots of
potential for clean up.

I made some minor changes to tiger-zip-to-osm.sh that allow it work on
other Unix/Linux variants.

The changes should be current to version 0.7

It should work with
---------

thanks all,

I hope you find this useful.

COMMENT, TIPS, IMPROVEMENTS WELCOME!

--nickg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tiger.rb
Type: text/x-ruby-script
Size: 13654 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20070915/0dfa3962/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tiger-zip-to-osm.sh
Type: application/x-sh
Size: 3601 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20070915/0dfa3962/attachment.sh>