[osmosis-dev] Osmosis + Hadoop (was: Re: Changes to Osmosis Pgsql Schema)

Sat Aug 7 19:15:58 BST 2010

On Sat, Aug 07, 2010 at 05:22:28PM +0200, Lars Francke wrote:
> Hi,
> 
> > 2: I am beginning a project to parallelize OSM data processing
> > with Hadoop, and the postgreSQL copy-format output is perfect
> > for loading into HDFS. (If this goes well, I'd want to discuss
> > ideas for adapting Osmosis to talk to Hadoop, eventually.)
> 
> that is very interesting. I'm doing the same with great success (I've
> recently written about it[1]) and I'm currently putting the final
> touches on a HBase patch[2] to allow bulk loading of OSM data into
> HBase.
> 
> Just as a heads up: If you're using Hive the PostgreSQL copy-format is
> unfortunately not perfect as the output of boolean columns is not
> recognized by Hive ('t' and 'f') resulting in NULL columns.
> 
> Would you mind sharing a few of your ideas and use-cases (in regards
> to OSM(osis) and Hadoop). What exactly do you mean by "Hadoop" and how
> do you think Osmosis could help here?

Firstly, I should point out that I only learned of Map/Reduce and Hadoop 
within the past two weeks, and I don't know Java (yet), so I've only 
gotten as far as some thought experiments.

The "easy" use case would be as a fast replacement/preprocessor for TagStat, 
i.e. frequency counts of tags.  An enhancement would be the ability to 
report those by geographic area, or better yet, a user's native language.

My original thought was simply to rapidly create the feature geometries for 
import to postGIS, for research and for quick setup of test/development 
environments.  I found a Master's thesis and some code on-line where the 
author used the Java Topology Suite in a study of parallel GIS processing [1]; 
unfortunately he seems not to have learned how to do joins, either natively 
or with Hive or Pig, and this probably had never encountered HBase before 
submitting his work.

If you look at the rate of growth of the OSM data [2], and look at the 
work we have to do in order to make postGIS handle what we have now [3][4], 
I think the handwriting on the wall is telling us that parallel processing 
is the only way we'll be able to scale, especially as we gain exposure 
through efforts like Bing's and MapQuest/AOL Local's.

So... my pie in the sky is to see Mapnik work with HBase and be 
able to scale out the rendering as much as we need, and vastly 
reduce/eliminate the need for postGIS.

I also intend to explore how we could use these techniques for 
error checking, and maybe an aid in processing imports.

Now... I shall go read your notes :)

[1] http://www.nathankerr.com/projects/parallel-gis-processing/
[2] http://wiki.openstreetmap.org/wiki/Stats
[3] http://wiki.openstreetmap.org/wiki/SotM_2010_session:Tuning_the_Mapnik_Rendering_Chain
[4] http://www.slideshare.net/loffenauer/nogago-distributed-bulk-rendering