[osmosis-dev] Osmosis + Hadoop (was: Re: Changes to Osmosis Pgsql Schema)
dlc at halibut.com
Sat Aug 7 19:15:58 BST 2010
On Sat, Aug 07, 2010 at 05:22:28PM +0200, Lars Francke wrote:
> > 2: I am beginning a project to parallelize OSM data processing
> > with Hadoop, and the postgreSQL copy-format output is perfect
> > for loading into HDFS. (If this goes well, I'd want to discuss
> > ideas for adapting Osmosis to talk to Hadoop, eventually.)
> that is very interesting. I'm doing the same with great success (I've
> recently written about it) and I'm currently putting the final
> touches on a HBase patch to allow bulk loading of OSM data into
> Just as a heads up: If you're using Hive the PostgreSQL copy-format is
> unfortunately not perfect as the output of boolean columns is not
> recognized by Hive ('t' and 'f') resulting in NULL columns.
> Would you mind sharing a few of your ideas and use-cases (in regards
> to OSM(osis) and Hadoop). What exactly do you mean by "Hadoop" and how
> do you think Osmosis could help here?
Firstly, I should point out that I only learned of Map/Reduce and Hadoop
within the past two weeks, and I don't know Java (yet), so I've only
gotten as far as some thought experiments.
The "easy" use case would be as a fast replacement/preprocessor for TagStat,
i.e. frequency counts of tags. An enhancement would be the ability to
report those by geographic area, or better yet, a user's native language.
My original thought was simply to rapidly create the feature geometries for
import to postGIS, for research and for quick setup of test/development
environments. I found a Master's thesis and some code on-line where the
author used the Java Topology Suite in a study of parallel GIS processing ;
unfortunately he seems not to have learned how to do joins, either natively
or with Hive or Pig, and this probably had never encountered HBase before
submitting his work.
If you look at the rate of growth of the OSM data , and look at the
work we have to do in order to make postGIS handle what we have now ,
I think the handwriting on the wall is telling us that parallel processing
is the only way we'll be able to scale, especially as we gain exposure
through efforts like Bing's and MapQuest/AOL Local's.
So... my pie in the sky is to see Mapnik work with HBase and be
able to scale out the rendering as much as we need, and vastly
reduce/eliminate the need for postGIS.
I also intend to explore how we could use these techniques for
error checking, and maybe an aid in processing imports.
Now... I shall go read your notes :)
More information about the osmosis-dev