[OSM-dev] New app for massaging osm data

Thu Jun 21 10:25:57 BST 2007

Hi All,

Over the last few weeks I've written a command line java app for 
processing OSM data.  It started out as a way of speeding up the import 
of a planet file into a local database and grew from there.

Apologies in advance for the huge email but hopefully there's enough in 
here to pique interest (or otherwise).  There may be a complete lack of 
interest if there are already better alternatives to most of this.

The entire codebase is just over 10000 lines so I can provide more info 
on the structure if required.  A copy of the package (including jar, 
javadocs and source) is available here:
http://www.bretth.com/osmosis/osmosis-latest.zip

The subversion repository is available here:
https://www.bretth.com/repos/main/osmosis/

Much of the contents of this email is available here:
https://www.bretth.com/wiki/Wiki.jsp?page=OpenStreetMap

It is all running off my home ADSL connection so if there's any problems 
getting access I'll find somewhere more appropriate to put it.

The tool consists of a series of pluggable components that can be 
chained together to perform a larger operation.  For example, it has 
components for reading from database and from file, components for 
writing to database and to file, components for deriving and applying 
change sets to data sources, components for sorting data, etc.  The 
number of features is relatively small at the moment but it's intended 
to be easy to add new features without re-writing common tasks such as 
file or database handling.

Some of the things it can perform with the current feature set are:
* dump a database
* load a database
* produce a diff file between two data sources (either database, file, 
or both)
* apply a diff to a data source
* sort data in a variety of ways using a file based sort algorithm to 
minimise memory
* extract data from a bounding box

Note that all features are performed in a "streamy" fashion including 
sorting and memory usage remains low at all times.

Most of these current features can already be performed by other tools, 
but there are a couple of possible advantages to this tool.
* At last check the speed of a complete planet import takes just around 
75 minutes versus approx 8 hours for the planetosm-to-db.pl script.
* The differencing algorithm works at an object level allowing diffs and 
merges to be performed against any data source, not just planet.osm files.
* The difference format allows changes to be applied to a database (not 
yet written) instead of just planet files.  This allows slave databases 
to be maintained for purposes such as rendering without requiring 
complete imports every time.  These synchronisations could be performed 
as regularly as required without significant overhead to the primary server.
* Many tasks can be chained together allowing a number of transforms or 
operations to be combined.

There are a few features on my todo list:
* Upgrade to support 0.4.  This works with a 0.3 schema but will 
presumably require changes since the Rails deployment.
* Fix date handling.  This shouldn't be difficult but different osm 
files appeared to be using different date formats so dates are currently 
being ignored.
* Write component for reading and writing changesets to databases 
directly using table history instead of comparing two complete data sets 
(ie. full planet files or complete table reads).  This would vastly 
improve performance and make master-slave database replication more 
feasible.
* Write a component allowing regular expression based updates to data.
* Write unit tests ...

I'll try to illustrate with some examples (complete java command line 
not provided for brevity):
IMPORT PLANET
Osmosis --read-xml file="planet.osm" --write-mysql host="x" database="x" 
user="x" password="x"

EXPORT PLANET
Osmosos --read-mysql host="x" database="x" user="x" password="x" 
--write-xml file="planet.osm"

GENERATE DIFF BETWEEN PLANETS
Osmosis --read-xml file="planet1.osm" --read-xml file="planet2.osm" 
--derive-change --write-xml-change file="planetdiff-1-2.osc"

GENERATE DIFF BETWEEN PLANET AND DATABASE
Osmosis --read-xml file="planet1.osm" --read-mysql host="x" database="x" 
user="x" password="x" --derive-change --write-xml-change 
file="planetdiff-1-2.osc"

APPLY DIFF TO PLANET
Osmosis --read-xml file="planet1.osm" --read-xml-change 
file="planetdiff-1-2.osc" --apply-change --write-xml file="planet2.osm"

SORT CONTENTS OF OSM FILE
Osmosis --read-xml file="data.osm" --sort type="TypeThenId" --write-xml 
file="data-sorted.osm"

The above examples make use of the default pipe connection feature, 
however a simple read and write planet file command line could be 
written in two ways.  The first example uses default pipe connection, 
the second explicitly connects the two components using a pipe named 
"mypipe".  The default pipe connection will always work so long as each 
task is specified in the correct order.
Osmosis --read-xml file="planetin.osm" --write-xml file="planetout.osm"
Osmosis --read-xml file="planetin.osm" outPipe.0="mypipe" --write-xml 
file="planetout.osm" inPipe.0="mypipe"

A complete list of the available tasks, their io pipes, and their 
arguments with default values is specified below:

--read-mysql
  outPipe.0: Produces an element stream.
  host=localhost
  database=osm
  user=osm
  password=

--write-mysql
  inPipe.0: Consumes an element stream.
  host=localhost
  database=osm
  user=osm
  password=

--read-xml
  outPipe.0: Produces an element stream.
  file=dump.osm

--write-xml
  inPipe.0: Consumes an element stream.
  file=dump.osm

--bounding-box
  inPipe.0: Consumes an element stream.
  outPipe.0: Produces an element stream.
  left=-180
  right=180
  top=90
  bottom=-90

--derive-change
  inPipe.0: Consumes an element stream.
  inPipe.1: Consumes an element stream.
  outPipe.0: Produces a change stream.

--apply-change
  inPipe.0: Consumes an element stream.
  inPipe.1: Consumes a change stream.
  outPipe.0: Produces an element stream.

--read-xml-change
  outPipe.0: Produces a change stream.
  file=change.osc

--write-xml-change
  inPipe.0: Consumes a change stream.
  file=change.osc

--write-null
  inPipe.0: Consumes an element stream.

--write-null-change
  inPipe.0: Consumes a change stream.

--sort
  inPipe.0: Consumes an element stream.
  outPipe.0: Produces an element stream.
  type=TypeThenId

--sort-change
  inPipe.0: Consumes a change stream.
  outPipe.0: Produces a change stream.
  type=streamable[|seekable]

I look forward to hearing your thoughts.
Cheers,
Brett