[OSM-dev] New app for massaging osm data
Brett Henderson
brett at bretth.com
Thu Jun 21 10:25:57 BST 2007
Hi All,
Over the last few weeks I've written a command line java app for
processing OSM data. It started out as a way of speeding up the import
of a planet file into a local database and grew from there.
Apologies in advance for the huge email but hopefully there's enough in
here to pique interest (or otherwise). There may be a complete lack of
interest if there are already better alternatives to most of this.
The entire codebase is just over 10000 lines so I can provide more info
on the structure if required. A copy of the package (including jar,
javadocs and source) is available here:
http://www.bretth.com/osmosis/osmosis-latest.zip
The subversion repository is available here:
https://www.bretth.com/repos/main/osmosis/
Much of the contents of this email is available here:
https://www.bretth.com/wiki/Wiki.jsp?page=OpenStreetMap
It is all running off my home ADSL connection so if there's any problems
getting access I'll find somewhere more appropriate to put it.
The tool consists of a series of pluggable components that can be
chained together to perform a larger operation. For example, it has
components for reading from database and from file, components for
writing to database and to file, components for deriving and applying
change sets to data sources, components for sorting data, etc. The
number of features is relatively small at the moment but it's intended
to be easy to add new features without re-writing common tasks such as
file or database handling.
Some of the things it can perform with the current feature set are:
* dump a database
* load a database
* produce a diff file between two data sources (either database, file,
or both)
* apply a diff to a data source
* sort data in a variety of ways using a file based sort algorithm to
minimise memory
* extract data from a bounding box
Note that all features are performed in a "streamy" fashion including
sorting and memory usage remains low at all times.
Most of these current features can already be performed by other tools,
but there are a couple of possible advantages to this tool.
* At last check the speed of a complete planet import takes just around
75 minutes versus approx 8 hours for the planetosm-to-db.pl script.
* The differencing algorithm works at an object level allowing diffs and
merges to be performed against any data source, not just planet.osm files.
* The difference format allows changes to be applied to a database (not
yet written) instead of just planet files. This allows slave databases
to be maintained for purposes such as rendering without requiring
complete imports every time. These synchronisations could be performed
as regularly as required without significant overhead to the primary server.
* Many tasks can be chained together allowing a number of transforms or
operations to be combined.
There are a few features on my todo list:
* Upgrade to support 0.4. This works with a 0.3 schema but will
presumably require changes since the Rails deployment.
* Fix date handling. This shouldn't be difficult but different osm
files appeared to be using different date formats so dates are currently
being ignored.
* Write component for reading and writing changesets to databases
directly using table history instead of comparing two complete data sets
(ie. full planet files or complete table reads). This would vastly
improve performance and make master-slave database replication more
feasible.
* Write a component allowing regular expression based updates to data.
* Write unit tests ...
I'll try to illustrate with some examples (complete java command line
not provided for brevity):
IMPORT PLANET
Osmosis --read-xml file="planet.osm" --write-mysql host="x" database="x"
user="x" password="x"
EXPORT PLANET
Osmosos --read-mysql host="x" database="x" user="x" password="x"
--write-xml file="planet.osm"
GENERATE DIFF BETWEEN PLANETS
Osmosis --read-xml file="planet1.osm" --read-xml file="planet2.osm"
--derive-change --write-xml-change file="planetdiff-1-2.osc"
GENERATE DIFF BETWEEN PLANET AND DATABASE
Osmosis --read-xml file="planet1.osm" --read-mysql host="x" database="x"
user="x" password="x" --derive-change --write-xml-change
file="planetdiff-1-2.osc"
APPLY DIFF TO PLANET
Osmosis --read-xml file="planet1.osm" --read-xml-change
file="planetdiff-1-2.osc" --apply-change --write-xml file="planet2.osm"
SORT CONTENTS OF OSM FILE
Osmosis --read-xml file="data.osm" --sort type="TypeThenId" --write-xml
file="data-sorted.osm"
The above examples make use of the default pipe connection feature,
however a simple read and write planet file command line could be
written in two ways. The first example uses default pipe connection,
the second explicitly connects the two components using a pipe named
"mypipe". The default pipe connection will always work so long as each
task is specified in the correct order.
Osmosis --read-xml file="planetin.osm" --write-xml file="planetout.osm"
Osmosis --read-xml file="planetin.osm" outPipe.0="mypipe" --write-xml
file="planetout.osm" inPipe.0="mypipe"
A complete list of the available tasks, their io pipes, and their
arguments with default values is specified below:
--read-mysql
outPipe.0: Produces an element stream.
host=localhost
database=osm
user=osm
password=
--write-mysql
inPipe.0: Consumes an element stream.
host=localhost
database=osm
user=osm
password=
--read-xml
outPipe.0: Produces an element stream.
file=dump.osm
--write-xml
inPipe.0: Consumes an element stream.
file=dump.osm
--bounding-box
inPipe.0: Consumes an element stream.
outPipe.0: Produces an element stream.
left=-180
right=180
top=90
bottom=-90
--derive-change
inPipe.0: Consumes an element stream.
inPipe.1: Consumes an element stream.
outPipe.0: Produces a change stream.
--apply-change
inPipe.0: Consumes an element stream.
inPipe.1: Consumes a change stream.
outPipe.0: Produces an element stream.
--read-xml-change
outPipe.0: Produces a change stream.
file=change.osc
--write-xml-change
inPipe.0: Consumes a change stream.
file=change.osc
--write-null
inPipe.0: Consumes an element stream.
--write-null-change
inPipe.0: Consumes a change stream.
--sort
inPipe.0: Consumes an element stream.
outPipe.0: Produces an element stream.
type=TypeThenId
--sort-change
inPipe.0: Consumes a change stream.
outPipe.0: Produces a change stream.
type=streamable[|seekable]
I look forward to hearing your thoughts.
Cheers,
Brett
More information about the dev
mailing list