I'm pretty much offline at the moment (home ADSL is out) but I've started making some changes which should allow SQLite to be plugged in.  If we focus on standard osm data (ignoring changesets), all data is passed through the osmosis pipeline using an EntityContainer class which can hold a single node, way or relation.  I'm adding a new class (actually an interface) called DatasetContainer which provides access to bulk data.  In effect osmosis will then support three types of data, entities (normal data), changes (normal data plus an action), and datasets.  Not sure if a dataset of changes makes sense so I'll ignore that for now.

My initial skeleton interface looks like this: public interface DatasetContainer {     /**      * Allows the entire collection of nodes to be iterated across.      *       * @return An iterator pointing to the start of the node collection.

*/     public ReleasableIterator<Node> iterateNodes();               /**      * Allows the entire collection of ways to be iterated across.      *       * @return An iterator pointing to the start of the way collection.

*/     public ReleasableIterator<Way> iterateWays();               /**      * Allows the entire collection of relations to be iterated across.      *       * @return An iterator pointing to the start of the relation collection.

<br>     */<br>    public ReleasableIterator<Relation> iterateRelations();<br>    <br>    <br>    /**<br>     * Retrieves a specific node by its identifier.<br>     * <br>     * @param id<br>     *            The id of the node.

<br>     * @return The node.<br>     */<br>    public Node getNode(Node id);<br>    <br>    <br>    /**<br>     * Retrieves a specific way by its identifier.<br>     * <br>     * @param id<br>     *            The id of the way.

<br>     * @return The way.<br>     */<br>    public Way getWay(Way id);<br>    <br>    <br>    /**<br>     * Retrieves a specific relation by its identifier.<br>     * <br>     * @param id<br>     *            The id of the relation.

* @return The relation.      */     public Relation getRelation(Relation id);               /**      * Allows all nodes within a bounding box to be iterated across.      *       * @return An iterator pointing to the start of the result nodes.

*/     public ReleasableIterator<Node> iterateNodesInBoundingBox();               /**      * Allows all ways within a bounding box to be iterated across.      *       * @return An iterator pointing to the start of the result ways.

*/     public ReleasableIterator<Way> iterateWaysInBoundingBox();               /**      * Allows all relations within a bounding box to be iterated across.      *       * @return An iterator pointing to the start of the result relations.

*/     public ReleasableIterator<Relation> iterateRelationsInBoundingBox(); } I am planning to write a base abstract class to provide a default implementation for the last three methods but requiring a final implementation to implement the first six.  A smarter database will provide indexing to support smarter bounding box extraction and also implement the last three utilising database indexes.

Once I add this new "dataset" support to the pipeline, new tasks can then be written to utilise it.  For example, a task could be written to accept a standard entity stream, write it to SQLite, instantiate a DatasetContainer implementation to access it and pass it to downstream tasks.  A downstream dataset capable bounding box task could then use the bounding box methods to extract only data within a box without having to read the entire dataset or use temporary files.  If somebody wishes to write an alternative storage mechanism they can do so and just plug it in as a new task allowing different storage mechanisms to be optimised for different purposes.

A very smart database can also be clever about detecting ways overlapping a bounding box but without any nodes in the box. Problems such as losing nodes outside the box due to "streamy" processing should be completely eliminated due to seekable data access.

This is just a skeleton for now and may change significantly but it seems like it should work so far.  It shouldn't take me too long to get something working.  The main blockers at the moment are the utf8 encoding issues and my home ADSL (and Christmas of course ...).

<br><br>Brett<br><br><br><div><span class="gmail_quote">On 12/17/07, <b class="gmail_sendername">Jochen Topf</b> <<a href="mailto:jochen@remote.org">jochen@remote.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi!<br><br>There is probably no one-size-fits-all. For some datasets and some<br>operations streaming will be fine, for others some sort of database is<br>needed.<br><br>To explore the "database option" without the overhead of using a real

<br>database like MySQL or PostgreSQL I have started experimenting with<br>SQLite. SQLite uses only a single file and is therefore easy to setup<br>and use. Libraries for all major languages are available and because<br>it uses more or less standard SQL, it is easier to use than a custom

<br>setup with DB files or so.<br><br>There is a Ruby library at <a href="http://osmlib.rubyforge.org/osmlib-sqlite">osmlib.rubyforge.org/osmlib-sqlite</a> which<br>currently can import an .osm file into a sqlite database and dump the

<br>sqlite database out again as .osm file. The schema for the database is<br>a very simple direct mapping of the osm data model into tables for<br>nodes, ways, relations, node_tags, way_tags, relation_tags, way_nodes,<br>

and members. In addition each table has a "marked" column, so you can<br>mark data you are interested in using SQL commands and only this data is<br>dumped out. The idea is to have the following workflow:<br><br>

1. Create database and fill it from .osm file (done)<br>2. Create indexes on database (sql to create index is there, but not<br>   integrated yet, have to call it by hand)<br>3. Decide which parts of the database you are interested in and call

<br>   SQL commands to mark the data, eventually there could be tools to<br>   help you do that for common filtering tasks. Basically this is things<br>   like:<br>   UPDATE way_tags SET marked=1 WHERE key='highway'

<br>4. Call a script that (recursively) marks data needed for data integrity<br>   (like marking all nodes used in already marked ways).<br>5. Dump the marked objects in the database out into a .osm file. (done)<br><br>If this proves to be a viable approach I could even imagine that somebody

<br>would prepare the sqlite database for the planet file and people can<br>download it directly sparing them the first step.<br><br>Jochen<br><br>On Mon, Dec 17, 2007 at 05:26:43PM +1100, Brett Henderson wrote:<br>> Date: Mon, 17 Dec 2007 17:26:43 +1100

> From: Brett Henderson <<a href="mailto:brett@bretth.com">brett@bretth.com</a>> > To: OSM-Dev Openstreetmap <<a href="mailto:dev@openstreetmap.org">dev@openstreetmap.org</a>> > Subject: Re: [OSM-dev] Osmosis bug when using 'completeWays' option?

<br>><br>> As already mentioned, the current osmosis pipeline design doesn't lend<br>> itself very well to random data access.<br>><br>> The current design focuses each task on a small and specific purpose

<br>> which is great for increased utility, but leads to compromises in<br>> performance.<br>><br>> A couple of possible solutions that come to mind are:<br>> 1. Write a new task that combines file read and bounding box

<br>> extraction.  Randomly seeking over a raw xml file is unlikely to provide<br>> ideal performance but it may be simplest to implement.  This essentially<br>> mimics the way Frederik's perl program works.<br>

> 2. Add seekable data support to the pipeline.  This will take more<br>> effort but may be the best solution usable in a wide variety of scenarios.<br>><br>> Step two would require a new data type to be added to the pipeline for

<br>> dealing with data as a complete (seekable) set.  Currently the pipeline<br>> can process entities and changes.  It would be possible to add a new<br>> type called dataset or similar.  A "reader" task could then read an

<br>> entire data set, store it in a seekable (and indexed) form, and pass<br>> that complete seekable set (exposed through a suitable interface) to<br>> downstream tasks.  That would eliminate the need for each task to

<br>> perform its own temporary file data buffering which will scale more<br>> effectively to large numbers of bounding boxes.  It would be possible to<br>> completely disconnect the seekable store creation task from the data

<br>> processing task which would allow a seekable data store to be re-used<br>> between osmosis invocations.<br>><br>> If the above description sounds a lot like a database then you're right,<br>> it is.  The current temporary files in osmosis provide some database

<br>> like features such as random access and indexes but they're very simple<br>> and limited.  I've been thinking about writing a more complicated<br>> database or using something like bdb to implement some of this but it is

<br>> not trivial and could take some time to complete.  Note that the<br>> existing production mysql schema is not suitable for this, it takes<br>> forever to import and the indexing is unlikely to be ideal.  A simple

<br>> read-only data store can provide huge speed advantages over a full sql<br>> database but requires a brand new "schema".<br>><br>> But, I may be smoking something.  I'm sure there are many ways of doing

<br>> this.<br>><br>> Note that all of this will take a significant performance hit, most<br>> osmosis tasks currently operate in a "streamy" fashion which requires<br>> minimal memory and is very fast for processing a planet file end to

<br>> end.  Any random access processing will slow that down.<br>><br>> Karl Newman wrote:<br>> > On Dec 2, 2007 2:44 PM, Frederik Ramm <<a href="mailto:frederik@remote.org">frederik@remote.org</a>> wrote:

<br>> ><br>> >> Hi,<br>> >><br>> >><br>> >>>>> If you're working on a<br>> >>>>> particular area area, you might want to start with a simple bounding

> >>>>> box first (no completeWays) to limit the data that is buffered. > >>>>> > >>>> Currently I'm working on a small area, but it will target the entire world.

> >>>> Maybe I need to split the world into a few sub areas first (e.g. Europe, > >>>> North America, South America, Afrika and Asia) before cutting them into > >>>> small pieces.

<br>> >>>><br>> >>> I should talk to Brett about making some sort of shared entity<br>> >>> cache/buffer that could be re-used by multiple downstream tasks, so in<br>> >>> the case of a "tee" with "completeWays" bounding boxes there would be

<br>> >>> only one copy of all the entities instead of one set per tee.<br>> >>><br>> >> Maybe the pre-cutting of a bounding box could be automated. Based on<br>> >> the assumption that no way is longer than "x", the area filter could

> >> hold only references to those objects which are *not* inside the > >> selected area but in the vicinity ("x" around the bounding box), for > >> possible later inclusion.

<br>> >><br>> >><br>> > That's a possibility; now solve for "x" :-)<br>> ><br>> ><br>> >> As far as I understand, you're keeping the full objects. My Perl<br>

> >> polygon extractor only kept the object IDs (in a bit set) and<br>> >> afterwards either used seek() to jump back to the beginning of the<br>> >> relevant section in the input file to retrieve the missing objects (or

<br>> >> simply re-opened it if it was not seekable). Also, for effective<br>> >> output ordering, it created three temporary output files (for nodes,<br>> >> ways, relations) and concatenated them at the end of the operation. I

<br>> >> fear it will be very hard to put such optimisations into Osmosis as<br>> >> they don't fit well with the pipeline concept.<br>> >><br>> >><br>> > Well, it does keep the object IDs in a bit set, but the pipline idea

> > is the problem--the source of the entity data could be a database. The > > pipeline paradigm doesn't lend itself to seeking backwards. That's why > > we store the entities (three separate temporary files, here, too).

<br>> ><br>> > Karl<br>> ><br>> > _______________________________________________<br>> > dev mailing list<br>> > <a href="mailto:dev@openstreetmap.org">dev@openstreetmap.org</a><br>> > 

<a href="http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev">http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev</a><br>> ><br>><br>><br>> _______________________________________________<br>

> dev mailing list<br>> <a href="mailto:dev@openstreetmap.org">dev@openstreetmap.org</a><br>> <a href="http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev">http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/dev

</a><br>><br><br>--<br>Jochen Topf  <a href="mailto:jochen@remote.org">jochen@remote.org</a>  <a href="http://www.remote.org/jochen/">http://www.remote.org/jochen/</a>  +49-721-388298<br><br></blockquote></div><br>