[OSM-dev] New OSM binary fileformat implementation.
scrosby06 at gmail.com
Sun May 9 17:05:09 BST 2010
On Thu, May 6, 2010 at 11:13 AM, Nolan Darilek <nolan at thewordnerd.info>wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Sorry for the delay in responding; crazy life, and I've been fixing
> existing bugs in my project rather than thinking about breaking new ground.
> On 05/02/2010 12:35 AM, Scott Crosby wrote:
> > With pruning out metadata, some judicious filtering of uninteresting
> > and increasing the granularity to 10 microdegrees (about 1m resolution),
> > I've fit the whole planet in 3.7gb.
> Sweet. I hope this format works for my use case.
> > I have no code for pulling entities out by ID, but that would be
> > straightforward to add, if there was a demand for it.
> I would definitely need that. I'm coding to the travelingsalesman API's
> DataSet interface which does include retrieval by ID.
This could be done by having a file sorted by ID. Metadata could be the
minimum and maximum ID in each block.
> have to pay a disk seek whether it is in my format or not. My format
> > very dense, might let RAM hold the working set and avoid the disk seek.
> > to decompress is already far faster than a hard drive, though not a SSD.
> Keeping everything in RAM is probably workable. At the very least, to go
> global with a format like this would seem to be a matter of starting
> with a mid-level VPS that stores everything on disk and eventually
> upgrading to a high-RAM, low disk space EC2 or GoGrid instance. Without
> it, I'm looking at half a TB of storage and possibly a significant chunk
> of RAM, and even so I don't think my current dataset can handle that.
> In other words, I like the option of keeping everything in RAM far
> better than what I'm doing right now. :)
> > Could you tell me more about the kinds of lookups your application will
> Sure. You can see the interface I've implemented here:
> Basically, the executive summary is that there are four broad kinds of
> Entity by ID, as mentioned earlier
> Entities based on intersection with bounding box, currently done by the
> somewhat inaccurate method of finding all contained nodes, then
> returning any associated ways/relations. Would be great if I could
> locate contained ways even if they don't have a node in the box, but
> even if not, it'd be no worse than what's there now. :)
Doable, even for ways that do not contain any node in the box.
> Entities by presence of certain tags, in some instances also with
> bounding box conditions (I.e. all "amenity"->"fuel" nodes, or all of
> such nodes within a given bounds)
There are several ways of doing this, perhaps the simpliest is to segregate
such PoI entities into a separate set of blocks and do the same
expanding-search mentioned below until a hit is found. There could be more
than one class of such PoI blocks (amenities, cities, interstate exits,
> Nearest entity to a given point, expanding outward.
> I can, for instance roughly find the nearest way by
finding the node nearest to a set of coordinates,
This part is doable: find the blocks that contain the coordinates, sort to
find the closest.
> checking for its presence in any ways,
My current design does not include this functionality; no
getWaysForNode(nodeID). However, there is another solution that may be
acceptable and not require writing an extension: If a file is geographically
sorted, what can be done is to load all ways intersecting a bounding box,
and all nodes intersecting a bounding box, and then combine these two
datasets together in RAM to figure out which way happens to be the closest.
This result however is not guaranteed to be complete and correct. For ways
extending out of the bounding box, this algorithm would not load those nodes
into RAM and the full geographic path of such a way would be unknown. A fix
might be to use a larger bounding box when loading nodes. To aid in choosing
what size of bounding box of nodes to load, each block of ways could store
the maximum inter-node spacing among the ways in that box. To be efficient,
the data needs to have a limit on maximum inter-node spacing in ways.
then finding the
> next nearest and recursing outward until the conditions are met. The
> conditions check is done externally, so the search need only return the
> nearest entity, next nearest, etc.)
> I know you've said elsewhere that you don't want this format to replace
> the need for a database, and I respect that. I just don't quite know
> where that line is. Even so, I clearly don't need all of my database's
> functionality for the OSM-facing aspects of this app and hope that these
> limited uses are in scope.
Its software. I don't think there is any hard-and-fast line. Just a set of
tradeoffs. My fileformat just demonstrates another position in the design
I had a set of goals: Small size, much faster to read, simple parsers, and
Small size means that redundant indexing information must be optional or
left entirely out.
Application-specific extensions must not impact forward and backward
compatability. My vision is that there should be one simple core that
contains the OSM data, readable by *all* programs. Extensions in the
form of new fields or fileblocks can ignored by other software without
OSM data. Therefore, any program can read any file in my format and
a file in my format with the extensions it needs and with the entity
Once my design is stabalized then hopefully this kind of compatability will
be guaranteed and anyone could add as many database-like indexes and
features as they wanted, without altering the core format. Each 'fileblock'
identifies what data it contains. Programs can skip fileblocks that they do
not understand. I have only definied two fileblocks so far, OSMHeader and
OSMData. Indexing, routing, and other kinds of fileblocks can be defined. I
estimate that indexes supporting getWaysForNode(nodeID),
getRelationsForWay(wayID), and getRelationsForNode(nodeID)) would add 2-3gb
of uncompressed size.
As a separate discussion, two extensions seem to be valuable and re-usable
enough to standardize on and have small overheads. In particular, two sort
orders: sort geographically and sort by ID, and optional metadata for each
fileblock: min_id/max_id and bounding boxes. There are some open questions
as to the format of these extensions, such as should each block of ways
indicate the maximum inter-node spacing among the ways in that block? Unlike
node positions, bounding boxes are derived data and can use a different
resolution coordinate system. So, should bounding boxes use a circular
coodrinate system (2<<32/360.0 * lon)?, or a lower resolution (2<<24/360.0
* lon)? Or measure in nanodegrees or microdegrees? lon/10e-9? lon/10e-6?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the dev