[OSM-dev] New OSM binary fileformat implementation.

Sun May 2 06:35:09 BST 2010

---------- Forwarded message ----------
(Accidently did not reply to list)

Some of these questions may be a bit premature, but I don't know how far
> along your design is, and perhaps asking them now may influence that
> design in ways that work for me.
>
>
I'm willing to call what I've designed so far in the file format mostly
complete, except for some of the header design issues I've brought up
already. The question is what extensions make sense to define now, such as
bounding boxes, and choosing the right definition for them.

> Unfortunately, this method introduces a variety of complications. First,
> the database for TX alone is 10 gigs. Ballpark estimations are that I
> might need half a TB or more to store the entire planet. I'll also need
> substantial RAM to store the working set for the DB index. All this
> means that, to launch this project on a global scale, I'd need a lot
> more funding than I as an individual am likely to find.

With pruning out metadata, some judicious filtering of uninteresting tags,
and increasing the granularity to 10 microdegrees (about 1m resolution),
I've fit the whole planet in 3.7gb.

> Is there a performance or size penalty to ordering the data
> geographically rather than by ID?

I expect no performance penalty.

As for a size penalty, it will be a mixed bag. Ordering geographically
should reduce the similarity for node ID numbers, increasing the space
required to store them. It should increase the similarity for latitude and
longitude numbers, which would reduce the size. It might change the re-use
frequency of strings. On the whole, I suspect the filesize would remain
within 10% of what it is now and believe it will decrease, but I have no way
to know.

> I understand that this won't be the
> default case, but I'm wondering if there would likely be any major
> performance issues for using it in situations where you're likely to
> want bounding-box access rather than simply pulling out entities by ID.
>
>
I have no code for pulling entities out by ID, but that would be
straightforward to add, if there was a demand for it.

There should be no problems at all for doing geographic queries. My vision
for a bounding box access is that the file lets you skip 'most' blocks that
are irrelevant to a query. 'most' depends a lot on the data and how exactly
the dataset is sorted for geographic locality.

But there may be problems in geographic queries. Things like
cross-continental airways if they are in the OSM planet file would cause
huge problems; their bounding box would cover the whole continent,
intersecting virtually any geographic lookup. Those geographic lookups would
then need to find the nodes in those long ways which would require loading
virtually every block containing nodes.  I have considered solutions for
this issue, but I do not know if problematic ways like this exist. Does OSM
have ways like this.

> Also, is there any reason that this format wouldn't be suitable for a
> site with many active users performing geographic, read-only queries of
> the data?

A lot of that depends on the query locality. Each block has to be
indpendently decompressed and parsed before the contents can be examined,
that takes around 1ms. At a small penalty in filesize, you can use 4k
entities in a block which decompress and parse faster. If the client is
interested in many ways in a particular geographic locality, as yours seems
to, then this is perfect. Grab the blocks and cache the decompressed data in
RAM where it can be re-used for subsequent geographic queries in the same
locality.

> Again, I'd guess not, since the data isn't compressed as such,
> but maybe seeking several gigs into a file to locate nearby entities
> would be a factor, or it may work just fine for single-user access but
> not so well with multiple distinct seeks for different users in widely
> separate locations.
>
>
Ultimately, it depends on your application, which has a particular locality
in its lookups. Application locality, combined with a fileformat, defines
the working set size. If RAM is insufficient to hold the working set, you'll
have to pay a disk seek whether it is in my format or not. My format being
very dense, might let RAM hold the working set and avoid the disk seek. 1ms
to decompress is already far faster than a hard drive, though not a SSD.

Could you tell me more about the kinds of lookups your application will do?

> Anyhow, I realize these questions may be naive at such an early stage,
> but the idea that I may be able to pull this off without infrastructure
> beyond my budget is an appealing one. Are there any reasons your binary
> format wouldn't be able to accomodate this situation, or couldn't be
> optimized to do so?
>
>
No optimization may be necessary to do what you want; all that would be
needed would be standardize the location and format of bounding box messages
and physically reorder the entity stream before it goes into the serializer.
I have some ideas for ways of doing that.

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20100502/5df582f6/attachment.html>