[OSM-dev] New OSM binary fileformat implementation.

jamesmikedupont at googlemail.com jamesmikedupont at googlemail.com
Sun May 2 10:25:34 BST 2010

Ok, my reader is now working, it can read to the end of the file,
now am fleshing out the template dump functions to emit the data.
git at github.com:h4ck3rm1k3/OSM-Osmosis.git

My new idea is that we could use a binary version of the rtree, I have
already ported the rtree to my older template classes.

We could use the rtree to sort the data and emit the blocks based on
that. the rtree data structures themselves could be stored in the
protobuffer so that it is persistent and also readable by all.



doxygen here :

On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <scrosby06 at gmail.com> wrote:
> ---------- Forwarded message ----------
> (Accidently did not reply to list)
>> Some of these questions may be a bit premature, but I don't know how far
>> along your design is, and perhaps asking them now may influence that
>> design in ways that work for me.
> I'm willing to call what I've designed so far in the file format mostly
> complete, except for some of the header design issues I've brought up
> already. The question is what extensions make sense to define now, such as
> bounding boxes, and choosing the right definition for them.
>> Unfortunately, this method introduces a variety of complications. First,
>> the database for TX alone is 10 gigs. Ballpark estimations are that I
>> might need half a TB or more to store the entire planet. I'll also need
>> substantial RAM to store the working set for the DB index. All this
>> means that, to launch this project on a global scale, I'd need a lot
>> more funding than I as an individual am likely to find.
> With pruning out metadata, some judicious filtering of uninteresting tags,
> and increasing the granularity to 10 microdegrees (about 1m resolution),
> I've fit the whole planet in 3.7gb.
>> Is there a performance or size penalty to ordering the data
>> geographically rather than by ID?
> I expect no performance penalty.
> As for a size penalty, it will be a mixed bag. Ordering geographically
> should reduce the similarity for node ID numbers, increasing the space
> required to store them. It should increase the similarity for latitude and
> longitude numbers, which would reduce the size. It might change the re-use
> frequency of strings. On the whole, I suspect the filesize would remain
> within 10% of what it is now and believe it will decrease, but I have no way
> to know.
>> I understand that this won't be the
>> default case, but I'm wondering if there would likely be any major
>> performance issues for using it in situations where you're likely to
>> want bounding-box access rather than simply pulling out entities by ID.
> I have no code for pulling entities out by ID, but that would be
> straightforward to add, if there was a demand for it.
> There should be no problems at all for doing geographic queries. My vision
> for a bounding box access is that the file lets you skip 'most' blocks that
> are irrelevant to a query. 'most' depends a lot on the data and how exactly
> the dataset is sorted for geographic locality.
> But there may be problems in geographic queries. Things like
> cross-continental airways if they are in the OSM planet file would cause
> huge problems; their bounding box would cover the whole continent,
> intersecting virtually any geographic lookup. Those geographic lookups would
> then need to find the nodes in those long ways which would require loading
> virtually every block containing nodes.  I have considered solutions for
> this issue, but I do not know if problematic ways like this exist. Does OSM
> have ways like this.
>> Also, is there any reason that this format wouldn't be suitable for a
>> site with many active users performing geographic, read-only queries of
>> the data?
> A lot of that depends on the query locality. Each block has to be
> indpendently decompressed and parsed before the contents can be examined,
> that takes around 1ms. At a small penalty in filesize, you can use 4k
> entities in a block which decompress and parse faster. If the client is
> interested in many ways in a particular geographic locality, as yours seems
> to, then this is perfect. Grab the blocks and cache the decompressed data in
> RAM where it can be re-used for subsequent geographic queries in the same
> locality.
>> Again, I'd guess not, since the data isn't compressed as such,
>> but maybe seeking several gigs into a file to locate nearby entities
>> would be a factor, or it may work just fine for single-user access but
>> not so well with multiple distinct seeks for different users in widely
>> separate locations.
> Ultimately, it depends on your application, which has a particular locality
> in its lookups. Application locality, combined with a fileformat, defines
> the working set size. If RAM is insufficient to hold the working set, you'll
> have to pay a disk seek whether it is in my format or not. My format being
> very dense, might let RAM hold the working set and avoid the disk seek. 1ms
> to decompress is already far faster than a hard drive, though not a SSD.
> Could you tell me more about the kinds of lookups your application will do?
>> Anyhow, I realize these questions may be naive at such an early stage,
>> but the idea that I may be able to pull this off without infrastructure
>> beyond my budget is an appealing one. Are there any reasons your binary
>> format wouldn't be able to accomodate this situation, or couldn't be
>> optimized to do so?
> No optimization may be necessary to do what you want; all that would be
> needed would be standardize the location and format of bounding box messages
> and physically reorder the entity stream before it goes into the serializer.
> I have some ideas for ways of doing that.
> Scott
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev

More information about the dev mailing list