[OSM-dev] New OSM binary fileformat implementation.

Thu Jun 17 01:12:53 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey, just wondering what the status on this project is?

My app which could greatly benefit from this format is advancing to a
usable state, but the main bottleneck for me right now is database size.
My current format stores Texas in around 10G, which doesn't easily
scale. If I can fit the entire globe into just over half that, well,
that'd rock. :)

Thanks.

On 05/02/2010 04:25 AM, jamesmikedupont at googlemail.com wrote:
> Ok, my reader is now working, it can read to the end of the file,
> now am fleshing out the template dump functions to emit the data.
> git at github.com:h4ck3rm1k3/OSM-Osmosis.git
> 
> My new idea is that we could use a binary version of the rtree, I have
> already ported the rtree to my older template classes.
> 
> We could use the rtree to sort the data and emit the blocks based on
> that. the rtree data structures themselves could be stored in the
> protobuffer so that it is persistent and also readable by all.
> 
> 
> https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg
> 
> 
> notes:
> http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042
> 
> doxygen here :
> http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html
> 
> On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <scrosby06 at gmail.com> wrote:
>> ---------- Forwarded message ----------
>> (Accidently did not reply to list)
>>
>>
>>> Some of these questions may be a bit premature, but I don't know how far
>>> along your design is, and perhaps asking them now may influence that
>>> design in ways that work for me.
>>>
>>
>> I'm willing to call what I've designed so far in the file format mostly
>> complete, except for some of the header design issues I've brought up
>> already. The question is what extensions make sense to define now, such as
>> bounding boxes, and choosing the right definition for them.
>>
>>
>>
>>> Unfortunately, this method introduces a variety of complications. First,
>>> the database for TX alone is 10 gigs. Ballpark estimations are that I
>>> might need half a TB or more to store the entire planet. I'll also need
>>> substantial RAM to store the working set for the DB index. All this
>>> means that, to launch this project on a global scale, I'd need a lot
>>> more funding than I as an individual am likely to find.
>>
>> With pruning out metadata, some judicious filtering of uninteresting tags,
>> and increasing the granularity to 10 microdegrees (about 1m resolution),
>> I've fit the whole planet in 3.7gb.
>>
>>>
>>> Is there a performance or size penalty to ordering the data
>>> geographically rather than by ID?
>>
>> I expect no performance penalty.
>>
>> As for a size penalty, it will be a mixed bag. Ordering geographically
>> should reduce the similarity for node ID numbers, increasing the space
>> required to store them. It should increase the similarity for latitude and
>> longitude numbers, which would reduce the size. It might change the re-use
>> frequency of strings. On the whole, I suspect the filesize would remain
>> within 10% of what it is now and believe it will decrease, but I have no way
>> to know.
>>
>>>
>>> I understand that this won't be the
>>> default case, but I'm wondering if there would likely be any major
>>> performance issues for using it in situations where you're likely to
>>> want bounding-box access rather than simply pulling out entities by ID.
>>>
>>
>> I have no code for pulling entities out by ID, but that would be
>> straightforward to add, if there was a demand for it.
>>
>> There should be no problems at all for doing geographic queries. My vision
>> for a bounding box access is that the file lets you skip 'most' blocks that
>> are irrelevant to a query. 'most' depends a lot on the data and how exactly
>> the dataset is sorted for geographic locality.
>>
>> But there may be problems in geographic queries. Things like
>> cross-continental airways if they are in the OSM planet file would cause
>> huge problems; their bounding box would cover the whole continent,
>> intersecting virtually any geographic lookup. Those geographic lookups would
>> then need to find the nodes in those long ways which would require loading
>> virtually every block containing nodes.  I have considered solutions for
>> this issue, but I do not know if problematic ways like this exist. Does OSM
>> have ways like this.
>>
>>>
>>> Also, is there any reason that this format wouldn't be suitable for a
>>> site with many active users performing geographic, read-only queries of
>>> the data?
>>
>> A lot of that depends on the query locality. Each block has to be
>> indpendently decompressed and parsed before the contents can be examined,
>> that takes around 1ms. At a small penalty in filesize, you can use 4k
>> entities in a block which decompress and parse faster. If the client is
>> interested in many ways in a particular geographic locality, as yours seems
>> to, then this is perfect. Grab the blocks and cache the decompressed data in
>> RAM where it can be re-used for subsequent geographic queries in the same
>> locality.
>>
>>>
>>> Again, I'd guess not, since the data isn't compressed as such,
>>> but maybe seeking several gigs into a file to locate nearby entities
>>> would be a factor, or it may work just fine for single-user access but
>>> not so well with multiple distinct seeks for different users in widely
>>> separate locations.
>>>
>>
>> Ultimately, it depends on your application, which has a particular locality
>> in its lookups. Application locality, combined with a fileformat, defines
>> the working set size. If RAM is insufficient to hold the working set, you'll
>> have to pay a disk seek whether it is in my format or not. My format being
>> very dense, might let RAM hold the working set and avoid the disk seek. 1ms
>> to decompress is already far faster than a hard drive, though not a SSD.
>>
>> Could you tell me more about the kinds of lookups your application will do?
>>
>>>
>>> Anyhow, I realize these questions may be naive at such an early stage,
>>> but the idea that I may be able to pull this off without infrastructure
>>> beyond my budget is an appealing one. Are there any reasons your binary
>>> format wouldn't be able to accomodate this situation, or couldn't be
>>> optimized to do so?
>>>
>>
>> No optimization may be necessary to do what you want; all that would be
>> needed would be standardize the location and format of bounding box messages
>> and physically reorder the entity stream before it goes into the serializer.
>> I have some ideas for ways of doing that.
>>
>> Scott
>>
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at openstreetmap.org
>> http://lists.openstreetmap.org/listinfo/dev
>>
>>
> 
> _______________________________________________
> dev mailing list
> dev at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/dev
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwZaIUACgkQIaMjFWMehWJn3wCeIAORsAFNkqIMrFuGwwjh7Ngn
SEQAnjLmoMxamzLDcie11Zl0GsOPsG2z
=HTt9
-----END PGP SIGNATURE-----