[OSM-dev] New OSM binary fileformat implementation.

jamesmikedupont at googlemail.com jamesmikedupont at googlemail.com
Wed May 5 10:30:25 BST 2010


I have now reworked the reader code to use only the internal google
protobuffer features for reading. It now can read the entire file to the
end. Code is commited.
Also, I have checked in an example small protobuffer file for testing.
http://github.com/h4ck3rm1k3/OSM-Osmosis

any testers would be appreciated.

in the dir OSM-Osmosis/src/crosby/binary/cpp, build and then run :
./osmprotoread albania.osm.protobuf3 > out.txt

mike

On Sun, May 2, 2010 at 11:25 AM, jamesmikedupont at googlemail.com <
jamesmikedupont at googlemail.com> wrote:

> Ok, my reader is now working, it can read to the end of the file,
> now am fleshing out the template dump functions to emit the data.
> git at github.com:h4ck3rm1k3/OSM-Osmosis.git
>
> My new idea is that we could use a binary version of the rtree, I have
> already ported the rtree to my older template classes.
>
> We could use the rtree to sort the data and emit the blocks based on
> that. the rtree data structures themselves could be stored in the
> protobuffer so that it is persistent and also readable by all.
>
>
> https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg<https://code.launchpad.net/%7Ejamesmikedupont/+junk/EPANatReg>
>
>
> notes:
> http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042
>
> doxygen here :
> http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html
>
> On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <scrosby06 at gmail.com> wrote:
> > ---------- Forwarded message ----------
> > (Accidently did not reply to list)
> >
> >
> >> Some of these questions may be a bit premature, but I don't know how far
> >> along your design is, and perhaps asking them now may influence that
> >> design in ways that work for me.
> >>
> >
> > I'm willing to call what I've designed so far in the file format mostly
> > complete, except for some of the header design issues I've brought up
> > already. The question is what extensions make sense to define now, such
> as
> > bounding boxes, and choosing the right definition for them.
> >
> >
> >
> >> Unfortunately, this method introduces a variety of complications. First,
> >> the database for TX alone is 10 gigs. Ballpark estimations are that I
> >> might need half a TB or more to store the entire planet. I'll also need
> >> substantial RAM to store the working set for the DB index. All this
> >> means that, to launch this project on a global scale, I'd need a lot
> >> more funding than I as an individual am likely to find.
> >
> > With pruning out metadata, some judicious filtering of uninteresting
> tags,
> > and increasing the granularity to 10 microdegrees (about 1m resolution),
> > I've fit the whole planet in 3.7gb.
> >
> >>
> >> Is there a performance or size penalty to ordering the data
> >> geographically rather than by ID?
> >
> > I expect no performance penalty.
> >
> > As for a size penalty, it will be a mixed bag. Ordering geographically
> > should reduce the similarity for node ID numbers, increasing the space
> > required to store them. It should increase the similarity for latitude
> and
> > longitude numbers, which would reduce the size. It might change the
> re-use
> > frequency of strings. On the whole, I suspect the filesize would remain
> > within 10% of what it is now and believe it will decrease, but I have no
> way
> > to know.
> >
> >>
> >> I understand that this won't be the
> >> default case, but I'm wondering if there would likely be any major
> >> performance issues for using it in situations where you're likely to
> >> want bounding-box access rather than simply pulling out entities by ID.
> >>
> >
> > I have no code for pulling entities out by ID, but that would be
> > straightforward to add, if there was a demand for it.
> >
> > There should be no problems at all for doing geographic queries. My
> vision
> > for a bounding box access is that the file lets you skip 'most' blocks
> that
> > are irrelevant to a query. 'most' depends a lot on the data and how
> exactly
> > the dataset is sorted for geographic locality.
> >
> > But there may be problems in geographic queries. Things like
> > cross-continental airways if they are in the OSM planet file would cause
> > huge problems; their bounding box would cover the whole continent,
> > intersecting virtually any geographic lookup. Those geographic lookups
> would
> > then need to find the nodes in those long ways which would require
> loading
> > virtually every block containing nodes.  I have considered solutions for
> > this issue, but I do not know if problematic ways like this exist. Does
> OSM
> > have ways like this.
> >
> >>
> >> Also, is there any reason that this format wouldn't be suitable for a
> >> site with many active users performing geographic, read-only queries of
> >> the data?
> >
> > A lot of that depends on the query locality. Each block has to be
> > indpendently decompressed and parsed before the contents can be examined,
> > that takes around 1ms. At a small penalty in filesize, you can use 4k
> > entities in a block which decompress and parse faster. If the client is
> > interested in many ways in a particular geographic locality, as yours
> seems
> > to, then this is perfect. Grab the blocks and cache the decompressed data
> in
> > RAM where it can be re-used for subsequent geographic queries in the same
> > locality.
> >
> >>
> >> Again, I'd guess not, since the data isn't compressed as such,
> >> but maybe seeking several gigs into a file to locate nearby entities
> >> would be a factor, or it may work just fine for single-user access but
> >> not so well with multiple distinct seeks for different users in widely
> >> separate locations.
> >>
> >
> > Ultimately, it depends on your application, which has a particular
> locality
> > in its lookups. Application locality, combined with a fileformat, defines
> > the working set size. If RAM is insufficient to hold the working set,
> you'll
> > have to pay a disk seek whether it is in my format or not. My format
> being
> > very dense, might let RAM hold the working set and avoid the disk seek.
> 1ms
> > to decompress is already far faster than a hard drive, though not a SSD.
> >
> > Could you tell me more about the kinds of lookups your application will
> do?
> >
> >>
> >> Anyhow, I realize these questions may be naive at such an early stage,
> >> but the idea that I may be able to pull this off without infrastructure
> >> beyond my budget is an appealing one. Are there any reasons your binary
> >> format wouldn't be able to accomodate this situation, or couldn't be
> >> optimized to do so?
> >>
> >
> > No optimization may be necessary to do what you want; all that would be
> > needed would be standardize the location and format of bounding box
> messages
> > and physically reorder the entity stream before it goes into the
> serializer.
> > I have some ideas for ways of doing that.
> >
> > Scott
> >
> >
> >
> > _______________________________________________
> > dev mailing list
> > dev at openstreetmap.org
> > http://lists.openstreetmap.org/listinfo/dev
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20100505/9a72a7c9/attachment.html>


More information about the dev mailing list