I have now reworked the reader code to use only the internal google protobuffer features for reading. It now can read the entire file to the end. Code is commited.<br><span class="gI"><span class="go"></span></span>Also, I have checked in an example small protobuffer file for testing.<br>


<a href="http://github.com/h4ck3rm1k3/OSM-Osmosis">http://github.com/h4ck3rm1k3/OSM-Osmosis</a><br><br>any testers would be appreciated.<br><br>in the dir OSM-Osmosis/src/crosby/binary/cpp, build and then run :<br>./osmprotoread albania.osm.protobuf3 > out.txt<br>


<br>mike<br><br><div class="gmail_quote">On Sun, May 2, 2010 at 11:25 AM, <a href="mailto:jamesmikedupont@googlemail.com">jamesmikedupont@googlemail.com</a> <span dir="ltr"><<a href="mailto:jamesmikedupont@googlemail.com">jamesmikedupont@googlemail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Ok, my reader is now working, it can read to the end of the file,<br>

now am fleshing out the template dump functions to emit the data.<br>

<div class="im">git@github.com:h4ck3rm1k3/OSM-Osmosis.git<br>

<br>

</div>My new idea is that we could use a binary version of the rtree, I have<br>

already ported the rtree to my older template classes.<br>

<br>

We could use the rtree to sort the data and emit the blocks based on<br>

that. the rtree data structures themselves could be stored in the<br>

protobuffer so that it is persistent and also readable by all.<br>

<div class="im"><br>

<br>

<a href="https://code.launchpad.net/%7Ejamesmikedupont/+junk/EPANatReg" target="_blank">https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg</a><br>

<br>

<br>

</div>notes:<br>

<a href="http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042" target="_blank">http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042</a><br>

<br>

doxygen here :<br>

<a href="http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html" target="_blank">http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html</a><br>

<div><div></div><div class="h5"><br>

On Sun, May 2, 2010 at 7:35 AM, Scott Crosby <<a href="mailto:scrosby06@gmail.com">scrosby06@gmail.com</a>> wrote:<br>

> ---------- Forwarded message ----------<br>

> (Accidently did not reply to list)<br>

><br>

><br>

>> Some of these questions may be a bit premature, but I don't know how far<br>

>> along your design is, and perhaps asking them now may influence that<br>

>> design in ways that work for me.<br>

>><br>

><br>

> I'm willing to call what I've designed so far in the file format mostly<br>

> complete, except for some of the header design issues I've brought up<br>

> already. The question is what extensions make sense to define now, such as<br>

> bounding boxes, and choosing the right definition for them.<br>

><br>

><br>

><br>

>> Unfortunately, this method introduces a variety of complications. First,<br>

>> the database for TX alone is 10 gigs. Ballpark estimations are that I<br>

>> might need half a TB or more to store the entire planet. I'll also need<br>

>> substantial RAM to store the working set for the DB index. All this<br>

>> means that, to launch this project on a global scale, I'd need a lot<br>

>> more funding than I as an individual am likely to find.<br>

><br>

> With pruning out metadata, some judicious filtering of uninteresting tags,<br>

> and increasing the granularity to 10 microdegrees (about 1m resolution),<br>

> I've fit the whole planet in 3.7gb.<br>

><br>

>><br>

>> Is there a performance or size penalty to ordering the data<br>

>> geographically rather than by ID?<br>

><br>

> I expect no performance penalty.<br>

><br>

> As for a size penalty, it will be a mixed bag. Ordering geographically<br>

> should reduce the similarity for node ID numbers, increasing the space<br>

> required to store them. It should increase the similarity for latitude and<br>

> longitude numbers, which would reduce the size. It might change the re-use<br>

> frequency of strings. On the whole, I suspect the filesize would remain<br>

> within 10% of what it is now and believe it will decrease, but I have no way<br>

> to know.<br>

><br>

>><br>

>> I understand that this won't be the<br>

>> default case, but I'm wondering if there would likely be any major<br>

>> performance issues for using it in situations where you're likely to<br>

>> want bounding-box access rather than simply pulling out entities by ID.<br>

>><br>

><br>

> I have no code for pulling entities out by ID, but that would be<br>

> straightforward to add, if there was a demand for it.<br>

><br>

> There should be no problems at all for doing geographic queries. My vision<br>

> for a bounding box access is that the file lets you skip 'most' blocks that<br>

> are irrelevant to a query. 'most' depends a lot on the data and how exactly<br>

> the dataset is sorted for geographic locality.<br>

><br>

> But there may be problems in geographic queries. Things like<br>

> cross-continental airways if they are in the OSM planet file would cause<br>

> huge problems; their bounding box would cover the whole continent,<br>

> intersecting virtually any geographic lookup. Those geographic lookups would<br>

> then need to find the nodes in those long ways which would require loading<br>

> virtually every block containing nodes.  I have considered solutions for<br>

> this issue, but I do not know if problematic ways like this exist. Does OSM<br>

> have ways like this.<br>

><br>

>><br>

>> Also, is there any reason that this format wouldn't be suitable for a<br>

>> site with many active users performing geographic, read-only queries of<br>

>> the data?<br>

><br>

> A lot of that depends on the query locality. Each block has to be<br>

> indpendently decompressed and parsed before the contents can be examined,<br>

> that takes around 1ms. At a small penalty in filesize, you can use 4k<br>

> entities in a block which decompress and parse faster. If the client is<br>

> interested in many ways in a particular geographic locality, as yours seems<br>

> to, then this is perfect. Grab the blocks and cache the decompressed data in<br>

> RAM where it can be re-used for subsequent geographic queries in the same<br>

> locality.<br>

><br>

>><br>

>> Again, I'd guess not, since the data isn't compressed as such,<br>

>> but maybe seeking several gigs into a file to locate nearby entities<br>

>> would be a factor, or it may work just fine for single-user access but<br>

>> not so well with multiple distinct seeks for different users in widely<br>

>> separate locations.<br>

>><br>

><br>

> Ultimately, it depends on your application, which has a particular locality<br>

> in its lookups. Application locality, combined with a fileformat, defines<br>

> the working set size. If RAM is insufficient to hold the working set, you'll<br>

> have to pay a disk seek whether it is in my format or not. My format being<br>

> very dense, might let RAM hold the working set and avoid the disk seek. 1ms<br>

> to decompress is already far faster than a hard drive, though not a SSD.<br>

><br>

> Could you tell me more about the kinds of lookups your application will do?<br>

><br>

>><br>

>> Anyhow, I realize these questions may be naive at such an early stage,<br>

>> but the idea that I may be able to pull this off without infrastructure<br>

>> beyond my budget is an appealing one. Are there any reasons your binary<br>

>> format wouldn't be able to accomodate this situation, or couldn't be<br>

>> optimized to do so?<br>

>><br>

><br>

> No optimization may be necessary to do what you want; all that would be<br>

> needed would be standardize the location and format of bounding box messages<br>

> and physically reorder the entity stream before it goes into the serializer.<br>

> I have some ideas for ways of doing that.<br>

><br>

> Scott<br>

><br>

><br>

><br>

</div></div><div><div></div><div class="h5">> _______________________________________________<br>

> dev mailing list<br>

> <a href="mailto:dev@openstreetmap.org">dev@openstreetmap.org</a><br>

> <a href="http://lists.openstreetmap.org/listinfo/dev" target="_blank">http://lists.openstreetmap.org/listinfo/dev</a><br>

><br>

><br>

</div></div></blockquote></div><br>