---------- Forwarded message ----------<br><div class="gmail_quote">(Accidently did not reply to list)<br><br><br>

<div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Some of these questions may be a bit premature, but I don't know how far<br>


along your design is, and perhaps asking them now may influence that<br>

design in ways that work for me.<br>

<br></blockquote></div><div><br>I'm willing to call what I've designed so far in the file format mostly complete, except for some of the header design issues I've brought up already. The question is what extensions make sense to define now, such as bounding boxes, and choosing the right definition for them.<div class="im">

<br>

<br><br><br>> Unfortunately, this method introduces a variety of complications. First,<br>

> the database for TX alone is 10 gigs. Ballpark estimations are that I<br>> might need half a TB or more to store the entire planet. I'll also need<br>

> substantial RAM to store the working set for the DB index. All this<br>> means that, to launch this project on a global scale, I'd need a lot<br>> more funding than I as an individual am likely to find.<br>


<br></div>With pruning out metadata, some judicious filtering of uninteresting tags, and increasing the granularity to 10 microdegrees (about 1m resolution), I've fit the whole planet in 3.7gb.<br> </div><div class="im">

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Is there a performance or size penalty to ordering the data<br>

geographically rather than by ID? </blockquote></div><div><br>I expect no performance penalty.<br><br>As for a size penalty, it will be a mixed bag. Ordering geographically should reduce the similarity for node ID numbers, increasing the space required to store them. It should increase the similarity for latitude and longitude numbers, which would reduce the size. It might change the re-use frequency of strings. On the whole, I suspect the filesize would remain within 10% of what it is now and believe it will decrease, but I have no way to know.<br>


 </div><div class="im"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I understand that this won't be the<br>

default case, but I'm wondering if there would likely be any major<br>

performance issues for using it in situations where you're likely to<br>

want bounding-box access rather than simply pulling out entities by ID.<br><br></blockquote></div><div><br>I have no code for pulling entities out by ID, but that would be straightforward to add, if there was a demand for it.<br>


<br>There should be no problems at all for doing geographic queries. My vision for a bounding box access is that the file lets you skip 'most' blocks that are irrelevant to a query. 'most' depends a lot on the data and how exactly the dataset is sorted for geographic locality. <br>


<br>But there may be problems in geographic queries. Things like cross-continental airways if they are in the OSM planet file would cause huge problems; their bounding box would cover the whole continent, intersecting virtually any geographic lookup. Those geographic lookups would then need to find the nodes in those long ways which would require loading virtually every block containing nodes.  I have considered solutions for this issue, but I do not know if problematic ways like this exist. Does OSM have ways like this. <br>


 </div><div class="im"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Also, is there any reason that this format wouldn't be suitable for a<br>

site with many active users performing geographic, read-only queries of<br>

the data? </blockquote></div><div><br>A lot of that depends on the query locality. Each block has to be indpendently decompressed and parsed before the contents can be examined, that takes around 1ms. At a small penalty in filesize, you can use 4k entities in a block which decompress and parse faster. If the client is interested in many ways in a particular geographic locality, as yours seems to, then this is perfect. Grab the blocks and cache the decompressed data in RAM where it can be re-used for subsequent geographic queries in the same locality.<br>


<br></div><div class="im"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Again, I'd guess not, since the data isn't compressed as such,<br>


but maybe seeking several gigs into a file to locate nearby entities<br>

would be a factor, or it may work just fine for single-user access but<br>

not so well with multiple distinct seeks for different users in widely<br>

separate locations.<br>

<br></blockquote></div><div><br>Ultimately, it depends on your application, which has a particular locality in its lookups. Application locality, combined with a fileformat, defines the working set size. If RAM is insufficient to hold the working set, you'll have to pay a disk seek whether it is in my format or not. My format being very dense, might let RAM hold the working set and avoid the disk seek. 1ms to decompress is already far faster than a hard drive, though not a SSD.<br>


<br>Could you tell me more about the kinds of lookups your application will do?<br> </div><div class="im"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Anyhow, I realize these questions may be naive at such an early stage,<br>

but the idea that I may be able to pull this off without infrastructure<br>

beyond my budget is an appealing one. Are there any reasons your binary<br>

format wouldn't be able to accomodate this situation, or couldn't be<br>

optimized to do so?<br>

<br></blockquote></div><div><br>No optimization may be necessary to do what you want; all that would be needed would be standardize the location and format of bounding box messages and physically reorder the entity stream before it goes into the serializer. I have some ideas for ways of doing that.<br>


<br>Scott<br><br>

</div></div>

</div><br>