<div dir="ltr"><div dir="ltr">Hi Frederick,<div><br></div><div>Requiring the sequential read makes using the pbf format difficult in data parallel processing.</div><div><br></div><div>When files are split into equal sized chunks to be processed in parallel, it is necessary to be able to seek to the beginning of the next block (blob) to begin processing there. </div><div><br></div><div>This is not currently possible with the pbf format, as the file _must_ be read sequentially to figure out where the blob ends / new one begins. With an index, or even just a simple delimiter it would be possible to figure this out in a parallel processing scenario.</div><div><br></div><div>My workaround was to pre-process and separate the blobs into a delimited format before processing.</div><div><br></div><div>Best,</div><div><br></div><div>Will Temperley</div></div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, 15 Oct 2018 at 23:58, Frederik Ramm <<a href="mailto:frederik@remote.org">frederik@remote.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
On 10/15/2018 11:32 PM, Nick Stallman wrote:<br>
> In doing this I noticed that all the tools handling PBF files are<br>
> horribly inefficient. With PBF files being in essentially raw<br>
> chronological order it means extracting any kind of meaningful data out<br>
> of them requires scanning the entire file (even if you are only<br>
> interested in a small region) and vast quantities of random reads.<br>
<br>
I don't think your analysis is correct. I am not aware of any file that<br>
processes PBFs and does random reads - they're all streaming, and worst<br>
case they're reading the file in full three times. But no seeking. And<br>
reading "only a region" from a PBF file is kind of a niche use case;<br>
most people get the file that covers the area they need, and load it<br>
into a database, where derived data structures will be built for<br>
whatever the use case is.<br>
<br>
The osmium command line tool is relatively good and efficient at cutting<br>
out regions from a planet file if needed. Indexing a planet file would<br>
only make sense if your use case involves repeatedly cutting out small<br>
areas from a planet file.<br>
<br>
> Judging from the PBF wiki page, all the work was done ~8 years ago and<br>
> included the foresight to have fields for indexing but from what I can<br>
> find nothing has been done about that since. Adding an index seems like<br>
> a logical step which would reduce processing times for many common<br>
> operations drastically.<br>
<br>
As I said, most people take a PBF and load it into a database, and I<br>
don't see how that processing would benefit from an index. What are the<br>
"many common operations" you are thinking of?<br>
<br>
> Some tools do make their own index or cache but<br>
> it needs to be done for each tool and is sub optimal.<br>
<br>
I'm only aware of Overpass which is essentially a database<br>
implementation of its own, which not only does regional cuts but also<br>
filtering by tags, and would certainly not be able to simply replace its<br>
own database with an indexed PBF.<br>
<br>
> I'm a little tempted to find the time to create an indexed format myself<br>
> if needed and submit patches to the relevant tools so they can benefit<br>
> from it.<br>
<br>
Again, I struggle to understand which operations and tools would<br>
benefit; I don't think the general OSM data user struggles with the<br>
issues an index would solve. I could imagine if you ran a custom extract<br>
server like <a href="http://extract.bbbike.org" rel="noreferrer" target="_blank">extract.bbbike.org</a> then having random, regionally indexed<br>
access to a raw data file could be beneficial but that's about the only<br>
case I can think of.<br>
> With this scheme, if you needed to make a country extract it would be<br>
> too easy, Blobs could simply be copied as-is selected by their geohash.<br>
> A later step could then filter out by polygon or bounding box if<br>
> required over the subsequent significantly smaller file. If the entire<br>
> planet file was being imported in to PostGIS then it could be done in a<br>
> single pass since everything would be easily locatable.<br>
<br>
The planet is imported into PostGIS in a single pass even now, at least<br>
if you use osm2pgsql.<br>
<br>
I am running a nightly job that splits the planet into tons of country<br>
and smaller extracts on <a href="http://download.geofabrik.de" rel="noreferrer" target="_blank">download.geofabrik.de</a>. It takes a couple of<br>
hours every night. Having an indexed planet file could save a little<br>
time in the process but I'm not sure if it would be worth it. The reason<br>
many people download country extracts from <a href="http://download.geofabrik.de" rel="noreferrer" target="_blank">download.geofabrik.de</a> is<br>
probably not that the planet file isn't indexed and therefor extracting<br>
a region is hard - it's that the planet file is huge and they don't want<br>
to download that much. An indexed planet file would not help these users.<br>
<br>
Not saying you shouldn't try it but I haven't yet understood the benefits.<br>
<br>
Bye<br>
Frederik<br>
<br>
-- <br>
Frederik Ramm ## eMail <a href="mailto:frederik@remote.org" target="_blank">frederik@remote.org</a> ## N49°00'09" E008°23'33"<br>
<br>
_______________________________________________<br>
dev mailing list<br>
<a href="mailto:dev@openstreetmap.org" target="_blank">dev@openstreetmap.org</a><br>
<a href="https://lists.openstreetmap.org/listinfo/dev" rel="noreferrer" target="_blank">https://lists.openstreetmap.org/listinfo/dev</a><br>
</blockquote></div>