[Tile-serving] [openstreetmap/osm2pgsql] Possible improvements for new ram middle (#1466)

Thu Apr 29 08:39:37 UTC 2021

The new ram middle (#1461) is better than the previous implementations but there are several places where it could be made even better. Here are some ideas that we can work on as time permits:

Note that many of these are interrelated and related to other upcoming changes in the code. So in many cases it makes sense to defer decisions on what best to do until later when we have a better idea of the environment this code runs in. (In fact this is one reason I am writing this issue instead of improving the code directly, the other is limited time...)

* [ ] The storage for node locations and way node lists is, in the end, a std::string (`node_locations_t::m_data` and `middle_ram_t::m_way_nodes_data`, respectively). As more data arrives this will be resized again and again. Copying the memory around when that happens isn't that expensive, so it isn't too bad. But while the resizing takes place, we temporarily basically need twice the memory. So when you get near your memory limit, this can break although enough memory would be available if we break the data into smaller chunks.
* [ ] The `node_locations_t` store writes new locations into a temporary cache (`m_block`) which gets written out to the real data storage (`m_data`) when full. This could be done directly instead. We don't need the `freeze()` function any more then.
* [ ] The `node_locations_t` store writes out all ids of one block and then all locations of that block. It might be better to write out all (id, location) pairs in order.
* [ ] The `node_locations_t::block_size = 32` should be evaluated and tuned.
* [ ] Reading node location back from `node_locations_t` is expensive, because we have to read through a lot of varints to find the data we need. We should probably use some kind of cache so that we can reuse decoded blocks. I have not implemented this, because we might want to run the reading code in multiple threads later. If that's the case we might want to have the cache outside the `node_locations_t` class so threads don't trample on each other. Or maybe we need a `mutex`.
* [ ] When two-stage processing is requested by the flex output, the middle needs to store tags (and possibly attributes) of objects (currently only way objects, but in the future also nodes and possibly relations) in addition to the node locations and way node lists. The current implementation simply stores the complete objects as they are in the `osmium::memory::Buffer` which takes a lot of memory. This was the simplest implementation I could think of and two-stage processing isn't used widely, so it is a reasonable compromise for the time being. But we can do much better here, details TBD.
* [ ] The ram middle could also use the existing disk flat node store or any future disk-based node stores we work on. Or some implementation where the data is stored on disk but we keep the index in memory. This can be improved once we have the new version of the pgsql middle and see better where we can share code.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/osm2pgsql/issues/1466
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/tile-serving/attachments/20210429/c7f0d91d/attachment.htm>