[OSM-dev] pbf2osm development has started [code to test!]

Scott Crosby scrosby at cs.rice.edu
Thu Sep 30 15:59:33 BST 2010


On Thu, Sep 30, 2010 at 6:04 AM, Stefan de Konink <stefan at konink.de> wrote:

> On Thu, 30 Sep 2010, Scott Crosby wrote:
>
>  I proposed exactly this for the mkgmap splitter. If you're going to do
>> this,
>> can I propose a tweak where you can output thousands of files
>> simultaneously? The differences are minor: Instead of tracking if a
>> node/way/relation was output or was missed with two bitsets, track which
>> areas it has been dumped to with two multimaps from the ID to the list of
>> areas it was output to or the list of areas in which it was missed.
>>
>
> How would you a priori know if a node is in a certain area if you haven't
> observed its locatio or trace it?
>
>
You don't know a.priori. However, you can track when which
nodes/ways/relations have already been output and which are missing. On the
second pass, output the missing ones, and also add to the list of.

But in the first pass, you output nodes&ways&relations. You also track which
nodes/ways/relations have already been output, and also track *missing*
nodes/ways/relations. (So that you know that they need to be output on the
second pass.)

DO:

while (node = stream.getNode()) {
   nodeid = node.getId();
   // Handle missing nodes.
   for (areaid : missingNodesMultiMap.get(nodeid) {
       areas[areaid].writeNode(node);
       alreadyOutputMultiMap.addKeyVal(nodeid,areaid);
   }
   missingNodesMap.clearKey(nodeid)
   // Only on the first pass:
   if (isFirstPass) {
     for (area : areas) {
         if area.containsNode(node) {
             area.writeNode(node)
             alreadyOutputMultiMap.addKeyVal(nodeid,area.getId());
             missingNodesMultiMap.remove(node.getId(),area.getId());
         }
      }
  }

while (way = stream.getWay()) {
   wayid = way.getId();
   // Handle relations reporting us as missing on second and later passes.
   for (areaid : missingWaysMultiMap(way)) {
      for (nodeid : way.getNodesIds()) {
        if (!alreadyOutputNodes.contains(nodeid,areaid))
           missingNodes.addKeyVal(wayid,areaid);
        }
   missingWaysMap.clearKey(wayid)

  // Only needed on the first pass to find the inital set of ways in the
area ; any way that contains a node in the area in question.
  if (isFirstPass) {
    for (area : areas) {
        areaid=arae.getid();
       for (nodeid : way.getNodesIds()) {
        if (alreadyOutputedNodes.contains(nodeid,areaid))
            toOutput = true;
        }
       for (nodeid : way.getNodesIds()) {
         if (toOutput && !alreadyOutputNodes.contains(nodeid,areaid))
             missingNodes.addKeyVal(nodeid,nodeid);
        }
    }
}

// Relations are similar to ways. Track what has been output and what
contained relations/ways/nodes are missing and need to be output on the next
pass.

WHILE( missingNodes && missingWays && missingRelations are nonempty);

>From my java benchmarks done with the splitter, a whole planet can be split
into 6000 disjoint areas at a time on a 8gb machine.


>> BBOX?
>>
>> Count of nodes/ways/relations in that block?
>>
>> What else?
>>
>
> I would find it very interesting if different types of output could be
> exported individually.


Definitely doable, but this needs to be thought through. To preserve the
semantics of each node/way/relation is in the file exactly once, the
exact partitioning strategy needs to be planned very carefully to guarantee
that each node is in exactly one partition, guarantee that all
of required partitions can be found, so that the output isn't different,
while being efficient and skipping as much of the file as possible.
Furthermore, it should work when composed with geographic sorting.

For example being context aware. Some data is landuse, I don't need landuse
> for routing, so it might be exported in a completely different part of the
> pbf. So if the format would be descriptive about 'exclusive roads' that
> might also help the application that uses the data to extract or leave the
> set.
>
> I don't think that counts are useful. The mbr is.


Veeac requested the counts in the prior message.



>  Ok, Is XML's gzipped size or parsing speed a bottleneck for storing or
>> processing changes? I'd be happy to offer suggestions on the protocol
>> buffer
>> architecture.
>>
>
> From what I observe now the bottleneck seems to be actually protocol
> buffers, while my output code can become slightly faster.


This would imply that doing binary changesets isn't a critical necessity.

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20100930/87e59841/attachment.html>


More information about the dev mailing list