[josm-dev] Path to spatial index [long][Was: Refactoring of the JOSM architecture vs. Plugins]

Tue Aug 19 21:11:33 BST 2008

Frederik Ramm napsal(a):
> Ok, then let's do it in small steps in order to (a) break as little as 
> possible and (b) give everybody a chance to understand what's happening. 
> Let's make the spatial index our concrete goal and do whatever is 
> required to get this off the ground.

This is very constructive approach, thanks.

With the risk of explaining obvious, let me describe the spatial index related
design decisions I made while implementing josm-ng, the reasoning behind them
and how do I evaluate them now. This should help us understand required granularity
of the changes and impact of the decisions on different (josm) model.
When I'll talk about memory usage, sample data would be germany.osm with around
8M nodes and 1M ways - a dataset that is usable in josm-ng on a 2GB RAM (~1GB java heap).
Note also that I might be wrong on some of the josm's inner workings, having spent
much more time on the -ng codebase.

> Do you propose having the spatial index as part of the OsmPrimitive 
> object, or would you recommend having an extra "display" object instance 
> to sit next to each OsmPrimitive that handles these things?

The index is not going to be "inside" OsmPrimitive - it is going to index the primitives themselves.
So it has to be either part of the DataSet or part of the OsmDataLayer.

Let's look at the use cases (API clients) for the index:
A) rendering. No question about it. You want to render only the things on the screen.
B) selection. Whereever you click, josm tries to find the thing you have clicked on. Having
    a filter here is very desirable for larger datasets.
B1) thing-under-cursor-in-status-line. A variation of B, have to be superfast.
c) merging - when merging DataSets, the MergeVisitor tries to match new nodes (id=0) if they have
    the same coordinates. But given that merging is already fast enough and this may unnecessarily
    complicate the code, I won't emphasis this.
d) back references lookup - again filtering the candidates would speed up things tremendously.

The first decision is what to base the index on. You can index either according to latLon (1,model space)
or eastNorth (2,viewSpace). Both (A) and (B) support (2) and this is what I did in -ng.
Less important (c) and (d) would vote for (1) and you can always transform the filter rectangle
between coordinate spaces, at worst enlarging it slightly.

Now back to the "extra display object" - this is what I did in -ng (View class), and it would slightly
correlate to the eastLong field of josm's Node. This was a clean-design design decision:
I have a model (DataSet) that contains only the primary (osm) data, then I have a view cache inside
the OsmLayer view implementation, which is completely transient - I can recreate it anytime just from
the primary data and this is what I really do e.g. in case of runtime projection switch.
In theory, I could have two views of the same DataSet visible at once, each of them in different
projection. Also, I could parse a DataSet for some batch processing and there will be no unnecessary
clutter wasting my memory. This is the good side.
The bad side is increased memory usage for the common (editing) case. For each model entity, there
is a view entity (at least 8B on additional object header, 72MB for germany). There is also
a mapping between the two (64MB in quite densely populated Storage's object array, don't even think
of HashMap, and another hypothetical* 32MB in back references) and a view-hierarchy equivalent
of all the Way's node collections (50MB). All in all, 30% more memory.
For josm, we should for long stick with single entity (it comprises so many objects already anyway)
and that makes indexing by lonLat a very viable option. At the cost of making OsmPrimitive even more
cluttered later (see below)

*) JVMs typically have 8B object size granularity, so, simplified, every second field is for free.

> We'd need protected access to the node position
Right. I would even go as far as updating the other coordinate automatically.
On the other hand, most of the position writes should go through the MoveCommand anyway,
so it won't be that painful.

> and member list for ways 
Right. This looks more painful on the first sight, but given the fact that every commited
node list modification should be done by ChangeCommanding the old way with a new one, it's
theoretically done already.
But there's an opportunity to enforce that contract (and that is,
prevent undo system from a disaster caused by a random plugin) once we're touching the code.
There are 1,5 options here and taking this path means quite a lot of changes instead of none:

1. Let nodes be a read-only view of the actual nodes collection. Everybody would _have to_
   create a new way to set the nodes. The pattern would then change from:
			Way wnew = new Way(wold);
			wnew.nodes.add(node);
			cmds.add(new ChangeCommand(wold, wnew));
   to
			List<Node> nodes = new ArrayList(wold.nodes);
			nodes.add(node);
			Way wnew = new Way(wold, nodes);
			cmds.add(new ChangeCommand(wold, wnew));
   IMHO preferable initial solution for josm (not for -ng, for a reason)

1,25. let getNodes() return a separate copy of the collection:
			List<Node> nodes = wold.getNodes();
			nodes.add(node);
			Way wnew = new Way(wold, nodes);
			cmds.add(new ChangeCommand(wold, wnew));
   Simplifies code for common modification case but costs time during rendering.
   This might be a good intermediate step after (1) while leaving the nodes field public.

1,5. let getNodes return a copy-on-write List view over private Node[] nodes.
   This trick removes the duality, allows you to easily use given collection,
   yet reduces the allocation during rendering. I'm going to implement this for -ng (which
   currently uses getNodes() returning read-only view), so such a custom List implementation
   will be available under GPLv2 for you.

> (and relations for good measure), right?
Right. You do some attempts on relation rendering, so filtering them will come handy.
Not because there are many of them (to the contrary), but because the index API
would be very simple then: Iterable<OsmPrimitive> giveMeAllThePrimitivesInside(Bounds bounds);
(or better "visitAllThePrimitivesInside(Bounds bounds, Visitor vis);"

 > Do we also have to protect the
> tag set or other features, initially?

Not initially, but you will, earlier or later, end up caching the associated style too,
(this is the "below" where you'll clutter OsmPrimitive with another field) and this is
tags dependent. Also, the index might take current zoom into account too and structurally
filter out unimportant (highway=residential, untagged node) primitives when looking
at the whole shebang. See:
http://svn.openstreetmap.org/applications/editors/josm-ng/src/org/openstreetmap/josmng/view/osm/View.java

Now more to the index implementation (the most important part of this message, I think):

In the very first approach, I used a simple 1d index of nodes (x axis), no index for ways
(just kept current bbox) and skipped painting nodes at all when zoomed to wide.
This worked quite fine (speed wise) but I used simple TreeMap for keeping the nodes.
Don't even try anything like that. That is, anything that keeps a cell (TreeMap$Entry
in my case) for every primitive. It is just a waste of memory and you don't need that
much detail.
josm-ng currently uses a self-tuning qtree that splits the cell whenever there are >100
entries in there. If you zoom into a sparse area, you'll get tens of offscreen objects,
but who cares as long as there aren't tens of thousands of them. If you're worried,
you can still postfilter them before actual paintin. See:
http://svn.openstreetmap.org/applications/editors/josm-ng/src/org/openstreetmap/josmng/view/osm/QTree.java

The second important thing is the way you'll react to modifications. You've got the hook
in the form of setCoords, so what would you do?
1. Update the index for given Node (that is, remove and readd it)
2. Find all the referers (ways, relations). Well, (d) from third paragraph has higher
    importance, it seems, as long as you don't have self-consistent backref baked into
    the DataSet (josm-ng does)
3. Update referer bboxes (maybe recursively)

Is this right? NO! Imagine a user drags a 4000node forest (there is one such in Germany,
and I've even been (physically) to one 5500+ nodes in Poland recently ;-) )
You're going to get 4000 node updates and you're likely to end up visiting all the 4000 nodes
each time to compute the new bounds (4000^2). Also, possibly traversing a large hierarchy
of relations (there is one such in Germany, covering all the motorways, boundaries and so on)
each time won't be nice too. Safer is to:
1. Add the node to a queue. Period. Full stop.
OK, 2. before each index inquiry, collect all the primitives influenced by the queue content
and update them only once.

That's all for now. HTH.

-- 
Petr "Nenik" Nejedly, NetBeans/Sun Microsystems, http://www.netbeans.org
355/113 -- Not the famous irrational number PI, but an incredible simulation!