[josm-dev] Optimizations for larger data sets

Sat Dec 8 15:16:07 GMT 2007

Hello,

Czech mapping community started generating czechia.osm subsets of the world map
and we'd like to be able to work with such a dataset inside JOSM.

Current trunk has quite big troubles working with such a dataset (~500.000 nodes,
50.000 ways), but I have already done few changes in my copy that makes
working on the data completely comfortable.

These changes are, so far, more local optimizations than real scalability changes,
but it may change in future. I'd like to be as little intrusive as possible,
at least initially.

I'll try to split and submit my changes in digestible pieces for better review-ability
and smaller granularity of potential disagreements. I just wanted to post a heads-up
in advance so we can clear up any misunderstanding early.

Why would anybody want to work with such a dataset in JOSM?
   - first, everybody will benefit from performance improvements, regardless of the dataset
     size. Even after just fetching maximum allowed fragment around Prague from server,
     JOSM feels slower.
   - you can easily work offline on the data set, doing edits on the road across whole
     country.
   - We use czechia.osm as input for navit. Being able to fix small bugs in map
     and regenerate the navit map is very valuable.

Overview of changes done so far (for usage w/o mappaint):
   1. Paint arrow on selection even when arrows are disabled.
   2. Avoid duplicate Strings in data (reduces memory used
      by czechia.osm 260MB->160MB)
   3. Different implementation of tags Map in OsmPrimitive.
      HashMap is unnecessarily expensive for such few entries
      OSM primitives typically carry. Czechia.osm: 160MB->106MB
   4. Use ArrayLists instead of LinkedLists in DataSet. They are
      cheaper memory-wise and faster to iterate. 106MB-> <100MB
   5. Rewritten MergeVisitor to avoid O(N^2) algorithm.
      Otherwise updating a window from the server takes looong
      time and you'd like to do such an update before you push
      your few-days-old-snapshot based changes back to the server.
   6. Reasonable offscreen filtering in SimplePaintVisitor, limit usage
      of floats during painting, limit doing the same transformation
      twice, don't paint too small objects.

These changes were possible and effective without the need to encapsulate
data fields in the dataset. I have few more ideas that would require
encapsulating the data inside the DataSet, but such a change would be
incompatible with current plugins. If there is a will, we could do this
in phases: first introduce API replacing direct field and collection access,
then rewrite plugins to the API and finally make the fields private and
implement the improvements:
  *) Order the nodes in the collection along single axis. This would
     allow faster (typically O(sqrt(N)) painting of nodes in close-zoom cases.
  *) Keep precomputed bbox of complete ways:
     This allows faster painting of ways both in close-zoom (you filter out
     the complete way quickly if it is off-screen) and wide-view (you filter
     out/dot-replace small ways) cases.

For even larger data sets, it might help to split the data into more thematic
layers and keep some of the layers in optimized/non editable format just for
rendering. For example, for Czech Republic, we expect import of forest
database into the OSM, which will enlarge the data set significantly.
This part of the data will hardly be manually edited, though.

I'd like to try optimizing mappaint mode later, as it is probably highly
preferred over SimplePaint (even by me). There seems to be some low
hanging fruit there as well.

Would you accept the above mentioned changes? Even the intrusive ones?
Do you have your own improvement ideas?

-- 
Petr "Nenik" Nejedly, NetBeans/Sun Microsystems, http://www.netbeans.org
355/113 -- Not the famous irrational number PI, but an incredible simulation!