[Routing] [OSM-dev] pre-compute routing

Sun Sep 30 21:38:22 BST 2007

Hi,

Steve Coast wrote:
> (please reply to the new routing list, sent to dev so that you know  
> it's there)
> 
> We have about 4 million ways.

...but we don't need to route between them all.  The thing to route 
between, I think, would be "fork" nodes - i.e. nodes where more than two 
ways join.  I have a query running to find these (as the first stage to 
producing more routing-optimised data).  At the start and end of the 
route, assuming the starting/ending node isn't itself a fork node, you 
end up with a choice of (at most) four routes (from the next fork node 
in each direction on the start and end ways) and pick the 
shortest/fastest best of those four.

My query isn't finished yet (I have ideas for a much faster version, but 
I want this one to get finished first).  But I'm going to go out on a 
limb and say that we have about a million fork nodes.

> An all pairs routing graph has therefore 16 million entries. We  
> assume a route from a to b may be different than from b to a due to  
> one way streets, turn restrictions etc.

I think your maths here is wrong (if I'm wrong, a lot of the rest of 
this mail is going to be irrelevant).  If I had 4 nodes I was routing 
between (and assuming I respect one-way restrictions), I'd have

1->2, 1->3, 1->4, 2->1, 2->3, 2->4, 3->1, 3->2, 3->4, 4->1, 4->2, 4->3

= 12 routes.  Basically, if I have N nodes, I have N*(N-1) routes.  For 
your figure of 4 million, I end up with 16 trillion routes.  This enters 
the realm of the unrealistic storagewise (at least for OSM, probably not 
for Google).  I'm not sure you need to be quite so pessimistic about the 
one-way stuff, though - if a route's not affected by one-way stuff, you 
could just store one of the directions and use some simple logic to 
retrieve that for the other direction too if the direction you're 
looking for doesn't have a route.

For 4 nodes, I then have

1->2, 1->3, 1->4, 2->3, 2->4, 3->4

= 6 routes.  For N nodes, (N*(N-1))/2.  Assume 25% of routes have a 
differing reverse direction, then we have ((N*(N-1))/2)*1.25 -> 
5(N*(N-1))/8.

Going with the million-fork-node assumption from above, all-pairs would 
be 5*(1000000*999999)/8 ~= 625,000,000,000 routes.  Assume 75% of these 
are unroutable (different continents, mixing of cycle paths and 
motorways, etc.) and you have 156,250,000,000 routes.

> Almost every route is going to be something like "take street a,b,c,  
> get on a motorway, come off and do streets d,e,f". That is, almost  
> every route is going to be fairly simple. 

Not if you also do shortest route - these tend to be significantly 
longer.  I've also had ideas about things like "scenic route" (weight 
ways on scenicness based on surrounding streets, green areas, etc.), 
"shortest route, but keep it to vaguely sane roads - i.e. 
non-residential, non-track, etc.".  Some kind of 
use-trains-or-flights-when-appropriate thing would rule too.

> Let us a assign L to be the  
> length of route in number of ways traversed. Of course many routes  
> won't exist (you can't drive across oceans).

Even assuming you do the famous Google "swim across Atlantic ocean" 
thing, this is effectively the concatenation of two routes.

> Let us assign N to be the number of types of route, and lets say N is  
> 3 for car, walk, cycle.

I'd guess at N=7 or more.  Car shortest, car scenic, car fastest.  Cycle 
normal, cycle avoid-fast-roads (or something).  Walk.  The 
use-trains-or-flights-when-appropriate thing.  Doubtless other cool 
things that people think of.

> Let us assign B to be the the number of bytes per way id. B=2, 16  
> bits can currently encode all ways.

Er, how?  We have more than 65535 ways - I make that you need at least 
22 bits as of right now (and 23 before very long).  However, since 
you're (presumably) not going to be storing the descriptions, you need 
to do a query on ways anyway.  By making your query return slightly more 
data, you can significantly save on route storage.  At each fork node, 
get all the ways which proceed from that node, sorted by way ID.  You 
can then store routes as:

Start at node 12345
First fork node: take way with the lowest ID = 0
Second fork node: take way with the third lowest ID = 2
etc.

Assuming the most ways joining at a single node is 16 (if anything, 8 
seems more likely), you need 4 bits per step of the route (making 16 the 
shinier choice, since 4 bits/step fits much more nicely into bytes).

> The total disk space to store all routes is therefore
> 
> ~= 4 * 10e6 * (L * B) * N

By my calculations thus far:

~= 1.5625 * 10e11 * (L * 0.5) * N

> I'm going to go out on a limb here, and almost any route I've got  
> directions for is about 10 steps. Some will be more of course, but  
> many will be much less.

I'd go with 20, bearing in mind the shortest-route thing and so on. 
It's easy to get 50+ steps on these routes, so that seems more 
reasonable to me.

> ~= 4 * 10e6 * (10 * 2) * 3 ~= 240 million bytes.

~= 1.5625 * 10e11 * (20 * 0.5) * 7

~= 10,937,500,000,000 bytes

~= 10,430,813 MB

~= 10,186 GB

~= 10TB

> Anyway the key point is we can pre-compute all routes and serve them  
> from one box fairly easily if we want to. Even if you add an order of  
> magnitude complexity, it's doable.

With 10TB - its doable, but I think we can drop "easy"...  You could cut 
it down to 1 type of route from 7 for starters, getting you down to a 
couple of TB, but it's still pretty big.

> This, I'm told, is exactly what Google do. But their dataset is a bit  
> bigger.

Google have lots of 10TBs, I'm guessing OSM doesn't (yet) have one 10TB...

But!  All is not lost.  Almost all of that incredible number of routes 
will never be asked for.  So you can probably cache every route anyone 
ever asks for.

Looking back to the precalculation goal, though: 99,9% of the routes you 
have will be A->B->the entirety of some other route->C->D.  You could 
store a "jump" to another route as being "take the 16th highest way ID" 
followed by a route ID.  You can recognize when to do this jumping by 
doing some variant of Longest Common Subsequence matching whenever you 
generate a new route.  Unfortunately, LCS is itself NP-complete and 
CPU-intensive.

Long story short: I think it *might* be possible to precalculate stuff. 
  But I think if you go down that road, you need to be using every trick 
in the book to save space (or getting someone to donate you a very large 
--- and growing --- storage array).  Assuming you're saving space, this 
will extend the time you spend precalculating.  Which means you need to 
be thinking of something more like a monthly or two-monthly update cycle 
for the routing DB.  I don't know how good a match this is to OSM, at 
least at its current stage of life.

--
Jon