[Geocoding] Agenda

Wed Jul 15 12:32:41 BST 2009

>>> ... UK is not a good test case...
>> The code handles the obvious variants, streets, saints, etc.
>> Undoubtedly there are some that I've missed and the current list would
>> need to be extended.
>
> What about the non-British stuff, like Sankt and the equivalence of german β
> and ss or ϋ and ue or danish Å and Aa for example? Or the way in which
> sometimes people write Potsdammerplatz and others Potsdammer Platz (though
> maybe your algorithm isn't as sensitive to word units as mine is). Can you
> fdeal with user searches for "Potsdammerplatz" when the database has
> "Potsdammerpl" & vv?

Most of this should be fine although this code hasn't really been well
tested yet - I hope to hand it to the germans and they can tell me if
it works :-)  At the moment cinemas Potsdammerplatz and Potsdammer
Platz wouldn't work.   I have some trigram match that would probably
allow this to work but at the moment it is turned off because it was
too slow to be useful.  I'm hoping to try using a custom aspell
dictionary to solve this - I've had good experience with this
technique in the past - but it's still on the todo list.

> Do you deal with object types as well? For example, if a cinema is tagged
> "name=Odeon" will "Odeon cinema" give you a hit, or "cinemas near Chelsea"?

Well, there doesn't seem to be a odeon in Chelsea, so the last one
certainly won't work :-)
Substitute Kensington and it seems to work for all these examples as
will 'kino kensington' since this is my one (and only) item of German
test data.  'Odeon cinema' produces a slightly odd result because it
first finds extra word matches and there are 2 items explicitly
labelled 'Odeon cinema' in the uk - but adding a town/city forced the
right result - and I will no go and edit those 2 items in osm ready
for the next import to fix it for good!

>> Some of the examples Ed Freyfogle gave are
>> possibly beyond what could reasonably be parsed, although it is always
>> fun to try.
>
> Indeed. (Though ironically the worst one I thought - the txtspk one, what
> was it, 2ton for Twoton or some such - would be almost trivial to do in
> Namefinder!).

Really?  I thought that one was horrible because of the multilingual
aspect of the problem - does 2 mean 'two', 'zwei', 'une' and
substituting in the other direction requires something that can parse
multi-lingual numbers.  I could see the list being effectively
infinite and couldn't see any way round the problem.  How were you
thinking this could be solved / indexed?

> Sometimkes the biggest issue is thinking what might happen. A recent request
> I had was very reasonably for "somewhere railway station" to match rather
> than "somewhere station" (and that of course leads to "somewhere train
> station" and "somewhere rail station" too).

Yes - at the moment I've got a fairly short list of special case
words/phrases - I can see it getting longer VERY quickly!

>>> Does it deal with updates as well as reloads (which are clearly
>>> impractical
>>> as a frequent solution)?
>>
>> This is the big 'todo' item, although osm2psql includes a framework
>> for doing this that means that a lot of the work has already been
>> done.  My intention was to try and get something out for people to
>> test and then start work on this.
>
> I don't know about your algorithm, but mine is focussed around names, and
> because people move objects, split ways, delete objects a change can result
> in a name not present in the update becoming "visible" to the search if a
> similarly named object is deleted. Deleting a way doesn't necessarily mean
> that street disappears from the search.

That sounds tricky.  I like the idea of merging adjacent ways sharing
the same name - it feels like very much the right approach - but I can
see how that could make updates a lot trickier.

Obviously I've not yet written this code - but I don't think I'm
likely to be affected by this particular problem since everything is
referenced back to the original osm ids at every point.  I'm sure that
are plenty of other problems though.  At the moment the one that is
annoying me imensly is that the indexing process is being slowed down
considerably because I've apparently interpolated all the postcodes
along with the house numbers - all because the format used in us is
slightly different and code didn't catch it.

All these problems would be trivial if there were no just such a huge
amount of data - it magnifies even trivial problems and makes testing
and debuging the whole thing painful in the extreme.

--
 Brian