[Geocoding] Agenda

Tue Jul 14 18:23:06 BST 2009

Hi,

> It may be worth me switching the database over to postgresql. Tom explained
> the difference in how the indexing is done in the two products, and given
> that it may well be that we would get very significant improvements for
> searches involving more than one word without changing my algorithm at all.
> In principle, this should be straightforward: it should involve just
> changing the database communication layer. But I also need to get to know
> Postgres and its tools and idiosyncrasies.

If you do decide to do this I can probably give you some pointers on
the idiosyncrasies - I seem to have spent the last month dealing with
nothing else.  Postgresql while good is definitely far from perfect.

> Whether I do this depends on what people want to do next with Brian's
> product (does it have a name I can call it by? I'll refer to it as BN for
> "Brian's Namefinder" for now). There doesn't seem much point in me putting
> significant effort into namefinder if it is going to be superseded.

Sorry, no name at all at the moment.  Its code has been checked in as
'gazetter' but only really because that is what TomH had called his
original osm2psql code.  The current version is here:

http://svn.openstreetmap.org/applications/utils/export/osm2pgsql/gazetteer/

> Plans...
> --------
> On the other hand, I don't know whether BN handles the language, alternative
> name and abbreviation variations that namefinder does. UK is not a good test
> case. Nevertheless cases like "The Red Lion" vs "Red Lion" and "St Peter's
> St" vs "St Peters Street" are pretty much a prerequisite, I discovered early
> on. The examples that Ed Freyfogle gave push this a lot further.

The code handles the obvious variants, streets, saints, etc.
Undoubtedly there are some that I've missed and the current list would
need to be extended.  Some of the examples Ed Freyfogle gave are
possibly beyond what could reasonably be parsed, although it is always
fun to try.

> I also don't know how or whether BN deals with rationalising multiple names.
> I did have plans to improve this a bit in namefinder (mainly to try to avoid
> it hitting on a small spur of a road rather than the main drag). Namefinder
> already gives you different hits for the "same" road in different contexts
> ("M11 near Harlow" vs "M11 near Cambridge" for example), but not duplicates
> in the same locality.

I've taken a very similar approach to the one you describe above.  At
the moment it can sometimes split out multiple spans of the same road
if it thinks they have seperate postcodes which it possibly going a
bit far, for instance:

http://dev.openstreetmap.org/~twain/?q=deerlands+avenue

returns multiple returns for the same street because they are each in
different postcode sectors.

> Does it deal with updates as well as reloads (which are clearly impractical
> as a frequent solution)?

This is the big 'todo' item, although osm2psql includes a framework
for doing this that means that a lot of the work has already been
done.  My intention was to try and get something out for people to
test and then start work on this.

> Is the source somewhere I can get at? Is there a development plan, or is
> this just a hacked together demo?

There definitely isn't a development plan :-)  This has very much been
an experimental 'lets see if this works' project and at the moment the
code is a bit on the hacked together side - it works but it needs more
commenting, documentation, etc.

> Geocoder...
> -----------
> I still need to look at the newly open-sourced Geocoding app, but one
> immediate thing that strikes me is that country specific parsing is probably
> not enough. While in principle we can identify the country from the OSM data
> (albeit not a straightforward problem), we still have to parse input from
> the user, and it is (probably) not possible to determine what country they
> are interested in when they search for something.
>
> My guess (and it's only a wild guess) is that this will help, but probably
> only address half of the problem. OTOH, there's presumably the opportunity
> to contribute back any developments we make into the project now it is open
> source.

Having looked at the source I tend to agree, although what I'd really
like to see is a working demo so I can try some searches and see the
results.  As Ed Freyfogle said (and everyone I think agrees) it all
comes down to the user experience and without trying various queries
it's hard to get a feel for how flexible the indexing really is.

> Strategy
> --------
> 1. Co-operate: Try to bring the software together and take the best aspects
> of both (plus the geocoder in due course).
> 2. Compete: Provided Tom will offer results from both sources (and, within
> reason, in the form we want), we could take the capitalist route of
> developing competing products. If we don't have equal access tot he home
> page, through, there's little point in doing this. Also, it's doubtful I can
> compete with someone who is being paid to do this as a full time job.
> 3. Abandon one or the other product: Possibly I step out of the picture;
> there's plenty else for me to do.

In pretty much every way it would make sense to co-operate.  If I
could go back in time as redo this I would undoubtedly be contacting
you before starting and coming to some arrangement to join in with
your existing development work.  I can only blame
hubris and an impression that development on the existing namefind had
ceased for this really rather stupid decision.

However, having reached the position we are in now the decision is
less clear cut - the obvious problem with that is that the two
products have virtually nothing in common in terms of both
implementation and to a lesser extent, algorithm so it probably isn't
possible to just merge them.

With regard to competing, just to make this clear this isn't my full
time job but purely a hobby.  It did follow on from some coding I did
for work but that was so simplistic as to now be completely
irrelevant.

Competing on the front page is certainly a way to handle this.  In
some ways it is the OSM way :-) But it tends to lead to problem from a
user interface point of view.  Multiple choices confuse rather than
empower the casual user and it potentially spilts and development
efforts in what is already a not hugely well populated area.

At this point about the only thing I'm certain of is what I don't know
- which is if the code I've written is actually any good and if it can
scale to the whole world - so I'm tempted to suggest that, given how
close it is, we should put any decision on hold for a couple of days
until we have a bit more information.

In conclusion, my preference is co-operation, but it comes down to if
that is practical at this point.
--
 Brian