[Geocoding] Agenda
David Earl
david at frankieandshadow.com
Tue Jul 14 13:32:47 BST 2009
Regarding the short term future of the namefinder...
----------------------------------------------------
I have started a planet import into separate InnoDB tables in the
namefinder database, after a couple of minor false starts due to (a) a
small tweak needed for API0.6 (which postdates my last attention to
this) and more importantly the failure of PHP to read bz2 files > 4GB
(which threshold has been passed since the last time I did a full planet
import, in Feb 2008). This should address the problem of the update
being too heavy handed in locking out searches.
To some extent the long planet import doesn't matter too much provided
the daily updates work in a reasonably timely fashion and don't
interfere with searches - which was the problem previously.
Assuming this works (and there may be other problems after I've allowed
it to get rusty for so long), we can get the current search up to date
in the short term.
Next...
-------
It may be worth me switching the database over to postgresql. Tom
explained the difference in how the indexing is done in the two
products, and given that it may well be that we would get very
significant improvements for searches involving more than one word
without changing my algorithm at all.
In principle, this should be straightforward: it should involve just
changing the database communication layer. But I also need to get to
know Postgres and its tools and idiosyncrasies.
Whether I do this depends on what people want to do next with Brian's
product (does it have a name I can call it by? I'll refer to it as BN
for "Brian's Namefinder" for now). There doesn't seem much point in me
putting significant effort into namefinder if it is going to be superseded.
I should say that I believe the performance issues in Namefinder on the
search side are nearly all to do with database access, not really the
PHP code. Though I could be wrong. And in particular the two I've
mentioned which Tom identified.
Plans...
--------
I did have plans to try to deal with a much more free-format and
language neutral user input - get rid of the "near" syntax, deal with
street numbers at least to the extent of recognising and discarding them
so we could still get a hit on the street, optimising place name
searches as these are the vast majority, deal with more variants, use
context from one part of the search (e.g. place name) to inform and
guide the rest of the parsing, and so on. However, again it doesn't seem
productive for me to put effort into this if it is going to get thrown away.
On the other hand, I don't know whether BN handles the language,
alternative name and abbreviation variations that namefinder does. UK is
not a good test case. Nevertheless cases like "The Red Lion" vs "Red
Lion" and "St Peter's St" vs "St Peters Street" are pretty much a
prerequisite, I discovered early on. The examples that Ed Freyfogle gave
push this a lot further.
I also don't know how or whether BN deals with rationalising multiple
names. I did have plans to improve this a bit in namefinder (mainly to
try to avoid it hitting on a small spur of a road rather than the main
drag). Namefinder already gives you different hits for the "same" road
in different contexts ("M11 near Harlow" vs "M11 near Cambridge" for
example), but not duplicates in the same locality.
Does it deal with updates as well as reloads (which are clearly
impractical as a frequent solution)?
Is the source somewhere I can get at? Is there a development plan, or is
this just a hacked together demo?
Gazetteer...
------------
The main thing I want to do, though - and this is really somewhat
independent of search - is to produce a gazetteer - contextual A-Z
listing pages of names organised by place, street and POI. This would
run off the same index but doesn't need the heuristics that search does.
The two motivations for doing this are:
1. Geographic searches in Google (et al) get frequent hits on OSM results.
2. Offer comprehensible URLs with place names in them
Geocoder...
-----------
I still need to look at the newly open-sourced Geocoding app, but one
immediate thing that strikes me is that country specific parsing is
probably not enough. While in principle we can identify the country from
the OSM data (albeit not a straightforward problem), we still have to
parse input from the user, and it is (probably) not possible to
determine what country they are interested in when they search for
something.
My guess (and it's only a wild guess) is that this will help, but
probably only address half of the problem. OTOH, there's presumably the
opportunity to contribute back any developments we make into the project
now it is open source.
Strategy
--------
We could...
1. Co-operate: Try to bring the software together and take the best
aspects of both (plus the geocoder in due course).
2. Compete: Provided Tom will offer results from both sources (and,
within reason, in the form we want), we could take the capitalist route
of developing competing products. If we don't have equal access tot he
home page, through, there's little point in doing this. Also, it's
doubtful I can compete with someone who is being paid to do this as a
full time job.
3. Abandon one or the other product: Possibly I step out of the picture;
there's plenty else for me to do.
David
Brian Quinion wrote:
> Hi,
>
>> http://apis.dev.openstreetmap.org/~twain/
>
> First of all - I've put this (hopefully!) back to a working state so
> you can have a play if you wish. UK data only at the moment and the
> whole thing is slow because I'm doing a full planet import to another
> database. The fact that this causes performance problems is obviously
> an issue in its own right.
>
>> It might perhaps be helpful if David and Brian could each give a brief
>> summary of where they think the two current codesbases are, and how they
>> think we should best move forward.
>
> Possibly it is best to start with a bit of background. Ideally I'd
> have done this over a pint at SotM if only I'd been there, but here we
> go...
>
> I originally started working on this because I'd build a very basic
> geocoder for my work based on php and postgresql. I was aware that
> there where some performance problems with namefinder and approached
> various people on IRC to see if I could take a stab at the problem. I
> had a look at namefinder, and Tom let me have a copy of his original
> gazetteer and of the two Tom's code seemed to match most closely with
> what I'd already written, plus I liked that the import routine would
> be written in C for performance.
>
> By the hack weekend I had something that seemed to work for the UK and
> was allowed to use a spare server to try implementing it for the whole
> planet.
>
> Unfortunately it turned out that my original technique didn't scale
> very well (putting it mildly) and I've ended up rewriting it
> completely over the last month.
>
> The current codebase consists of:
> a postgresql import modules for osm2psql
> a postgresql module with a c helper function
> a set of plpgsql database functions and triggers that handle indexing
> a short php script that performs the search queries and returns the results
>
> The code has support for simple text queries, 'near' queries, house
> level addressing (including interpolation from number ranges), various
> levels of postcode to handle the different countries standards and
> special support of interpolating unknown UK postcodes. It can handle
> the various name:en, name:fr, etc. standards used in OSM and return
> address strings based on the browser accept-language settings. There
> is support alt names, common names and similar.
>
> Search performance is generally fairly good with queries taking around
> 0.01 to 0.1 of a second databased side with some addition time spent
> presenting the data using php. Index generation performance is
> dreadful, and it can take up to 5 days to reindex a full planet file.
> The UK is processed in around 6 hours. I have some ideas on how to
> improve this but recently decide to stop fiddling with the code and
> try and get something actually finished even if it isn't perfect.
>
> Although the UK version has had some testing as yet the full planet
> version hasn't been tested at all. I was hoping to make a test
> version available some time early next week and get some feedback on
> how well it works for international addresses although I'll happily
> put this on hold if needed until we have worked out where we are
> going.
>
> Look forward to hearing from everyone else.
>
> Cheers,
> --
> Brian
>
> _______________________________________________
> Geocoding mailing list
> Geocoding at openstreetmap.org
> http://lists.openstreetmap.org/listinfo/geocoding
>
More information about the Geocoding
mailing list