[OSM-dev] extracting house-number from string
Marcus Wolschon
Marcus at Wolschon.biz
Sun Mar 15 17:14:11 GMT 2009
On Sun, Mar 15, 2009 at 5:52 PM, Stefan Breunig
<stefan at mathphys.fsk.uni-heidelberg.de> wrote:
>> So it misses 5bis and 135sous ?
> Yes, it would. But maybe it's the wrong approach to identify the
> housenumber and then look it up. Instead, building a list of all
> available housenumber/street combinations from the dataset and doing a
> "similar" search might yield the best results, independently from the
> exact address definition (similar means like Google does its "did you
> mean…?" feature).
That would mean to load all streets of all matching cities from disk.
Loading an area of nodes for all of them to get the house-numbers,
evaluating all house-number -interpolations of all streets of all
matching (potentially major) cities...
That is not feasable.
Try to search for a street+house-number in "Frankfurt" that way.
You get 2 major cities to load from disk and the user expects a
result fast as he sits in his car and is about to start driving.
With "Dataset" we are talking about the planet earth or at least
one of its major continents.
> Otherwise I really don't see a way to get it right everytime. If the
> "address style" differs from city to city, one would have to have a
> huge database on different address formats which still doesn't catch
> user typos or if someone gives out addresses in the wrong format. But
> that goes far beyond 'simple' and may require extensive resources (but
> it's possible,
Such databases of country-> address-format exist. I know that
at least OsCommerce contains one.
> Google did it for their Maps and it even only requires
> one field). On the other hand an algorithm that finds the correct
> address most of the time is probably good enough.
> For navigating it's possible to drop the entire lettering stuff (i.e.
> "13a", "13b"…) because the houses are so close in these cases that it
> really doesn't matter. My guess is the same goes for "5bis" or
> "135sous". Maybe we want to just drop any lettering and just take the
> most likely number?
There is no guarantee that 13b is near 13 and anyway...if you can identify
the "13b" in the search-field you have already correctly identifies it.
Just because it`s a hard problem does not mean we can not try to find
a good solution that works in as many cases as possible.
Marcus
More information about the dev
mailing list