[Talk-us] matching external data to OSM

Sat Jan 12 18:05:08 GMT 2013

Hi,

I'm not sure if this has been discussed or not, if yes, a pointer to
the old thread would be much appreciated!

People working on imports, QA tools, and others come across the
problem of comparing data in OSM to external data sources. So far, I
have seen proposals that look like git and use change history, or use
some kind of ID in OSM that corresponds to an ID in the external data
set. However, both of these approaches have critical issues. The
external IDs can be deleted or damaged during normal editing. The
change history can be lost if editors delete an area and retrace
rather than move the existing data around. OSM needs to optimize first
and foremost for normal hand editing, so adding restrictions to
prevent either kind of activity obviously should not happen.

There is another approach. We instead should focus on using weighted
matching algorithms.

http://en.wikipedia.org/wiki/Bipartite_graph
http://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm

This is a well-studied problem. Because our data is basically aligned,
it is not NP hard. Doing 30,000 matches in a minute should be no
problem. The algorithms and new computers are fast enough for real
problems.

The idea is that we make a feature vectors that get turned into a
feature to feature match score. If you are matching buildings use
Euclidean distance between center of masses, area, ratio area to
circumference, etc. If we are doing roads, break things into segments,
and use length, Euclidean distance between center of masses, PCA to
get orientation, etc. If the roads have names, you could add in the
hamming edit distance to score name matches. The feature vectors would
be dependent on the specific kind of data that is being matched, we
would probably need a handful of them to handle all of the different
kinds of data OSM has.

http://en.wikipedia.org/wiki/Shape_analysis_(digital_geometry)
http://en.wikipedia.org/wiki/Principal_component_analysis
http://en.wikipedia.org/wiki/Hamming_distance

For example, if you are matching houses. You would find all of the
houses that are within a specified distance in the external data set
and calculate a match score between the OSM house and each house in
the external data set. This forms a weighted graph between the OSM
data and the external data, with the weight of each edge between the
match score. The match score a derived from the feature vectors. You
then use something like the augmented path maximum flow algorithm to
calculate the globally optimal match.

You would need to fiddle with the scoring and feature vector, but it
would automate the trivial matching, and by inspecting matches with
low scores, would allow the end user to focus his/her attention on
areas that need a person looking at it. If you went crazy, you could
even include the OSM change history to find items that were
intentionally removed from OSM that are still present in the external
data source. The soft matching followed by an optimization will allow
us to compare OSM to existing data as it is now.

It seems like having some code to do this would be useful to a bunch
of different people in OSM. This kind of problem is tackled all the
time by people doing machine learning and image processing.

Has this been tried by anybody? Is there any code available?

Thanks
Jason