[OSM-talk] Keeping imported data updated with source changes

Jason Remillard remillard.jason at gmail.com
Sat Jan 10 15:44:20 UTC 2015


 Hi Wiktor,

I don't think an address tag is needed or desirable.

The best way of doing this is to compare versions of the official data
(perhaps every 6 months), making a list of things that have changed so
that they can be examined in OSM.

Of coarse the big issue is that the matching is not trivial. First
devise a matching score combining of distance to address, and edit
distance in the address name and number. These scores are the weights.
Then use one of the weighted bipartite graph matching algorithm
(augmented path) that works well on sparse data. If you keep the
search radius down, the graph will be very sparse, so should be
manageable. Using the match, you can get a list of nodes that have
been moved, deleted, and edited in the official data set.

Jason

On Sat, Jan 10, 2015 at 4:59 AM, Wiktor Niesiobedzki <osm at vink.pl> wrote:
> Hi,
>
> In Poland we have quite a few addresses imported from government
> sources for quite long time, but as time goes on, changes are made to
> the source databases, and local communities don't have any viable
> tools, to track, what has changed in source. In case of city of
> Skarżysko-Kamienna, local mapper tried hard to track all the changes
> in source (as well as check this on site), but still, missed a lot of
> changes, and as it's now - there is no tooling to help such users.
>
> What I'd like to do, is to prepare a service, that will generate
> changes for OSM containing differences for each municipality, so local
> mapper can load, review and decide what to import.
>
> But this tool, to be efficient, needs additional information to be
> stored in OSM - identifier of the object in the source database, for
> which i propose tag: ref:addr.
>
> This tag is used for both identifying what was already imported, as
> well as, I'd like to create a protocol, that if there are some "wrong"
> data in the import source, we would leave a point in OSM containing:
> addr:ref
> source:addr
>
> So we can instruct further imports, to skip this point, unless there
> will be some change in source data.
>
> I find this solution most robust, as it gives great Signal-to-Noise
> ratio for local mappers, when they are identifying what needs to be
> updated, as well as, gives as resilience when someone accidentally
> deletes some address.
>
> In Poland there thousands of people employed by government to keep
> this data in good quality and using OSM community to duplicate their
> work is in my opinion - wasteful. Using this method, we can use their
> work, and use OSM community to improve the data, that government is
> sourcing. And this is something we should consider for all of the
> imports.
>
> We had some discussion about this already in Polish community, but as
> it seems, it might be philosophical change for this project, I'd like
> to raise this issue on global level.
>
> Apart from addresses I plan to start importing national heritage
> objects, for which I see exactly the same problem.
>
> The other solution that we discussed in our community is to keep track
> of import source state in separate database, and use this, to see what
> has changed in source, to generate files for local mappers, but I see
> following disadvantages of such solution:
> - such solution doesn't take into account current state of objects in
> OSM, what may generate duplicates or miss data, that were accidentally
> deleted
> - it makes harder to fork OSM project, as you need to fork two
> databases, know about them, and the license for such database should
> be open
> - it still needs some "protocol" to this database, to mark that import
> was done (and in what extent) - it would require additional tooling
> and might be additional problem to causual mappers, and probably would
> render the tool unusable
> - it gives no tools for integrity with OSM databases
> - needs additional support
>
>
> The disadvantages of my solution, that I found most concerning were:
> - nodes contaning only ref:addr and source:addr might be hard to
> understand by newcomers, especially that ref:addr doesn't contain any
> human-understandable data
> - ref:addr might get clobbered during merge of nodes
>
> But I hope that with extensive description on Wiki we can handle that problems.
>
> Cheers,
>
> Wiktor Niesiobędzki
>
> _______________________________________________
> talk mailing list
> talk at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/talk



More information about the talk mailing list