[Imports] Proposal for proper OSM import solution (OpenMetaMap)

Thu Aug 18 16:22:14 UTC 2011

On Thu, Aug 18, 2011 at 11:31 AM, Jaak Laineste <jaak at nutiteq.com> wrote:

>> 2. This approach implies that external data sets are correct.
>>
>> Underlying this approach is an assumption that we can rely on other
>> datasets accuracy. Sadly this is not the case. As I work with more
>> datasets and compare them to on the ground surveying, I find that many
>> government datasets are either wrong, or out of date.

> I do not really agree with this implication, it does not assume that external dataset is correct. The process of linking (resolving conflations) would be actually same as with normal import: somebody has to review all data overlappings/conflicts/duplicates and solve them.

> It assumes that using external data is better than having nothing - just the same assumption you have with any external data usage and import.

For those of us who've been in the project for more than a year or two
the jury is still out on this.

It's easy to say "Well isn't some data better than no data?" but then
we see places where imports have taken place also have low community
uptake. These are of course correlations and we cannot automatically
assume causation, but we can certainly raise these as concerns.

> So after first linking round you would have correction of external data, at least as much as it is possible then.

If you have the corrections of the external data in OSM, you might as
well have the data in OSM in the first place.

> But then in later days it could happen that the data what was ok in initial linking will be changed to something else (worse). Here you are right - external data provider can do harm to our data. I would say we assume that the external data provider works in the direction of making data better, not worse. In other words: it is ok to have bad data in the beginning, but it is not ok if the data modifications are in wrong direction.

The problem isn't that external datasets get worse, the problem is
that external datasets make it hard to see what's missing, or worse,
what's wrong.

The premise of OSM is that many people mapping improves data quality.
We rely on the mappers to improve the map. We've shown through studies
that where we have few mappers, our data is of lower quality, and
where we have many mappers, our data is of very high quality.

Therefore one of our main focuses for the project is to get more
mappers. My concern is that by relying on external datasets, we reduce
the mappers motivation, and therefore end up with fewer active
mappers.

> Also there will be always problem of added new data - maintainer of database links has to do occasional reviews and correct this also. So with usual import you have to fix the data once. There is no bulk update possible so you do not need to worry about later updates. Now when we have later updates, maintainer has to start taking care about it also. More gain, more pain.

I'm sorry, while your English is far better than my Estonian, I do not
understand this paragraph. Can you rephrase?

> Actually I'm afraid that most external datasources will be rather static (just OSM files). This way there is no risk that external dataset will be suddenly damaged. There would be no benefit of later updates, but even then there is advantage of MetaMap database - you keep the datasets clean and separated.

The key value proposition of external datasets is that they could be
updated by external entities (think distributed version control). If
you think this isn't the case, or is not the case you're designing
around, then I see no benefit of using this technique vs improving our
conflation tools inside OSM, which is something we need today!

>> 3. Data in the aggregated map won't be collected by on the ground mappers.
>>
>> Some data, like the road data, will appear in both OSM and external
>> datasets, but there's other data which may just never get collected by
>> the community, if the map appears to already be complete.

> This is valid point. This is very general problem: data what "is already there (from imports, even just by the other mappers before you) is quite likely to be left behind, not reviewed and trusted. This is separate issue what I do not solve here.

The point of the map is not just to exist, but to be better. If we do
things which hurt the community, then they had better come with a huge
benefit.

> I assume here that often usage of external datasets is good and reasonable, and in many cases unavoidable (admin borders, shoreline and other samples).

This sentence has two statements:

1. You assume that imports are often unavoidable.

2. You assume that often the imports are good and reasonable.

1 isn't true. We see lots of imports of data that could be collected
manually. TIGER could have been done manually given time. GNIS could
have likely been done manually, and even Corine could have been done
manually. OSM took shortcuts. That doesn't mean they were bad, but
they weren't unavoidable. And if you look at the datasets users plop
in most often, without discussing with the community, that could have
been collected manually. Again, doesn't mean it's bad, but it's
certainly avoidable.

2 isn't true at all. In fact, we have tons of problems due to imports.
Imports are hard to get right (I'll address more technical issues
later on in this mail). We have had to revert changesets, we've had to
fix problems. I spent a lot of time fixing TIGER data, as do many US
mappers. That's time we could be spending mapping, we spend fixing.

> I propose here that MetaMapping is better way of using other datasets than importing.  There are several cases (possibly roads) where other datasources should be avoided.

I'll go into more technical depth on why the OSM model doesn't lend
itself well later on in the mail.

> In fact with OpenMetaMap you would always have two views and maps - one is pure OSM - all made by our mappers, and another would be complete map (OMM map) with all external sources. This is something what you with current imports approach cannot get. So if you wish you can ignore complete map and work on on OSM only.

Sort of a devil's bargain eh?

> And there is always risk that a mapper finds from Internet site called Google Maps and discovers that "the map" is already there and complete :)

Is it? If that were true, Google wouldn't have accidentally used OSM
on at least one (but I think I remember two) occasions. There are
places where OSM is of higher quality than Google. We just aren't as
good consistently across the globe.

>> 4. It assumes OSM object IDs remain constant.
>>
>> OSM object IDs change. They don't change a lot, but they do change,
>> and you can't force users to jump through hoops to preserve them (as
>> we've seen people propose).
>
> Yes, it assumes that IDs do not change. This is most important. Can you explain more why and how OSM object IDs change? I've heard it too, but to analyze cases in more details I'd need to know the details.

They change because people delete things, and add things, and move
things around.

A simple example is that often I'll see a POI node, and I'll go ahead
and draw the building outline and put the data on the building. I draw
the building and delete the node.

Another example would be that I might delete a road segment and redraw
it, if it's easier to do that than to move every single node around.

There have been proposals which ask/require users to map a certain
way, but that's not the OSM way. There's nothing that says object IDs
are unique or permanent. There have been proposals regarding that
issue, and none of them have been feasible.

And by the way, since we're on the topic of object IDs, your proposal
only addresses one end product: rendering.

How do you propose to handle routing?

And what about layers?

And what about about objects which contain other objects. Even if you
ignore ways, you still have relations.

>> 5. It assumes external data sets IDs remain constant
>>
>> One of the whole points of this project seems to be to keep up to date
>> with external datasets, such as those put out by local governments
>> every quarter.
>> Since most of these external datasets will be given in Shapefile
>> format, there will need to be a conversion process.
>>
>> You can't be assured that the ID numbers on objects will remain
>> constant from Q1 and Q2. Heck, I bet you'd find that even their own
>> internal IDs won't remain constant, at least not for every single ID
>> on every single object on every single external database, of which
>> there may be dozens or more.
>>
>> So you're constantly in a race to conflate changing object IDs.
>
>  I would put to API specification that object ID must not change by definition.

And how do you propose to enforce that for every object in every
dataset for every organization? Our import page mentions at least 30
datasets, but with the floodgates open, how many more would you have
to deal with, and then enforce these rules on?

>> 6. License nightmare
>>
>> This is a powder-keg ready to explode, but I'll just say this:
>> Incompatible licenses will not allow this.
>
> Yes, by using OMM, OSM and DBX data then you would create derivate of all of them and they must be compatible. But here again - this is general issue what I do not solve nor create there. I'm comparing OMM solution with usual import, and license issues are there basically the same. Maybe the problem happens just later - with imports the importer has to check it over once, with OMM-linking the user has to be sure that he merges appropriate databases.
>
> Actually it would reduce nightmare a lot in some cases - if someone has imported data what was OK in 2010, but is not OK  in 2012 anymore.

That problem is solved with the CT.

And we solve the general issue by /generally/ discouraging imports,
especially those where a strict process hasn't been followed.

I'm under the assumption that in your system, any user will be able to
add a dataset.

>  In principle I do not see significantly more work as you need to do with imports now. Extra work comes only from extra data updates - instead of data bursts you will have continuous stream to take care of - with all the gains and pains. You can use very similar tools (scripts and JOSM) as now. I hope that if external data providers can quite easily get back also community edits, then actually they should be much more motivated to look after their OSM/OMM derivate than now.

I think that you've touched on an important bit here. What you propose
is not OpenStreetMap, and you couldn't call it OpenStreetMap.

>> These are the reasons I never went forward with this project.

> I really hope you are open to reconsider :)

When the OSM wanted to split the project, I stayed here. Many of them
will be encouraging of your work. Some of these people who are
supporting you are folks who are banned from editing in OpenStreetMap.
That, I think is why they're so encouraging of your idea, because it
may be something they feel could give them the advantages of OSM
without being OSM.

I think that's sad. But no, despite its faults, I like OpenStreetMap
and will stay with the project for the foreseeable future.

- Serge