[Imports-us] Restarting NYC building and address imports

Serge Wroclawski emacsen at gmail.com
Mon Feb 17 15:02:32 UTC 2014

There is an imports-us hangout tomorrow.

I have a few concerns that remain:


This is really an entirely new model for an import. It's not an automated
import, which is broken but understandable, and it's not a community
import, where there are people who are local, on the ground working on it,
having it take "as long as it takes". This is a concentrated effort, mainly
(almost exclusively) centred around a company and their paid employees, who
are loading data from an external dataset, then applying that dataset to
OSM in a merge process.

This new process brings with it some new challenges, which is what we've
been trying to navigate.

Specifically (and at the root of many of the concerns) is the fact that
nearly 400,000 buildings were done in about a month, and that in the
restart discussions, Alex said he'd like the import finished quickly, from
the 2 months originally slated, to (I believe) six months, which I
understand to be 6 months total, rather than six additional months, meaning
that there will be ~600,000 buildings (and more addresses than buildings)
in 4 months. If this assumption is incorrect, someone should tell me.

1. The key to finding the errors has been the people doing validation.
Validation is a slow process. It's ideally what an importer would be doing,
but based on the amount of errors that I and the other validators have
found, has not been done as much as we'd hope in the past.

I think that we need to assume that the paid importers are not good
candidates for validators. This is based on the past actions by these same
individuals, as well as by the general motivation of these folks to appear
as good employees by importing as much as possible (this is certainly what
I'd do if I was paid to be importing!).

This work is time consuming, repetitive and detail oriented. It also
requires knowledge of OSM in general, and an eye for "what doesn't look

I'd like to see us address how validation will be done going forward.

2. In the last meeting, Alex stated that MapBox support would end when the
import was complete. Since the validation step takes so much longer than
the import step. Unfortunately, this position then forces us to slow down
the import in order to catch the problems and correct them while MapBox is
still willing to fix them. Associated with the previous issue, how can we
address this?

3. Part of the argument for having the import go this quickly is that we
would be able to do updates easily, but there are no agreements in place
that there will be any updates, and there has (to the best of my knowledge)
ever been any import that's had an update, so I'd like to see something
concrete in place about a future update (in a year or two perhaps).

4. A lot of the previous discussion has been that if the data and OSM don't
match, the differences should be stored in Github issues.

This is because Github allows Alex to centralize the issues in the same way
that he would if this was a software development project.

Github issues are fine for systemic issues, but for individual problems, we
need to store the data somewhere else, such as in notes. There has been
resistance to using notes in the past because there would be tens of
thousands, but there's certainly a place for notes when dealing with low
volume issues, such as data merging. I'd like to see that flushed out more.

5. We've had a few mass automated edits now based on correcting bad
imported data. I'm somewhat uncomfortable with this being done on an ad-hoc
basis, without the normal checks we do for automated edits.

I'd like to see us address this as well.

You may notice that most of my concerns are procedural, rather than
technical. I think we have a good track record for handling the technical
issues, but these procedural ones are proving to be more difficult.

- Serge
