[Talk-us] [Talk-us-newyork] Interested in importing address points in New York State

Fri Jul 17 04:01:00 UTC 2020

Thank you so much for your reply! That's exactly the kind of insight I was 
hoping for by posting here.

On July 16, 2020 12:16:19 Kevin Kenny <kevin.b.kenny at gmail.com> wrote:
>
> I'm less sanguine than Skyler is about the data quality.  I suspect
> s/he (the given name doesn't clearly identify a preferred pronoun) has
> been looking at urban or suburban areas in counties whose GIS
> departments have relatively stable funding. In those situations, yes,
> the data are fairly good.  There is still a serious conflation issue
> that isn't addressed, with respect to buildings whose footprints are
> already mapped but do not bear addresses, where the address point may
> or may not be in the building footprint.  Many address points, too,
> get clustered at the entrance of a private or shared driveway, rather
> than being on the indivdual dwellings. I seem to recall that at least
> one or two of the apartment and townhouse complexes in the general
> area of https://www.openstreetmap.org/#map=18/42.83211/-73.89931 had
> to have their house numbers collected on foot, because the E911 data
> showed all the address points in a single cluster.
>
> In the rural areas, particularly in the counties with tiny
> populations, the situation is grimmer. I'm not certain that Schuyler
> or Wyoming Counties even would _have_ dedicated GIS departments!
> Until relatively recently, when grant money was available to have this
> information in GIS systems for E911 use, they mostly were still using
> paper maps, often referenced to an unknown datum.  (The first job in
> dealing with any scanned tax plat is figuring out what coordinate
> frame it's using - around here, NAD27 differs from NAD83 by a few tens
> of metres.) The address points may be parcel centroids, or building
> centroids, or the point where the driveway meets the road, or even
> just something that was digitized from a pencil sketch made by an
> assessor.  Import of this sort of data could well prove to be a
> short-term gain but impose a heavy long-term burden; consider the
> love-hate relationship that we all have with TIGER. (The import means
> that we've got a nearly-filled-in map, a lot of which is of
> halfway-decent quality, and we don't have the mappers to have done it
> nearly as quickly any other way. Nevertheless, for some years we've
> been paying the price in bad data and worse conflation.)
>
> So, my advice for both legal and technical reasons would be to use
> caution, and recognize that mechanical import is likely to be a
> disaster - the data will need to be eyeballed by human beings and
> corrected.

I certainly did not do an extensive check of the quality, so this is a 
super useful perspective. (I wanted more clarity on the legal aspect before 
investing more time in that, since, after all, if it's a definite no go 
from a legal perspective, why waste any time at all?) It's unfortunate that 
there's such a big variation in quality, although not unexpected, since 
they come from the counties themselves.

However, at least the examples you gave would not necessarily make me 
consider the data unusable without extensive correction. The way I look at 
this is: if the point is close enough that were a person to stand right at 
the exact spot, could they find the place they are looking for? If the 
answer is yes for the vast majority of the data, then I would call that a 
net gain for OSM.

Furthermore, if the data were never manually reviewed and corrected, would 
it still be valuable enough to import? You obviously have extensive 
experience with this data set, so I would trust your judgment on this, but 
if the worst problems we see are mostly the ones you described, it would 
sound to me like the pros outweigh the cons, even if the points were never 
corrected.

For example, I've personally seen many roads from TIGER imports that are 
way way off, or even nonexistent, especially long driveways in deeply rural 
areas. But the fact that the main named roads are there at all is a huge 
benefit to OSM, even if not every road is perfectly accurate, and many will 
simply never be reviewed.

(With that said, obviously I would want the data to be as accurate as 
possible, and I'm not making a case to import all the data as is with no 
review or correction, but simply thinking through the practical reality of 
the task of making all the data completely accurate. We don't want perfect 
to be the enemy of good.)

For the issue of conflation with existing buildings with no address tags, 
that might be too difficult of a case to address without reviewing each and 
every case by hand, which might be practically infeasible. I've seen a lot 
of cases where there is a house and a detached garage, or in-law right next 
to the house. It might be possible to detect if there is only one point 
that is inside of a building, but for the other cases you mentioned, where 
it might instead be the centroid of the parcel, or at the intersection of 
the driveway and the street, I don't think there would be a way around 
fixing these by hand, which indeed would be infeasible without a large 
number of people participating.

I think this goes back to my earlier point: if the address points were 
added and not conflated with an existing building, would that still be 
valuable? It may not be perfect. It may go against the "one feature, one 
object" principle. But I think at the end of the day, it might provide 
enough value to do it anyway.

Thinking about it in terms of short- vs long-term gains vs work, I don't 
have extensive experience cleaning up bad imports, so I appreciate that I 
may be missing some perspective on the woes of bad data... but one could 
also see all of the missing addresses and houses as long-term work, the 
same way that fixing the accuracy of imported data is long-term work. If 
you see *all* of it as work, at the other end of an import, was there a net 
gain in work accomplished? If there aren't extensive problems with the 
address data, then you could choose to think about it like more work was 
done with adding good address data than work was added with bad or 
not-perfect-but-usable data.

> From the legal standpoint, it would be best to proceed only
> with those counties that have granted fairly broad authority to use
> their cadastral data. Those include the five boroughs of New York City
> (that is, Bronx, Kings, New York, RIchmond and Queens Counties), and
> the counties of Cayuga, Chautauqua, Cortland, Erie, Genesee, Greene,
> Lewis, Ontario, Orange, Rensselaer, Sullivan, Tioga, Tompkins, Ulster,
> Warren and Westchester.  In New York City, the job is  essentially
> done, because there have been massive (and relatively well curated)
> imports of the public data from the city's GIS department.  I'd
> recommend avoiding the Long Island counties of Nassau and Suffolk,
> because they've been litigious in the past about their data.

Thanks so much for this list! Is there anything specific we can reference 
as far as some kind of proof of such granted authority? It might be useful 
to add that to the wiki.
--
Skyler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-us/attachments/20200717/6a39f760/attachment-0001.htm>