[Imports] HIFLD

Sun Oct 2 10:09:38 UTC 2022

James Crawford via Imports <imports at openstreetmap.org> writes:

> I’ve been preparing an import for the HIFLD for a couple months
> now. This dataset is maintained by the US Department of Homeland
> Security, and contains a wide array of data in the US.

A few questions:

  Thanks for including your osm user name at the end; I couldn't
  remember if I had seen your name or not.  I'm glad to see you have
  been actively mapping.  hat's not as far as I know part of the rules,
  but I believe that imports are vastly harder than anyone who hasn't
  imported thinks, and that people doing imports should have
  substanntial hand-mapping experience.  Sorry if you've been on the
  email lists and I failed to recognize you.

  Are you an employee of or acting on behalf of DHS?  Again, no rules,
  but I think people should understand how an importer relates to the
  data publisher.  <humor>With disroot I can't tell if you're a
  cypherpunk or a fed in hiding</>.

> The data varies a lot in type, accuracy, size, etc. and as such I make
> individual documentation on importing with special instructions for
> each individual dataset.

Really only high-quality data should be imported, so I don't follow a
plan to import data of varying quality.  By high-quality I mean that
substantially all (>= 99%) of objects in the data set exist and the
positions are close (within 20m?) to the correct positions.

A typical issue in databases is recency.  Around me, pharmacies to pick
a random example come and go.   I have added and removed them in some
cases (Walgreen's opens a new store, closes after a year).  It would not
be ok for an import to be re-adding them.   So data quality assessment
needs to ask "do >= 99% of the objects in the db currently exist".

Given how these datasets seem to have been created, I am pretty sure
quality assessment has to be done for each state separately; you can't
assume different processes have the same kinds of outcomes.

In the web page, quality is labeled with subjective terms, and for an
effort of this scope I'd like to see quantitative definitions.

In general, I am uncomfortable with advice for people to download data,
transform tags and upload.  I think it's far better to have a published
program (e.g. python script) that:

  takes the data (downloaded shapefile, whatever)
  takes an OSM snapshot (e.g. planet sub-chunk in postgis)
  takes a defined area

  converts the shapefile to OSM format doing the tag transformation and
  tag dropping

  conflates the date, producing separate output that has
    features not in OSM that could be uploaded
    features already in OSM
    something else depending on what's learned in the process

This way, people can run the conflation and examine the results to
assess quality.  And, I think actually writing this as code and
expecting it to be run repeatedly sharpens the thinking about the import
transformation process and shines a more careful light on quality.

> If a dataset has about <1000 objects, I feel that it is reasonably
> small enough that I could check each object individually, conflate as
> needed, and upload all at once, (possibly using the one feature at a
> time option in JOSM so that the bounding box isn’t large) This way I
> don’t have to waste my time making micro-changesets in each state for
> like 1-2 objects each. (this is the main strategy for most of the
> financial datasets)

I wonder what others think, but I don't personally object to import
changesets being largish.   It would be nice if JOSM could do "one
changeset per admin-level N boundary" and you picked 4, though.

But, once there is code to transform/conflate, you'll have candidate
upload files.

> 3: Imported using MapRoulette
>
> I’m not very familiar with MapRoulette, but it was suggested as a
> viable option for importing this data. For some very particular
> datasets, such as the National Bridge Inventory, it would likely work
> best, because a MR user could add a nearby bridge based on the data
> provided in the point. This may also work well for some data in the
> public health section for example, but it remains to be seen.

Keep in mind that mappers in some states have added various data and
often there is state data, so you should evaluate if the DHS data is
better or worse than the state data, and not import it if it's worse.

> 4: Imported on a state/local level without review
>
> If the locals of a state are in support of having data added without
> manual review before uploading, I am willing to have data added with
> review being done after upload by the locals of a state. I don’t plan
> on doing this unless it is explicitly requested.

This isn't really ok, depending on what you mean.   I think it's ok to
take a dataset and do statistical quality control, where some fraction
of points are checked (against on-the-ground reality), and then if >99%
of them are correct, to assume they are all correct (enough that "fix
later" is ok).

Then, there still needs to be conflation, to avoid overwriting
manually-mapped data, and to avoid duplication.

In my view, as stated above, conflation should be done by programs and
the programs should be published.

> Obtaining local support:
>
> I am willing to do the legwork. I plan on spending some time in each
> state to find any mappers that are active in the state and can speak
> on behalf of the local community, so I make sure that I can have full
> local support rather than just pinging an empty slack channel and
> taking silence as a yes. I’ll publish a table of the active mappers by
> state on a personal wiki page for everyone’s enjoyment as well :)

Note that some states, including MA, have email lists, and a number of
active mappers do not believe the use of Slack is legitimate (because
it's a proprietary system requiring signing a contract with a particular
company).  However others think it's okk.

And obviously talk-us, but it makes sense to get a more baked proposal
here.

> Licensing:
>
> Any data published by a national agency in the US is required to be in
> the public domain. If the HIFLD has external data published, it is
> automatically in the public domain.

I'm not super worried about licensing, but can you provide a citation
for this claim?  I realize that "works by the US government" including
those by employees under the work-for-hire doctrine are PD, but I have
not seen statute that says mere publication of material from others
results in PD status.

Some of these datasets seem to be compliations of other datasets.
Nursing homes, that I picked because I can sort of armchair assess
quality, seems to be copied from state databases, at least in MA.  The
source data is by address, so it was geocoded somehow.  All of this is
unclear about licensing, so that makes me really want to understand the
"if published by the US, is PD" claim.q

The website doesn't really get this right.  Looking at "nursing homes":
  https://hifld-geoplatform.opendata.arcgis.com/datasets/geoplatform::nursing-homes/about
says

  License

  None (Public Use). Users are advised to read the data set's metadata
  thoroughly to understand appropriate use and data limitations.

which is confusing public domain vs data for which a license under
copyright law is needed.

> You can read all the information about this import, as well as a few
> drafts I've written for importing certain datasets here:
> https://wiki.openstreetmap.org/wiki/HIFLD

I really don't follow some of the comments like:

  Fine for import, probably good for equity analysis

This list is about "is this data high quality enough to import".
Anybody can do any kind of analysis they want with OSM and external
data, so mentioning that seems a bit red herring is.  But it did make me
realize that the wiki page is describing a set of data sources more than
it is an import proposal.

The wiki page should probably be renamed to be clearly a US thing, and a
DHS thing within that.  I bet am in the upper few % of mappers in
speaking fed acronyms especially in the emergency management world, and
HIFLD was totally unknown to me.

I am particularly skeptical of trail data, and this page doesn't clearly
separate import candidates from "recommend against import; useful as
reference layer".

I did look at nursing home data (which includes assisted living).  Some
of it seemed pretty good.  For one campus in Worcester MA the ALR and
SNF locations were backwards (both are already correct in OSM).  There
were some SNFs that I had not heard of, and on researching them, some
seem ok, and some I can't figure out.  And, SNFs are coded as
TYPE="NURSING HOME" and NAICS_CODE="NURSING CARE FACILITIES (SKILLED
NURSING FACILITIES)" the second of which is fine, but for example the
"Transitional Care Unit" at Emerson Hospital is listed.  That's a rehab
facility, but it is not a "nursing home".  The WINTER HILL REST HOME in
Worcester is coded as SNF, and on searching I find no web presence, and
one index site of unknown quality listing it as Assisted Living.   But
at ALR without a web site doesn't make sense.  A MA dataset shows it as
a 'rest home', which is not even an ALR.

So from a quick glance this dataset fails badly at the ">= 99% correct"
test.  It does seem useful for people in state to go over the map with
and identify places that could be added, but that's not import.

This isn't even asking: What is the quality in each other state?  If the
data is gathered from state databases on a state-by-state basis (which
seems like the only sane thing for DHS to have done), then there's no
reason to expect a quality assessment for one process to be valid for a
different process, with different source data.  But in this case, rest
homes are coded as SNFs, and that indicates a fundamental lack of
quality control on the part of DHS.  And all I did was download the file
geodatabase, add in in qgis, symbolize by TYPE and display NAME and
click around in areas where I have a fairly good clue about
on-the-ground reality, for maybe 20-30 minutes total.  So therefore I am
immediately skeptical of data quality in general -- if I found this much
wrong in 20 minutes including what seem to be systematic problems, how
careful could they have been?

Greg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/imports/attachments/20221002/7d748603/attachment.sig>