Skip to content

Latest commit

 

History

History
362 lines (311 loc) · 13.5 KB

NMNH-US-National-Herbarium.md

File metadata and controls

362 lines (311 loc) · 13.5 KB

Dataset Summary

The US National Herbarium dataset is a natural history dataset containing over 4.5 million plant specimens in over 1,300 families, with images of each specimen as well as corresponding occurrence data. Occurrence data includes the species classification, the date/time and site/location of collection, and other metadata conforming to the Darwin Core data standard (https://dwc.tdwg.org). The plant collections of the Smithsonian Institution reside in the US National Herbarium within the National Museum of Natural History, which today numbers over 5 million historical plant records, placing it among the world's largest and most important. The overwhelming majority of these specimens are digitized and publicly available.

Languages

English

Data Instances

[ 
  { 
    "gbifID": "", 
    "abstract": "", 
    "accessRights": "", 
    "accrualMethod": "", 
    "accrualPeriodicity": "", 
    "accrualPolicy": "", 
    "alternative": "", 
    "audience": "", 
    "available": "", 
    "bibliographicCitation": "", 
    "conformsTo": "", 
    "contributor": "", 
    "coverage": "", 
    "created": "", 
    "creator": "", 
    "date": "", 
    "dateAccepted": "", 
    "dateCopyrighted": "", 
    "dateSubmitted": "", 
    "description": "", 
    "educationLevel": "", 
    "extent": "", 
    "format": "", 
    "hasFormat": "", 
    "hasPart": "", 
    "hasVersion": "", 
    "identifier": "https://collections.nmnh.si.edu/search/botany/", 
    "instructionalMethod": "", 
    "isFormatOf": "", 
    "isPartOf": "", 
    "isReferencedBy": "", 
    "isReplacedBy": "", 
    "isRequiredBy": "", 
    "isVersionOf": "", 
    "issued": "", 
    "language": "", 
    "license": "CC0_1_0", 
    "mediator": "", 
    "medium": "", 
    "modified": "", 
    "provenance": "", 
    "publisher": "", 
    "references": "", 
    "relation": "", 
    "replaces": "", 
    "requires": "", 
    "rights": "", 
    "rightsHolder": "", 
    "source": "", 
    "spatial": "", 
    "subject": "", 
    "tableOfContents": "", 
    "temporal": "", 
    "title": "", 
    "type": "PhysicalObject", 
    "valid": "", 
    "institutionID": "urn:lsid:biocol.org:col:34871", 
    "collectionID": "urn:uuid:60e28f81-e634-4869-aa3e-732caed713c8", 
    "datasetID": "", 
    "institutionCode": "US", 
    "collectionCode": "", 
    "datasetName": "NMNH Extant Biology", 
    "ownerInstitutionCode": "", 
    "basisOfRecord": "PRESERVED_SPECIMEN", 
    "informationWithheld": "", 
    "dataGeneralizations": "", 
    "dynamicProperties": "", 
    "occurrenceID": "", 
    "catalogNumber": "", 
    "recordNumber": "", 
    "recordedBy": "", 
    "recordedByID": "", 
    "individualCount": "", 
    "organismQuantity": "", 
    "organismQuantityType": "", 
    "sex": "", 
    "lifeStage": "", 
    "reproductiveCondition": "", 
    "behavior": "", 
    "establishmentMeans": "", 
    "degreeOfEstablishment": "", 
    "pathway": "", 
    "georeferenceVerificationStatus": "", 
    "occurrenceStatus": "PRESENT", 
    "preparations": "Pinned", 
    "disposition": "", 
    "associatedOccurrences": "", 
    "associatedReferences": "", 
    "associatedSequences": "", 
    "associatedTaxa": "", 
    "otherCatalogNumbers": "", 
    "occurrenceRemarks": "EMu record was created as part of the Smithsonian Institution Digitization Program Office (SI DPO) mass digitization project.", 
    "organismID": "", 
    "organismName": "", 
    "organismScope": "", 
    "associatedOrganisms": "", 
    "previousIdentifications": "", 
    "organismRemarks": "", 
    "materialSampleID": "", 
    "eventID": "", 
    "parentEventID": "", 
    "fieldNumber": "", 
    "eventDate": "", 
    "eventTime": "", 
    "startDayOfYear": "", 
    "endDayOfYear": "", 
    "year": "", 
    "month": "", 
    "day": "", 
    "verbatimEventDate": "", 
    "habitat": "", 
    "samplingProtocol": "", 
    "sampleSizeValue": "", 
    "sampleSizeUnit": "", 
    "samplingEffort": "", 
    "fieldNotes": "", 
    "eventRemarks": "", 
    "locationID": "", 
    "higherGeographyID": "", 
    "higherGeography": "", 
    "continent": "", 
    "waterBody": "", 
    "islandGroup": "", 
    "island": "", 
    "countryCode": "", 
    "stateProvince": "", 
    "county": "", 
    "municipality": "", 
    "locality": "", 
    "verbatimLocality": "", 
    "verbatimElevation": "", 
    "verticalDatum": "", 
    "verbatimDepth": "", 
    "minimumDistanceAboveSurfaceInMeters": "", 
    "maximumDistanceAboveSurfaceInMeters": "", 
    "locationAccordingTo": "", 
    "locationRemarks": "", 
    "decimalLatitude": "", 
    "decimalLongitude": "", 
    "coordinateUncertaintyInMeters": "", 
    "coordinatePrecision": "", 
    "pointRadiusSpatialFit": "", 
    "verbatimCoordinateSystem": "", 
    "verbatimSRS": "", 
    "footprintWKT": "", 
    "footprintSRS": "", 
    "footprintSpatialFit": "", 
    "georeferencedBy": "", 
    "georeferencedDate": "", 
    "georeferenceProtocol": "", 
    "georeferenceSources": "", 
    "georeferenceRemarks": "", 
    "geologicalContextID": "", 
    "earliestEonOrLowestEonothem": "", 
    "latestEonOrHighestEonothem": "", 
    "earliestEraOrLowestErathem": "", 
    "latestEraOrHighestErathem": "", 
    "earliestPeriodOrLowestSystem": "", 
    "latestPeriodOrHighestSystem": "", 
    "earliestEpochOrLowestSeries": "", 
    "latestEpochOrHighestSeries": "", 
    "earliestAgeOrLowestStage": "", 
    "latestAgeOrHighestStage": "", 
    "lowestBiostratigraphicZone": "", 
    "highestBiostratigraphicZone": "", 
    "lithostratigraphicTerms": "", 
    "group": "", 
    "formation": "", 
    "member": "", 
    "bed": "", 
    "identificationID": "", 
    "verbatimIdentification": "", 
    "identificationQualifier": "", 
    "typeStatus": "", 
    "identifiedBy": "", 
    "identifiedByID": "", 
    "dateIdentified": "", 
    "identificationReferences": "", 
    "identificationVerificationStatus": "", 
    "identificationRemarks": "", 
    "taxonID": "", 
    "scientificNameID": "", 
    "acceptedNameUsageID": "", 
    "parentNameUsageID": "", 
    "originalNameUsageID": "", 
    "nameAccordingToID": "", 
    "namePublishedInID": "", 
    "taxonConceptID": "", 
    "scientificName": "", 
    "acceptedNameUsage": "", 
    "parentNameUsage": "", 
    "originalNameUsage": "", 
    "nameAccordingTo": "", 
    "namePublishedIn": "", 
    "namePublishedInYear": "", 
    "higherClassification": "Plantae", 
    "kingdom": "Plantae", 
    "phylum": "", 
    "class": "", 
    "order": "", 
    "family": "", 
    "subfamily": "", 
    "genus": "", 
    "genericName": "", 
    "subgenus": "", 
    "infragenericEpithet": "", 
    "specificEpithet": "", 
    "infraspecificEpithet": "", 
    "cultivarEpithet": "", 
    "taxonRank": "", 
    "verbatimTaxonRank": "", 
    "vernacularName": "", 
    "nomenclaturalCode": "", 
    "taxonomicStatus": "", 
    "nomenclaturalStatus": "", 
    "taxonRemarks": "", 
    "datasetKey": "", 
    "publishingCountry": "US", 
    "lastInterpreted": "", 
    "elevation": "", 
    "elevationAccuracy": "", 
    "depth": "", 
    "depthAccuracy": "", 
    "distanceAboveSurface": "", 
    "distanceAboveSurfaceAccuracy": "", 
    "issue": "", 
    "mediaType": "", 
    "hasCoordinate": "", 
    "hasGeospatialIssues": "", 
    "taxonKey": "", 
    "acceptedTaxonKey": "", 
    "kingdomKey": "", 
    "phylumKey": "", 
    "classKey": "", 
    "orderKey": "", 
    "familyKey": "", 
    "genusKey": "", 
    "subgenusKey": "", 
    "speciesKey": "", 
    "species": "", 
    "acceptedScientificName": "", 
    "verbatimScientificName": "", 
    "typifiedName": "", 
    "protocol": "DWC_ARCHIVE", 
    "lastParsed": "", 
    "lastCrawled": "", 
    "repatriated": "", 
    "relativeOrganismQuantity": "", 
    "level0Gid": "", 
    "level0Name": "", 
    "level1Gid": "", 
    "level1Name": "", 
    "level2Gid": "", 
    "level2Name": "", 
    "level3Gid": "", 
    "level3Name": "", 
    "iucnRedListCategory": "" 
  } 
] 

Data Fields

Fields conform to the Darwin Core data standard and are detailed here: https://dwc.tdwg.org.

Curation Rationale

The dataset represents all records from the U. S. National Botany Collection (Herbarium: US) with digital records and images as of September 2023. Over 4.5 million specimen records (including over 115,000 type specimens with images) are currently available in the online catalog. The U.S. National Herbarium was founded in 1848, when the first collections were accessioned from the United States Exploring Expedition (50,000 specimens of 10,000 species). This collection is among the ten largest in the world representing about 8% of the plant collection resources of the United States. A seven year effort to digitize the herbarium through a digitization conveyor system, ending in May 2022 resulted in 3.8 million new specimen images, 2.8 million new label transcriptions, and over 80,000 new taxonomic names added to the data catalog. This effort was a collaboration between The Smithsonian National Museum of Natural History Department of Botany, the Smithsonian Office of the Chief Information Officer Digitization Program Office and the digitization company Picturae.

Initial Data Collection and Normalization

The plant collections of the Smithsonian Institution began with the acquisition of specimens collected by the United States Exploring Expedition (1838-1842). These formed the foundation of a National Herbarium.

NMNH specimen data get exported to the Global Biodiversity Information Facility (GBIF) on a weekly basis through an installation of an Integrated Publishing Toolkit (IPT, https://collections.nmnh.si.edu/ipt/). Some data transformation takes place within EMu and GBIF likewise normalizes the data to meet their standards.

Who are the source language producers?

The occurrence data were produced by humans, observed and written onto paper labels over the museum’s history, and then transcribed from the herbarium sheet labels.

Annotations

The specimen occurrence data in Darwin Core fields.

Annotation process

The occurrence data were transcribed from the labels by the conveyor vendor, and by NMNH staff over 40 years, into Darwin Core fields.

Who are the annotators?

Original collectors and identifiers were botanists and researchers from the Smithsonian and other institutions. Collectors may not be specialists. For data transcription, online volunteers and professional transcription service workers. Demographic data of transcribers is unknown.

Personal and Sensitive Information

The dataset contains the names of the collectors and identifiers.

Social Impact of Dataset

Digitized natural history collections have the potential to be used in diverse research applications in evolutionary biology, ecology, and climate change.

The dataset contains records for species listed on the U.S. Endangered Species List.

Some site/location names could cause harm as they are insensitive or racist towards indigenous communities.

Discussion of Biases

Estimates of species geographic ranges based on these data may not be complete. There are many reasons collectors may collect more frequently from some areas rather than others, including their own taxonomic interests, proximity to collections institutions, accessibility via roads, ability to acquire permits for a specific area, or for geopolitical reasons.

The geographic distribution of specimens in this dataset is not necessarily representative of the geographic distribution of plant diversity. The lack of a taxon from a specific area in the herbarium data does not prove the that the species does not exist in that area; all herbaria data can prove is the positive, not negative, existence.

Other Known Limitations

As with all natural history collections data, there is the potential that some metadata are inaccurate or inconsistent given that they have been collected and recorded over the course of the past 185 years. Smithsonian staff seek to correct these errors as they are identified but the dataset as presented is a snapshot in time.

Species identifications may be inaccurate or not up-to-date based on the latest classification.

Collector names may not be consistent across records (e.g. the same person’s name may be written differently). For women’s names, which were often historically recorded as Mrs. <spouse’s name>, only the spouse’s name may appear.

Locality data may use historical place names that are no longer used.

Although there are written locality data for most specimen records, a large percentage do not have geocoordinates.

Dates may sometimes have been recorded by original collectors inconsistently or may be incomplete (no month/day information).

For endangered species, locality data is not included in the dataset.

For specimens collected from Brazil, specimen images are not included in the dataset.

Plant specimens from Cactaceae and Orchidaceae do not have images online because of poaching concerns. Locality data of these families is also kept offline.

The approximate numbers reported in the dataset card are accurate as of September 2023, but specimens are continually being deposited in the US Herbarium and subsequently digitized.

Dataset Curators

Smithsonian National Museum of Natural History, Department of Botany.

Licensing Information

Public domain, Creative Commons CC0.

Citation Information

Orrell T, Informatics Office (2023). NMNH Extant Specimen Records (USNM, US). Version 1.71. National Museum of Natural History, Smithsonian Institution. Occurrence dataset https://doi.org/10.15468/hnhrg3 accessed via GBIF.org on 2023-08-21.

Contributions

Thanks to NMNH for adding this dataset.