The past as the key to the future

This post will share an article about digital preservation. The article is titled: Bridging the Two Fossil Records: Paleontology’s “big data” future resides in museum collections.

The authors in this article begin by introducing the concept of two distinct but intertwined fossil records. The first is the physical record consisting of material objects, the fossils themselves, either those residing in collections or remaining in nature awaiting future discovery (Allmon et al. 2018). The second is the abstracted fossil record, consisting of contextual and comparative information gathered by researchers, including any interpretations (Allmon et al. 2018). The entirety of the abstracted record is based on the physical record; the authors state the obvious, that this is simply how paleontology works (Allmon et al. 2018). Overtime the physical record requires reinterpretation, adding to the abstracted record; this cycle reveals that the physical record is the true source of data. While both records can be examined, Allmon et al. 2018 make very plain the benefits of studying or re-studying the fossil record. As primary data sources, fossils preserve information that may not be currently accessible, meaning new discoveries can be made from specimens collected hundreds of years ago, due to new technologies and knowledge. They also form the basis of biology and paleontology, as verification and replication of observations is essential to further advancing the fields and is a fundamental concept of science itself.

A look at paleontology history shows that in the 1970’s and 1980’s there was a shift in focus from studying specimens to data digitization and big data studies stemming from the published literature. This was innovative for the time and allowed for many questions to be answered that had been impossible before to even ask. Allmon et al. state that digitized, big data is still the future, but literature is a finite resource and much has scientifically and technologically developed since then. Current paleontology databases are good tools, but need improved for a variety of reasons. Firstly, there is a very small percentage of data digitally available compared to the potential scope as the vast majority of collections have not been digitized, thereby making the databases un-comprehensive (Allmon et al. 2018). Gaps in the digitized record are evident not only in which specimens and geographic locations are included, but also the metadata for each is not standardized, so some specimens have metadata which others don’t. This not only leads to interoperability problems, but also quality issues. Another quality concern is the reliability of taxonomic identifications; outdated, incorrect or missing identifications (a large proportion of museum specimens have not been identified at all) are a problem in a research database; incorrect information leads to incorrect results (Allmon et al. 2018). Increased engagement of researchers could alleviate some of these issues, however there is little if any reward system or incentives for them to do, as there is for writing publications and receiving tenure. The last major obstacle to digitizing specimen data is the availability of funding to ‘digitize everything’.  They include some crude calculations as to the cost of digitizing only the identified specimens in American museums, at $1 USD per specimen, to be about $75 million USD, and compare it to NSF’s $10 million USD allocated in the 2017 budget for Advancing Digitization of Biological Collections program for all natural history collections (Allmon et al. 2018). Seeing these rough totals, they conclude that digitizing all or even most is very unlikely. Questions then remain, such as what to prioritize and how much digitization is sufficient.

I found this article very interesting as it brings up points we’ve discussed during the digital preservation class, and other weeks’ topics. Their concerns are that while digitization of data has been able to answer some big data research questions already, it’s come to a point where the databases need maintenance and new data to continue being useful; we talked in class about digital preservation being an ongoing activity, and this is clear here. They bring up issues of bias in which data is excluded, in this case it’s identified specimens that are more likely to be included than the unidentified ones. We’ve talked about bias in LAM before, concerning which objects are included or how they are described. Being more inclusive will lead to a more robust digital collection, enhanced accuracy of data, and better research. This would happen in an ideal world; however it is quite sad that the funding simply isn’t there. I recall a paper from earlier in the course from Erway and Schaffner, who advocated for mass digitization as soon as possible to expose hidden collections, and enhance the quality over time as resources allow (Erway and Schaffner 2007). While this can be a positive approach to many projects, Allmon et al. are of the opinion that in this case, for fossil collections (even just the identified specimens) residing in the U.S., not even this is feasible.


Allmon, W.D., Dietl, G.P., Hendricks, J.R., Ross, R.M. (2018) Bridging the Two Fossil Records: Paleontology’s “big data” future resides in museum collections, in Rosenberg, G.D., and Clary, R.M., eds., Museums at the Forefront of the History and Philosophy of Geology: History Made, History in the Making: Geological Society of America Special Paper 535.

Erway, R., Schaffner J. 2007. Shifting Gears: Gearing Up to Get Into the Flow. OCLC Programs and Research. Published online.


Pyrite in Two Libraries

This post is a response to the prompt to compare metadata in two different digital libraries. The two libraries are Colorado School of Mines – Mineral Specimens and System for Earth Sample Registration. Each library and it’s metadata will be discussed and towards the end we’ll look at a pyrite sample from each collection.

To begin with, we’ll examine the Colorado School of Mines – Mineral Specimens project, which involved the library, institutional repository and geology museum all of whom collaborated to promote a themed mineral collection focusing on the historical mining district of Creede, Colorado (Dunn, 2018). Minerals in this collection are on virtual display. For example a pyrite (54072) shows users a brief overview of the mineral, in addition to a full metadata record. From the element names it’s clear that the schema is Dublin Core, indicated by the prefix ‘dc.’. Elements include: contributor, coverage, date, identifier, description, publisher, relation, rights, subject, title, and type. However there are variants to these elements too. Dublin Core was intentionally created to be simplistic, so as to be widely used, but over time has adapted to become more specific when needed by adding in quantifiers (Riley, 2017). Looking at the record for pyrite (54072), the contributor, coverage, date, identifier, description, relation, and subject have quantifiers.

Below is a summary of the elements used for this collection. It’s worth pointing out that Dublin Core is a flat schema.

  • coverage.spatial
  • date
  • date.accessioned
  • date.available
  • identifier
  • identifier.uri
  • description
  • description.abstract
  • publisher
  • relation.ispartof
  • rights
  • subject.lcsh
  • title
  • type
  • contributor.institution

The second digital library, System for Earth Sample Registration, is an NSF funded project of the EarthChem Program and part of IEDA, Interdisciplinary Earth Data Alliance (SESAR, 2018). System for Earth Sample Registration is more than a digital library; it also distributes International Geo Sample Numbers (IGSN). IGSNs are similar to ORCIDs for people, in that they are unique IDs to help overcome the issues of ambiguous naming. IGSNs are equal to the identifier in Dublin Core. More element equivalents between the IGSN schema and Dublin Core, are in the table below (ISGN / Metadata Version 1.0, 2016). Before looking at the schema and its elements, please note there are in fact two schemas at work. Also of importance is that unlike Dublin Core, IGSN is a hierarchical schema.

dc:element IGSN Descriptive Element
dc:contributor N/A
dc:coverage samplingLocation
dc:creator sampleCollector
dc:date samplingTime
dc:description comments
dc:format materialType
dc:identifier IGSN
dc:language N/A
dc:publisher sampleCurator
dc:relation relatedResources
dc:rights N/A
dc:source N/A
dc:subject classification
dc:title sampleName
dc:type sampleType

Below is the IGSN registration metadata schema, which looks like administrative metadata (ISGN / Metadata Version 1.0, 2016). There are four elements, each containing at least one sub-element.

  • sampleNumber
    • identifierType
  • registrant
    • registrantName
    • nameIdentifier
      • nameIdentifierScheme
    • relatedResourceIdentifier
      • relatedIdentifierType
      • relationType
    • log
      • logElement
        • event
        • timestamp
        • comment

Below is the IGSN descriptive metadata schema (ISGN / Metadata Version 1.0, 2016). Here there are 17 elements, only one more than the Dublin Core example above, however if the sub-elements in the hierarchy are include there are substantially more.

  • identifier
    • identifierType
  • name
  • alternateIdentifiers
    • alternateIdentifiers
    • identifierType
  • parentIdentifier
    • identifierType
  • CollectionIdentifier
    • identifierType
  • relatedIdentifiers
    • relatedIdentifier
    • identifierType
    • relationType
  • description
  • registrant
    • identifier
      • identifierType
    • name
    •  affiliation
      • identifier
        • identifierType
      • name
  • collector
    • identifier
      • identifierType
    • name
    • affiliation
      • identifier
        • identifierType
      • name
  • contributors
    • contributor
      • contributorType
      • identifier
    • identifierType
    • name
  • geoLocations
    • geoLocation
      • geometry
        • geometryType
        • sridType
      • toponym
        • identifier
          • identifierType
        • name
  • resourceTypes
    • resourceType
    • alternateResourceTypes
      • alternateResourceType
  • materials
    • material
    • alternateMaterials
      • alternateMaterial
  • collectionMethods
    • collectionMethod
    • alternatecollectionMethods
      • alternatecollectionMethod
  • collectionTime
  • sampleAccess
  • supplementalMetadata
    • record

Comparing Dublin Core and IGSN schemas there are noticeable differences. The IGSN schema has two separate schemas that work together; Dublin Core has one. IGSN elements are very specific compared to Dublin Core’s. The elements in the Dublin Core are intentionally vague so as to be adaptable to a variety of resources (Riley, 2017). IGSN on the other hand, is tailored to geological samples and not easily adaptable to other resources, rather limiting its uses. Specificity and complexity can have benefits and burdens. With more tailored metadata, searching becomes more efficient for its specific audience and new relationships between resources can be identified. This in turn can result in new data creation, spurring further research. The burden is that this much metadata is very time consuming to input or may not be entirely known.

To make the comparison between metadata of these two digital libraries more concrete, let us look at a record from each; two pyrites, (#54072) and (IGSN HRV0005V9). These were chosen because they are the same mineral, therefore very similar, however the approaches to describing them are quite different. As discussed above, there are different schemas and elements describing each pyrite. The result of this is that pyrite 54072, is described as though it were a photo, and while a photo is included in the library record, the only element that describes the physical specimen is the dc.description which has the specimen size. The others appear to describe the photo of the pyrite, such as dc.type being photograph, the description indicating the photographer’s name and presence of rights element. Minerals, by definition, are not created by people, so can’t themselves be copyrighted. Pyrite HRV0005V9 however seems centered on the mineral itself despite the fact that not many of the elements have been filled in. There is no rights information here, but rather the current archive and curator. The geolocation in IGSN is comparable to dc.coverage.spatial, however the hierarchical level of detail in the metadata, implies a different philosophies, and users. Overall the description of the two pyrites and the comparison of the schemas and their elements are quite drastic.



Dunn, L. 2018. Library/Museum Partnership, Colorado School of Mines. in Geoscience Information Society Newsletter. Geoscience Information Society 278. (8). Published online.

Riley, J. 2017. Understanding Metadata: What is Metadata and what is it for?. National Information Standards Organization.  Published online.

ISGN / Metadata Version 1.0. 2016. GitHub, Accessed Oct. 20, 2018.

SESAR. 2018. System for Earth Sample Registration. Interdisciplinary Earth Data Alliance, Accessed Oct. 20, 2018.

Digitizing Natural History Collections

In this first blog post, I wanted to respond to the reading titled Shifting Gears (Erway & Schaffner, 2007),  which discussed digitization philosophy. A few semesters ago I read an article on digitizing natural history collections which covered the efforts of the NHM in London (Blagoderov et al., 2012). The project at London’s NHM was a mass digitization effort which involved photographing specimens by the drawer and then dividing up the large image into smaller individual specimen images. Unique IDs were assigned in addition to some other metadata . For example if all specimens in a drawer were of the same biological phylum, then this information would be filled in the Darwin Core metadata elements in use. All other metadata would be filled in at a later time by professionals and citizen scientists via crowd-sourcing. When I read this article about a year ago, I was impressed at the amount of work they could achieve but was rather uneasy about the missing metadata and description. With very abbreviated metadata the surrogate records for the specimens seemed to be quite useless for discoverability and research.

In the Erway and Schaffner report (2007), several points are made that are in agreement with the NHM’s approach. They promote access over preservation, state that selection for digitization has already been done based on the holdings, and that quantity is more important than quality. The report recognizes that these are not long standing values or philosophies in digitization projects in LAM (libraries, archives and museums), but suggest they become more accepted. They base this reasoning on nearly two decades of pursing digitization projects where quality was more important than quantity, preservation mattered more than access, and selection of certain objects was preferred to digitizing everything. They argue that moving past these old philosophies would be beneficial for efficiently exposing entire collections to users. Essentially they want to allow users to browse collections and find objects of interest before the information professionals feel that proper metadata and description is present, before certain objects have been chosen to be featured, before everything is scanned to the highest quality, before it’s all perfect.

My reaction to Erway and Schaffner’s report (2007) was initially rather skeptical, just as it was for London’s NHM project (Blagoderov et al., 2012). I still am of the opinion that quality description, metadata, and imaging are very important. However in light of the recent devastating fire in Brazil’s National Museum, I’m starting to see things differently. A vast quantity of the museum’s collection has been lost in the fire (Solly, 2018). In hindsight of such a massive loss, having a digital photographic record of the museum collection would mean, at the least, having an inventory of the collection, with a record of how many specimens there were in various collections or categories. It’s important to say that for some objects, a photograph is not adequate for detailed study, so it could be argued that 3D scans would be more fitting, or even that no digital representation can replace holding the original. However, this way of thinking seems to be what Erway and Schaffner (2007) are suggesting the community move away from and after considering ramifications of the fire, I can start to agree with them.

It seems that if digitization projects become more of a digitization process, with the mindset that quality can come after quantity, LAM as a community may be less hesitant to do them. In my mind, I conceptualize this philosophical change from the notion of a once and done project to a gradual process that starts with mass digitization which is refined over time with more description, metadata and better images/scans of the objects. This is still an expensive and time consuming effort, but on an industrial scale such as demonstrated by London’s NHM (Blagoderov et al. 2012) the costs can be reduced. Perhaps annual budget funds, rather than grant funds, could be advocated for digitization efforts stating the needs for a digital record of their holdings and global scientific access for increased communication about their holdings is valuable.

There are other article from the same journal volume of ZooKeys, which No Specimen Left Behind was published in, which cover similar topics, and look worth reading.


Blagoderov, V., Kitching, I., Livermore, L., Simonsen, T. J., and Smith, V. S. 2012. No Specimen Left Behind: Industrical Scale Digitization of Natural History Collections. ZooKeys 209 (133-146).

Erway, R., Schaffner J. 2007. Shifting Gears: Gearing Up to Get Into the Flow. OCLC Programs and Research. Published online.

Solly, M. 2018. Why Brazil’s National Museum Fire was a Devastating Blow to South Ameria’s Cultural Heritage. Smithsonian, Sep. 4, 2018. Published online: