This post will share an article about digital preservation. The article is titled: Bridging the Two Fossil Records: Paleontology’s “big data” future resides in museum collections.
The authors in this article begin by introducing the concept of two distinct but intertwined fossil records. The first is the physical record consisting of material objects, the fossils themselves, either those residing in collections or remaining in nature awaiting future discovery (Allmon et al. 2018). The second is the abstracted fossil record, consisting of contextual and comparative information gathered by researchers, including any interpretations (Allmon et al. 2018). The entirety of the abstracted record is based on the physical record; the authors state the obvious, that this is simply how paleontology works (Allmon et al. 2018). Overtime the physical record requires reinterpretation, adding to the abstracted record; this cycle reveals that the physical record is the true source of data. While both records can be examined, Allmon et al. 2018 make very plain the benefits of studying or re-studying the fossil record. As primary data sources, fossils preserve information that may not be currently accessible, meaning new discoveries can be made from specimens collected hundreds of years ago, due to new technologies and knowledge. They also form the basis of biology and paleontology, as verification and replication of observations is essential to further advancing the fields and is a fundamental concept of science itself.
A look at paleontology history shows that in the 1970’s and 1980’s there was a shift in focus from studying specimens to data digitization and big data studies stemming from the published literature. This was innovative for the time and allowed for many questions to be answered that had been impossible before to even ask. Allmon et al. state that digitized, big data is still the future, but literature is a finite resource and much has scientifically and technologically developed since then. Current paleontology databases are good tools, but need improved for a variety of reasons. Firstly, there is a very small percentage of data digitally available compared to the potential scope as the vast majority of collections have not been digitized, thereby making the databases un-comprehensive (Allmon et al. 2018). Gaps in the digitized record are evident not only in which specimens and geographic locations are included, but also the metadata for each is not standardized, so some specimens have metadata which others don’t. This not only leads to interoperability problems, but also quality issues. Another quality concern is the reliability of taxonomic identifications; outdated, incorrect or missing identifications (a large proportion of museum specimens have not been identified at all) are a problem in a research database; incorrect information leads to incorrect results (Allmon et al. 2018). Increased engagement of researchers could alleviate some of these issues, however there is little if any reward system or incentives for them to do, as there is for writing publications and receiving tenure. The last major obstacle to digitizing specimen data is the availability of funding to ‘digitize everything’. They include some crude calculations as to the cost of digitizing only the identified specimens in American museums, at $1 USD per specimen, to be about $75 million USD, and compare it to NSF’s $10 million USD allocated in the 2017 budget for Advancing Digitization of Biological Collections program for all natural history collections (Allmon et al. 2018). Seeing these rough totals, they conclude that digitizing all or even most is very unlikely. Questions then remain, such as what to prioritize and how much digitization is sufficient.
I found this article very interesting as it brings up points we’ve discussed during the digital preservation class, and other weeks’ topics. Their concerns are that while digitization of data has been able to answer some big data research questions already, it’s come to a point where the databases need maintenance and new data to continue being useful; we talked in class about digital preservation being an ongoing activity, and this is clear here. They bring up issues of bias in which data is excluded, in this case it’s identified specimens that are more likely to be included than the unidentified ones. We’ve talked about bias in LAM before, concerning which objects are included or how they are described. Being more inclusive will lead to a more robust digital collection, enhanced accuracy of data, and better research. This would happen in an ideal world; however it is quite sad that the funding simply isn’t there. I recall a paper from earlier in the course from Erway and Schaffner, who advocated for mass digitization as soon as possible to expose hidden collections, and enhance the quality over time as resources allow (Erway and Schaffner 2007). While this can be a positive approach to many projects, Allmon et al. are of the opinion that in this case, for fossil collections (even just the identified specimens) residing in the U.S., not even this is feasible.
Allmon, W.D., Dietl, G.P., Hendricks, J.R., Ross, R.M. (2018) Bridging the Two Fossil Records: Paleontology’s “big data” future resides in museum collections, in Rosenberg, G.D., and Clary, R.M., eds., Museums at the Forefront of the History and Philosophy of Geology: History Made, History in the Making: Geological Society of America Special Paper 535.
Erway, R., Schaffner J. 2007. Shifting Gears: Gearing Up to Get Into the Flow. OCLC Programs and Research. Published online.