Digitizing for Preservation vs Access

Something I’ve thought a lot about ever since my intro classes last year is the issue of digitizing for preservation vs digitizing for access – digitizing with individual care to replicate resources in the highest possible quality, or digitizing quickly to make sure that lower quality versions of resources are available to the public as soon as possible. It’s a complicated question, and the answer varies depending on the purpose of the collection; archivists digitized the Mona Lisa for preservation, placing emphasis on the best replication of colors and marks, but they might digitize written documents from Da Vinci for researcher access, as their value comes primarily from their contents, not their artistic execution. A low-quality digital double is not an archival-quality facsimile, just like a black and white scan on a copier is a poor replication of the pages of the original book, and while some grainy 8.5×11 copies serve their purpose just fine, a book with color illustrations would lose much of its nuance and value.

The preservation vs access debate is ongoing in digital archive circles. In their article Digitization as a Preservation Strategy, Krystyna Matusiak and Tamara Johnson explain that cultural heritage digitization started purely as a strategy for access, a way of creating easily distributable copies of original resources, and that the movement to include digitization in resource preservation plans is relatively new and controversial. According to their assessment, however, digitization for preservation is becoming more and more popular, particularly for endangered materials like photograph negatives and audio recordings. Instinctively, I’m inclined to agree without reservations; I’m a millennial digital hoarder who likes to buy their movies in 1080p, and lower-quality digitization pains me. It could be because I grew up in the age of the internet, where storage seems endless and everything lasts forever (especially when you don’t want it to), but I err on the side of new digitization thought – that digitization can be part of a long-term preservation plan for analog visual archives.

However, I’m learning to spot when I need to temper that instinct. In their overview of Thirteen Ways of Looking At … Digital Preservation, Brian Lavoie and Lorcan Dempsey discuss both preservation and access. In their section presenting digital preservation as “a selection process,” they remind us that digitization can be expensive, and that it is advantageous for leaders of digitization efforts to think ahead regarding which objects they plan to digitize for preservation, and which they plan to preserve for access. While digitizing en masse and sorting later may sound appealing, particularly when unconcerned with storage space limits, they note that “saving is not preserving” – digitizing an entire collection at the highest possible quality is rarely affordable or a good use of time, but digitizing an entire collection at a lower quality and sorting back through for preservation would require re-digitizing selected resources, which is not an effective use of money or time, either.

This all comes back to stuff I’ve been chewing on thanks to other classes dealing with archives – namely, how do we as archivists actually know what people in the future will need high-quality digitizations of? The Mona Lisa is a pretty safe bet, but what if researchers in the future need high resolution scans of Da Vinci’s ‘boring’ handwritten records to study his penmanship? It’s impossible to anticipate every possible need, which Lavoie and Dempsey acknowledge when they refer to digital preservation as “an ongoing activity,” one that, with changing file formats and digitization technologies, is never truly complete. All archivists can do at this stage of the evolution of digitization technologies is make educated guesses based on current research patterns. Perhaps, as technologies evolve further, someone will create a miraculous workflow that allows for high quality, time sensitive digitization, and all of these problems will be solved – assuming, of course, that the technology is open source. Fingers crossed.

Lavoie, B. and Dempsey, L. (2004). Thirteen ways of looking at … digital preservation. D-Lib Magazine, 10(7-8). http://www.dlib.org/dlib/july04/lavoie/07lavoie.html

Matusiak, K. and Johnson, T. (2014). Digitization as a preservation strategy. The American Archivist, 77(1), 241-269. http://www.unesco.org/new/fileadmin/MULTIMEDIA/HQ/CI/CI/pdf/mow/VC_Matusiak_Johnston_28_B_1400.pdf


Best Practices for Metadata Aggregation

If it’s not already obvious, I love metadata. I love it. Someday, when I don’t have to worry as much about job security, I’m going to get ‘META DATA’ tattooed on my knuckles. This is the kind of nerd that I am.

I especially love metadata as a means of facilitating aggregation. This class, however, is the first time I’ve had the chance to think critically about what creating metadata to facilitate sharing actually means; I knew in a vague, general sense that metadata for aggregation couldn’t look the same as in-house metadata, but Shreeves, Riley, and Milewicz’s article Moving Towards Shareable Metadata lays out the changes necessary in a way I hadn’t encountered before. They use “the six C’s of shareable metadata” – content, consistency, coherence, context, communication, and conformance to standards – to remind digital metadata librarians what they need to do to make sure their in-house metadata translates well to aggregated databases. I thought a lot about my own metadata schema for the digital library project as I was reading; some of my elements, like titles and rights, would transfer well to an aggregated database, but a little in-house corner that I cut to save time – not putting ‘pixels’ after the dimensions of images to clarify the measurement metric – would translate terribly to an aggregated database pulling from other collections who used inches or centimeters to measure dimensions. My photos aren’t 6000 inches by 4000 inches, fortunately for my Google Drive storage.

Several articles we’ve read have mentioned the Open Archives Initiative Protocol for Metadata Harvesting – the OAI-PMH – and I was curious what it actually looked like. I tracked down version 2.0 of the protocol, which is freely available at the Open Archives Initiative’s website. The document is huge, with thorough definitions of the terms it uses, paragraphs to provide context, and endless boxes of XML script. I would need a seasoned guide and/or a few uninterrupted days of research to fully understand everything, but from what I can gather right now, the OAI-PMH provides a lot of options for selective harvesting, and for filtering harvesting based on the contents of certain elements; if metadata librarians at individual institutions strictly follow Dublin Core, or whatever other schemas they use with their collections, harvesters using the OAI-PMH would be able to reliably harvest specific data sets from those collections, based on the purpose of the aggregation.

Ultimately, a good portion of the responsibility in developing metadata for aggregation still rests on metadata librarians. Many of Shreeves, Riley, and Milewicz’s six C’s apply to in-house metadata as well as shareable metadata, particularly content and consistency, but preparing objects for aggregation requires coordination with other repositories, many of which might use the same Dublin Core elements in different ways to suit their individual collections. Librarians should develop their in-house metadata to make the work of aggregators as easy as possible; in addition to the OAI-PMH, the Open Archives Initiative developed a set of best practices to facilitate the use of the OAI-PMH. The best practices were published in 2006, so they were difficult to track down in their original form, but I found a summary. The document starts with a list of “seven characteristics of quality metadata” – completeness, accuracy, provenance, conformance to expectations, logical consistency and coherence, timeliness, and accessibility – and an additional five characteristics of good shareable metadata – proper context, content coherence, use of standard vocabularies, consistency, and technical conformance – that largely overlap with Shreeves, Riley, and Milewicz’s six C’s. It goes on to offer best practices for a long list of metadata concerns, including recommendations to help record things like names, dates, and geographic locations in ways that facilitate aggregation.

I was especially curious about geographic locations – I’ve never built a schema involving geographic data, so I had no idea going in what a standardized method of recording geographic places would look like. The section starts by emphasizing the importance of “explicitly us[ing] relevant controlled vocabularies,” which seems like good universal shareable metadata advice, and including information in the presentation of the element that specifies which controlled vocabulary is in use. The document then includes links to several geographic controlled vocabularies, which I didn’t realize even existed before this semester, and are very cool. The Dublin Core even suggests a “point encoding scheme,” which uses geographic coordinates to locate points in space; not immediately intelligible for most human browsers, but good for machine reading, and possible to interpret visually with mapping tools.

What I gathered from all of this OAI digging was that Shreeves, Riley, and Milewicz knew what they were talking about when they said that “the commitment to make the necessary changes” in metadata development for shareability would be one of the most important factors in clearing the path for metadata librarians and harvesters to move forward into an age of streamlined digital aggregation. As someone who would like to work in metadata in some capacity, I think it’s important for me to keep shareability in mind as I dig deeper into the specifics of the profession – LIS jobs are constantly evolving, and it’s our job as emerging professionals to guide that evolution in ways that will benefit the science and the public.

Best practices for shareable metadata. (2005). The Open Archives Initiative. http://webservices.itcs.umich.edu/mediawiki/oaibp/sites/oaibp/uploads/f/f4/ShareableMetadataBestPractices.doc

Lagoze, L. (2002). Open archives initiative – protocol for metadata harvesting – 2.0. The Open Archives Initiative. https://www.openarchives.org/OAI/openarchivesprotocol.html

Shreeves, S., Riley, J. and Milewicz, L. (2006). Moving towards shareable metadata. First Monday, 11(8). https://firstmonday.org/ojs/index.php/fm/article/view/1386/1304

Fandom, Copyright, and Digital Archives

When I was first ventured into the mid-2000s fledgling world of social media, one of the forces that shaped my experience – and that has come up in my LIS classes many, many more times than I expected it to – was online fandom culture. I was fourteen when I started my LiveJournal, and I probably read more fanfiction than original work during my high school years. I’m … not going to comment on whether or not that trend has continued into my adult life.

What definitely has continued into my adult life is an interest in the phenomenon of fandom: its evolution through history, its implications as a broader social model, and the ethics of fanwork produced and shared without the explicit blessing of the creators of the source material from which those works derive. That third thing is the one that tends to come as part of my LIS studies, and it’s the first thing I think of whenever anyone mentions copyright and intellectual property. Dedicated fanwork hosting websites like Archive of our Own are digital libraries in their own right, sourcing their content from thousands of creators, maintaining searchable databases through custom metadata schemas, and doing their best to protect the legal rights of their contributors in the rocky terrain of derivative work.

One of the most interesting pieces of scholarship I’ve come across about fandom is Brittany Johnson’s piece about fanfiction and copyright law titled Live Long and Prosper. In addition to its excellent title, the article comes from a legal scholar who can speak with authority on the implications of fanwork. What used to be a purely nonprofit enterprise has become more complicated with the phenomenon of ‘pulling to publish,’ when works like Fifty Shades of Grey are plucked from their fandom roots by publishers and turned into works of ‘original’ fiction. This upsets the plausible deniability that has protected fandom in the past – that fanwork was fair use because it was not and would never be a source of profit, much less a source of profit that would pull income from the original title. Fair use comes after clearing four hurdles – the intent of the new work, the nature of the copyrighted work, the amount of the copyrighted work used in the new work, and the impact of the new work on the potential market of the copyrighted work. Some fanworks borrow more heavily from source material than others, but the precedent of pulling to publish places even the most derivative of derivative works in a tenuous position if the copyright holder of the source material decides they’re not happy with something they perceive as a threat to their own revenue.

The fandom conversation came to mind as I read Peter Hirtle’s Learning to Live with Risk and its discussion of copyright in libraries and archives. Hirtle discusses institutions who are “unintentionally violat[ing] copyright law with the best intentions” and who distribute intellectual property in ways that may not be protected under the banner of fair use. As applied to fandom, while the hosting archive has full permission from the creator of the fanwork, they might get trouble from the copyright holder of the source material from which the fanwork derives, which puts them in a similar position to the situations Hirtle discusses. Fandom spheres are the ones in which I have the most personal experience, but nontraditional archives, particularly digital nontraditional archives, will likely have a slew of similar problems. If a library holds digitizations of an artist’s collage pieces, for example, are they beholden to the original copyright holders of the assets used in each collage? It’s difficult to anticipate when copyright holders will decide to crack down; some unexpected creators have swept through fandom spaces to purge fanwork of their intellectual property for a number of dubious reasons.

Johnson proposes an amendment to existing copyright law that would protect derivative works, as long as the hosts of those works (digital fandom libraries) had contributors meet non-commercialization and attribution standards agreed upon by the hosts and the copyright holders of the source material. It would be a huge shift in fandom culture, perhaps even moreso than the shift already happening around pulling to publish; fandom has always been a clandestine thing, a space primarily occupied by unpaid creative women sharing their work amongst themselves, but as it becomes a more mainstream phenomenon, the way that creators (particularly female creators) occupy those spaces is changing. An amendment to copyright law that would protect both creators and the hosts of their work might be necessary as fandom moves into its new digital era. No matter what happens, however, the owners and curators of digital fandom archives will be in a unique position as the bridge between fandom creators and copyright holders.

Hirtle, P. (2012). Learning to live with risk. Art Libraries Journal, 37(2), 1-15. http://ecommons.cornell.edu/bitstream/1813/24519/2/ARLIS%20UK%20final.pdf.

Johnson, B. (2016). Live long and prosper: How the persistent and increasing popularity of fan fiction requires a new solution in copyright law. Minnesota Law Review, (4), 1645. http://www.minnesotalawreview.org/wp-content/uploads/2016/04/Johnson_ONLINEPDF.pdf.