Unleashing Digital Knowledge into the Future

The internet is a powerful tool for exchanging digital information. But the Internet’s contents changes constantly: websites are launched and taken down, webpages change, and content gets archived offline or is lost forever. In other words, we are subjected to the permanent now in which knowledge is fleeting on the Internet.

By design, a web address, or Uniform Resource Locator (URL), points to a specific internet location from which a resource, like a webpage, can be retrieved. However, a URL does not provide a way to verify that a retrieved webpage was the one we asked for. ¹

Imagine using a URL-like reference to find a book at a library: instead of locating a book by what it is (e.g., title, author), you refer to a book by its location (e.g., third shelf on the second row next to the window). With this, a book becomes unfindable if moved to another shelf. And, if you do manage to find a book at the referenced location, how would you know you’ve found the book you are looking for?

Instead of pointing to where books are located, librarians point to them using a bibliographic reference. For practical reasons, only a few identifying clues are included in such a reference (e.g., author, year of publication, title, and publisher). So, librarians refer to content by what it is, and knowing where it may be located is secondary.

Thanks to recent advances in mathematics ², we can add digital fingerprints to bibliographic citations of digital content. A digital fingerprint uniquely describes any digital content (e.g., a webpage, an digital image, a pdf document) by a fixed length sequence of numbers and letters ³. It is generated by performing a calculation ⁴ on the content itself. Citations that include a digital fingerprint are also referred to as signed citations ⁵.

These digital fingerprints open up a way to automatically verify, with astronomical certainty, that you got the digital content that you asked for.

Asking for What You Want

However, for retrieving specific content, like a newspaper article or research paper, we’d like to have a way of saying:

In asking for content using their unique digital fingerprint, the current Internet location of the content becomes secondary. More importantly, we can use digital fingerprints to refer to content regardless of the digital communication medium that happens to be in fashion now. In other words, digital fingerprints help preserve references to digital content to a future (or past!) beyond the internet.

The Content-Verse ⁶

This is where I’d like to introduce the content-verse as the collection of every single piece of digital content and each associated digital fingerprint. By definition, the content-verse contains all content (or knowledge) ever to be created.

Using Digital Fingerprints

https://linker.bio builds a bridge ⁷ from the exciting, dynamic internet to its reliable, boring, cousin — the content-verse. In this content-verse, digital fingerprints are used as links instead of those pesky, unreliable URLs. Contrary to URLs, these digital fingerprints do not break, or expire ⁸.

Where the internet excels in spreading new information, the content-verse excels at referencing known information.

Through digital fingerprints, linker.bio provides a bridge to access billions of openly available biodiversity data records ⁹, millions of Open Science publications through Zenodo, over eight hundred thousand datasets via DataOne, billions of open source files Software Heritage Library, and more than ninety seven million freely usable media files from WikiMedia Commons.

The beauty of digital fingerprints is that in fifty years from now, you may use that same fingerprint to find that information, regardless where it may be located, or how it is stored, or transmitted ¹⁰.

Unleashing Digital Knowledge Into The Future

Cassette tapes are still around since their introduction in the 1960s, but their use has dwindled over time. Similarly, the Internet is expected to give way for some other way to exchange content.

Digital fingerprints are independent of content communication protocols or digital storage media popular today. This is why these fingerprints can refer to digital knowledge inside and beyond the Internet and into the future. So, by adopting fingerprints as digital content identifiers, we can help carry our digital knowledge into the future.

Acknowledgments

Michael Elliott, José Fortes and Cypress Hansen provided comments to help improve a description of todays internet and the benefits of the content-verse.

Appendix

Use Case 1: Requesting Content by Their Fingerprint

https://linker.bio/ helps to request information, wherever it may be, using a notation like:

Exploring Content Request Examples

or, to get a copy of a scientific dataset, like a historical CO2 Record from the Vostok Ice Core, you can ask for:

or, perhaps even better, you can also ask for a picture of a 🐇 (Oryctolagus cuniculus) by JM Ligero Loarte -

Use Case 2: Retrieving a Bunny Picture Using Your Own “linker.bio”

So far, https://linker.bio may appears to be a “black box”: you ask for some content by their fingerprint, and linker.bio attempts to retrieve that content.

Now, you may wonder: how does “linker.bio” work? And, how could I build my own “linker.bio”?

linker.bio is powered by Preston. Preston builds a bridge from the content-verse (e.g., a digital fingerprint) to the content stored in physical locations. Preston is the little machine that responds when you ask for the picture of the bunny using the URL https://linker.bio/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2. And, if you know how to run a program on your computer, you can run your own machine (or server) that looks up that bunny picture. In other words, with some effort, you can build your own bridge without having to ask for permission or paying some kind of license fee¹¹.

For the tech savvy, you can run Preston in server mode on linux/mac ¹² by executing the following in the terminal:

Now, you can visit http://localhost:8080/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2 to retrieve the bunny picture. On receiving your question, Preston will try and ask https://wikimedia.org whether is has any content in their https://commons.wikimedia.org/ library with that digital fingerprint. If so, Preston will ask Wikimedia to send that content, and then, pass it to you. The next time you ask for the bunny picture, you’ll receive the picture pretty fast, because Preston remembers the content associated to the digital fingerprint and doesn’t have to ask https://wikimedia.org anymore.

In addition to Wikimedia Commons, Preston knows how to talk to https://zenodo.org, https://softwareheritage.org, https://dataone.org and … other Preston servers!

So, in the previous example, Preston talked directly to wikimedia.org . In the example below, your Preston server would talk to https://linker.bio instead, and https://linker.bio would relay the request:

You can even provide a list of “remotes”. If a list is provided, Preston asks the provided location in order of appearance. With the example below, Preston would first ask linker.bio, then if linker.bio doesn’t have it, it’ll ask wikimedia.org.

So, with this you can create elaborate combinations of ways to ask for content. One example of such elaborate setup is a content delivery network to facilitate reliable access to well-known content.

If you’d like to learn more about how to run a Preston server, but don’t know where to start, please send an email to Jorrit or open a GitHub issue.

Use Case 3: Studying Pine Pests Caused by Weevils (Curculionoidea)

Imagine studying a pine pest caused by weevils, plant eating beetles of super order Curculionoidea. In preparation for answering a research question, you may want to understand what is known about them, and some of their hosts pine trees. By combining large versioned corpora compiled using digital fingerprints, you can answer complex questions across disciplines. The examples below show how some related questions span digital collections made available through natural history collections, taxonomic literature, genetic records of plants, and biodiversity literature.

By using nimble, yet powerful, data processing tools like Preston and Nomer to make your laptop (or some powerful computer you have access to) to answer complex and specific interdiscplinary research questions.

To prepare for addressing more complex research question, the following basic question may need answering:

Q1. How many specimen of Weevils (plant eating beetles of super order Curculionoidea) have been recorded globally?

And these steps may be implemented in linux bash using Preston, jq, mlr, Nomer, and grep:

where lines 1-4 implement step 1., lines 5-7 implement step 2., lines 8-9 implement step 3., line 10 implements step 4., and lines 11-13 implement step 5.

The implementation above shows a brute force way to answer Q1. Resources permitting, optimized workflows can be generated to allow for more responsive and specialized services. So, you can answer your question based on versioned digital corpora, and optimize when needed / possible. One such optimization can be to clone the needed versioned datasets on a fast solid state drive to reduce network delays. Another could be to configure a search engine to help answer a selective kind of user queries quickly. And, with the versioned corpus, these uses of the datasets of known origin can be built independently of the corpus itself, allowing for teams work independently to improve the use of a well-defined knowledge corpus.

Q2. How many distinct species of Weevils, or plant eating beetles of the superfamily Curculionoidea have been described in taxonomic literature and checklists?

Q3. How time do species names of Weevils (plant eating beetles of the superfamily Curculionoidea) occur in the a copy of transcribed texts made available through the Biodiversity Heritage Library? Q4. How many genetic sequences are available for Pinus taeda (loblolly pine) are available through GenBank?

Use Case 4: Assessing FAIRness of Biodiversity Data

corpus	last updated	size (approx)	version
GIB (GBIF, iDigBio and BioCASe)	2025-07-01	~4TB	hash://sha256/9dcf2e…
OBIS	2024-04-01	~40GB	hash://sha256/61827c…
ChecklistBank	2024-04-01	~30GB	hash://sha256/00989c…
Biodiversity Heritage Library	2024-04-01	~300GB	hash://sha256/9afaca…
GenBank PLN Division	2023-06-28	~250GB	hash://sha256/efa589…
Nomer Corpus of Taxonomic Resources	2024-03-12	10GB	hash://md5/706450…
OpenAlex ¹³	2023-11-01	~300GB	hash://sha256/f19011…¹⁴

As a way to promote the mobility and usability of digital data, the FAIR principles ¹⁵ have gained traction in the science community. In order for data to be FAIR, they have to be “Findable”, “Accessible”, “Interoperable”, and “Reusable.” But what exactly does it mean to be FAIR? Who determines whether data is FAIR?

Thousands of Darwin Core Archives ¹⁶ (DwC-A) containing valuable biodiversity data are published by Natural History Collections (e.g., the Field Museum, the Museum of Southwestern Biology), Community Science Intiatives (e.g., iNaturalist, eBird), and Taxonomic Authorities (e.g., Integrated Taxonomic Information System (ITIS), World Register for Marine Species (WoRMS)). To increase their reach, many of these archives are registered with the Global Biodiversity Information Facility (https://gbif.org), Integrated Digitized Biocollections (iDigBio) or Ocean Biodiversity Information System (OBIS).

Since 2018/2019¹⁷, Preston processes have been tracking registered datasets in GBIF, iDigBio, and OBIS. Now, many years later, a wealth of data is available on which archives were registered with networks including, but not limited to, GBIF, iDigBio and OBIS. By sampling monthly, a detailed temporal record is kept on the origin and content of these archives. So, if an archive has left a trace in these registry records, the origanization that published the archive can say that their data is FAIR. They are FAIR because, the Preston tracking process was able to Find the archive in a registry, Access their associated content, show their Interoperability through their adoption on a recognized standard, DwC-A, and was able to Reuse the archive by keeping versioned copies as proof of registration.

To make it easier to see whether an archive is FAIR according to the methods describe above, you can get your FAIR assessment badge using:

For instance, the University of Santa Barbara’s Invertebrate Zoology Collection (UCSB-IZC) has registered the location of their archive (i.e., https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip) with iDigBio and GBIF. iDigBio assigned the UCSB-IZC the recordset uuid urn:uuid:65007e62-740c-4302-ba20-260fe68da291, GBIF assigned both a DOI (i.e., 10.15468/w6hvhv) and UUID (i.e., urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0).

Now, the FAIRness of the UCSB-IZC archives can be visualized by visiting one of the following location a web browser:

To make the links direct to the underlying content instead of showing a badge, you can drop the badge/ part . For example, the tracked content associated with FAIR badge:

https://linker.bio/badge/hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a

https://linker.bio/hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a

If an archive reference (by location, uuid, doi, or content id) is associated with a tracked DwC-A, a download badge is generated for a recently tracked versioned copy of the FAIR archive. If an archive reference could not be resolved in the corpus of tracked biodiversity archives, a 404 unknown archive badge is generated. With this, an independent FAIR assessment badge service is available: the service is independent of the publisher (UCSB-IZC) or registries (iDigBio, GBIF). These badges may be used to institutions to show off their commitment to FAIRness, or by registries to show that they contribute to the findability to existing data archives.

An example of a FAIR badge for UCSB-IZC rendered by https://linker.bio/badge/10.15468/w6hvhv is

You can embed this particular badge in a markdown document using a notation like:

Note that the corpus of tracked biodiversity datasets used to determine this FAIRness assessment can be cloned, copied, and verified. This means that others can implement FAIR assessment services (or any other kind of service using the biodiversity data archives) on the verifiably exact same tracked corpus as the one that https://linker.bio uses.

If you’d like to learn more about how this service works, please read through the history of the feature, review an associated GBIF forum discussion, or contact the author of this document.

Please note that this FAIR assessment feature was heavily influenced by discussion following the WorldFAIR project report by Trekels et al. 2023 ¹⁸.

A similar argument can made for a Digital Object Identifier (DOI, https://en.wikipedia.org/wiki/Digital_object_identifier).↩︎
Sobti, R. & Geetha, G. Cryptographic Hash Functions: A Review. 2012. International Journal of Computer Science Issues (IJCSI) 9, 461–479 https://www.ijcsi.org/papers/IJCSI-9-2-2-461-479.pdf accessed at 2023-10-11 with hash://md5/eb8e2fb3e16bd5839443cd40a9a8c3c1↩︎
The length of a fingerprint depends on the kind of calculation used, and typically vary between 32 (for MD5), 40 (for sha1) or 64 (for SHA256) characters.↩︎
Such calculations are referred to as cryptographic hash functions like MD5, or SHA-256↩︎
Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). https://doi.org/10.1038/s41597-023-02230-y hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d ↩︎
Probably need a better term for this, because of existing uses elsewhere.↩︎
linker.bio is not the only bridge to the content-verse. In fact, linker.bio re-uses existing bridges provided by Zenodo, DataOne, WikiMedia Commons, and Software Heritage Library to the massive amount of content they keep. Note also that Carl Boettiger maintains a bridge from R to the content-verse via the R package contentid.↩︎
Digital finger prints are cryptographic objects that are mathematically linked to the content they reference. They can be generated with most digital devices. And, by embedding fingerprints into other digital content, you can pretty much reference anything and everything digital using a digital fingerprint that fits on a T-shirt.↩︎
Biodiversity records include snapshot version of digital collections registered with iDigBio, GBIF, BioCase, Biodiversity Heritage Library, OBIS and CheckListbank ↩︎
In other words, digital fingerprints are agnostic of location, technology, and … time.↩︎
Preston is open source software.↩︎
Windows is supported too, but you’ll have to run Preston a little differently. See documentation for examples.↩︎
OpenAlex is an open access index of scientific works, similar to Web of Science and Google Scholar but then without paywalls or walled gardens.↩︎
See https://gist.github.com/jhpoelen/8b263027ff13c7b788fa24866ce73bfb for example usage of a versioned copy of the OpenAlex index.↩︎
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.18.↩︎
“Darwin Core is a standard […] intended to facilitate the sharing of information about biological diversity […]” - https://dwc.tdwg.org/ accessed at 2024-01-03↩︎
Poelen, J. H. (2023). A biodiversity dataset graph: GBIF, iDigBio, BioCASe hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd6463d252bff37b55 hash://md5/898a9c02bedccaea5434ee4c6d64b7a2 (0.0.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7651831↩︎
Trekels, Maarten, Debora Pignatari Drucker, José Augusto Salim, Jeff Ollerton, Jorrit Poelen, Filipi Miranda Soares, Max Rünzel, Muo Kasina, Quentin Groom, and Mariano Devoto. 2023. “WorldFAIR Project (D10.1) Agriculture-related pollinator data standards use cases report.” Zenodo. https://doi.org/10.5281/zenodo.8176978.↩︎

The Permanent Now

Asking for What You Want

The Content-Verse 6

Using Digital Fingerprints