(a working document)
2023-10-11/2024-04-17
edit this page / share suggestions
The internet is a powerful tool for exchanging digital information. But the Internet’s contents changes constantly: websites are launched and taken down, webpages change, and content gets archived offline or is lost forever. In other words, we are subjected to the permanent now in which knowledge is fleeting on the Internet.
By design, a web address, or Uniform Resource Locator (URL), points to a specific internet location from which a resource, like a webpage, can be retrieved. However, a URL does not provide a way to verify that a retrieved webpage was the one we asked for. 1
Imagine using a URL-like reference to find a book at a library: instead of locating a book by what it is (e.g., title, author), you refer to a book by its location (e.g., third shelf on the second row next to the window). With this, a book becomes unfindable if moved to another shelf. And, if you do manage to find a book at the referenced location, how would you know you’ve found the book you are looking for?
Instead of pointing to where books are located, librarians point to them using a bibliographic reference. For practical reasons, only a few identifying clues are included in such a reference (e.g., author, year of publication, title, and publisher). So, librarians refer to content by what it is, and knowing where it may be located is secondary.
A bibliographic citation:
Darwin, C. 1859. On the Origin of Species. John Murray.
Thanks to recent advances in mathematics 2, we can add digital fingerprints to bibliographic citations of digital content. A digital fingerprint uniquely describes any digital content (e.g., a webpage, an digital image, a pdf document) by a fixed length sequence of numbers and letters 3. It is generated by performing a calculation 4 on the content itself. Citations that include a digital fingerprint are also referred to as signed citations 5.
These digital fingerprints open up a way to automatically verify, with astronomical certainty, that you got the digital content that you asked for.
On the internet, we’ve learned to say:
“I’d like to get the latest content from this web address.”
, and trust that the retrieved content is what we asked for.
This may work well for a current news website or an internet search engine.
However, for retrieving specific content, like a newspaper article or research paper, we’d like to have a way of saying:
“I’d like to get the content with this digital fingerprint.”
, and verify that the retrieved content is exactly what we asked for.
In asking for content using their unique digital fingerprint, the current Internet location of the content becomes secondary. More importantly, we can use digital fingerprints to refer to content regardless of the digital communication medium that happens to be in fashion now. In other words, digital fingerprints help preserve references to digital content to a future (or past!) beyond the internet.
This is where I’d like to introduce the content-verse as the collection of every single piece of digital content and each associated digital fingerprint. By definition, the content-verse contains all content (or knowledge) ever to be created.
https://linker.bio builds a bridge 7 from the exciting, dynamic internet to its reliable, boring, cousin — the content-verse. In this content-verse, digital fingerprints are used as links instead of those pesky, unreliable URLs. Contrary to URLs, these digital fingerprints do not break, or expire 8.
Where the internet excels in spreading new information, the content-verse excels at referencing known information.
Through digital fingerprints, linker.bio provides a bridge to access billions of openly available biodiversity data records 9, millions of Open Science publications through Zenodo, over eight hundred thousand datasets via DataOne, billions of open source files Software Heritage Library, and more than ninety seven million freely usable media files from WikiMedia Commons.
The beauty of digital fingerprints is that in fifty years from now, you may use that same fingerprint to find that information, regardless where it may be located, or how it is stored, or transmitted 10.
Cassette tapes are still around since their introduction in the 1960s, but their use has dwindled over time. Similarly, the Internet is expected to give way for some other way to exchange content.
Digital fingerprints are independent of content communication protocols or digital storage media popular today. This is why these fingerprints can refer to digital knowledge inside and beyond the Internet and into the future. So, by adopting fingerprints as digital content identifiers, we can help carry our digital knowledge into the future.
Michael Elliott, José Fortes and Cypress Hansen provided comments to help improve a description of todays internet and the benefits of the content-verse.
https://linker.bio/ helps to request information, wherever it may be, using a notation like:
https://linker.bio/[fingerprint][.extension]
The extension is optional.
For instance, to get a copy of a scientific paper, you can ask for:
or, to get a copy of a scientific dataset, like a historical CO2 Record from the Vostok Ice Core, you can ask for:
https://linker.bio/hash://md5/e27c99a7f701dab97b7d09c467acf468
or, perhaps even better, you can also ask for a picture of a 🐇 (Oryctolagus cuniculus) by JM Ligero Loarte -
https://linker.bio/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2.jpg.
or, to review an initial draft of the Hash URI Specification by Ben Trask -
https://linker.bio/hash://sha256/3fee21854fb6d81573b166c833db2771b21f0c77daa3095aab542764d89c94c1.
So far, https://linker.bio may appears to be a “black box”: you ask for some content by their fingerprint, and linker.bio attempts to retrieve that content.
Now, you may wonder: how does “linker.bio” work? And, how could I build my own “linker.bio”?
linker.bio
is powered by Preston. Preston builds
a bridge from the content-verse (e.g., a digital fingerprint) to the
content stored in physical locations. Preston is the little machine that
responds when you ask for the picture of the bunny using the URL https://linker.bio/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2.
And, if you know how to run a program on your computer, you can run your
own machine (or server) that looks up that bunny picture. In other
words, with some effort, you can build your own bridge without having to
ask for permission or paying some kind of license fee11.
For the tech savvy, you can run Preston in server mode on linux/mac 12 by executing the following in the terminal:
preston server --remote https://wikimedia.org
or,
preston s --remote https://wikimedia.org
for short.
On starting the server, you’ll see some cryptic messages that end with
[main] INFO org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@76a4d6c{HTTP/1.1, (http/1.1)}{localhost:8080}
[main] INFO org.eclipse.jetty.server.Server - Started @561ms
This means that the Preston server is waiting for requests.
Now, you can visit http://localhost:8080/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2 to retrieve the bunny picture. On receiving your question, Preston will try and ask https://wikimedia.org whether is has any content in their https://commons.wikimedia.org/ library with that digital fingerprint. If so, Preston will ask Wikimedia to send that content, and then, pass it to you. The next time you ask for the bunny picture, you’ll receive the picture pretty fast, because Preston remembers the content associated to the digital fingerprint and doesn’t have to ask https://wikimedia.org anymore.
In addition to Wikimedia Commons, Preston knows how to talk to https://zenodo.org, https://softwareheritage.org, https://dataone.org and … other Preston servers!
So, in the previous example, Preston talked directly to wikimedia.org . In the example below, your Preston server would talk to https://linker.bio instead, and https://linker.bio would relay the request:
preston s --remote https://linker.bio
You can even provide a list of “remotes”. If a list is provided, Preston asks the provided location in order of appearance. With the example below, Preston would first ask linker.bio, then if linker.bio doesn’t have it, it’ll ask wikimedia.org.
preston s --remote https://linker.bio,https://wikimedia.org
So, with this you can create elaborate combinations of ways to ask for content. One example of such elaborate setup is a content delivery network to facilitate reliable access to well-known content.
If you’d like to learn more about how to run a Preston server, but don’t know where to start, please send an email to Jorrit or open a GitHub issue.
For more information and background, see:
Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). https://doi.org/10.1038/s41597-023-02230-y hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d
Imagine studying a pine pest caused by weevils, plant eating beetles of super order Curculionoidea. In preparation for answering a research question, you may want to understand what is known about them, and some of their hosts pine trees. By combining large versioned corpora compiled using digital fingerprints, you can answer complex questions across disciplines. The examples below show how some related questions span digital collections made available through natural history collections, taxonomic literature, genetic records of plants, and biodiversity literature.
By using nimble, yet powerful, data processing tools like Preston and Nomer to make your laptop (or some powerful computer you have access to) to answer complex and specific interdiscplinary research questions.
corpus | last updated | size (approx) | version |
---|---|---|---|
GIB (GBIF, iDigBio and BioCASe) | 2024-04-01 | ~3TB | hash://sha256/37bdd8… |
OBIS | 2024-04-01 | ~40GB | hash://sha256/61827c… |
ChecklistBank | 2024-04-01 | ~30GB | hash://sha256/00989c… |
Biodiversity Heritage Library | 2024-04-01 | ~300GB | hash://sha256/9afaca… |
GenBank PLN Division | 2023-06-28 | ~250GB | hash://sha256/efa589… |
Nomer Corpus of Taxonomic Resources | 2024-03-12 | 10GB | hash://md5/706450… |
OpenAlex13 | 2023-11-01 | ~300GB | hash://sha256/f19011…14 |
To prepare for addressing more complex research question, the following basic question may need answering:
Q1. How many specimen of Weevils (plant eating beetles of super order Curculionoidea) have been recorded globally?
The following steps can help towards answering Q1.
step 1. list GIB corpus content at version hash://sha256/a755...
step 2. print all related biodiversity records
step 3. for each record, select origin and scientific name
step 4. align names with Catalogue of Life as included in Nomer Corpus version hash://sha256/1205...
step 5. count only aligned records that mention "Curculionoidea"
And these steps may be implemented in linux bash using Preston, jq, mlr, Nomer, and grep:
preston cat\
--no-cache\
--remote https://linker.bio\
hash://sha256/a755a6ac881e977bc32f11536672bfb347cf1b7657446a8a699abb639de59419\
| grep --after 10 'application/dwca'\
| grep hasVersion\
| preston dwc-stream\
--no-cache\
--remote https://linker.bio\
| jq -c '{ "src": .["http://www.w3.org/ns/prov#wasDerivedFrom"], "name": .["http://rs.tdwg.org/dwc/terms/scientificName"] }'\
| mlr --ijsonl --otsv cat\
| nomer append col\
| grep -v NONE\
| grep Curculionoidea\
| pv -l
where lines 1-4 implement step 1., lines 5-7 implement step 2., lines 8-9 implement step 3., line 10 implements step 4., and lines 11-13 implement step 5.
The implementation above shows a brute force way to answer Q1. Resources permitting, optimized workflows can be generated to allow for more responsive and specialized services. So, you can answer your question based on versioned digital corpora, and optimize when needed / possible. One such optimization can be to clone the needed versioned datasets on a fast solid state drive to reduce network delays. Another could be to configure a search engine to help answer a selective kind of user queries quickly. And, with the versioned corpus, these uses of the datasets of known origin can be built independently of the corpus itself, allowing for teams work independently to improve the use of a well-defined knowledge corpus.
Other possible questions include:
Q2. How many distinct species of Weevils, or plant eating beetles of the superfamily Curculionoidea have been described in taxonomic literature and checklists?
Q3. How time do species names of Weevils (plant eating beetles of the superfamily Curculionoidea) occur in the a copy of transcribed texts made available through the Biodiversity Heritage Library? Q4. How many genetic sequences are available for Pinus taeda (loblolly pine) are available through GenBank?
As a way to promote the mobility and usability of digital data, the FAIR principles 15 have gained traction in the science community. In order for data to be FAIR, they have to be “Findable”, “Accessible”, “Interoperable”, and “Reusable.” But what exactly does it mean to be FAIR? Who determines whether data is FAIR?
Thousands of Darwin Core Archives 16 (DwC-A) containing valuable biodiversity data are published by Natural History Collections (e.g., the Field Museum, the Museum of Southwestern Biology), Community Science Intiatives (e.g., iNaturalist, eBird), and Taxonomic Authorities (e.g., Integrated Taxonomic Information System (ITIS), World Register for Marine Species (WoRMS)). To increase their reach, many of these archives are registered with the Global Biodiversity Information Facility (https://gbif.org), Integrated Digitized Biocollections (iDigBio) or Ocean Biodiversity Information System (OBIS).
Since 2018/201917, Preston processes have been tracking registered datasets in GBIF, iDigBio, and OBIS. Now, many years later, a wealth of data is available on which archives were registered with networks including, but not limited to, GBIF, iDigBio and OBIS. By sampling monthly, a detailed temporal record is kept on the origin and content of these archives. So, if an archive has left a trace in these registry records, the origanization that published the archive can say that their data is FAIR. They are FAIR because, the Preston tracking process was able to Find the archive in a registry, Access their associated content, show their Interoperability through their adoption on a recognized standard, DwC-A, and was able to Reuse the archive by keeping versioned copies as proof of registration.
To make it easier to see whether an archive is FAIR according to the methods describe above, you can get your FAIR assessment badge using:
https://linker.bio/badge/[your archive DOI/UUID/URL]
For instance, the University of Santa Barbara’s Invertebrate Zoology
Collection (UCSB-IZC) has registered the location of their archive
(i.e., https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip
)
with iDigBio and GBIF. iDigBio assigned the UCSB-IZC the
recordset uuid
urn:uuid:65007e62-740c-4302-ba20-260fe68da291
, GBIF
assigned both a DOI (i.e., 10.15468/w6hvhv
) and UUID (i.e.,
urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
).
Now, the FAIRness of the UCSB-IZC archives can be visualized by visiting one of the following location a web browser:
https://linker.bio/badge/https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip (by archive location)
https://linker.bio/badge/urn:uuid:65007e62-740c-4302-ba20-260fe68da291 (by iDigBio RecordSet UUID)
https://linker.bio/badge/10.15468/w6hvhv (by GBIF DOI)
https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 (by GBIF Dataset UUID)
https://linker.bio/badge/hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a (by content id)
To make the links direct to the underlying content instead of showing
a badge, you can drop the badge/
part . For example, the
tracked content associated with FAIR badge:
https://linker.bio/badge/hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
can be accessed via:
https://linker.bio/hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
If an archive reference (by location, uuid, doi, or content id) is associated with a tracked DwC-A, a download badge is generated for a recently tracked versioned copy of the FAIR archive. If an archive reference could not be resolved in the corpus of tracked biodiversity archives, a 404 unknown archive badge is generated. With this, an independent FAIR assessment badge service is available: the service is independent of the publisher (UCSB-IZC) or registries (iDigBio, GBIF). These badges may be used to institutions to show off their commitment to FAIRness, or by registries to show that they contribute to the findability to existing data archives.
An example of a FAIR badge for UCSB-IZC rendered by https://linker.bio/badge/10.15468/w6hvhv is .
You can embed this particular badge in a markdown document using a notation like:
[![](https://linker.bio/badge/10.15468/w6hvhv)](https://linker.bio/10.15468/w6hvhv)
or, by including the following HTML fragment in your web page:
<a href="https://linker.bio/10.15468/w6hvhv" target="_blank">
<img src="https://linker.bio/badge/10.15468/w6hvhv"/>
</a>
Note that the corpus of tracked biodiversity datasets used to determine this FAIRness assessment can be cloned, copied, and verified. This means that others can implement FAIR assessment services (or any other kind of service using the biodiversity data archives) on the verifiably exact same tracked corpus as the one that https://linker.bio uses.
If you’d like to learn more about how this service works, please read through the history of the feature, review an associated GBIF forum discussion, or contact the author of this document.
Please note that this FAIR assessment feature was heavily influenced by discussion following the WorldFAIR project report by Trekels et al. 2023 18.