How P2P and blockchains make it easier to work with scientific objects – three hypotheses
The financial industry, medical records, the Internet of Things – millions are currently being invested in research into blockchains, financed from numerous private and public sources. This basic technology of the cybercurrency bitcoin promises to resolve a number of major problems using a decentralised approach. In this blog post, I show that blockchains and P2P (peer-to-peer) systems also represent a unique opportunity for digital libraries to meet the challenge of Big Data. They could make it considerably easier for users to identify, describe, transport, use and edit scientific data on the web, and to enable versioning.
Those who already know all about blockchains, content-addressed storage, GitHub and DOI can skim over the sections preceding the three hypotheses.
The internet is broken – it must become more decentralised
The internet as we currently use it has many bottlenecks that make it susceptible to breakdowns and – more or less subtle – attacks. In a 2015 appeal that is well worth reading, Brewster Kahle, founder of the Internet Archive, explained why we need a decentralised web. In the following year, a number of internet celebrities and pioneers called for a rethink of the fundamental structures of the web at the Decentralized Web Summit of the Internet Archive. In this blog post, I would like to show that work with digital scientific objects can also benefit from decentralised approaches, which we are familiar with from P2P networks and cryptocurrencies such as bitcoin.
When a file system becomes “interplanetary” …
In 2014, Juan Benet outlined a shared, versioned, content-addressable file system under the immodest name of the Interplanetary File System (IPFS). For files and structured collections of files (which I often refer to in short as objects in the following), IPFS automatically creates a de facto unforgeable checksum, a hash. The checksum changes as soon as a file has been altered. Objects can be called up universally using this hash. “Universal” is the decisive term here: once they have been made available in IPFS, objects can be used by all other computers where IPFS has been installed. (For this purpose, IPFS uses BitTorrent in the background.)
One example of how IPFS works
If I need the graphic that can be accessed at https://www.tib.eu/typo3conf/ext/tib_tmpl_bootstrap/Resources/Public/images/TIB_Logo_en.png, I must hope that the tib.eu server is indeed online – and that it always provides me with exactly the same image at the specified URL. Maybe I’m in luck and there’s another copy online, e.g. https://upload.wikimedia.org/wikipedia/commons/f/f7/Logo_of_the_German_National_Library_of_Science_and_Technology%2C_or_Technische_Informationsbibliothek_%28TIB%29%2C_in_English_language.png – but I am unable to recognise instantaneously whether it really is exactly the same image. In contrast, the same image at https://ipfs.io/ipfs/QmWngwr7TzDJujWKXD282tmpnjZuD7d5JutVaYuoZ9Xq1t will never change – I only need to know the hash value, i.e. the part of the URL that comes after “ipfs.io/ipfs/”. Incidentally, the server ipfs.io is just an aid to explain this example – thanks to IPFS, I don’t actually need a central server at all. Once IPFS has been installed on my computer, then http://localhost:8080/ipfs/QmWngwr7TzDJujWKXD282tmpnjZuD7d5JutVaYuoZ9Xq1t suffices (i.e. the same object name, but with “localhost” instead of the host name of an external server), and I am able to view the file – at least as long as any computer is connected to the IPFS network that makes this file available. An interesting secondary effect: files that you otherwise copy again and again – even on the same computer – because you want to use them in different contexts, then exist once and only once, and are not only impossible to confuse, but also take up less memory space and bandwidth.
Content-addressable objects – nothing new
If an object changes in Git, even only minimally, then its hash value also changes – an ideal concept on which to develop version management. At the latest by the time the commercial Git service provider GitHub had hit the scene, the idea of content-addressable objects became popular. GitHub is a platform where source codes and other data can be stored, versioned and edited collaboratively. With regard to these functions, GitHub can be both a blessing and a curse. A blessing because it has popularised the public provision of data by offering particularly convenient ways to collaborate, and to enable third parties to look on and possibly even join in. A curse because – as a commercially successful company – it does not exactly alleviate the problems mentioned above of a highly centralised internet.
One example of collaboration in GitHub: pull requests
Version management systems such as Git have made it easier to keep a public archive of my files up-to-date. Just one command (“push”), and the new variant of a file I’ve just edited is stored in a public project archive that I’ve set up (“repository”). In the process, the old version of the file is not overwritten. Instead, the most recent version is automatically saved next to the old version, and versioning information is added. It is then easy for third parties to tell who changed what in the file, and when. If I want, I can also retrieve a copy of an external repository. Once I have revised the copy, I can ask the creator if he wants to add these changes to his “original” – a “pull request”. The ability to execute such functions conveniently, so that even third parties can understand them easily in the browser – that’s one of the main strengths of the GitHub platform.
Collaborative, transparent “accounting” without all parties having to trust a central authority – is it possible?
Users must have a certain degree of trust in the operators of the platform to perform the above-mentioned functions offered by GitHub. When I copy files in a repository, it should be clear for all to see that these files were created by me, and not someone else. And it should be possible to retrieve the actual repository on a long-term basis under its own name, regardless of whether, and how, its content has changed in the meantime. Reliably recording such transactions by many users so that every single change can be comprehended by all at any time – until now, this has always meant that users have to trust some central entity or other.
Bitcoin entered the scene in 2008, stating its ability to solve problems of this kind in a completely different way. The key element here is the “blockchain”, the public book of accounts of this virtual currency, in which all payment transactions are irreversible and clearly documented for all to see. Everyone should be able to register a transaction for inclusion in the blockchain, but all participants in the system must first consult with each other to decide whether, and when, the transaction will be included in the blockchain. What’s special about the blockchain approach is that such a decision cannot simply be intentionally prevented or influenced by malicious participants. (To see how this can be prevented in the case of the blockchain by a “proof of work” or a “proof of stake”, refer to the Wikipedia article on distributed consensus.)
Blockchains for special tasks: Namecoin & Co.
In 2011, i.e. three years after Satoshi Nakamoto had published the idea of bitcoin, the first fork of the bitcoin blockchain was launched: Namecoin. Unlike with bitcoin, the purpose of the blockchain in this case is not to establish another cryptocurrency. Instead, the datasets entered in the blockchain should be able to contain additional useful information, which is irreversibly recorded in this way. The developers of the IPFS presented above also want to use Namecoin to ensure that, one day, files and objects can be designated by anyone.
Whereas the allocation of a host name requires the centralised “Domain Name System” (DNS), and with GitHub, for example, the allocation of a repository name requires the user to conclude a contract with the operators of this platform, with blockchains this could all take place without a central entity. A blockchain could be used to record, for example, who originally created a file, and who subsequently made changes to it.
Timestamps – one example of a service based on blockchains
Let’s take a look at the example of the TIB logo (the example from above): using a blockchain-based timestamp service such as Originstamp, you can check whether the image file already existed at https://www.tib.eu/typo3conf/ext/tib_tmpl_bootstrap/Resources/Public/images/TIB_Logo_en.png on 2 May 2017. At https://app.originstamp.org/s/c18158e5fefadb4ff24318bf9fdd41e446b2939f09bec06dbf56821f5aeb72ce, the service provides tamper-proof evidence that you could also check without this service, i.e. without having to trust a central entity. A remarkable detail: you only need to check the hash value, the object itself does not have to be published – similar to a notarial attestation. I could also prove that the document was in my possession at this point in time by adding an appropriate remark before the timestamp process.
Transferred to a scenario related to research infrastructure: a measuring instrument for research purposes that links up to the net could obtain a timestamp for all of the data recorded in a certain timeframe. (An occasional, slow exchange of a few datasets with an – easily replaceable – blockchain-based timestamp service would suffice; it would even be a realistic scenario for a sensor far from civilisation.) This way, it would be possible to prove beyond doubt, also in retrospect, which data was created when by which sensor – a step that could help support the reproducible collection and processing of research data.
Allocating a ‘name’ to digital scientific objects using a DOI
The “Digital Object Identifier” (DOI) is a widely used system for allocating persistent names to digital scientific objects. What’s special about DOI is that the digital archive (which can be a scientific publisher or a library, for example) takes care of allocating names, and simultaneously assures that the object can be accessed long-term under this name. If the URL changes, the archive updates the relevant DOI dataset which, incidentally, may also contain further useful metadata. Major consortia behind DOI, such as Crossref and DataCite, ensure that this system has functioned well for various object types for many years.
In recent years, these two consortia have increasingly drawn attention to themselves with new, beneficial value-added services, which are offered in a more or less modular fashion alongside the “core business” of allocating DOI names. One example is Crossref DOI Event Data – a service that proves mentions of DOIs in the literature, as well as on various online platforms, and that makes the collected data freely available. The infrastructure of Event Data can and should also be utilised to track the use of other URIs, in addition to the DOI – this could also be a hash value from a P2P file system, for example.
Hypothesis 1: It would be better for researchers to allocate persistent object names than for digital archives to do so
As shown above, content-addressable objects are suitable for ensuring that the same object is obtained each time. At the latest since GitHub, it has been demonstrated that it is possible to build a versioning system on this basis that makes it easy for many people to share data and that also supports the collaborative processing of data. And if this method of name allocation can be made available without a central bottleneck (whether a commercial platform operator or a consortium) – all the better. This possibility is raised by IPFS and supplementary blockchain-based services.
Digital archives attempt to transfer objects from the “private domain” of researchers or the “shared domain” in which a team works on the data, for example, to the third, public domain with a minimum of efficiency losses. (See also Andrew Treloar’s Data Curation Continuum; in the figures below, my amendments are highlighted in red.) Universalising automated name allocation for versioned objects, as currently practised on GitHub as a typical “shared domain” – that would greatly reduce inefficiency.
Hypothesis 2: From name allocation plus archiving plus x as a “package solution” to an open market of modular services
The mere allocation of a persistent name does not ensure the long-term accessibility of objects. This is also the case for a P2P file system such as IPFS. After all, a file will only be accessible if at least one copy of the file is available on a computer connected to the network. Since name allocation using IPFS or a blockchain is not necessarily linked to the guarantee of permanent availability, the latter must be offered as a separate service. To achieve this, digital archives will continue to be needed as service providers. Archives can make their services intelligible and predictable by following common standards (as is currently undertaken in Crossref and DataCite), or they can offer value-added services. This is nothing new either – journals take care to archive “their” articles, for example, which meet certain additional criteria (“peer reviewed”).
Compared to a DOI consortium, decentralised name allocation merely changes the threshold to enter the market. Contracts with consortium partners are no longer essential. Those wishing to offer new approaches to the allocation of names, to ensuring availability or other value-added services could do so – without any kind of contract. The only prerequisite would be compliance with the protocols of the relevant decentralised system.
In any case, the fact that digital archives “possess” scientific objects plays an increasingly marginal role. This “possession” will still be apparent in a name component of the DOI or the host name where the “original” can be accessed. However, various scientific objects can de facto be found at different places, or copies of them can be retrieved. All the more beneficial it would be to identify objects primarily on the basis of their content. The editing, publication and versioning of a file could then be checked as directly as possible by the relevant creators. Supplementary services (collecting and archiving, classifying and mining, making discoverable and usable, etc.) could then be offered on a modular basis by anybody, with a minimum of efficiency losses between creators and supplementary service providers, as well as among the different service providers.
Hypothesis 3: It is possible to make large volumes of data scientifically usable more easily without APIs and central hosts
Text and data mining of large volumes of data has now become common scientific procedure in many subject areas. Services that enable scientific content to be discovered, assessed or further processed are growing in popularity. In view of these developments, it would be desirable to access large, complex, dynamically growing collections of objects as easily as accessing content on our own hard disks. To achieve this, we especially need alternatives to application programming interfaces (APIs), which are vulnerable and present too many obstacles.
Just as, from the perspective of digital archives, decentralised approaches can help reduce efficiency losses in the transfer from the private to the public domain, the same approach, from the perspective of data users, can help make it easier to use previously published data. This increased level of accessibility for all, combined with an open market for modular services (see Hypothesis 2), could also have a positive effect on the landscape of scientific digital services. There is hope that we will see more innovative, reliable and reproducible services in the future, also provided by less privileged players; services that may turn out to be beneficial and inspirational to actors in the scientific community.
Epilogue: Decentralised applications help to structure Big Data for academics in a reliable and trustworthy manner – now it’s up to the established players in the scientific infrastructure to make a move
This blog article explored the potential use of blockchains and P2P in working with large, complex, dynamically growing collections of objects in the scientific community. In this respect, decentralised applications can help give reliable, trustworthy responses to the challenge of structuring Big Data according to the needs of researchers. It is therefore now up to libraries, as well as funders, universities, publishers and other players in the scientific infrastructure, to use the momentum of decentralisation, and to further develop their own roles and business models accordingly.
Additional potential approaches are demonstrated by Sönke Bartling and Benedikt Fecher, among others. For example, the financial flows that take place within research funding could be regulated in future by smart contracts – i.e. programmes that are executed by a distributed, blockchain-based virtual machine. The multifaceted function of peer review in science could also benefit from decentralisation, since it would be easier to use reliable pseudonyms, protecting researchers from unfair assessment.
6 Antworten auf “How P2P and blockchains make it easier to work with scientific objects – three hypotheses”
IPFS is great! Blockchains are useful! Decentralization indeed improves robustness and efficiency. Services do help to streamline and automate many aspects thus making development and often research more transparent and reproducible. But IMHO such decentralization and reliance on services should be accompanied with a “local human-compatibility-layer” to retain some control and some guarantees for resources availability and introspection, to avoid making the entire ecosystem a black box of difficult to figure out machinery.
As you have noted on the example with git/github it is important to be able conveniently manipulate available (locally and/or in the “cloud”) content, without loosing track of the history of events etc. That is where git is helping us in a distributed and “duplicating” fashion indeed without much of an API (just with basic human accessible UIs). We can browse and share our histories and content, while retaining clearly identifiable, verifiable, with traceable exact copy of the content while allowing for “updates” and merges. The beauty is that due to its true distributed nature, git can be used even without any services/internet. We can be sure that if github goes down we can still continue our work, although possibly in a temporarily hindered fashion (if we rely on additional services). Our workflows could be as decentralized as we decide them to be (not as some service mandates). We can be sure that we have all data we need/want today and tomorrow.
I expect the ideal ecosystem to be as resilient and flexible, and have a good tool belt to withstand possible “cloud disturbances”.
While manipulating data, such projects as git-annex could help to orchestrate data access not only overcoming the shortcoming of git not being able to contain large files, but similar to IPFS, access to copies of the data available elsewhere (possibly in IPFS). Extending your example here is a complete script to generate a git-annex repository where the TIB_Logo_en.png file could be obtained from any of the urls you have mentioned, while guaranteeing that it is exactly that copy:
mkdir /tmp/demo; cd /tmp/demo
git init; git annex init
git annex addurl –pathdepth=-1 https://www.tib.eu/typo3conf/ext/tib_tmpl_bootstrap/Resources/Public/images/TIB_Logo_en.png
git commit -m ‘added a favorite logo to annex’
echo “registering more sources for it”
git annex addurl –file TIB_Logo_en.png https://upload.wikimedia.org/wikipedia/commons/f/f7/Logo_of_the_German_National_Library_of_Science_and_Technology%2C_or_Technische_Informationsbibliothek_%28TIB%29%2C_in_English_language.png
git annex addurl –file TIB_Logo_en.png https://ipfs.io/ipfs/QmWngwr7TzDJujWKXD282tmpnjZuD7d5JutVaYuoZ9Xq1t
echo “now I can ‘drop’ it having verified that it is available”
git annex drop TIB_Logo_en.png
echo “and if needed get it back”
git annex get TIB_Logo_en.png
Having such repositories, we can still collaborate, share, merge, and have knowledge about data availability, even when some services are down.
If interested to see more data available through git-annex, have a look at http://datasets.datalad.org/ with our collection of primarily neuroimaging datasets, where actual data load often comes from their original data sharing portals thus providing decentralization of storage, and (hopefully) unified convenient interface to access it.