Technical Infrastructure

Open access to digital information

A robust infrastructure, compliant with FAIR principles is needed

With approximately 1.5 billion objects to be digitised, bringing natural science collections to the information age is expected to result in 90 petabytes of new data over the next decades, used on average by 5,000 – 15,000 unique users every day. A robust technical infrastructure is required to support working with digital specimens and collections over their entire research data life cycle and to provide unified open access to the digital information, ensuring that it is Findable, Accessible, Interoperable and Reusable (FAIR).

How the infrastructure collects the data

Historical data combined with data from new techniques

The DiSSCo infrastructure will combine earlier investments in data interoperability practices with technological advancements in digitisation, cloud services and semantic linking. The infrastructure will connect historical collection data with data emerging from new techniques that is derived from the specimen but is not necessarily linked to species names. These new data include DNA barcodes, whole genome sequences, proteomics and metabolomics data, chemical data, trait data, and imaging data, e.g., Computer-assisted Tomography (CT) and Synchrotron data.

A novel and advanced infrastructure is needed to deliver the diagnostic information required for novel approaches and technologies for accelerated field identification of species, regular environmental monitoring, trend analysis and future prediction. Machine readability and actionability will enable integration of quality assured FAIR data into analytical workflows and tools.

The DiSSCo technical infrastructure will provide eServices tailored to actual researchers needs. This is achieved by discussions with users, inventory of user stories (more than 160 thus far) and an agile development and operations process that allows for testing and feedback after each development round (sprint). All user stories are collected in the DiSSCo GitHub repository. Software components for the technical infrastructure are developed as open source.

A provisional Data Management Plan (DMP) has been provided as deliverable D6.6 by the ICEDIG project. It is a living document by design that will be updated in the DiSSCo Prepare project. The DiSSCo DMP describes the main DiSSCo data management principles and requirements, adopting Digital Object Architecture (DOA) and FAIR Digital Objects (FDO) as its foundation. The DiSSCo DMP offers unified policies for data providers, managers and users, and guidance on technical standards to be applied.

The three building blocks of the technical infrastructure:

^

Repositories with data provided by the DiSSCo Facilities

The Infrastructure will connect data that is provided by the DiSSCo facilities in trusted repositories. These can include local institutional repositories as well as global thematic repositories such as GBIF. It will also connect data in third-party repositories like genetic sequence and literature databases. All data that can be linked to collection objects (specimens) are in scope.

^

Digital Object Infrastructure

The data will be linked through a Digital Object (DO) infrastructure in which Digital Collection (DC) objects and Digital Specimen (DS) objects are the principal object types. For this it is planned to use CORDRA software as the basis for a Natural Sciences Identifier Registry (NSIDR).  A sandbox, nsidr.org is presently (early 2020) being used to demonstrate and develop this. The DO infrastructure will include tools for federation and linkage as well as services to support annotation and enrichment of the data by the scientific community. It will draw upon common services provided at the global level or by European Open Science Cloud (EOSC) for, for example authentication and authorization.

^

Community Services

The infrastructure will provide community services to discover, consume and interact with the federated Digital Collection and Digital Specimen data. Some of these services will be provided in collaboration with other research infrastructures to enable innovative services for multi-disciplinary science.

Digital object infrastructure in more detail:

(Click the ‘+’ symbol to the right of a topic to expand its description)

Digital Specimens (DS)

A Digital Specimen (DS) contains the data or links to data about a physical specimen in a natural sciences collection, and as such acts as its digital representation (or surrogate) on the Internet. Digital Specimens provide an anchoring function for data locked up in physical specimens and released through digitization and other computational practices. However, they are more than just digital representations. Philosophically, digital objects (of which DS are a kind) represent a new category of industrial object sitting alongside natural objects (such as rocks, plants and animals) and tools (hammers, drills, screwdrivers). This opens many new and exciting possibilities for digital manipulation and computation that can lead to new working practices and a digital transformation in collections-based science.

Natural Science Identifiers (NSId)

A Natural Sciences Identifier (NSId) is a kind of universal and stable persistent identifier – a long-lasting reference to a digital resource – that is used to unambiguously, uniquely and globally identify a Digital Specimen.

The notion of Natural Sciences Identifiers for Digital Specimens’ is central to museums’ ambitions for widening access, and to proposed notions of Extended Specimens (Webster et al., 2017, Lendemer et al. 2019) and Next Generation Collections (Schindel and Cook 2018). NSIds act as a digital doorway that allows more than just finding, accessing and re-using specimens data. A wide variety of novel first and third-party services become possible, including for example: harmonizing the arrangement of loans and visits through the European Loans and Visits System (ELViS), finding specimens related to one another (think: ‘frequently bought together’, ‘customers who viewed this also viewed these’), linking to third-party information, and providing support to the Nagoya Protocol on Access and Benefit Sharing.

Digital Collections (DC)

Digital Collections (DC) are another kind of FAIR Digital Object supported by DiSSCo. They are used to provide descriptions of distinctly identifiable collections of natural sciences specimens, such as a specific herbarium collection or a collection of insects. DC objects are defined based on the emerging TDWG metadata standard for Collection Descriptions.

 

FAIR Digital Objects

Technically, Digital Specimens, Digital Collection and other kinds of DiSSCo digital object are ‘FAIR digital objects’. That is, they are bit sequences stored in repositories with a persistent identifier, metadata and a type definition. By being part of the FAIR Digital Object Framework, they are findable, accessible, interoperable and reusable by default. DiSSCo needs a type definition for Digital Specimens (as well as for every other object type) that is provided by creating a new standard, ‘openDS’ to provide the harmonization needed across research infrastructure subsystems and also at the global level. A recent COST MOBILISE workshop in Warsaw (February 2020) has agreed to move ahead with development of this during the remainder of 2020.