Blogs & News
In a previous blog, we discussed the ‘F’, or Findability, of the FAIR data principles. Even with an understanding of datasets via clear, concise and exhaustive metadata, researchers still often face the difficult task of obtaining access to the dataset.
Accessibility of a dataset does not imply that data should be instantly available for use but “as open as possible and as closed as necessary” [European Commission]. Typically, there are three methods to obtain access to data:
While in healthcare settings the latter approach is typically employed, obtaining access to data can often be difficult, with complex request processes and turnaround times of request decisions. Data governance policies and restrictions imposed by data owners rightly try to strike a balance between satisfying ethical and regulatory requirements, and data confidentiality, particularly when data contains personal information. This can however restrict the ability of researchers collating necessary datasets to analyse and innovate to help improve patient outcomes. See our blog article discussing data this balancing act in relation to sharing data here.
The Accessibility (or ‘A’) of FAIR defines a dataset to be accessible when:
With open, free and standardised protocols such as HTTP(s) to retrieve data via identifiers (as defined in principle F1), the process to access a dataset is primarily defined through authentication and authorisation workflows.
Note that these principles do not outline how these workflows are implemented, what features of a dataset are released or how much of the dataset is released. For example, this may be a synthetic or de-identified version of the dataset, or a limited subset of data defined by the data owner. Private datasets could adhere to the Accessible principles if a process exists to access this data.
Authentication can be added to protect the visibility and security of data where users must register and login to see and access data. Open identity management standards such as OAuth and developed solutions such as Azure Active Directory, Okta, Auth0, etc, can be used to protect and control access to data and associated metadata. The addition of authentication to obtain or request access to data is entirely optional, however a number of tiered levels of per-dataset visibility and access exist:
A data platform providing access to data will ideally implement all authentication tiers allowing data owners to select the level of dataset visibility and access on a per-dataset basis.
Authorisation is an ambiguous term when applied to procedures of requesting and obtaining access to data. Below are some important components of a potential authorisation procedure, or Data Access Request (DAR) in relation to a data aggregator platform, e.g. those collating and providing access to datasets from various communities and data owners:
To reduce the risk of data misuse, leakage or breaches of any data usage agreements in place, requesting users should be verified before sharing any data – primarily to confirm their identity but also as is common, whether they are from a valid institution who would likely be interested in using the requested data.
Three such approaches that are often combined to reduce the aforementioned risks are:
Equally important to user verification is understanding why a user requires access to a dataset. As a minimum, data owners want to understand:
The data access request form captures this information. Commonly these forms are static and are applied to all datasets within a data aggregator platform, however with the greater volumes and types of data being generated in healthcare settings, forms should be both extensible and relevant to the dataset being requested, e.g. a dataset with strict release conditions will require more information to be provided than those with lesser restrictions. By doing so, requesting users need not supply unnecessary information that a data owner does not require as part of the decision review process.
Processes should exist to allow users to request access to multiple datasets concurrently by adopting smart form capabilities to streamline the user’s request experience but still capture the information required by the data owner. Similarly, an access request need not apply to the entire dataset. A data selection extension can be added to the request form giving requesting users the ability to select tables or define cohorts of interest to request access to. Understanding how the data will be used can be defined by introducing a Data Usage Ontology (DUO) to semantically tag datasets with usage restrictions that requesting users must specify and adhere to upon access (e.g. “for non-profit use only”, “for Alzheimer’s research”, etc.)
Request forms should also be machine-readable (e.g. JSON) to assist with sending request information to external sources where the request review process occurs elsewhere. A good example of this is work underway as part of the ADDI AD Workbench allowing requests to be completed by users in FAIR Data Services but requests are sent to the data owner to review via an API integration with their request review system. A request decision (approval or denial) is issued by a callback to FAIR Data Services that then informs the requesting user.
A dataset will usually have one or more associated agreements that need to be entered into before data is released to a user. Regardless of the agreement in place, data owners must clearly outline the conditions of access to assist requesting users whether requesting access is worthwhile. This may involve exposing a Data Usage Agreement (DUA) defining, but not limited to:
Governance measures should be developed to audit datasets, capture all outcomes of requests, list users who have been provided access and what and when associated agreements, were entered into. Dataset agreements and compliance is a subject that merits a detailed overview that will be covered in a future blog post.
Research organisations and data owners will employ different workflows on how a data access request is processed and ultimately approved or denied. This can depend on an organisations internal processes, interactions with users, the content of the request or the dataset in question, to name a few. Workflows should be assigned on a per-dataset basis that ideally can be created and deployed by data owners in a self-service manner.
Detailed below are five types of data access request that should be supported by a data aggregator platform:
This example workflow shown (represented as a Business Process Model and Notation chart), depicts a common access request workflow that includes multiple stages and business rules. Note either review stage could be handled internally within a data aggregators platform or externally via a third-party.
In the example, an access request is received and if the dataset is an open dataset, the request can be automatically approved by the platform. Otherwise, the dataset must reviewed by both the Research and Data teams to gain approval, however request denial can occur at either of these stages. Data access request processes can be greatly sped up by integrating a feedback loop where requesting users can contact data owners and iteratively re-work their access requests, or obtain feedback when requests have been denied.
A data owner needs to have at least all the information provided by the requesting user to help make a judgement on approving or denying an access request. However additional information and features should be made available to increase both the correctness and timeliness of the decision. For example:
Upon reviewer decision, requesting users should be informed of both the decision and reason, particularly when a request is denied. Access should then be granted to the user in secure manner compliant with any agreements, e.g. securely transferring data to a Digital/Trusted Research Environment.
The requirements listed above provide one blueprint on how to improve existing data access request processes. Without continued improvements in this area, lengthy and complex data access processes will soon become the primary bottleneck of healthcare data innovation. Automation, customised, and streamlined access request processes will be key going forward.
We will explore some of the concepts above in more detail in further blog posts as well as outline how FAIR Data Services can help make data accessible.
February 5, 2021
Gary joined Aridhia in 2019 as the Product Owner of FAIR Data Services. With three years prior Product Management experience in the Cloud Computing domain, Gary also has a PhD in Computer Science and has worked in a number of Postdoctorate researcher roles from developing online distance learning courses to improving usability of HPC infrastructures. Similarly, he has worked at Apple and CERN in similar capacities of improving Cloud Computing infrastructures. As a FAIR data advocate, Gary is now responsible for the development of Aridhia FAIR Data Services and the implementation of the wider product strategy.