Blogs & News
Earlier in the year, members of our development team attended a Microsoft Hackathon to investigate ways that we could utilise Azure AI services as part of the Aridhia Digital Research Environment.
The short duration of these events means that after a brief exploratory period, you need to settle on one or two ideas to take forward for the rest of the hackathon. On that basis, we chose to investigate two possible implementations in more detail:
• Use of Azure Language Service LLM to provide human-readable summaries of dataset metadata
• Introducing AI-backed Vector search in FAIR Data Services
Both of these proved to be interesting avenues of exploration.
We were able to quickly establish the type of prompt and data format that would produce the best summary of the metadata. FAIR allows users to export dataset metadata in both JSON and Markdown format, and the Azure Language Service (ALS) produced noticeably better outputs when asked to summarise data presented in Markdown.
As expected, the quality of summaries was also heavily dependent on both the volume of metadata presented to the ALS, and the specificity of the prompt provided to the service. This provided useful learning, and while dataset metadata summary doesn’t feel like an essential feature at this stage, a similar implementation could prove more useful in other contexts within FAIR and the wider DRE.
FAIR currently uses Azure Cognitive Search. This largely depends on string similarity to return search results (i.e. if a user wants to discover datasets about Alzheimer’s disease, their search really must include the term “alzheimer’s”).
AI-backed Vector search provides users with a more powerful and intuitive search experience, it can match on:
• semantic or conceptual likeness (“dog” and “canine”, conceptually similar yet linguistically distinct)
• multilingual content (“dog” in English and “hund” in German)
• multiple content types (“dog” in plain text and a photograph of a dog in an image file)
In our context, this means that a user could return datasets on Alzheimer’s disease while using a more general, but semantically similar search term like “cognitive decline”. This clearly represents a significant boost to our existing search capabilities, and as such, introducing it into FAIR is a near-term priority.
The immediate future of AI in the DRE is an extended trial of Vector search in FAIR. This will allow us to determine the metadata that is best suited for vectorisation and help identify any changes we may want to make to our existing metadata model to better support this. Vector search will only be available on DRE hubs that are part of this trial and will not be generally enabled. Even when then trial is complete, and the feature in production, we anticipate that it will be enabled at the discretion of the hub owner.
Clearly AI is still in its infancy as a tool, and its use in trusted research environments is an even more recent development. However, there are already a number of possible use cases within the DRE that we will be exploring in the coming months:
• Research transparency is an emerging theme, with data owners keen to make information about the use of their data publicly available. Using the metadata summarisation technologies we researched as part of the hackathon to summarise data access requests could provide data owners with an easy way to generate public summaries of data usage.
• Data export security is a key benefit of the Aridhia DRE, provided by our workspace Airlock feature. Our development team is already exploring the use of AI to make this even more secure, by investigating how we can use LLMs to identify and report when sensitive data, such as PII, is pushed into an airlock for export.
• Federated analysis is another emerging technology in healthcare research, which we are already integrating with the DRE. This requires data owners to review and approve code to be run against their data. Not all data owners have the skills or capacity to carry out these reviews, but AI-assisted code reviews represent a potential way to resolve this issue.
Our use of these new technologies will continue to develop over the coming months and years. If you’d like to know more about what Aridhia is doing in this space, please get in touch.
August 29, 2024
Ross joined the Aridhia Product Team in January 2022. He is the Product Owner for FAIR Data Services, and Aridhia's open source federation project. He works with our customers to understand their needs, and with our Development Team to introduce new features and improve our products. Outside of work, he likes to go hill walking and is slowly working his way through Scotland's Munros.