Home Blogs & News

An Experiment in Using Offline LLMs Within a Secure Data Environment – A Breast Cancer Case Study

Introduction

In health research, data privacy and security are paramount. Researchers work with sensitive datasets, such as patient records, where any breach could have significant legal and ethical implications. Yet the need for advanced analytical tools like large language models (LLMs) is growing. These tools simplify complex tasks, such as exploratory data analysis (EDA), making them accessible to researchers with varying technical expertise.

This is where a Trusted Research Environment (TRE) shines. By integrating offline LLM frameworks like Ollama, researchers can leverage the power of AI without the fear of data or prompt leakage. I’m using an open-source model, Mistral-7B, which provides a balance between performance and resource consumption (no expensive GPU required), while ensuring all prompts and data stay securely within the TRE, ensuring compliance with data protection regulations such as GDPR.

At Aridhia, our DRE/TRE platform is designed to support secure, compliant, and collaborative data analysis. It empowers teams to harness cutting-edge AI capabilities while safeguarding sensitive health data. This was covered more extensively in a previous blog.

In this first post, we’ll explore how AI-driven coding assistance can guide less experienced R users through some simple EDA, demonstrating these capabilities using the publicly available Breast Cancer Wisconsin dataset. We’ll do a more detailed EDA in a later post in this series.

Background Setup for the Demo

Dataset Upload

For this demo, I’ve downloaded the Breast Cancer Wisconsin dataset as a CSV file. The dataset is publicly available and can be accessed from here. After securely uploading it into my TRE workspace via the inbound airlock, the dataset is ready for analysis.

Workspace/TRE Environment

This demo is conducted entirely within a Workspace/TRE, which provides a secure, controlled environment for analysing sensitive data. To facilitate AI-driven insights, I’ve installed the Ollama LLM framework along with the Mistral-7B model (I’m using the Q4_K_M version of the model that uses up to 7GB of RAM). These tools were also uploaded securely through the workspace inbound airlock.

With this setup, I can confidently use advanced AI capabilities without risking data exposure, making it ideal for privacy-sensitive applications like healthcare research.

Visual Insights through AI-Powered EDA

Let’s walk through a practical EDA example, showcasing how an AI coding assistant can guide researchers unfamiliar with R through each step. I have the LLM prompt open in one terminal, and an R console for running commands open in another.

Step 1: Load the Dataset

Prompt:
“I’ve uploaded a CSV file named `wdbc.csv` into my TRE workspace. Can you help me load it into R and take a quick look at the data?”

Here’s the response from my LLM:

And here’s a snipped screengrab after I pasted the code in my R session:

You might notice the weird “X” column of blank values in the file. As it turns out, the Kaggle dataset has an error in the CSV file formatting, with an extra column name compared to the number of data columns. It’s always good to preview the data to see if it has loaded correctly. Next, we’ll fix this.

Step 2: Fix Empty Column

Prompt:
“It looks like there’s a column with completely missing values. Can you help me identify and remove it?”

Step 3: Generate a Histogram

Prompt:
“I want to see how tumour sizes are distributed. Can you help me create a histogram for the `radius_mean` column?”

Step 4: Create a Scatter Plot

Prompt:
“How can I visualise the relationship between `radius_mean` and `texture_mean`, and show different tumour types in different colours?”

Prompt:
“Can you help me create a correlation heatmap for the first 10 features to see which ones are strongly related?”

Here’s what this all looks like inside the TRE workspace, with my LLM prompt in one terminal window, the R console in another (could easily have been using RStudio), and the heatmap plot produced.

The Power of LLMs in a Secure Data Environment

The ability to run powerful LLMs like Mistral-7B within a TRE ensures that researchers can access cutting-edge tools without compromising data security. With no external dependencies, researchers can confidently explore and analyse sensitive data while maintaining strict privacy standards.

At Aridhia, we provide a secure, scalable platform that not only meets regulatory requirements but also enhances productivity by integrating advanced AI tools into the research workflow. Whether you’re performing EDA, conducting federated learning, or managing complex multi-site collaborations, our DRE/TRE offers the flexibility and security you need to succeed.

Conclusion

AI tools like Ollama provide invaluable assistance for researchers, especially those with limited coding expertise. By working securely within a Trusted Research Environment, these tools ensure that sensitive health data remains protected while delivering actionable insights. Whether you’re performing EDA or diving deeper into modeling, the integration of AI into your TRE workflow is a game-changer.

Stay tuned for the next post, where we’ll explore Retrieval-Augmented Generation (RAG) and show how to train and query your datasets entirely offline.

An Experiment in Using Offline LLMs Within a Secure Data Environment – A Breast Cancer Case Study

Introduction

Background Setup for the Demo

Dataset Upload

Workspace/TRE Environment

Visual Insights through AI-Powered EDA

Step 1: Load the Dataset

Step 2: Fix Empty Column

Step 3: Generate a Histogram

Step 4: Create a Scatter Plot

The Power of LLMs in a Secure Data Environment

Conclusion

Kim Carter

Recent Posts