0:00
Hi, everyone, and thank you very much for coming to this talk on unlocking prognostic biomarkers.
0:07
I'll be talking about how we can achieve this within digital pathology with the use of foundation models alongside some of these cases of foundation models in digital pathology.
0:20
So yeah, thanks for your introduction.
0:22
I'm Matt Lee.
0:23
I'm the Director of AI and Medical Imaging at Sonrai.
0:26
I'll kick off today with an introduction to Sonrai, talk about what we do, the kinds of data that we support, and why we've built these capabilities with digital pathology images.
0:36
But I get on to the current state of play within computational pathology before take talking about the challenges of which there are quite many, and then we'll dive into some use cases in precision medicine.
0:47
And we'll finish up with a demonstration of a prognostic modelling task on a public data set.
0:54
I don't think it's controversial to say that the future of precision medicine is multimodal.
0:58
It's a message I've heard a lot over the course of this conference.
1:02
And we'll look at how digital pathology can contribute to that multimodal future.
1:09
So Sonrai.
1:10
Sonrai are a Belfast based company with a mission to power the most important precision medicine breakthroughs through connected data and collaborative science.
1:19
We have a platform, Sonrai Discovery, which is a trusted research environment ensuring auditability, reproducibility and traceability.
1:27
On here we have a suite of tools, pipelines and features for working with multiomic data, clinical data and imaging data.
1:36
What I'll present here today is complemented really well by a webinar my colleague James Leitch did earlier in the year and that was alongside our partners at Page.
1:45
They've since been acquired by Tempus.
1:47
So if you like what you hear here today and you want to take a deeper dive, that's a really good place to get going.
1:57
Computational pathology is having a real impact across a range of different areas.
2:01
I'm not going to go into all of these in detail, but we're at a biomarkers conference, so it seems fitting to start with novel biomarker discovery.
2:09
So we'll see how we can derive features from our digital pathology images that can act as novel biomarkers.
2:16
They can be used to model things like prognosis, therapy response or resistance, for example.
2:21
In pharma and diagnostics, we can do things such as tumour micro environment analysis and get a quantitative assessment of immune infiltration right away to a clinical impact where we can build automated scoring algorithms based upon our biomarkers.
2:37
Also risks stratification models, classification models that can support clinical decision making.
2:46
OK. The challenges, there are many, as I said.
2:48
If we start with the data, there's a lot of variables and there have been a few talks so far across these two days on the challenges here.
2:58
There's a lot of variables in sample preparation, sample staining, scanning, and the various file formats are produced, which can hinder reproducibility.
3:09
Validation and regulation are obviously really important if you want these algorithms to get anywhere near a patient, but few AI tools meet bar for external validation and regulatory approval.
3:23
Interpretability and trust is a really big one.
3:26
Arguably, the more interpretable we can make our algorithms, the more trust we can earn and the more rapid and they'll be rolled out.
3:37
Bias and generalizability are a known challenge across precision medicine and imaging is no exception.
3:43
The risk is narrow training data sets give you poor performance on more diverse populations. And then multiomic integration.
3:52
So there's been a lot of talk about multimodal AI.
3:55
How does pathology fit in?
3:56
How can we align the rich set of information that's hidden amongst the billions of pixels in our pathology images and get that to work with genomics, proteomics, lipidomics and so on, as well as the clinical data?
4:12
So computational pathology is a really broad term.
4:16
I'll be focusing on the use of foundation models.
4:19
So an imaging foundation model takes an image as an input 2 dimensional image, normally with three colour channels, but it's a lot of talk about multiplex today and yesterday, normally with three colour channels.
4:34
And it creates a representation, a distillation of that image into a set of features, which is sometimes called a feature vector but interchangeably called an image embedding.
4:45
This works at a tile level.
4:46
So I'm sure most of you know whole slide image can consist of 10s of thousands or 100,000 tiles.
4:54
So if we apply a foundation model across our whole slide image, we can stack the tile embeddings that we generate, and you can see how the data is in a slightly more ready, readily available format to work with.
5:10
We can do this across a cohort of images.
5:13
So a cohort of images all of the same indication can generate those stacked style embeddings.
5:18
And from there we can do apply processes such as clustering.
5:22
So any clustering process will do the job.
5:25
We can use K means, we've used K means clustering here where we can define the number of clusters upfront.
5:31
And what we find is because tile embeddings should be similar for the same tissue type or morphology, they're naturally clustered together.
5:39
And once we've done the clustering, we can present tiles randomly from each cluster and ask a pathologist or a specialist to label the morphology or tissue type of that cluster.
5:51
That's what we've done here.
5:53
I've seen UMAPs a good number of times across these two days.
5:56
Here's another one.
5:58
So that's the laser here.
6:01
So in this case, there were two clusters that were labelled as containing immune cells, another 2 for Acinar cells.
6:07
And you can see the kind of relation here on UMAP.
6:10
Up here at the top, you see the immune cell clusters, quite small.
6:15
They're adjacent to each other, as you would expect.
6:18
And the Acinar clusters here and here, excuse the shaky hand.
6:23
So you expect clusters representing the same tissue to be adjacent.
6:27
It's interesting as well to see that the adipose clusters at the extremes of the image, which you might expect given that adipose tissue looks quite different to, you know, other tissues in a pathology image.
6:39
So that's a visual representation of what we've got so far.
6:42
But we can go one step further and we can generate something like a whole slide embedding.
6:48
So just as a tile level embedding is a condensed representation of the information contained in a single tile, we can generate a whole slide embedding which is a condensed representation of the information within the whole slide image itself.
7:05
And the way we can do this has been inspired by some work that was produced by Mahmood Lab and published last year.
7:13
We can take the stacked tile embeddings from an individual case.
7:17
We can take the cluster centroids, which we've generated by clustering on a range of cases.
7:22
I mean, you can use those to initiate or seed a Gaussian mixture modelling process.
7:27
And the result of that is the whole slide embedding.
7:30
So it's a series of Gaussian distributions where each Gaussian distribution captures how that tissue type or morphology is presented in that whole slide case.
7:41
So it's a really rich set of information.
7:44
And it's worth pointing out the dimensions of the whole slide embedding, they're determined by the number of clusters, which was something that we chose upfront and also by the size of the feature vector or tile embedding, which is the property of the foundation model that we've chosen.
7:58
So there are two things that we know and can define upfront.
8:01
Just to emphasise, it's not dependent on the number of tiles in your whole slide image.
8:07
So it's a really valuable property of the whole slide embedding that we'll talk more to you later on.
8:15
Why use a pathology specific foundation model?
8:19
So same lab, Mahmood lab have done some great work on this recently and I'll try and summarise it here as best I can.
8:27
So I've taken a set of tiles for which the gene expression is known.
8:32
From those tiles they then extracted tile embeddings using the various foundation models you see on the X axis here and have used those tile embeddings to train a small model to predict the gene expression.
8:47
What we see on the Y axis is the correlation coefficient between the predicted gene expression and the actual gene expression.
8:53
So the higher the correlation coefficient, the better the foundation model is done to condensing that information.
9:03
So on the left here we have Resnet 50, which is perhaps the best known generic imaging foundation model.
9:11
As pathology specific foundation models have come into play, we can see an increased performance.
9:17
So these are ordered by average correlation coefficient across these genes or benchmarks.
9:23
So we can see an increase in performance culminating in the most recent foundation models such as Virchow2, UNIv1.5 and Hoptimus.
9:34
This is a little bit dated, so there's maybe a few more that have come out since as well.
9:40
So these foundation models, these pathology specific foundation models have been trained on hundreds of thousands or sometimes millions of cases across different continents, lots of different scanners, sometimes different stains as well.
9:51
Most of these are H&E, but there's also IHC staining incorporated into the training data.
9:57
And there's evidence that by using a foundation model such as this as a backbone for modelling process, it's much more generalizable than if you trained a model from scratch.
10:07
There's also evidence that you need far less data to get state-of-the-art results.
10:12
So pathology specific foundation models are really adding value.
10:18
These are some of the things that we can do with them.
10:20
So we kind of unlock features across image management, tile level analysis and slide level analysis.
10:26
And I'll go into some of these in some detail.
10:30
So if we take the classification sort of the cluster assignments that we've assigned to each of the tile embeddings for an individual case, we can map those back onto the original image.
10:42
And you can see this looks something like self-supervised annotation.
10:47
This could perhaps be used for macro dissection region proposal or other segmentation tasks that are quite laborious for pathologists to carry out.
10:55
There's a lot of talk on the workforce pathology workforce kind of dwindling.
11:02
And if we can save them some time, that's a real bonus.
11:09
We can also the slide level, we can do things like anomaly detection, so we can generate the whole slide embeddings upfront.
11:15
If we have a research project of several hundred, a few 1000 cases, we can generate the whole slide embeddings upfront and we can look for outliers. Before we get going with our research project analysis we can flag these for investigation and decide whether they should be included in the study cohort.
11:35
So it's also fairly trivial to get the slide composition.
11:38
We just look at the proportion of each cluster within each slide, and you can imagine this could be used for perhaps filtering by immune response, sorting by immune response, and various other workflows.
11:52
But for me, the most exciting use of foundation models is in generating a whole slide embedding.
11:59
So whole slide embeddings help solve 2 really significant challenges in pathology image modelling.
12:05
One is the sheer scale of the data.
12:06
We're working with whole slide images of billions of pixels, gigabytes of data, and the other is the variability in that scale.
12:14
We can have small biopsies in the same cohort as large resections, and machine learning models need a fixed size input.
12:23
Whole slide embeddings give us that.
12:24
They give us a condensed fixed size representation of each image, regardless of the size of the imaged tissue.
12:34
They also represent a huge amount of compression, so a whole slide embedding is several megabytes.
12:40
So we've distilled gigabytes worth of data into a really representative set of features, which is, you know, a thousandth of the size.
12:52
Whole slide embeddings are also really amenable to being incorporated or integrated with multiomic data and with clinical data to enable that multimodal modelling approach.
13:06
I've just mentioned multimodal modelling, but to show the undiluted utility of a whole slide embedding, I've just used whole slide embeddings for this modelling task.
13:17
So we've taken open source public data set TCGA with 196 cases.
13:28
So we've got the H&E images, and we've got a clinical metadata.
13:31
We've just used the survival data from the clinical metadata.
13:33
We're not used anything else here.
13:36
We've generated tile level embeddings first using Vircow2 as our foundation model.
13:42
We've then used the K-means clustering process and Gaussian mixture modelling processes I described previously to generate whole slide embeddings.
13:50
From there we've taken 30 principal components, the first 30 principal components and trained a Cox proportional hazard model using scikit survival.
14:02
So I'll pause here and emphasise a few things.
14:05
This is just the pixel derived data, so we're just taking the digital pathology images and the wholesale embeddings generated from those to create this modelling.
14:17
Another thing worth noting is that the whole slide embeddings are versatile.
14:20
So we've used them here for survival modelling, for prognosis, but they could equally be used for various other prognostic tasks or regression tasks or classification tasks.
14:30
They're really versatile.
14:31
And then finally, I just want to emphasise how trivial or straightforward the analysis is once those whole slide embeddings have been generated.
14:40
We've done principal component analysis, which is pretty straightforward to do, and then trained a Cox proportional hazard model using a freely available and widely used tool.
14:51
We've done this in Python.
14:52
It could equally be done in a language of your choice.
14:58
So I've gone to the results, and we've generated these results by stratifying the model risk score for the cohort.
15:08
So low and high risk split on a 50th percentile of the score.
15:12
And this is the training set.
15:13
So we see a really clear separation on the Kaplan Meier curves and a very statistically significant P value that's held out on the test set.
15:23
So we see a really clear separation.
15:26
The P value is not quite as spectacular, but still very statistically significant.
15:31
So hopefully I've conveyed the power or the signal that we've got in our digital pathology images by using them in isolation.
15:42
And that whole slide embeddings are a really great way to work with them. To loop back to roughly where I started the future and what I've heard a lot over the past couple of days, the future of precision medicine is multimodal.
15:58
And I think whole slide embeddings are the best approach I've seen yet to incorporating a rich set of information we've got within our digital pathology images in that multimodal future.
16:10
Earlier this year, ArteraAI received the first FDA de novo authorization for multimodal AI in digital pathology.
16:17
That's for prostate cancer.
16:18
That's a real milestone for the entire field and arguably de risks working in this area.
16:24
For everyone else, the regulatory path has been trod.
16:29
So lastly, I'd like to finish up just by introducing my colleagues Hamzah and Chris.
16:34
Thank you for joining.
16:36
So please do come and speak to any one of us if you'd like to take a deeper dive and learn more and look forward to any questions.