0:01
OK, for our last talk before lunch, we have Ragavi Shanmugan and Anji Sinha.
0:10
Ragavi is a Data Architect at Zifo R&D Solutions with a decade of experience in bioinformatics.
0:18
She has previously worked as a bioinformatician decoding drug resistance and infectious diseases at Gilead and SDSU Research Foundation.
0:28
Her expertise lies in integrating data, technology and science to streamline scientific infrastructures and engineer knowledge to accelerate research.
0:39
Agni is a bioinformatics analyst at Zifo Technologies who specialises in Next Gen Omics and its intersection with machine learning.
0:48
He has a background in machine learning aided drug discovery, having previously worked at Schrodinger Inc and has a Master’s in Bioinformatics from Georgia Tech.
1:09
Good afternoon everyone.
1:10
Thank you for coming to our talk.
1:13
I hope you are enjoying the conference and learning all the fun ways to study biology.
1:19
So I love hiking right?
1:21
But the disadvantage or the problem with hiking is you have to carry this heavy backpack and then you have to make sure you have enough food and water and everything that you need for the hiking, right?
1:35
And I even more hate packing that backpack because, you know, carrying is one thing and then you have to make sure you have all the items you need with the optimal weight.
1:46
So most of the time when I hike in California, I make sure, you know, I always wish I had a Sherpa.
1:55
So we are like the Sherpas of data discovery, drug discovery.
2:02
So, you know, we can help you carry half of the load in your, you know, you have risk hiking and then you can go and make the discovery that you would you want to really do.
2:13
So we have two parts in the speech.
2:17
The first part, we'll be talking about how we can effectively store and manage the data and how we can visualise the data for transcriptomics, single cell and spatial transcriptomics.
2:31
And then the second part, Agni will be talking about how we can integrate this data.
2:38
So with this single cell and spatial technologies, it's not just that the technology is getting more and more complex, the data is also getting complex.
2:50
So when we had bulk transcriptomics alone, the data was just coming out like a CSV or TSV file.
2:56
You just got the expression matrices that you can even put it in Excel and then look at it.
3:02
But with single cell, we added more layers to it, right?
3:06
So it has become into like Seurat or AnnObjects [unclear].
3:10
So it has become multidimensional matrices and then spatial data, spatial technology came in and we are now adding images to it.
3:19
So the data has become multimodal and even more complex so that traditional systems, for example the RDMS and all those are becoming obsolete if you want to store and manage this data.
3:31
So in our company, we were, you know, trying to understand how we can accelerate the, you know, the spatial research or the spatial data analysis.
3:42
And then we identified 3 major challenges with the data analysis.
3:46
You know, the domain itself is evolving.
3:49
And as you can see in this conference, you know, there are interesting discoveries to be made from this data.
3:56
And the data is so complex that even though the scientists are, you know, highly knowledgeable in biology, they are finding it hard to understand and work with this data.
4:09
And then the technology is also, you know, multidimensional.
4:12
So, you know, if you have legacy systems, it's not able to use that.
4:17
So we wanted to try out, you know, few storage methods and part of it, you know, we identified a couple of technology that can help with it.
4:28
So Zifo as a company partnered with multiple data or storage platform vendors and the TileDB is one of our partners.
4:39
We picked TileDB because it is an array based storage system and it kind of mimics how the AnnData or your Seurat object can be stored.
4:51
Otherwise, we are vendor agnostic.
4:52
We work with, you know, all kinds of platform providers and then another technology we picked to check the storage is Parquet, because Parquet has been a friend for bioinformaticians or Com Bio people for all these times.
5:08
So we wanted to compare how Parquet can be used for spatial storage and versus the Tile DB.
5:18
So what we observed is.
5:19
So Parquet as you know is traditionally being used for archival.
5:23
But when you have to query from Parquet, you need to add another layer to it.
5:27
So we added what is called as DuckDB on top of Parquet to get the data searchable using Parquet and then Tile DB.
5:36
You can ingest the H5ad files as such into the TileDB system and it goes and stores in a special object storage format called SOMA objects, and the SOMA objects can pack and help you query the data within TileDB.
5:57
So we were trying to ingest, you know, large data sets and we are trying to benchmark how Parquet versus TileDB can be used for spatial storage.
6:07
And we have seen both advantages and disadvantages with both these systems with Parquet, you know, with separatingly interesting with respect to ingestion.
6:18
But ingestion, you know, when you're trying to handle like hundreds of data sets, it's probably like a onetime effort or a week effort.
6:26
You'll probably be, you know, consuming data from these systems lot more than your ingesting data into these systems.
6:33
With that assumption, we are not showing all the benchmarking here, but our data sets range from 40 MB to 20 GB and we had around 7 million cells at the max or to 18 million cells, I guess.
6:49
So we were trying to, you know, query data from different data sets, you know, if you're trying to do meta-analysis or trying to combine like different public data.
7:00
The metadata is also important in the in this research.
7:03
So you have to make sure all the metadata is properly captured, is properly queried and you're getting the right data set for your research.
7:11
So we had like lot of different criteria for querying from the system and we were able to benchmark against those metrics.
7:19
And we see, you know, you can choose either Parquet or TileDB and the with respect to performance the TileDB perform better for us for the transcriptomic single cell data.
7:34
OK, storing and retrieving data is one thing and the next step is presenting it in the way that you know, the scientist can understand.
7:42
Because the visualisation even in the previous talk, there were like some amazing visualisations because the way you plot your data gives you more insights than how you know how you are seeing your raw data, right.
7:58
So this is for one of our customers where you know, the com bio team was overwhelmed with these ad hoc requests.
8:05
So they wanted us to build like a very generic platform to come up with lot of visualisations and they wanted to even like see both bulk and single cell data.
8:16
And then they wanted to see if we if they can salvage some bulk transcriptomics data to single cell and all that.
8:22
So we worked with them to identify what kind of plots they need.
8:27
And then you know, this is across different target classes, too [unclear].
8:30
So we had to like come up with a set of plots that will help the scientist view the data for their research purposes.
8:41
And the nice thing about it is it was highly interactive.
8:45
So like different set of users will be able to see it.
8:49
For example, the data scientist will probably be looking at different set of plots than you know, someone who is like a VPR director who just want to see how many experiments are there, what kind of samples are there, you know.
9:03
So this platform was supposed to serve all kinds of users across this organisation.
9:09
So as you can see, we had like a simple dashboard to like GSEA or UMAP kind of plots to serve, you know, different wide range of users.
9:20
And then so the so far I've been only talking about individual data sets.
9:26
The next step is for research is combining all these data sets together, right?
9:31
There are different ways to combine these or integrate these data sets.
9:36
You can go do it through like modelling or knowledge graph kind of applications or you can use machine learning methods which are talk about in a very short time.
9:46
So this knowledge graph we tried to combine not just multi omics data, we wanted to see whether you can include drug data, the clinical data into it.
9:56
And then you know, you will have like the wholesome picture of what is happening because you know, I heard in this conference, you know, biology is a team sport.
10:04
If you're just looking at individual data, you are probably missing out on lot of other information.
10:10
So this knowledge graph is again like an internal effort trying to bring in all the publicly available data that we could.
10:18
So we included GWAS, transcriptomics, microbiome data from omics and then we also included some systems biology databases along with the drug database, the drug bank information.
10:31
And then we are trying to use Neo4j to see like how you know, disease comorbidities are connected.
10:39
So we looked into like obesity and diabetes link and diabetes and coronary heart disease links.
10:46
And we were getting some interesting hits on the genes even if from the public data, we didn't use any.
10:52
We don't generate any patient data.
10:55
So it was an interesting case study too.
10:59
And now I request Agni to talk about other ways of integrating the data.
11:03
Thank you.
11:11
Thanks Ragavi, I really liked your analogy.
11:14
I can't come up with my own so I'm just going to steal yours.
11:18
So hiking bags been packed for you, now what?
11:23
There are various routes you can take.
11:25
You can take one specific route, you can take another.
11:28
Or maybe just combine it together, right?
11:30
There have been lots of talks about multi omics and integration, and this is just a brief look into a few integration methods and try to compare and see what is there and what can be done.
11:46
So there were roughly 4 aims.
11:48
1 was again understanding broadly how multi omics integration is done right and what's method and use cases, what's available.
11:58
And secondly, we wanted to see if we can leverage machine learning to maybe accelerate or enhance this.
12:04
Specifically we thought about LLMs because of its multi modality, its ability to understand complex information, huge feature space.
12:14
And again, we wanted to have a demonstratable version of this code as well as benchmarks for the LLM that we were using.
12:23
So we mostly worked with real world data sets from public sector. One was the BMMC data set, and another one was the flu data set.
12:36
So again, just I know every, a lot of people here have spoken about scGPT.
12:42
So it's pretty common.
12:43
It's widely known.
12:46
It's trained on the CELLxGENE Atlas.
12:49
So it's trained on about 33 million data points.
12:54
And the reason, you know, we kind of took it as like a standard benchmark because it, we felt it was very representative of the direction large language models for genomics is going.
13:06
And we just wanted to compare it with gold standard methods like WNN or MOFA for integration specifically and see if there's an if there's a way or anything special that comes out of leveraging large language models for this particular task.
13:24
So a lot of these simpler models are algorithmic.
13:28
They find linear relationships.
13:30
We wanted to see the huge understanding of LLMs.
13:34
Can it find identities and relationships that these can't?
13:39
How do these look like?
13:41
Yeah, so this was the rough kind of road map that we followed, right?
13:48
Really simple, around 5 aspects, right?
13:52
So we worked with CITE-Seq data again, it's been well expressed, right?
13:57
So it's a combination of gene and protein expression data, right?
14:00
So we worked on ingestion and then after that preprocessing, right, involves machinery reduction normalisation.
14:09
So CITE-Seq data, especially the correlation between RNA and ADT data is really important because it, RNA data will tell you what the cell does, but the protein data will tell you what's actually happening.
14:29
And then we performed, so our model we used was scGPT and we performed something called fine tuning.
14:36
This is currently the best method to use large language models in this space because it is really costly to train a model from scratch.
14:45
So instead we give it new information and kind of tweak it and teach it to kind of work on that new information.
14:54
It's like someone who's seen oranges his entire life and he knows an orange is fruit and you give him an apple, and he makes the connection, right.
15:03
So we wanted to see if scGPT which had not been trained on CITE-Seq did how it performed for CITE-Seq integration.
15:10
And we wanted to see how this pertained or how this compared to these other methods.
15:16
So like I said, we had hyperparameter tuning.
15:19
So we had, we performed a fourfold cross validation for these metrics listed here, learning rate, batch size, epochs.
15:26
And then we did tokenization, fine tuning and then we compared the results with the more fine WNN based on two metrics roughly.
15:39
So again, like I said, two important metrics were used, average Bioscore and number of cell types identified.
15:47
So average Bioscore is a clustering metric.
15:50
It tells you information about integrate, how well the integration has happened and number of cell types identified as compared with the ground truth from literature just to see the power of the models themselves.
16:08
So we found that scGPT roughly outperformed all of these specifically in cell type identification.
16:16
And this is to be expected because of the enhanced feature space that scGPT has.
16:23
It essentially makes connections based on prior existing data.
16:29
It's like how let's say in spatial biology, right, you always validate your hypothesis with a pathologist because he's been seeing this data and he can identify patterns.
16:42
So LLM is like that.
16:43
It's trained on a huge corpus of information data, right?
16:47
So we wanted to see how it could leverage this and while of course comparing the average Bioscore to make sure it has a robust bioscore and metrics.
17:02
And again like you see her, these are the clusters that were there.
17:10
And again, these are the benchmarks and the results, the heat maps that we got.
17:15
So we reached after cross fold validation, we had about .75 in terms of average bioscore.
17:23
And that was for both data sets that we tried and that was across various fields of the data.
17:37
And yeah, just to kind of summarise from the start, right, Raghavi spoke about what do you do with the data?
17:46
How do you store it?
17:47
How do you query it?
17:48
How do you scale it?
17:49
And this is one look into what you could do with the scaled up data.
17:53
How would you integrate it and why would you choose a smaller model versus a larger model or a larger model versus a smaller model?
18:03
What are the things that LLMs can do specifically for this kind of data that you cannot do without LLMs?
18:12
And this is just like a summary, you know, of what was said and how we feel.
18:21
This can accelerate science, right?
18:22
So increase predictive power, identify their cell types using Parquet, TileDB.
18:30
So you're going to generate huge amounts of information by conducting your experiments, right?
18:34
You're going to need a scalable and efficient way to store and query them for your own models later on, right?
18:41
So you need access.
18:43
So architecture is really important when you're going to deploy the set scale.
18:49
There are also applications of using large language models.
18:52
So these kind of models for perturbation analysis to reduce number of samples for knockout studies as well as there are applications for reasoning models for things like text mining mechanism of action detection.
19:09
So it's we live in a very weird time wherein we can be very optimistic about the future.
19:20
Thank you.