Summary
When it comes to analysing multiomics data, there are three main challenges: the volume, complexity, and integration of different data modalities. Bioinformaticians usually use a combination of tools to conduct the data analysis with the aim of transforming fast-moving R&D in multiomics into robust, scalable, and reusable operational solutions.
Since meaningful analysis requires long chains of interconnected tools, the work is rarely structured or reproducible when handled solely through ad hoc research practices. To cope with increased workload, complexity, and ad hoc tasks, the team at Data Intuitive advocated for more standardisation and modularity.
This presentation examined four key themes. First, the gap between exploratory research and stable operations: researchers often prototype methods in computational notebooks, which are powerful for exploration but unsuitable for long-term, automated workflows. Therefore, translating these prototypes into robust modules requires the right tooling and processes.
The second theme is reuse. As multiple workflows develop, open pipelines, high-throughput pipelines, and spatial workflows overlap, which inherently makes everything more complicated. The talk explained that without a shared module system, teams fall into copy and paste maintenance traps. At Data Intuitive, the teams suggest using a centralised, dependency-based toolbox that enables true reuse and reduces duplication.
Thirdly, reproducibility: containerisation is necessary but insufficient. The speaker commented that versioning, compatibility management, and careful handling of bug fixes versus new features are essential to ensure analyses remain reproducible over time.
The final theme is ease of use and adaptability: workflows must be adaptable to changing user needs without constant forking or rewriting. Furthermore, wrappers around validated workflows allow new interfaces, integrations, and input formats while keeping core logic stable. Overall, the talk emphasised building flexible, modular, and reproducible systems that allow innovation to scale into dependable operational pipelines. These principles, which are derived from IT and software engineering practices, ensure that rapid R&D can be transformed into dependable, maintainable, and reusable operational pipelines.
Transcript
0:03
Good afternoon, I will talk about how to turn R&D into solutions that you can build on.
0:12
First, a few words about myself, only a few.
0:15
My life for the most part is quite boring, but I want to pinpoint one specific thing here.
0:21
I spent more than 10 years in and around IT organisations and this has learnt me a lot and I will come back to that later in my presentation.
0:31
Now, First things first, this conference is about omics data, with the focus on omics and in my case on data as well, because I'm mostly focused on and interested in the data analysis part.
0:43
Now, when we're talking about omics data, there's three main challenges that we have to deal with.
0:49
First of all, there's the volume.
0:52
Usually whatever omics type of modality that we're talking about, it comes with a lot of volume, right?
1:00
Second is complexity.
1:01
And I quite like this illustration in a paper by Krassowski et al. because it quite nicely illustrates the complexity of the type of types of data that we're working with.
1:11
And then the third big factor is when we start to integrate different modalities.
1:20
It's not only that we have to deal with the volume and the complexity and times for every modality, it's also that we want and need to integrate these different modalities.
1:30
So the complexity still rises considerably.
1:36
So these three main challenges are something that we have to keep in mind.
1:40
Now, typically what happens in bioinformatics when we want to effectively analyse this type of data, it's not just one tool that you should run in order to do the analysis part.
1:52
It's usually a combination of tools.
1:55
This is a conceptual overview of a workflow that we have built.
2:00
And I will come back to the workflow.
2:01
It's open pipelines.
2:03
It's a conceptual overview.
2:04
Why?
2:04
Because every step in this workflow is rather a group of steps.
2:08
It's not just one individual step.
2:11
So as I pointed out already, this is part of OpenPipeline.
2:15
You can refer to the QR code if you're interested.
2:19
And just to zoom into the last part of this conceptual overview, you will see that already more than 10 steps are involved in this one specific part of their workflow.
2:28
So imagine that in total, there's more than 150 components that you can potentially run in the scope of this OpenPipeline workflow.
2:37
Now, typically there's a couple of people that are involved in analysing this type of data.
2:44
You see a few on the slides.
2:47
And I think it's fair to say that if you take a look at what type of work they do, how they handle these tasks, it's not always in a very structured and very robust way.
2:59
I think it's fair to say that these people are typically more on the research side of things rather than on making things operational and robust.
3:13
And this typically reminds me of when I was in IT, remember my earlier career, the business requirements, questions from customers.
3:27
Things got more and more complex over time.
3:30
And it just wasn't possible anymore to cope with this load, with this workload and complexity just by doing things manually ad hoc with a bunch of people.
3:40
And so what was needed at that time was more standardisation, modularity in order to get the complexity.
3:49
So in the end, what you want to turn up to is this kind of situation, let's say with big thanks to ChatGPT.
3:58
Now, if you dive into the matter and we take a further look at what kind of challenges that we're encountering, what kind of topics are relevant in this scope, in this context.
4:10
And there's a couple that I put on the slide.
4:12
We're not going to tackle them all.
4:13
We're just going to zoom into four of those, the green ones.
4:17
And let's start with R&D for these operations.
4:22
I only quickly touched upon that, but let me start from an anecdote here.
4:28
Two years ago, we were asked to implement a SCGPT tool.
4:33
It was a tool that was just published.
4:35
So GPT approach to annotating genes in single cell data.
4:40
OK, interesting.
4:41
And we were told that the researcher who came up with the method already developed it completely and it was ready to go.
4:49
And we just had to implement it in the workflow, the OpenPipeline’s workflow.
4:54
We started working with the researcher, turned out OK.
4:56
She created the process in a computational notebook.
5:02
Now, for those who do not know, a computational notebook is a very powerful tool for a researcher to dive into an analysis, connect to the data, see what are the results, create a plot, the table, whatever.
5:14
A very powerful tool.
5:15
It's just not something that you can take and just put in a workflow and think it will run forever in a robust way.
5:21
It's not meant for that.
5:23
It's meant for interactive work.
5:25
So we, what we had to do was take a look at that code, work with the researcher, turn it into a module and add that module to our workflow.
5:34
Now it's only possible to do that, I mean that kind of transformation to do it quickly if you have the proper tool set and the proper platform to do that.
5:44
In the end, we were able to quite quickly create a module scgpt annotation out of this, which is now since then included in our open pipelines workflow.
5:57
So again, there is a big gap between the researcher who has to innovate, who has to work quickly to come up with new scenarios, new prototypes, new ways of working on the left-hand side.
6:08
But on the other hand, we often also need robust operations.
6:12
We want to be able to depend on what has whatever has been developed on the left hand side and there's a gap in between.
6:19
And somehow, if you don't think it's true and if you don't have the tool set and the processes, it's really hard to bridge that gap.
6:28
To put it differently, if you want to run the scale that we did for Tabula Sapiens version 2, for instance, and this is just a subset of that data set, just to be clear, you don't do this from a notebook.
6:40
You have to have different tools to run these types of data sets through your analysis.
6:48
Second topic, reuse.
6:51
All right, let's start with the story on the left hand side.
6:55
There's three conceptual workflows here.
6:57
The left one is basically a pointer to OpenPipeline.
7:01
Again, when we started developing Open Pipelines quickly after, we also started developing a high throughput workflow.
7:07
And I'll come back to the high throughput workflow later.
7:10
But just for now, know that this is the second workflow that we started working on quickly after the first one.
7:16
And we developed the modules for Open Pipelines and also for the high throughput pipeline.
7:20
And we quickly noticed that some of those modules are actually shared.
7:24
It's the same module with some different parameterisation, some under the hood changes, but basically the same module.
7:31
At that time, we didn't have the tool set yet to be able to share modules across workflow.
7:36
So we had to copy paste the module to the other workflow.
7:39
But of course there's two versions of the same thing that you have to maintain, which makes life harder.
7:46
Later this year we started creating a spatial single cell workflow and it quickly turned out that lots of the downstream components and modules were actually the same or could be reused from the OpenPipelines workflow.
8:02
If we didn't have the tool set then we would have to copy paste all these modules to our new workflow, which of course is a big pain to maintain now.
8:11
Luckily, by then we had our platform, we had our tool set.
8:15
So what we are currently doing is basically we have this common set of modules that are used in various places and we just create a dependency to this toolbox.
8:28
And in terms of the spatial workflow, we don't copy paste any modules whatsoever.
8:34
We just point to the as a dependency.
8:36
We point to whatever is existing already in OpenPipelines, which is of course a very better approach to deal with these things.
8:46
In other words, be careful if you're doing R&D, just make sure that you can reuse it, that you can effectively use it, that it doesn't go to waste.
8:58
Third topic, reproducibility.
9:00
This is a nice one.
9:02
It's all about containers, right?
9:06
I was talking to a bioinformatician some time ago and I asked him, well, what are you doing in terms of reproducibility?
9:12
And he told me, ah, OK, well, we have the code of our notebooks, we have and in source control, so that's fine.
9:18
But basically, well, we should really create containers for all of our work.
9:22
I said, OK, that's a good start.
9:26
It's a necessary condition.
9:27
It's not a sufficient condition.
9:29
It only starts with creating containers.
9:31
Let me illustrate that.
9:33
Let's say following scenario you created version one of a workflow contains 4 modules, 4 steps, simple workflow.
9:41
Because it's all sequential, every step has its own container.
9:46
Now be careful, you have two versions of the containers and if your container is gone, you cannot reproduce the workflow anymore.
9:55
So it's important to keep versioned containers as well, but then you want to create version 2.
10:00
Why?
10:01
Well, two reasons.
10:02
First of all, people recognise that with certain data sets, module B, there's a bug in it.
10:08
So it doesn't work in some cases.
10:10
So there needs to be a bug fix.
10:12
OK, fair enough.
10:13
And also we would like some module E at the end because perhaps there's some nice reporting that we would add to this workflow.
10:20
OK, nice, interesting.
10:21
Now turns out that module B the output, the new, the bug fixed module B, the output of it is not compatible with module C anymore.
10:32
Now you might say, well, OK, this seems a bit far fetched.
10:34
Well, I can assure you I've encountered these situations a few times the last years, so it's not so far fetched as you might think.
10:41
And so now the question is what to do now?
10:44
Well, the answer is let's take a step back and see how these kinds of things have been tackled in different IT organisations, but also in computer science in general and just look at this with a different perspective.
10:57
And then one thing you should notice is that B, the new version of type B, adding that bug fix is a completely different thing than adding a module at the end of your pipeline.
11:09
This is new functionality.
11:10
You add something to the workflow.
11:12
The other one is a bug fix.
11:14
So let's handle it as a bug fix.
11:15
And let's call that version 1, not one, because it doesn't add any functionality.
11:20
But then how do I cope with the incompatibility of B prime?
11:23
Well, let's add a module that only transforms the out of B prime to whatever C uses our needs as input.
11:32
The end result will be the same and your work has been resolved in version 1.1.
11:37
And you can safely create version 2 with new functionality and with a different type of B prime or the same one doesn't really matter.
11:46
Both will have to be validated.
11:47
But this is an easy job, and this will have to be validated from scratch.
11:54
So please keep in mind that the shelf life of your analysis is only as long as you can reproduce it.
12:02
For topic, ease of use also a very nice one.
12:07
I told you already that we have been working on this high throughput pipeline for JNJ.
12:12
It's been running for five years now and recently we got the question of can you maybe provide a different interface to that workflow?
12:21
OK, what should be changed?
12:22
Well, we would like to have multiple inputs at the same time so it runs the workflow in parallel.
12:27
OK, nice thing.
12:29
But also we would like to provide the input via CSV file rather than a form basically because this was the way that it was before.
12:38
OK, interesting.
12:40
And then can we then maybe replace the form with that new format?
12:44
No, of course not.
12:45
We should still be able to use the old way.
12:48
OK, all right.
12:50
If you think about this kind of requirements, it sounds really easy, but in practise it's not.
12:55
Because when you look at, for instance, typical and of course next floor pipelines, you get this monolithic structure, monolithic pipeline.
13:03
It's not really modular also, even though they call it modular.
13:07
And you have this whole lot long list of possible parameters and that's it.
13:11
That's the interface.
13:13
If you want to change that, you create a different version with a different version.
13:17
Again, means again, additional maintenance because you want to keep things In Sync.
13:22
That's not the way to do it.
13:24
OK, but then what is?
13:26
Well, again, it comes down to creating the tool set, the ways to do that.
13:31
What if we could just add a wrapper around an existing workflow that deals with these changes in requirements for input and output, for instance. So your inner workflow remains exactly the same.
13:42
It is validated and can be stable.
13:44
It can be run as is, as it was in the past, but you add a wrapper around it that can deal with different requirements in terms of input output.
13:56
But maybe also you want to integrate with your LIMS system, or you want to integrate with the data governance policies that are available in your company, alternative outputs obviously, etcetera.
14:07
Maybe you want to deal with some same defaults for your organisation again without having to recreate or fork your existing workflow.
14:18
And if you think that's true, you can add as many of these possible alternatives as you want, because you only have to validate this workflow once, the rest is just a different interface.
14:34
And that made me think that if you think about usability, usually we try to come up with the best design for everyone, but maybe it's the other way around.
14:43
Maybe we just think should think about how to make our systems as adaptable as possible.
14:50
OK, so this covered the four topics that I wanted to cover.
14:56
You can find us at our booth.
15:01
And if you're interested, please, you can use the QR code here for some more information about Data Intuitive and what we do Thank you.