Key Takeaways

  • Yale and Google built an open-source AI model (Cell2Sentence-Scale 27B) for single-cell RNA analysis
  • The model helps interpret huge, complex single-cell datasets.
  • It generated a new, experimentally confirmed hypothesis about cancer cells.
  • The model predicted drug candidates for immune signal boosting – most were previously unknown.
  • Researchers say this lets AI simulate real human cell behaviour in silico.
  • Results are open-source and available on BioRxiv, Hugging Face, and GitHub.

 


 

Researchers at Yale University and Google have collaborated on a project to use large language models (LLMs) for the analysis of single-cell RNA data. Together, they have developed and released the AI model Cell2Sentence-Scale 27B, which uses 27 billion parameters to better understand single cell data.

Detangling Single Cell Data

Single cell sequencing datasets are ginormous and difficult to parse into interpretable hypotheses, requiring specialised tools to analyse. That makes sifting through all this important data slow and cumbersome.

Google worked with David van Dijk’s Lab at Yale to develop the initial version of Cell2Sentence-Scale, calling it “a family of powerful, open-source large language models (LLMs) trained to ‘read’ and ‘write’ biological data at the single-cell level.”

“A milestone for AI in science”

In its announcement Google said that the newest version of the project “marks a milestone for AI in science.” Google said that the model generated a novel hypothesis about cancer cell behaviour, which they were able to experimentally confirm in living cells. “This discovery reveals a promising new pathway for developing therapies to fight cancer,” said Google.

Finding ‘surprising hits’

The team asked Scale 27B to find a drug that would conditionally amplify immune signals only in immune-context-positive environments. In other words, environments where low levels of interferon are present, but not at levels that were high enough to induce antigen presentation.

They simulated the effect of over 4,000 drugs to predict which would be suitable boosters of antigen presentation. The model was able to highlight a broad range of candidates, only 10-30% of which were already known hits in the literature.

David van Dijk, Assistant Professor and Head of the lab at Yale said “We can finally begin to simulate how real human cells behave – in context, in silico. This is where AI stops being just an analysis tool and starts becoming a model system for biology itself.”

Open for Research

These findings are soon to be published in a forthcoming paper. The preprint is available to read on BioRxiv. C2S-Scale 27B is open-source; it’s code and resources can be accessed on Hugging Face and GitHub.