Single cell technologies have unlocked the ability to profile molecular features, such as gene expression, chromatin accessibility, methylation, and protein surface markers, at the level of single cells. The challenge lies in making the most of the rich datasets that these technologies produce.
In this presentation Luca Pinello outlines SIMBA, a method for building gene regulatory maps. Starting with the matrices of data, SIMBA organises the cells into a two-dimensional map, where cells with common features are positioned closer together. Crucially, SIMBA labels the areas with the relevant features on the map, much like metadata on Google Maps. The concept aims to elucidate the underlying principles of gene regulation.
One method for analysing such data and constructing these maps is natural language processing (NLP). This technique enables computers to process and understand text and has broad applicability. In NLP, words are encoded as vectors in a latent space, where words with similar meanings are positioned closer together. The meaning of a word can then be inferred from its position relative to other words. For example, king + (woman – man) ≈ queen.
Embeddings can also capture hierarchical structures. For example, Pinello described how words within a tweet can be embedded in relation to higher-level elements such as hashtags, forming a hierarchical representation. Subsequent research has expanded the capabilities of NLP tools: not only can they embed individual words and sentences, but also entire hierarchical graphs. This enables the analysis of relationships across multiple levels of language, including words, sentences, and full articles.
Applied to biology, SIMBA can construct a hierarchical graph of the many factors which impact changes in gene expression. This graph connects genes and cells to secondary features like ATAC-seq peaks, motifs, and K-mers. After embedding the graph, researchers can explore the latent space and locate relevant cells with important features.
By leveraging proximity in the embedded space, SIMBA facilitates the identification of key genes, transcriptional regulators, and regulatory regions associated with specific cell types.