Sophia Tsoka, Reader in Bioinformatics at King’s College London, explored how to use pathways, computational frameworks, and deep learning to make data more accurate and interpretable. The size and complexity of data generated do not negate the “large p, small n problem” meaning that there are a small number of samples that are characterised by a large number of features or predictors.
The ultimate necessity of the platform is to reduce dimensionality from thousands of gene features to hundreds of pathways to make computational problems more tractable and biologically meaningful. Feature selection methods tend to neglect actual interactions that exist within those features and treat them as independent features which is not accurate. The task is to aggregate gene level features into pathway level data.
This approach was tested on a few hundred pathways and the matrix was decomposed into pathway-specific data sets. The optimization problem is to find the weight of its gene so that the class representation in different colours can be optimally separated. This ultimately allows for maximising disease class separation. A study on breast cancer showed that the top 15 pathways separated breast cancers well - through these pathways five breast cancer subtypes were revealed.
The mathematical nature of these problems can offer scientists more adaptable ways of reflecting on different biological constraints and must be adapted to different types of problem statements. The key is to minimise the number of misclassifications so there are different ways of formulating a specific mathematical problem.
Deep learning is also critical to pathway-based classification. It relies on incorporating prior knowledge to improve performance and interpretability. Tsoka added that specific neural networks can encode and decode data, reconstructing the entire expression profiles and offering better separability of disease classes.
By rationalising data and linking specific pathways to disease classes, the typical black-box phenomenon of deep learning becomes more comprehensible. The next step aims to extend these methods to single-cell RNA sequencing data and examine other functional units for classification.