Sophia Tsoka, Reader in Bioinformatics, King’s College London explained that it is important to adapt methodologies to address specific problems presented by specific data sets. The principles of machine learning provide good prediction performance but it is essential that this is provided in an interpretable manner so that scientists can trace decisions and that they are able to flexibly model and represent data and the problems they are encountering.
Tsoka argued that it isn’t the volume of data the problem - it’s the heterogeneity. Different data types, such as numerical and text data, all have distinct inherent properties. This means one must consider a variety of considerations when developing an appropriate data science methodology for a specific problem.
The ‘large p, small n problem’ is a common recurrence for computational science. To tackle this, the team resorted to feature selection methods, but the majority of these methods failed to represent relationships between features. For instance, we know that genes or proteins do not act in isolation. So, scientists must tweak their methods to address this.
So computational scientists are seeking to identify latent patterns. Tsoka uses mathematical optimisation to predict phenotypes with good accuracy in a way that scientists can create interpretable and explainable models. Tsoka suggested that mathematical optimisation models can infer gene weights that effectively separate sample phenotypes and express pathway activities. This aims to minimise misclassifications and improve prediction performance.
Regarding pathway activity interference, the methodology involves decomposing large data matrices into pathway-specific matrices and using optimisation procedures to derive pathway activities, which are then used for classification tasks. This approach reduces noise and improves the robustness of the data representation.
Tsoka tested her ML model on cardiovascular, breast, and colorectal cancer datasets. The model displayed a strong performance and performed better in terms of accuracy and other standard performance metrics compared to other methods. Tsoka selected three distinct metrics: multiclass classification accuracy, robustness to noise, and survival.
She stressed that her model is interpretable which enabled her to trace the entire modelling, understand how decisions are made, As a result she was then able to adjust analytical capabilities to the extent that the model allowed. Visualisations show improved sample separability using pathway-based representations.
Finally, Tsoka touched on ongoing work to integrate these concepts into neural network architectures, aiming to improve accuracy and reduce the number of parameters by using pathway-based autoencoders. The team is also extending their methods to single-cell RNAseq data for cell-type annotation.