In the field of precision medicine, there is a demand for robust biomarkers and robust methods for characterising complex diseases. Professor Graham Ball, Director of the Medical Technologies Research Centre at Anglia Ruskin University, outlined several challenges and strategies related to AI/ML applications in biomarker discovery.
Characterising diseases is complex because many things must be measured but the post-genomic era can assist because it generates vast amounts of useful data. However, finding relevant information in the mass of data is a challenge.
Big data describes large, hard-to-manage volumes of data, some of which is unstructured. The main challenge with this is false discovery. T-tests have a 5% error rate meaning that 5% of things discovered are false discoveries. Large data sets increase the risk of false positives. For example, if a sample consists of 50,000 features that equates to 2,500 false markers. To overcome this, one can rely on Bonferroni correction or Benjamini-Hochburg testing.
There is also a challenge around quality and replication. Proteomic profiles or protein samples can degrade rapidly so there is a need for standard operating procedures and protocols to ensure that the data available is of the best quality. So, Professor Ball emphasised the significance of experimental design. To achieve optimal experimental design, one must have robust data sets, blinded samples, and conduct QC checks.
The curse of dimensionality refers to a phenomenon where a high-dimensional large number of parameters interact in a non-linear fashion. High-dimensional data masks what is actually happening in the data meaning certain features might be swamped by the high dimensionality. As a result, researchers may find it difficult to accurately analyse the data and identify the importance of specific markers, and the risk of false discovery increases. Another challenge linked to high-dimensional data sets is computational time. Professor Ball explained that analysing massive datasets requires high-performance computing and GPU acceleration to be feasible within a reasonable timeframe.
When designing a model for biomarker discovery, it must be broadly applicable to a broad population and not be assigned by constrained conditions. Professor Ball advocated for a hypothesis-free approach when examining big data, he said: “What I've seen in the past is people who are using linear methods pull out features that are linearly constrained. That immediately throws away all the complexities of biology.”
To address these issues, Ball highlighted several strategies for robust AI/ML applications. He explained that feature selection and validation are crucial, with Monte Carlo cross-validation and ranking genes based on information content rather than just fold-change ensuring stability. Instead of analysing thousands of genes simultaneously, breaking the problem into smaller subsets helps reduce overfitting and improves generalisability.
Professor Ball, also stressed the importance of multi-dataset and multi-algorithm approaches, identifying robust biomarkers by confirming concordance across multiple models and datasets. Beyond computational findings, biological validation is essential; for example, a proliferation marker (SPAG5) identified through ML was later confirmed through functional assays, demonstrating its real biological significance.
AI/ML has significant potential in precision medicine and biomarker discovery but requires robust validation and interpretable models. Professor Ball stressed that the combination of computational efficiency, biological validation, and interpretability is key to making ML-driven discoveries clinically relevant.