Introduction

Computing power has increased greatly over the last few decades due to advances in technology. Despite this increase, there are various applications whose requirements exceed the available computing power of smaller general purpose machines. To tackle this, specialised machines are periodically constructed with the best available technology to provide a very large amount of concentrated compute power, called high performance computers (HPC), to give the best possible answers for such demanding applications. The next generation of HPC, expected sometime after 2020, is called Exascale, a name related to the amount of computation available.

In the last decade, a new breed of user of very large machines has appeared, those concerned with Big Data. Big Data problems, usually deal with less sophisticated models but with many more parameters, and try to choose the model parameters by analysing large amounts of data with relatively little associated computation. However, there are problems in this area for which the data are very expensive to generate. In this case it becomes important to be able to use more sophisticated models to be able to squeeze as much knowledge as possible out of the data. Such problems are at the juncture of HPC and Big Data in that they have large data sets to analyse, yet should exploit more sophisticated models through computation to make the most of the available data.

The ExCAPE project is about how to tackle such problems. The core of the project is on maths and software and how they work on HPC machines. However, to be able to advance the state of the art it helps to have a concrete problem to tackle. For this we take the chemogenomics problem, that of predicting the activity of compounds in the drug discovery phase of the pharmaceutical industry, leading to the project name (Exascale Compound Activity Prediction Engines - ExCAPE). Making such predictive models belongs to the field of Machine Learning.

The overall objectives of the project are to find methods and systems that can tackle large and complex machine learning problems, such as chemogenomics. This will require algorithms and software that make efficient use of the latest HPC machines. Creating these, along with preparing the data to give the system something to work on, is the main work of the project. The project is part of the H2020 European Initiative, the biggest EU Research and Innovation programme ever with nearly €80 billion of funding available over 7 years (2014 to 2020). The RHUL team contribute in the area of Uncertainty Quantification with their expertise in Conformal Prediction and Venn-Abers Prediction.

Reference:
Tom Ashby TechReport.

Deliverable reports

  • Report #1 : Conformal Predictors The report summarises some preliminary findings of WP1.4: Confidence Estimation and feature significance. It presents an application of conformal predictors in transductive and inductive modes to the large, high-dimensional, sparse and imbalanced data sets found in Compound Activity Prediction from PubChem public repository. The report describes a version of conformal predictors called Mondrian Predictor that keeps validity guarantees for each class. The report also describes briefly the parallelization approach that allowed to distribute the computational load and reduce execution time. Download PDF
  • Report #2 : Probabilistic prediction The objective of this subpackage is to complement the bare prediction of bioactivity with an estimate of its uncertainty. The previous report described Conformal Prediction and discussed some results of its application to BioAssay data. This Report introduces Multi-probabilistic prediction, also referred to as Venn prediction. Download PDF
  • Report #3 : Integration of Conformal Prediction with ML Algorithms The objective of this subpackage is to integrate Conformal Prediction (CP) with the Machine Learning methods adopted in ExCAPE. Conformal Prediction was described in the first Report, with particular emphasis on Inductive and Class-conditional (Mondrian) forms. The deliverable is a Python module that implements Mondrian Inductive CP (MICP). The module is meant to be used as a stage downstream the ML algorithms in the overall pipeline that takes the EXCAPE DB data as input and produces predictions as final output. We understand that the module offers a reference implementation. Partners in WP2 and WP3 are free to re-implement it. Download PDF

Technical reports


Publications

Researchers