ExCAPE Project at Royal Holloway University of London

Introduction

Computing power has increased greatly over the last few decades due to advances in technology. Despite this increase, there are various applications whose requirements exceed the available computing power of smaller general purpose machines. To tackle this, specialised machines are periodically constructed with the best available technology to provide a very large amount of concentrated compute power, called high performance computers (HPC), to give the best possible answers for such demanding applications. The next generation of HPC, expected sometime after 2020, is called Exascale, a name related to the amount of computation available.

In the last decade, a new breed of user of very large machines has appeared, those concerned with Big Data. Big Data problems, usually deal with less sophisticated models but with many more parameters, and try to choose the model parameters by analysing large amounts of data with relatively little associated computation. However, there are problems in this area for which the data are very expensive to generate. In this case it becomes important to be able to use more sophisticated models to be able to squeeze as much knowledge as possible out of the data. Such problems are at the juncture of HPC and Big Data in that they have large data sets to analyse, yet should exploit more sophisticated models through computation to make the most of the available data.

The ExCAPE project is about how to tackle such problems. The core of the project is on maths and software and how they work on HPC machines. However, to be able to advance the state of the art it helps to have a concrete problem to tackle. For this we take the chemogenomics problem, that of predicting the activity of compounds in the drug discovery phase of the pharmaceutical industry, leading to the project name (Exascale Compound Activity Prediction Engines - ExCAPE). Making such predictive models belongs to the field of Machine Learning.

The overall objectives of the project are to find methods and systems that can tackle large and complex machine learning problems, such as chemogenomics. This will require algorithms and software that make efficient use of the latest HPC machines. Creating these, along with preparing the data to give the system something to work on, is the main work of the project. The project is part of the H2020 European Initiative, the biggest EU Research and Innovation programme ever with nearly €80 billion of funding available over 7 years (2014 to 2020). The RHUL team contribute in the area of Uncertainty Quantification with their expertise in Conformal Prediction and Venn-Abers Prediction.

Reference:
Tom Ashby TechReport.

Deliverable reports

Report #1 : Conformal Predictors The report summarises some preliminary findings of WP1.4: Confidence Estimation and feature significance. It presents an application of conformal predictors in transductive and inductive modes to the large, high-dimensional, sparse and imbalanced data sets found in Compound Activity Prediction from PubChem public repository. The report describes a version of conformal predictors called Mondrian Predictor that keeps validity guarantees for each class. The report also describes briefly the parallelization approach that allowed to distribute the computational load and reduce execution time. Download PDF
Report #2 : Probabilistic prediction The objective of this subpackage is to complement the bare prediction of bioactivity with an estimate of its uncertainty. The previous report described Conformal Prediction and discussed some results of its application to BioAssay data. This Report introduces Multi-probabilistic prediction, also referred to as Venn prediction. Download PDF
Report #3 : Integration of Conformal Prediction with ML Algorithms The objective of this subpackage is to integrate Conformal Prediction (CP) with the Machine Learning methods adopted in ExCAPE. Conformal Prediction was described in the first Report, with particular emphasis on Inductive and Class-conditional (Mondrian) forms. The deliverable is a Python module that implements Mondrian Inductive CP (MICP). The module is meant to be used as a stage downstream the ML algorithms in the overall pipeline that takes the EXCAPE DB data as input and produces predictions as final output. We understand that the module offers a reference implementation. Partners in WP2 and WP3 are free to re-implement it. Download PDF

Technical reports

Performance analysis of Mondrian Conformal Prediction for the top 10 targets in the ExCAPE dataset Khuong An Nguyen, internal report, August 2018.
Multi-target learning Ilia Nouretdinov, internal report, August 2018.
Inductive Venn–Abers prediction for regression Ivan Petej, internal report, July 2018.
Venn–Abers partial ordering method applied to ExCAPE datasets Ivan Petej, internal report, July 2018.
Prediction in bioinformatics applications by conformal predictors Alex Gammerman, invited talk at ICPB 2016, Pattaya, Thailand.
Applying Conformal Predictions on Public BioAssay Data Paolo Toccaceli, poster presented at ICPB 2016, Pattaya, Thailand.

Publications

Conformal Prediction of Biological Activity of Chemical Compounds Paolo Toccaceli, Ilia Nouretdinov, Alexander Gammerman; Annals of Mathematics and Artificial Intelligence, p.1-19, 2017.
Combination of Conformal Predictors for Classification Paolo Toccaceli, Alexander Gammerman; Proceedings of Machine Learning Research. Vol.60, p.39-61, 2017.
Conformal Predictors for Compound Activity Prediction Paolo Toccaceli, Ilia Nouretdinov, Alexander Gammerman; 5th International Symposium on Conformal and Probabilistic Prediction with Applications, 2016.

Researchers

Prof. Alex Gammerman
Principal Investigator
Prof. Vladimir Vovk
Co-Investigator
Prof. Zhiyuan Luo
Co-Investigator
Dr. Lars Carlsson
Co-Investigator
Dr. Khuong An Nguyen
Research Assistant
Dr. Paolo Toccaceli
Research Assistant
Dr. Ilia Nouretdinov
Research Assistant
Dr. Ivan Petej
Research Assistant

Introduction

Deliverable reports

Technical reports

Publications

Researchers

::: Login to your account :::