Improving the Protein-Ligand Scoring Function for Molecular Docking by Fuzzy Rule-based Machine Lear


Protein–ligand docking is an essential step in modern drug discovery process. The challenge here is to accurately predict and efficiently optimize the position and orientation of ligands in the binding pocket of a target protein. In this research, we investigate into applying machine learning algorithms for developing a ML-based protein-ligand scoring function.

In particular, rule-based induction and swarm search optimization algorithms will be looked into. These two enable finding insights in the form of human-interpretable decision rules in protein-ligand interaction, and solving high-dimensional feature space problems, respectively. With sufficient insights gained on the very non-linear relations between the protein-ligand descriptors (input feature) and the predicted reaction (conformation or otherwise), the logics for the scoring function would be refined. The swarm search optimization algorithms help in finding a significant feature subset out from the large conformational space. The ML-based model would be generalized, tested and integrated with some popular molecular docking, with a new ML-based scoring function to be developed. The expected result will be improvement in efficiency (speed) and accuracy in the protein–ligand docking process.

Protein–ligand docking has become an important step in modern structure-based drug design. Given a biological target related to the disease of interest, the docking program helps to decide if a small molecule (the ligand) can bind to the target protein with a desirable level of affinity.

High quality docking predictions can reduce the time and cost for experimental tests remarkably and thus both academic and industrial researches have been focused intensely on improving the accuracy and efficiency in docking algorithms.

Technically speaking, a protein–ligand docking algorithm consists of two main steps: Conformation generation and scoring. The former uses sampling techniques to generate different ligand orientations at different positions inside the protein binding pocket. Each of these conformations will be evaluated by a scoring function and the top scoring ligand conformations will be reported in a ranked list as a result. In flexible ligand docking, the size of the conformational space or the search space depends on the volume of the protein binding pocket and the number of rotatable bonds of the ligand of interest, while the energy landscape of the search space is determined by the energetic properties of protein–ligand binding which is known to be complex and rugged in shape.

To be able to search quickly and intelligently over the huge conformational space, heuristic or metaheuristic algorithms which find near-optimal solutions instead of the global optimum would be a method of choice for initial docking studies or high-throughput virtual screening, from which the potential ligand conformations can be further optimized by expensive but more accurate modeling methods.

So if the virtual screening by computer technology works quickly and works very accurately, a lot of efforts (cost, time, money) by expensive software and equipment could be saved.

In this project, we innovate new metaheuristics for speeding up and finding significant features (that lead to accuracy), as the post-processing of the first part – Conformation generation, and as the pre-processing the second part – Scoring, that requires some machine learning model.

While current docking algorithms are able to generate docked conformations reasonably close to the native complexes, the problem lies in the difficulty to accurately predict the binding affinities of the docked complexes in order to distinguish the active ligands from decoys.

In addition, highly accurate scoring functions are essential for lead optimization in the later stage of the drug discovery process. Despite years of effort, the performance of scoring functions is still far from satisfactory.

One promising alternative to the conventional scoring functions is to apply machine learning (ML) algorithms to construct models for protein-ligand binding prediction. Since ML algorithms learn the theory directly from data through the process of fitting and inference without prior assumption of the statistical distribution in the data, ML is ideally suited for handling biological data that is noisy, complex, and lack of comprehensive theory. Only in the last two years, studies applying ML techniques to construct scoring functions have been seen to emerge.

In this project, we aim at finding out the best possible ML techniques (as there are over hundreds of algorithms, and new ones emerging every month) for predicting the affinity of protein-ligand interaction. In particular, we opt to emphasis in Fuzzy rule-based branch of ML algorithms because the decision rules offer insights on the highly non-linear relationships between the descriptors and the conformation, which would be in great interest to both research communities of computer science and bioinformatics. Since the relations between the descriptors of protein, ligand and protein-ligand, as well as the affinity energies are highly non-linear and fuzzy in nature. Naturally, our research efforts will be diverted towards the direction of Fuzzy rule-based ML models.