Chemcial Language Models and Active Learning for identification of good binding ligands

In this work we use chemical language models, such as ChemBERTa and Molformer in combination with Bayesian Optimization/Active learning to efficiently identify ligands with promising binding affinity for a protein.

The rough process is described in the following figure. We employ a two-level approach. Instead of directly optimizing the whole molecule pool, we first cluster the available molecules using k-means and select promising clusters using a Multiarmed-Bandit. In each iteration we select the most rewarding cluster and only then select the molecule with the highest score according to bayesian optimization.
The Multiarmed-Bandit selection can be seen as a coarse pruning of the input space, focusing on regions where promising ligands have been observed.

Results


chembl203	4UNN

Setup

Install the required python packages

conda env create -f environment.yml

We use smina for docking. For receptor preparation install ADFR suite in a suitable place e.g. "~/.local/share"

cd $HOME/.local/share \
    && wget -O ADFRsuite.tar.gz https://ccsb.scripps.edu/adfr/download/1038/ \
    && tar -xzvf ADFRsuite.tar.gz \
    && cd ADFRsuite_* \
    && echo "Y" | ./install.sh -d . -c 0 \
    && cd .. \
    && rm -rf ADFRsuite.tar.gz
    && ln -s $HOME/.local/share/ADFRsuite_x86_64Linux_1.0/bin/prepare_receptor $HOME/.local/bin

Similarly, to install smina itself:

cd $HOME/.local/share \
    && wget -O smina https://sourceforge.net/projects/smina/files/smina.static/download \
    && chmod +x smina \
    && mv smina ../bin/

Docker

Alternatively, build the provided docker image.

docker build --build-arg USER_ID=$UID --build-arg USER_NAME=$(id -n) -t protein_ligand:base -f Dockerfile .

Optionally install development packages:

conda install --file dev_requirements.txt
pre-commit install

Run

Bayesian optimization can either be run on a dataset with precalculated docking scores or in an online fashion, docking ligands using smina in each iteration.

More information about running can be found in the commandline interface. Also, checkout scripts/compare_and_analyze.sh.

export PYTHONPATH=$PYTHONPATH:.
conda run -n protein_ligand python experiments/run.py --help

The results will be stored in a sqlite database. When docking online, sdf files will be stored in the user defined output directory. Each file contains the SMILES string of the ligand md5 encoded, to avoid possible issues with the special characters appearing in the SMILES notation.

Data

The data directory contains preprocessed files for the 5 proteins in ChEMBL with the most recorded IC50 activity values, as well as the 10k and HTS collections from David Graff's publication Accelerating high-throughput virtual screening through molecular pool-based active learning.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bayesian_protein		bayesian_protein
data/processed		data/processed
experiments		experiments
results		results
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
architecture.png		architecture.png
dev_requirements.txt		dev_requirements.txt
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemcial Language Models and Active Learning for identification of good binding ligands

Setup

Docker

Run

Data

About

Releases

Packages

Languages

uds-lsv/chemical-lm-active-learning

Folders and files

Latest commit

History

Repository files navigation

Chemcial Language Models and Active Learning for identification of good binding ligands

Setup

Docker

Run

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages