Predict Suzuki Reaction Outcomes: A Guide to CGR Modeling with GitHub

Want to predict the yield of Suzuki reactions? This guide provides a step-by-step approach using Composition, Group, and Relationship (CGR) modeling with tools available on GitHub. Learn how to set up your environment, process data, and train machine learning models to accurately forecast reaction results.

1. Set Up Your CGR Modeling Environment for Suzuki Reactions

Before diving into the modeling process, you'll need to create the necessary environments. Run the provided script to establish three distinct environments:

cgr-frag: For generating the crucial CGR fragments.
ml-env: The standard machine learning environment for algorithms like Gradient Boosting Machines (GBM), k-Nearest Neighbors (kNN), and Random Forests (RF).
dl-env: Specifically tailored for deep learning using ChemProp and Multi-Task Neural Networks.

These environments ensure compatibility and streamline your workflow for modeling Suzuki reactions with CGRs.

2. Prepare Your Suzuki Reaction Dataset

A pre-processed dataset from a JACS 2022 paper is included, containing Suzuki reactions. The dataset is located in: jacs_data_extraction_scripts/dataset/suzuki_USPTO_with_hetearomatic.txt. The data is conveniently organized into splits within: data/parsed_jacs_data/splits.

Need to recreate the dataset? Simply run the make_jacs_dataset_and_frags.sh script. This script automates these steps:

Parses the raw reaction data.
Atom-maps the reactions for accurate representation.
Removes any duplicate entries.
Fragments the CGRs, making them ready for Suzuki reaction modeling.

3. Model Training for Suzuki Reaction Prediction

Optimized hyperparameter configurations reside in the hpopt/coarse_solvs folder, including config files tailored for RF, GBM, D-MPNN, and CGR-MTNN models. Full configuration files offering a complete range of model options are located in the configs folder.

4. Reproducing Published Suzuki Reaction Results

To replicate the original paper's findings, execute these scripts:

test_all_models_cpu.sh -s coarse: Trains and tests baseline models (Pop. Baseline, Sim. Baseline, and RF) on coarse solvent classes without GPU acceleration.
test_all_models_gpu.sh -s coarse: Trains and tests GPU-accelerated models (CGR GBM, CGR MTNN, Morgan MTNN, and CGR D-MPNN).

Repeat these steps for 'fine' solvent classification, remembering to use the -c flag to indicate the use of coarse hyperparameters since they were optimized on the coarse solvent classes.

5. Hyperparameter Optimization to Improve Prediction

For even better performance, consider re-running the hyperparameter optimization. Be warned: this is computationally intensive!

Optimize CGR-RF models.
Optimize CGR GBM, CGR MTNN, and CGR D-MPNN models.

Optimized configurations land in the hpopt/coarse_solvs directory. Manually copy the optimized hyperparameters from the CGR MTNN into the appropriate configuration files within configs/coarse_solvs/cgr_mtnn/split_seed_{split}/.yml and configs/fine_solvs_coarse_hparams/cgr_mtnn/split_seed_{split}/.yml.

6. Run Models to Predict Suzuki Reaction Yield

The MorganFP and CGR-MTNN models, built with PyTorch Lightning, can be initiated directly from the command line:

python src/modelling/multitask_nn.py {fit/test} -c {config_file}

This command will either fit (train) or test the model based on the specified configuration file. This approach is particularly useful for computational Suzuki reaction analysis.