Predict Suzuki Reaction Outcomes: A Guide to CGR Modeling with GitHub
Want to predict the yield of Suzuki reactions? This guide provides a step-by-step approach using Composition, Group, and Relationship (CGR) modeling with tools available on GitHub. Learn how to set up your environment, process data, and train machine learning models to accurately forecast reaction results.
1. Set Up Your CGR Modeling Environment for Suzuki Reactions
Before diving into the modeling process, you'll need to create the necessary environments. Run the provided script to establish three distinct environments:
- cgr-frag: For generating the crucial CGR fragments.
- ml-env: The standard machine learning environment for algorithms like Gradient Boosting Machines (GBM), k-Nearest Neighbors (kNN), and Random Forests (RF).
- dl-env: Specifically tailored for deep learning using ChemProp and Multi-Task Neural Networks.
These environments ensure compatibility and streamline your workflow for modeling Suzuki reactions with CGRs.
2. Prepare Your Suzuki Reaction Dataset
A pre-processed dataset from a JACS 2022 paper is included, containing Suzuki reactions. The dataset is located in: jacs_data_extraction_scripts/dataset/suzuki_USPTO_with_hetearomatic.txt
. The data is conveniently organized into splits within: data/parsed_jacs_data/splits
.
Need to recreate the dataset? Simply run the make_jacs_dataset_and_frags.sh
script. This script automates these steps:
- Parses the raw reaction data.
- Atom-maps the reactions for accurate representation.
- Removes any duplicate entries.
- Fragments the CGRs, making them ready for Suzuki reaction modeling.
3. Model Training for Suzuki Reaction Prediction
Optimized hyperparameter configurations reside in the hpopt/coarse_solvs
folder, including config files tailored for RF, GBM, D-MPNN, and CGR-MTNN models. Full configuration files offering a complete range of model options are located in the configs
folder.
4. Reproducing Published Suzuki Reaction Results
To replicate the original paper's findings, execute these scripts:
test_all_models_cpu.sh -s coarse
: Trains and tests baseline models (Pop. Baseline, Sim. Baseline, and RF) on coarse solvent classes without GPU acceleration.test_all_models_gpu.sh -s coarse
: Trains and tests GPU-accelerated models (CGR GBM, CGR MTNN, Morgan MTNN, and CGR D-MPNN).
Repeat these steps for 'fine' solvent classification, remembering to use the -c
flag to indicate the use of coarse hyperparameters since they were optimized on the coarse solvent classes.
5. Hyperparameter Optimization to Improve Prediction
For even better performance, consider re-running the hyperparameter optimization. Be warned: this is computationally intensive!
- Optimize CGR-RF models.
- Optimize CGR GBM, CGR MTNN, and CGR D-MPNN models.
Optimized configurations land in the hpopt/coarse_solvs
directory. Manually copy the optimized hyperparameters from the CGR MTNN into the appropriate configuration files within configs/coarse_solvs/cgr_mtnn/split_seed_{split}/.yml
and configs/fine_solvs_coarse_hparams/cgr_mtnn/split_seed_{split}/.yml
.
6. Run Models to Predict Suzuki Reaction Yield
The MorganFP and CGR-MTNN models, built with PyTorch Lightning, can be initiated directly from the command line:
This command will either fit (train) or test the model based on the specified configuration file. This approach is particularly useful for computational Suzuki reaction analysis.