Using Catboost-SMOTE Machine Learning to Discover Brand New Amines which can be Synthesized and Tested for their CO2 Capture Performance
This post highlights a powerful statistical technique known as Synthetic Minority Over-sampling Technique (SMOTE) that was also used in combination with Categorical Boosting (Catboost) technique in our recently published work on development of a predictive degradation model of amine used in CO2 capture process. Theoretically, SMOTE is described as a technique used to address class imbalance by generating synthetic samples for the minority class. To put it more simply, SMOTE is a statistical tool that can be used to systematically increase the number of data points from existing experimentally derived data points, thus creating more data points to be available for accurate data training and development of a desired predictive model. Specifically in our work, only 27 amines of different chemical groups and structures were initially picked and tested for their degradation rates. SMOTE was then brought in to help increase the number of the amines to 51 by using experimental data tested for the 27 known amines. The groups of amines with insufficient structural variations were then filled in by the 24 synthetic amines from the SMOTE. This step introduced more variations to all the chemical groups, which in turn, enabled the assessment of degradation rates to occur more effectively. SMOTE also allowed us to have data from amine groups that were neither commercially available nor too expensive to test in an actual experiment.
For example, in our work, SMOTE was able to generate syn-1, syn-2, syn-3, syn-4, syn-5, and syn-6 amines and estimated their degradation rates as shown in the table in the graphical abstract. Among the 6 syn-amines, syn-3, syn-4, and syn-6 amines structurally belonged to the same amine group of tertiary N, N’-alkylalkanolamine which originally had only 3 known amines (in blue color) data available from the experiment. Inclusion, the 3 aforementioned synthetic amines (in yellow color) added more structural variations (i.e. the end-alkyl and number of carbons in between the amino (-NH2) and hydroxyl (-OH) structures) to this amine family, thus increasing the accuracy of the amine degradation rate predictive model built at the final stage by the Catboost. SMOTE can also be used in isolation from Catboost to help one to effectively identify and synthesize a desired amine whose chemical structures are the least prone to degradation or other CO2 capture characteristics without the need of any rigorous lab testing. This feature offered by the SMOTE eliminates all the expensive lab work that one would normally need in amine selection and formulation.