Sunday, December 3, 2017

TS_conf_search: Conformer search for transition states

Here's som prototype code for performing a conformer search for a transition state.  The idea is that you have found a TS for small model of the system and now you want to attach floppy ligands, generate different conformers for these ligands, and MMFF minimise them while keeping the core (model) frozen.  Each conformer is then used as an initial guess for a TS search, perhaps preceded by a constrained minimisation using the QM method.

To use the program type "python TS_conf_search ts.sdf small_model.sdf  5".

ts.sdf is an sdf file that contains a guess structure of the TS you want to do a conformational search for.  The coordinates of the small model-part of the structures does not have to match that of the small model and it doesn't even have to be very TS-like but it has reflect the bonding in the small model. Alternatively you can provide a SMILES string of the molecule.

small_model.sdf is an sdf file with with the optimized coordinates for the small model TS.  You can use any name for the file as long as the extension is sdf. TS_conf will only use the coordinates of the non-hydrogen atoms, unless the word "keepHs" is part of the file name.  If you use keepHs then the hydrogen atoms have to have corresponding  hydrogen atoms in the big model (see examples below)

5 is the number conformers you want to generate and you can use any integer you want here.  The default is 20.

What TS_conf_search does
TS_conf_search uses the ConstrainedEmbed function implemented in RDKit, which generates a conformer of a molecule, finds the atoms in the molecule corresponding to the small model, and then performs a FF minimization in which the Cartesian coordinates of these atoms are constrained to match the Cartesian coordinates of the small model.  If the small model contains two separate molecules then intramolecular energy terms are ignored.

sdf files
sdf files contain the Cartesian coordinates of the atoms but also explicitly defines the bond orders (single, double, etc) and any formal charges.  It is important the the bond orders in the ts.sdf and small_model.sdf files match. sdf files are basically the same as mol files, but if your program generates mol files be sure to change the file extension to sdf.  You can use OpenBabel to generate sdf files from nearly any other format (however, see next section).

Examples (and when to use keepHs)
The GitHub repository contains two Diels-Alder examples, where the small models are the PM3 optimized transition states for R = H, and the big model is R = CH2CH2CH2CH3.

I created the small model sdf files by converting the GAMESS output files to sdf using Open Babel.  However, the bond order didn't correspond to that shown in the picture above, so I manually edited the bond orders in the sdf file and remved the "RAD line" that OpenBabel created.

For Diels_Alder1 we type 'python "Diels_Alder1;ClC=C.C=C(CCCC)C=C" Diels_Alder1_small.sdf 5' i.e. we build the TS guess from the SMILES string of the molecule.

For Diels_Alder2 we type "python Diels_Alder2.sdf Diels_Alder2_small_keepHs.sdf 5". Here I got Diels_Alder2.sdf by loading the small model in to Avogadro, adding the CH2CH2CH2CH3 chain in some arbitrary conformation, and minimising the  structure.

In the second example, I use keepHs because the dienophile (ethene) only has two heavy atoms, which is not enough to define the orientation of the plane of the molecule relative to the diene. Because I use keepHs I have removed the hydrogen corresponding to the R group, so that all the hydrogens in the small model can be matched to hydrogens in the large model.

While Diels-Alder2 provides one example where keepHs is useful.  Another is where hydrogens atoms are an integral part of the mechanism, such as in hydrogen or proton transfer TSs.


I used each of the geometries as a starting guess for a TS search using GAMESS where I calculated the Hessian to start with (hess=calc) and after every fifth step (ihrep=5).

In the first example none of the five runs found the correct TS, while for the latter example all 5 worked. When I re-ran the first example with python "Diels_Alder1;ClC=C.C=C(CCCC)C=C" Diels_Alder1_small_keepHs.sdf 5 it worked fine.

Not using keepHs has the advantage that one doesn't have to make different small model sdf files when different hydrogens are replaced by substituents. But in this case an extra constrained minimisation step appears to be needed to get the hydrogen atoms placed correctly.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Friday, November 24, 2017

Random versus systematic errors in computed reaction enthalpies

Jimmy Kromann, Alexander Welford, Anders Christensen, and I have been testing the connectivity-based hierarchy (CBH) approach, developed by Raghavachari and co-workers, for semempirical methods. For most methods the improvement has been quite modest and here I discuss why (you'll need to read the paper to follow what I write here).

For example, for PM6 the RMSE relative to G4 for the 23 parent reactions is 9.4 kcal/mol, which increases to 16.4 kcal/mol for CBH-1 and decreases to 7.1 kcal/mol for CBH-2.  For comparison, the corresponding values for HF/6-311G(d,p) are 15.2, 11.4, and 3.0 kcal/mol, respectively.

To see what's going on, let's look at the CBH-1 correction for the first reaction in the set. The PM6 error is -0.3 kcal/mol for the parent reaction and the CBH-1 correction is -12.7 kcal/mol. PM6 is doing great for the parent reaction so, ideally, the CBH-1 correction should be 0.  Why is it so large in magnitude?

It is a Diels-Alder reaction, so two double bonds are turned into four single bonds and the CBH-1 reaction is (using SMILES notation) 4 C + 2 C=C -> 4 CC.  The error in the heats of formation (HOF) relative to G4 are C: -5.6, C=C -3.3, and CC: -4.1 kcal/mol. So the CBH-1 correction is -[4(-4.1) - 2(-3.3) - 4(-5.6)] = -12.6 kcal/mol.  So the CBH-1 correction is large because the error for C is somewhat larger than for CC and, especially, C=C.

These errors also shows that the 0 kcal/mol error for the parent reaction is clearly fortuitous. If the errors in the HOFs of the reactants and products are similar to those observed for C, C=C, and CC then one would expect an error in the reaction energy of about 3-6 kcal/mol since you go from 2 to 1 molecule.

Let's compare the PM6 results to HF/6-311G(d,p). The error for the parent reaction is -12.2 kcal/mol, which decreases to -7.9 with the CBH-1 correction.  The error in the HOFs relative to G4 are C: -95.4, C=C -140.6, and CC: -166.8 kcal/mol. So the CBH-1 correction is -[4(-166.8) - 2(-140.6) - 4(-95.4)] = 4.4 kcal/mol, which when added to -12.2 lowers the error in reaction energy to -7.9 kcal/mol.  Clearly the errors in HOFs are much than larger for PM6, yet most of the error cancels.  Why?

The answer is that the magnitude of the HOF error scales with the number and types of bonds in the molecule.  For example, the HOF-error of C=CC is -213.2 kcal/mol, which is approximately the sum of the HOF-errors of CC and C=C, minus the error for C [-166.8 -140.6 + 95.4 = -212.0 kcal/mol].  This type of error scaling makes HF-results very amenable to improvement using the CBH approach.

Implications for optimisation of new semiempirical methods
Most NDDO-based semi-empirical methods are optimised by reducing the HOF-RMSE (or MUE). If the absolute errors for predictions are consistently below, say, 1 kcal/mol then the maximum possible reaction energy-error will be 2, 3, and 4 kcal/mol for a A -> B, A + B -> C, and A + B -> C + D type reactions respectively.

However, if the errors are larger (say 5 kcal/mol on average) then the best case scenario for this approach is one where the errors are as systematic as possible (i.e. that all are as close to 5 kcal/mol as possible).  Then the reaction energy-error will be around 0 for A -> B and A + B -> C + D reactions and typically around 5 kcal/mol for A + B -> C reaction (which one could correct for since it is consistent).

As shown by the HF example, another approach could be to minimise the the error in a "bond-corrected HOF" for larger molecules, e.g. minimise the errors in HOF(C), HOF(C=C), HOF(CC), and [HOF(C=CC) - HOF(C=C) - HOF(CC) + HOF(C)] and then present a CBH-1-corrected HOF for C=CC to the user.

This approach might make it easier to find more generally applicable parameters that better and systematically minimise the error, since the underlying HF approach "naturally" gives larger errors for larger systems and we are no longer asking the parameters to "undo that" by searching for small errors independent of system size.

Can we tell from the fragment-HOF errors if CBH-corrections will improve results?
No, but perhaps we can identify cases where the correction makes it much, much worse.  For example, the fifth reaction in the set is also a Diels-Alder reaction, so the CBH-1 correction is the same as for reaction 1, but the PM6 error for the parent reaction is 9.1 kcal/mol, so the CBH-1 correction lowers the error to -3.2 kcal/mol.

On the other hand, the PM6 CBH-2 correction for Reaction 23 is -44.2 kcal/mol, which is unusually large (and makes things much worse).  The CBH-2 reaction 2 C#C + 3 CC + 5 C=C=C -> 6 C=C + 5 C#CC and the large error comes from the fact that the HOF error is 7.6 kcal/mol, much larger and of opposite sign than the other HOF-errors in the reaction, plus the fact that it is multiplied by 5.

However, it is not enough to simply look for outliers. The largest PM6 HOF error among the fragment we have studied so far is 11.7 kcal/mol for NN, and this enters in to the CBH-2 correction for R10, but this correction actually lowers the error relative to the parent reaction. 

Perhaps the best one can do is to view corrections that are significantly larger than 20 kcal/mol with suspicion since such large errors are rarely observed for the parent reactions.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, November 12, 2017

Atom_mapper: determining atom correspondence between reactants and products

Many TS search algorithms interpolate between reactants and products and require that the atom order is identical. This is time consuming to enforce by hand and hard to automate, but this paper by the folks at Schrödinger describe a solution that seems to work pretty well. The general idea is also fairly easy to implement if you are familiar with RDKit or similar cheminformatics tools and here I describe my implementation so far.

The basic idea
If two molecules are identical but have different atom numbering then RDKit can map the atoms using "GetSubstructMatch". So (pseudocode) "OCC.GetSubstructMatch(CCO)" gives "(2,1,0)" meaning that atom 2 in CCO corresponds to atom 0 in OCC and so forth.  And "RenumberAtoms(CCO, (2,1,0))" gives OCC.

Let's say our reaction is CH3-CH=O -> CH2=CH-OH.  "product.GetSubstructMatch(reactant)" will not give a match because the molecules are different. To focus on connectivity we remove bond orders: CH3-CH-O -> CH2-CH-OH, but the molecules are still different. The next step is to break each bond in turn and compare. For the reactant you get [H + CH2-CH-O, CH3 + CH-OH, H + CH3-C-O, O + CH3-CH2] and the first entry will match the last entry in the corresponding list for the product [H + CH-CH-OH, CH2 + CH-OH, H + CH2-C-OH, H + CH2-CH-O].

It is very easy to check to for equivalence by converting the list of fragmented molecules to a set of corresponding SMILES strings and looking for identical elements in the two sets: "matches = list(set(react_frag_smiles) & set(prod_frag_smiles))". If there is a match then the atoms can be matched as before because they are identical (note that the bond breaking does not alter the atom numbering of the molecule). If no match is found then the procedure is repeated for the fragments.

One problem is that equivalent atoms are not matched in 3D. For example, if you label (i.e. number) the equivalent H atoms in a methyl group then the methyl group is chiral and RDKit will create a random chirality.  So, the Cartesian coordinated of methyl hydrogens in the CCO example above are not necessarily superimposable, which creates problems in TS search algorithms that use interpolation.

Following the ideas presented in the Schrödinger paper, I try to solve this issue by switching pairs of equivalent atoms and comparing the RMSDs between reactant and product before and after.  However, this will only work if the reactants and products have the same conformation, so I use a modified version of ConstrainedEmbed (ConstrainedEmbedRP) which uses either Cartesian or distance constraints to match two structures using a UFF force field minimization of one of the structures.

Using ConstrainedEmbed has the added advantage of producing reactant and product structures that are in the same orientation and, therefore, often good starting points for interpolation.

* I have implemented these ideas in Atom_mapper and made it available on GitHub under the MIT license.
* The input is SMILES strings of reactant and products but the code can be adapted to work with user-supplied coordinates via sdf files.
* The output is two sdf files labelled "template" and "mol" with (hopefully) identical atom ordering. Template is the most rigid of the two and the coordinates come from an unconstrained UFF minimization. Mol is superimposed on template using ConstrainedEmbedRP.
* Both geometries should probably be minimized using the energy function you plan to use for the TS search and, depending on the optimization algorithm, the structures should probably be rigidly superimposed before interpolation
Current limitations
* I haven't tested the code extensively. If you run into cases where it doesn't work, file a bug report on GitHub
* No conformation search is done so the conformations chosen will be random. Solution: start with coordinates rather than SMILES.
* If reactants or products consists of two molecules their relative orientation resulting from the UFF minimization will be somewhat arbitrary. This is not a problem if only the reactants or products consists of two molecules, since the coordinates will be superimposed on a more rigid molecule. However, it is a problem if the reactants and products both consist of two molecules. Solution: start with coordinates rather than SMILES, but not sure how to determine best intermolecular orientation automatically.
* The code only permutes pairs of equivalent atoms so it won't work for equivalent groups, e.g. methyl groups in CC(C)(C)CO because we need to switch several pairs (corresponding to C and 3 H atoms) simultaneously.  Not sure how to fix that easily. Suggestions welcome.

There are also some general limitations to this "active bonds" approach as described in the article.

This work is licensed under a Creative Commons Attribution 4.0

Sunday, September 3, 2017

Automatic generation of a set of molecules

Many quantum chemistry projects have reached a point where setup and analysis consumes more human time than CPU time, e.g. it takes all day to set-up enough input files to keep the computer busy overnight. Many people use scripts to automatically extract and analyse the data, but few use scripts to generate the coordinates.

Here I show how this can easily be done using the RDKit toolkit.  The following python script adds F and OH substituents to benzene making all possible 91 combinations of mono-, di-, hexa-substituted molecules.

However, the script can easily be changed to do something else by changing parent_smiles and rxn_smarts_list in line 4 and 5.  If you are not familiar with SMILES start here and there are plenty of GUIs, such as this, that generate SMILES.

To use the Reaction SMARTS you have to learn SMARTS, which can be a bit tricky, but it is a very powerful tool. For example, if you change [cX3;H1:1]>>[*:1]F to [*;H1:1]>>[*:1]F then the program will add H to any atom with one H, i.e. also the OH group to create the OF substituent.  So it you set substitutions = 2, you'll get mono-substituted Ph-OF in addition to mono- and di-substituted Ph-F and Ph-OH.

Similarly, ff you add [cX3;H1:1][cX3;H1:2]>>[c:1]1[c:2]cccc1 to the list (and use substitutions = 2) you'll get un- and mono-substituted napthalene as well as un-substituted anthracene and phenanthrene.

In my experience, the only thing that limits what I can build with this approach is my understanding of SMARTS.  Hope this is of some use to you.

This work is licensed under a Creative Commons Attribution 4.0

Thursday, August 10, 2017

Predicting most labile CH bond using semiempirical methods

Yesterday, I mentioned on Twitter that I had written some prototype code to predict labile proton using PM3/COSMO.  A few people expressed interest so I put the code on GitHub.

The code is based on some previous work and removes all CH protons/hydrides/H atoms and finds the position with the lowest free energy.  I've just finished the code so I have no idea how well it works or if PM3/COSMO is the best choice.  Also, there are very few comments and some aspects of the setup (paths, etc) are particular to my machine and needs to be changed.

In its current form the method doesn't consider different reference molecules, so the prediction is "pure PM3/COSMO".  The basic code to do it is there but I haven't extended it yet. Also, the code only considers CH groups, so if have other groups (OH, NH, etc) with lower pKa they won't be considered.  This is very much a work in progress but if you have suggestions, test cases, ideas, or any other kind of feedback please let me know.  I hope you find it useful.

This work is licensed under a Creative Commons Attribution 4.0

Saturday, July 15, 2017

Planned papers for 2017 - six months in

In January I wrote about the papers I plan to publish and made this list:

1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

3. Intermolecular Interactions in the Condensed Phase: Evaluation of Semi-empirical Quantum Mechanical Methods
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods
5. Benchmarking cost vs. accuracy for computation of NMR shielding constants by quantum mechanical methods
6. Improved prediction of chemical shifts using machine learning
7. PM6 for all elements in GAMESS, including PCM interface

Probably not in 2017
8. Protonator: an open source program for the rapid prediction of the dominant protonation states of organic molecules in aqueous solution
9. pKa prediction using semi-empirical methods: difficult cases
10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods
11. High throughput transition state determination using semi-empirical methods

The status is

1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable (at least submission)
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods

We had a draft ready to be sent in in mid-April, but decided to include a "few" more types of heteroaromatics and the study has now ballooned to nearly 600 compounds (up from about 150 in the original draft)! The calculations and most of the analysis is done, but the paper basically has to be rewritten, and this really has to be done by my synthetic chemistry co-author, since the study is now much more about the chemistries of these heteroaromatic and much less about the method.  Even though its out of my hands I am still hopeful that we can submit it this year.

9. pKa prediction using semi-empirical methods: difficult cases automation

This paper is 2/3 written and presents a completely automated PM3-based pKa prediction protocol. The method works quite well, but most outliers turn out to be due to high energy conformations. The main remaining issue is to find a conformer-search protocol that consistently gives low-energy conformations. Depending on how much time I have to devote to paper 4 and the proposal mentioned below, I am still hopefull I can get this published this year.

7. PM6 for all elements in GAMESS, including PCM interface

This will basically be a paper on the accuracy of the SMD method using semiempirical methods.  I'm hopefull we will submit this year, but there is still a lot to do.  Among other things, it explains why PM3/COSMO works best for pKa predictions: the solvation energy errors happen to be smallest for this method.

10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods

I am planning to submit a proposal on prediction of pKa,  homolytic bond strengths, etc as predictors of CH activation-sites. The proposals is due September 25th, so much of my "spare" time until then will be spent on getting preliminary results and writing.

This work is licensed under a Creative Commons Attribution 4.0

Sunday, May 14, 2017

Predicting pKa values using PM3 - conformer search using implicit solvation

Disclaimer: These are preliminary results and may contain errors

In a previous post I showed that minimising RDKit-generated conformers generated with MMFF led to slightly worse PM3/COSMO pKa predictions. One reason might be that the MMFF minimisation is done in the gas phase.  Anders Christensen reminded me that TINKER can do MMFF/GBSA minimisations so I used TINKER so minimise 20 and 50 RDKit-generated conformers in used these as initial guesses for the PM3/COSMO energy minimisations (20mmtk and 50mmtk

As you can see there is relatively little difference in the overall performance of Xmm and Xmmtk. The error distribution is a bit tighter for the tk variants and 50mmtk does better at finding structures closer to the "global" minimum, but 50nomm is still the best choice for avoiding >1 pKa errors.

So, either GBSA is not a good substitute for COSMO or MMFF is not a good substitute for PM3.  In light of this recent paper my money is on the latter.  So I'll go with 50nomm for now.

This work is licensed under a Creative Commons Attribution 4.0