3.3. Force field design¶
3.3.1. Physical Background¶
The Alexandria force-fields have a functional form consisting of van der Waals (vdw), electrostatics (coul), polarization (pol) and bonded terms including a radial (b), angular (a), out-of-plane dihedral (i) and torsion (d) terms.
where \(r_{ij}\) refers to the distance between atoms \(i\) and \(j\), \(r_{cs}\) to the distance between a core and a shell (Drude) particle, \(\theta_{ijk}\) to the angle given by three atoms \(i\), \(j\) and \(k\), and \(\phi_{ijkl}\) to a torsional angle between the planes given by atoms \(i,j,k\) and atoms \(j,k,l\). For each of the terms, multiple functional forms are available, so that within the Alexandria framework, different force-fields, including previously published ones, can be reconstructed and compared to one another in a systematic manner [6]. For details on these potentials we refer to Chapter Energy Function.
In total there are seven atom parameter types and 7-14 bond parameter types. A variety of virtual sites can be used, like those used in water models [34, 93] or to model anisotropy due to \(\sigma\)-holes on halogen atoms or water [26].
3.3.2. Determining atom types¶
When importing structure files (SDF, PDB, or XYZ), ACT uses RDKit for the initial chemical perception, including sanitization and aromaticity assignment [94].
Attention
If bond information is present in the PDB or SDF file, it will be used by ACT. That means the user is responsible for providing correct information, unless the compound is a special case, see below. If there is no bond information in the PDB (or a XYZ) file, it will be generated using RDKit. There is no guarantee that the output from RDKit is correct in all cases, however. Please check your output.
Atom typing in ACT is based on SMARTS patterns stored in share/atom_bond.xml [95]. The distributed atom_bond.xml file assigns atom types for carbon, nitrogen, oxygen, phosphorus, and sulfur according to their hybridization state, adding 1,2,3 behind the element symbol. For all other elements, atom types are assigned based on formal charge. For neutral atoms, the default atom type is the lowercase element symbol.
The file also contains special cases, such as resonance structures or situations where RDKit assigns bond orders or hybridizations that differ from those required by the ACT force fields. For example, sulfate (SO\(_4\)\(^{2-}\)) is assigned with one sulfur as s3 and four oxygens with mixed types (o- and o2), which does not reflect the resonance equivalence of the oxygens. These special rules are typically placed at the top of the file.
You may provide your own atom_bond.xml. ACT always checks the current working directory first (the filename must be exactly atom_bond.xml), and if no file is found there, it falls back to the shared directory. The file is ordered from most specific to most general rules (see Table 3.2 and Table 3.3).
ACT provides two Jupyter notebooks to assist with inspecting and developing SMARTS patterns:
atom_bond_demo.ipynb: Provide a specific molecule to obtain the assigned bonds and atom types, together with a labeled 3D visualization.
test_smarts.ipynb: Provide a molecule and a SMARTS pattern and the notebook returns the matching atoms and a 2D depiction with highlighted hits.
Users are strongly encouraged to verify that the chemical interpretation is appropriate for their specific use case. If you need assistance creating new SMARTS patterns, please open an issue on GitHub.
The code to determine atomtypes is implemented in both the C++ code and in a Python file get_mol_dict.py that is part of the ACT distribution. The latter can be used in conjunction with the generating quantion chemistry data in the SaptACT repository.
Molecule/Rule |
SMARTS / recursive SMARTS |
atom_type(s) |
q |
bo |
|---|---|---|---|---|
sulfate |
[#16](=[#8])(=[#8])(-[#8-])-[#8-] |
s3, o2, o2, o2, o2 |
-2 |
1.5 |
phosphate2 |
[#8-]-[#15](-[#8-])(-[#8-])=[#8] |
o3, p3, o3, o3, o3 |
-3 |
1.5 |
phosphate |
[#15D4]([#8D1])([#8D1])([#8-,#8D1])([#8-,#8D1]) |
p3, o3, o3, o3, o3 |
-3 |
1.5 |
phosphate2 (variant 2) |
[#8-]-[#15](-[#8-])(-[#8-]) |
o3, p3, o3, o3 |
-2 |
1.5 |
carboxylate |
[#6X3](-[#8-])=[#8] |
c2, o2, o2 |
-1 |
1.5 |
phosphate3 |
[#8-]-[#15](-[#8-]) |
o3, p3, o3 |
-1 |
1 |
nitro1 |
[#7+](-[#8-])=[#8] |
n2, o2, o2 |
0 |
1.5 |
nitro2 |
[#7+](-[#8])=[#8] |
n2, o2, o2 |
0 |
1.5 |
water |
[#8](-[#1])-[#1] |
ow, hw, hw |
0 |
1 |
Atom type rule |
SMARTS / recursive SMARTS |
atom_type(s) |
q |
|---|---|---|---|
gaff h5 ring 2EWG |
[#1X1;$(*-[c](~[#7,#8,#16,#17,#35,#53])~[#7,#8,#16,#17,#35,#53])] |
h5 |
0 |
gaff h4 ring 1EWG |
[#1X1;$(*-[c]~[#7,#8,#16,#17,#35,#53])] |
h4 |
0 |
gaff h3 chain 3EWG |
[#1X1;$(*-[C]([#7,#8,#16,#17,#35,#53,F,Cl,Br,I])([#7,#8,#16,#17,#35,#53,F,Cl,Br,I])([#7,#8,#16,#17,#35,#53,F,Cl,Br,I]))] | h3 | 0 |
||
gaff hc tertcarbon like |
[#1X1;$(*-[#6X4]([#6])([#6])[#1])] |
hc |
0 |
gaff h2 C eq 2EWG |
[#1X1;$(*-[C](=[#7,#8,#9,#16])([#7,#8,#9,#16]))] |
h2 |
0 |
gaff h2 chain 2EWG |
[#1X1;$(*-[C]([#7,#8,#9,#16,#17,#35,#53])[#7,#8,#9,#16,#17,#35,#53])] |
h2 |
0 |
gaff h1 chain 1EWG |
[#1X1;$(*-[C]-[#7,#8,#9,#16,#17,#35,#53])] |
h1 |
0 |
gaff h1 chain 1EWG double |
[#1X1;$(*-[C]=[#7,#8,#9,#16])] |
h1 |
0 |
gaff ha conjugated CX3 eq C |
[#1X1;$(*-[#6X3]=[#6])] |
ha |
0 |
gaff ha triplebond |
[#1X1;$(*-[#6X2]#[#6])] |
ha |
0 |
gaff hc aliphatic C |
[#1X1;$(*-[#6X4])] |
hc |
0 |
hcl |
[#1X1;$(*-Cl)] |
hcl |
0 |
hbr |
[#1X1;$(*-Br)] |
hbr |
0 |
gaff ho oxygen |
[#1X1;$(*-O)] |
ho |
0 |
gaff hn nitrogen |
[#1X1;$(*-N)] |
hn |
0 |
gaff hn aromatic n |
[#1X1;$(*-n)] |
hn |
0 |
gaff hs sulfur |
[#1X1;$(*-S)] |
hs |
0 |
gaff hp phosphorus |
[#1X1;$(*-P)] |
hp |
0 |
hf |
[#1X1;$(*-F)] |
hf |
0 |
hi |
[#1X1;$(*-I)] |
hi |
0 |
gaff ha generic |
[#1X1] |
ha |
0 |
gaff hc any |
[#1] |
hc |
0 |
p sp |
[#15;^1] |
p1 |
0 |
p sp2 |
[#15;^2] |
p2 |
0 |
p sp3 |
[#15;^3] |
p3 |
0 |
p sp3 sp3d |
[#15;^4] |
p3 |
0 |
p sp3 sp3d2 |
[#15;^5] |
p3 |
0 |
s sp |
[#16;^1] |
s1 |
0 |
s sp2 |
[#16;^2] |
s2 |
0 |
s sp3 |
[#16;^3] |
s3 |
0 |
s sp3 sp3d |
[#16;^4] |
s3 |
0 |
s sp3 sp3d2 |
[#16;^5] |
s3 |
0 |
c sp |
[#6;^1] |
c1 |
0 |
c sp2 |
[#6;^2] |
c2 |
0 |
c sp3 |
[#6;^3] |
c3 |
0 |
c sp3 sp3d |
[#6;^4] |
c3 |
0 |
c sp3 sp3d2 |
[#6;^5] |
c3 |
0 |
n sp3 +1 |
[#7+;^3] |
n4 |
1 |
n sp |
[#7;^1] |
n1 |
0 |
n sp2 |
[#7;^2] |
n2 |
0 |
n sp3 |
[#7;^3] |
n3 |
0 |
n sp3 sp3d |
[#7;^4] |
n3 |
0 |
n sp3 sp3d2 |
[#7;^5] |
n3 |
0 |
o sp |
[#8;^1] |
o1 |
0 |
o sp2 |
[#8;^2] |
o2 |
0 |
o sp3 |
[#8;^3] |
o3 |
0 |
o sp3 sp3d |
[#8;^4] |
o3 |
0 |
o sp3 sp3d2 |
[#8;^5] |
o3 |
0 |
li +1 |
[Li+] |
Li+ |
1 |
li 0 |
[Li] |
li |
0 |
na +1 |
[Na+] |
Na+ |
1 |
na 0 |
[Na] |
na |
0 |
k +1 |
[K+] |
K+ |
1 |
k 0 |
[K] |
k |
0 |
rb +1 |
[Rb+] |
Rb+ |
1 |
rb 0 |
[Rb] |
rb |
0 |
cs +1 |
[Cs+] |
Cs+ |
1 |
cs 0 |
[Cs] |
cs |
0 |
be +2 |
[Be+2] |
Be2+ |
2 |
be 0 |
[Be] |
be |
0 |
mg +2 |
[Mg+2] |
Mg2+ |
2 |
mg 0 |
[Mg] |
mg |
0 |
ca +2 |
[Ca+2] |
Ca2+ |
2 |
ca 0 |
[Mg] |
mg |
0 |
al +3 |
[Al+3] |
Al3+ |
3 |
al 0 |
[Al] |
al |
0 |
si -4 |
[Si-4] |
Si4- |
-4 |
si +4 |
[Si+4] |
Si4+ |
4 |
si 0 |
[Si] |
si |
0 |
p -3 |
[P-3] |
P3- |
-3 |
p +3 |
[P+3] |
P3+ |
3 |
p +5 |
[P+5] |
P5+ |
5 |
f -1 |
[F-] |
F- |
-1 |
cl -1 |
[Cl-] |
Cl- |
-1 |
cl +1 |
[Cl+] |
Cl+ |
1 |
cl +3 |
[Cl+3] |
Cl3+ |
3 |
cl +5 |
[Cl+5] |
Cl5+ |
5 |
cl +7 |
[Cl+7] |
Cl7+ |
7 |
cl 0 |
[Cl] |
cl |
0 |
br -1 |
[Br-] |
Br- |
-1 |
br +1 |
[Br+] |
Br+ |
1 |
br +3 |
[Br+3] |
Br3+ |
3 |
br +5 |
[Br+5] |
Br5+ |
5 |
br +7 |
[Br+7] |
Br7+ |
7 |
i -1 |
[I-] |
I- |
-1 |
i +1 |
[I+] |
I+ |
1 |
i +3 |
[I+3] |
I3+ |
3 |
i +5 |
[I+5] |
I5+ |
5 |
i +7 |
[I+7] |
I7+ |
7 |
br 0 |
[Br] |
br |
0 |
he 0 |
[He] |
he |
0 |
ne 0 |
[Ne] |
ne |
0 |
ar 0 |
[Ar] |
ar |
0 |
kr +2 |
[Kr+2] |
Kr2+ |
2 |
kr 0 |
[Kr] |
kr |
0 |
xe 0 |
[Xe] |
xe |
0 |
xe +2 |
[Xe+2] |
Xe2+ |
2 |
xe +4 |
[Xe+4] |
Xe4+ |
4 |
xe +6 |
[Xe+6] |
Xe6+ |
6 |
rn 0 |
[Rn] |
rn |
0 |
rn +2 |
[Rn+2] |
Rn2+ |
2 |
3.3.3. Determining partial charges¶
The electrostatic potential (ESP, Section Electrostatic Potential) has historically been used to determine partial charges [96] and the ACT supports training models to reproduce the ESP. However, in a very recent paper we have shown that fitting charges to reproduce the ESP in a limited volume around a compound is fundamentally flawed due to lack of information [38]. If the purpose is to build models that reproduce electrostatic interactions, this can be done directly by training models to reproduce SAPT energy components. Here, the split-charge equilibration (SQE) algorithm [97] is used to generate the effective partial charge on each atom in a molecule. SQE, in turn, is based on the electronegativity equalization method (EEM), as developed by Rapp{'e} and Goddard [98]. In brief, EEM uses the atomic hardness \(\eta\) and electronegativity \(\chi\) to determine the atomic charges in a molecule from a Taylor expansion of the molecular energy in terms of charges. The SQE algorithm introduces a correction to the atomic electronegativities for bonded atoms \(\Delta\chi\) as well as a bond hardness \(\Delta\eta\). With this addition, charge can flow through bonds only, which overcomes issues with over-polarization in the EEM [99].
The ACT code implements the possibility to generate charges for compounds in dimers or clusters where charge transfer between compounds is disallowed which is a reasonable approximation since charge transfer has been shown to have limited impact on the binding energy of non-covalent complexes [100]. For the SQE algorithm two atomic parameters (\(\chi\) and \(\eta\)) as well as two bond parameter types (\(\Delta\chi\) and \(\Delta\eta\)) need to be determined and the ACT can train SQE parameters to reproduce electrostatic and induction energies [38]. For background information we refer the reader to an excellent review by Jensen [101], but below follows a break-down of using SQE with shells or virtual sites.
3.3.4. Charge Equilibration with Shells or Virtual Sites¶
Among the approaches to modeling the charge-dependent component of a force field, those rooted in the chemical potential equalization principle are especially notable, as the principle stems directly from density functional theory [102]. The first computational implementation of the chemical potential equalization principle was the electronegativity equalization method (EEM) [98, 103]. However, due to limitations of this model, Chelli et al. proposed the atom-atom charge transfer (AACT) model [104]. Later, Nistor and co-workers combined the EEM and AACT approaches into a single framework, the split-charge equilibration (SQE) model, which fulfills the essential criteria for a successful charge-transfer potential [97, 99]. The ACT implements both the EEM and the SQE as algorithms for determining partial charges.
In brief, EEM minimizes an empirical model of the intramolecular electrostatic energy (computed from the atomic electronegativity \(\chi_i\) and atomic hardness \(\eta_i\)) with respect to the atomic partial charges \(q_i\), where \(i\) are the atoms. This method comprises a second order expansion of the molecular energy \(E_{\mathrm{EEM}}\) in terms of the partial charges \(q_i\):
where \(N\) is the number of atoms, \(\chi_i\) are the atomic electronegativities, \(\eta_i\) the atomic hardness, and \(J_{il}\) the Coulomb interaction between atoms. The factor \(\frac{1}{2}\) before the Coulomb matrix is to avoid double counting.
In this work, the SQE method is used, which addresses a shortcoming of the EEM, namely that molecules tend to get over-polarized [99]. For more background, we refer to the recent review on charge flow models by Jensen [101].
Verstraelen and co-workers proposed the following variant of the molecular energy:
where \(p_{ij}\) corresponds to the (intramolecular) charge transfer over bonds, \(\Delta\eta_{ij}\) is the bond hardness and \(\Delta\chi_{ij}\) is the bond electronegativity correction. Therefore, the charge variables \(q_i\) are replaced by charge-transfer variables \(p_{ij}\) which are related by
where \(q_{\mathrm{tot}}\) is the net charge on the compound and \(p_{ij} = -p_{ji}\). Although it is trivial to determine the partial charges \(q_i\) from the charge transfer \(p_{ij}\), the reverse is not necessarily true. As outlined by Chen et al. [105], the problem can be solved by expressing the energy in terms of the charge transfer variables. By substituting \(J_{ii} = \eta_i\) in Eq. (3.50), inserting Eq. (3.50) into Eq. (3.51) and introducing \(M_x\) as the number of bonds for species \(x\), we obtain:
The next step is to determine the \(p_{ij}\) that minimize \(E_{\mathrm{SQE}}\). Since all summations run over atoms \(i,j,k,l\), we take the derivative with respect to \(p_{ij}\) and equate it to zero:
using the identity of the Coulomb-matrix elements (\(J_{ij} = J_{ji}\)), the terms involving the atomic hardness \(\eta_x\) are incorporated into the diagonals of the \(J_{xy}\) matrix, excluding the contribution from the total charge \(q_{\mathrm{tot}}\). Note that the \(q_{\mathrm{tot}}\) terms in the two sums cancel. The first term in Eq. (3.53) is the difference in electronegativity between atoms \(i\) and \(j\) sharing the bond plus the correction; the second term represents the difference in electrostatic potentials at the atoms, and the third term accounts for the interaction between \(i\) and \(j\) times the charge transfer. This results in a coupled set of equations, written in matrix form as
where \(\mathbf{M}\) is a square matrix of dimension equal to the number of bonds, \(\mathbf{P}\) is the vector of the charge transfers for all bonds, and \(\mathbf{R}\) is the right-hand side of the equations. The matrix elements are given by
where \(\delta_{ij,kl}\) is one if bond \(ij\) is identical to \(kl\) and zero otherwise. The right-hand side is defined by the electronegativity terms according to
The charges of the shells (and virtual sites) are treated as constant in this algorithm, meaning that \(q_{tot}\) becomes the sum of the charges of the shells and virtual sites and the total charge of the compound. During force field training, all of these charges can be modified alongside the SQE parameters. As noted in the paper describing the ACT software [1], the SQE algorithm may not be flexible enough to reproduce optimal charge distributions, and other algorithms [101, 106] may need to be implemented in future versions of the software.