S-CGFG Algorithm

The core engine of ChemFAST is the S-CGFG (SMILES/SMARTS-driven Coarse-Graining/Fine-Graining) algorithm. This framework treats molecular simulation not as a geometric packing problem, but as a “Chemical Compilation” process.

By abstracting molecular systems into a Generalized Topological Representation, ChemFAST achieves resolution independence and chemical completeness. This chapter details the algorithmic stages from chemical definition to final force field assignment.

1. Chemical Definition: The “Source Code”

In ChemFAST, the “source code” of any simulation is defined purely by chemical descriptors, decoupling the user from the complexity of configuration files (e.g., ITP, RTP, PDB, GRO).

SMILES (Node Identity): Defines the atomic composition of a coarse-grained (CG) bead.
- Example: A styrene monomer bead is defined as C=Cc1ccccc1.
- Significance: This allows arbitrary mapping resolution. You can map a whole monomer to one bead, or split a monomer into backbone (CC) and side-chain (c1ccccc1) beads, simply by changing the input SMILES.
SMARTS (Connectivity Logic & Graph Modification): Defines the “Reaction Rules” that govern how beads connect. In ChemFAST, SMARTS acts as a chemical language for subgraph matching and modification.
- Syntax (Labeled vs. Unlabeled Atoms): In a reaction like vinyl polymerization encoded as [C:1]=C.[C:2]=C >> C[C:1][C:2]=C, the numerical maps (:1, :2) are critical. They define the active atoms that will undergo bond breaking or formation. Unlabeled atoms simply provide the necessary chemical context (e.g., “this active carbon must be adjacent to an unmapped double-bonded carbon”) but are not modified.
- Function: It mathematically enforces regioselectivity (Head-to-Tail), stereochemistry (cis/trans), and branching logic during both the CG assembly and All-Atom reconstruction phases.

Under the Hood: High-Performance Fragmented Reactions

Building a massive polymer topology poses a significant algorithmic challenge. If ChemFAST were to build a macromolecule by sequentially applying SMARTS reactions to an ever-growing molecular graph, the underlying subgraph matching would become exponentially slower as the molecule scaled up in size.

To ensure high computational efficiency, ChemFAST employs a Fragmented Reaction Algorithm:

Static Matching: Subgraph matching is always performed on the original, small reactant monomers-never on the growing macromolecule.
Delta Recording: Instead of immediately altering the molecular graph, ChemFAST simply calculates and records the “add, delete, and modify” operations for the mapped atoms and bonds.
Unified Modification: After traversing all required reactions for the system, a single, unified graph modification operation is executed to assemble the final macroscopic topology.

State Tracking & The “Reacted Atom Set”

Because ChemFAST matches against the original monomer templates every time, it must prevent the same reactive site from being erroneously used multiple times (which would yield chemically impossible, hyper-branched structures).

To manage this, the engine utilizes an allow_p (Allow Polymerization/Reaction) validation flag governed by a dynamically updated Reacted Atom Set.

As reactions are recorded, any atom that carries a map label (e.g., [C:1]) in the reactant SMARTS is added to the Reacted Atom Set.
Before any new reaction delta is recorded, the allow_p logic checks if the mapped atoms of the current match already exist in the Reacted Atom Set.
If all the mapped atoms in the current match (SMARTS) have already been “used,” the reaction is blocked at that specific site, ensuring strict valency and correct topological connectivity are maintained.

Note

Computational Efficiency & Reaction Order Dependency

This fragmented approach ensures that the computational cost of subgraph matching remains constant-dependent only on the size of the monomer templates-while the unified graph modification step scales linearly. This results in an overall efficient assembly process even for massive networks.

However, this algorithm is strictly order-dependent. For example, forming an A-B-C chain requires the compiler to know that A reacted with B first, and then the A-B intermediate reacted with C.

Explicit Order (Recommended): The optimal reaction sequence is typically provided directly by the CG simulation engine (e.g., HOOMD-blue/PyGAMD), which records bonds exactly as they form based on spatial proximity during the reaction trajectory.
Implicit Order (Fallback): If explicit order information is missing, ChemFAST will employ a Breadth-First Search (BFS) algorithm to deduce a default reaction sequence. While this fallback usually succeeds, it can occasionally lead to topological conflicts. Specifically, an incorrect sequence might prematurely add an atom to the Reacted Atom Set, causing the allow_p validation to fail and improperly blocking subsequent, valid reactions.

2. HSP-based CG Parameterization

To ensure the Coarse-Grained (CG) pre-equilibration phase is physically meaningful, ChemFAST derives interaction potentials directly from chemical structure using Hansen Solubility Parameters (HSP).

Instead of arbitrary tuning, parameters are generated via a physics-based pipeline:

HSP Prediction: A pre-trained Graph Attention Network (GAT) predicts the dispersion (\(\delta_D\)), polar (\(\delta_P\)), and hydrogen-bonding (\(\delta_H\)) components for each bead based on its SMILES graph. The model is trained on a dataset of experimentally measured HSP values (from HSPiP).
Interaction Strength (\(\epsilon\)): The interaction well-depth is derived from the Cohesive Energy Density (CED), calculated as \(\text{CED} = \delta_{tot}^2\). Parameters are scaled relative to a reference compound (e.g., PEO):

\[\epsilon_{i} = \frac{\delta_{tot,i}^2}{\delta_{tot,ref}^2} \times \epsilon_{ref}\]
Bead Size (\(\sigma\)): Effective diameters are estimated from the statistical average of gyration radii of 3D conformers generated via ETKDG.

This protocol ensures that the CG simulation captures correct thermodynamic trends (e.g., phase separation, solubility).

(i) Usage in CG Pre-Equilibration

HSP-derived parameters are used for the initial CG pre-equilibration. To maintain local structural accuracy, bonded parameters (e.g., equilibrium bond lengths, angles, and dihedrals) are derived from the statistically averaged geometries of the reaction products. Using monomer SMILES and reaction SMARTS as inputs, users can generate the CG topology and parameters by running the following code:

import json
from domd_cgbuilder.cg_ff import get_cg_ff

# Function to predict CG parameters and save to cg_parameters.txt
def predict_cg_params(molecules, reaction_template, iterations=3):
    graphs, prod_mols, stats, all_bonded, pairs, rigid_meta = get_cg_ff(molecules,reaction_template,iterations=iterations, create_epsilon=True)
    nonBondPara = {k:stats[k] for k in stats if '-' not in k}
    BondPara = {}
    AngPara = {}
    for k in stats:
        if '-' in k:
            if len(k.split('-')) == 2:
                BondPara[k] = stats[k]
            if len(k.split('-')) == 3:
                AngPara[k] = stats[k]
    for k in pairs:
        stats[str(k)] = pairs[k]
    json_str = json.dumps(stats,indent=4)
    with open('cg_parameters.txt', 'w', encoding='utf-8') as f:
        f.write(json_str)
    return

smiA = 'Nc1ccc(Oc2ccc(N)cc2)cc1'
smiB = 'O1C(=O)c2cc(Oc3cc4C(=O)OC(=O)c4cc3)ccc2C1=O'

reaction_template = {
        'B-A': {
            'cg_reactant_list': [('A', 'B')],
            'smarts': '[#7H2:1].[#6:3](=[#8:4])[#8:2][#6:5]=[#8:6]>>[#6:3](=[#8:4])[#7:1][#6:5]=[#8:6].[#8:2]',
            'prod_idx': [0]
        }
    }
mols = {
    'A': {'smiles': smiA, 'file': None},
    'B': {'smiles': smiB, 'file': None},
}

# Predict CG parameters and save to cg_parameters.txt
predict_cg_params(mols, reaction_template)

3. CG Construction Pathways

ChemFAST employs a multi-pathway strategy to generate the initial Coarse-Grained topology, adapting to the complexity of the target system:

(i) In Situ Polymerization (Reactive)

Target: Complex, amorphous networks (e.g., Epoxy resins, Hydrogels).
Algorithm: The system starts as a “soup” of reactive monomers. A stochastic reaction scheduler attempts bond formation based on spatial proximity and the defined SMARTS rules.
Criticality: This is the efficient and valid method for networks, as it naturally captures the kinetic topology and defects formed during synthesis.

(ii) Algorithmic Construction (Pre-defined)

Target: Linear chains, Star polymers, Block copolymers.
Algorithm: Uses a Self-Avoiding Random Walk (SARW).
Mechanism: Chains are grown step-by-step. The step length and excluded volume parameters are derived from a conformational analysis of oligomers generated via SMILES.

(iii) Post-hoc Reconstruction (BFS)

Target: Pre-assembled configurations or user-supplied snapshots.
Algorithm: Uses a Breadth-First Search (BFS) algorithm to traverse the neighbor graph of a bead system. It infers the global connectivity logic by matching local neighbors against the allowed reaction templates.

(iv) Rigid-Body Protocol (Bio-Hybrid)

Target: Structured biomolecules (Proteins, DNA, RNA, Nanoparticles).
Algorithm:
1. Forward Mapping: The high-resolution input (PDB) is discretized into a CG point cloud using Adaptive Voxelization. The grid resolution (\(\sigma_{rb}\)) is dynamically matched to the surrounding polymer matrix to prevent interaction mismatches.
2. Dynamics: The biomolecule moves as a rigid body during pre-equilibration.
3. Backward Mapping: A deterministic Principal Axes Alignment (PAA) maps the original atomic structure back into the equilibrated cavity, strictly preserving secondary structures (\(\alpha\)-helices, \(\beta\)-sheets) and chirality.

4. All-Atom Topology Building

The “Backmapping” process in ChemFAST is a deterministic compilation step divided into topological and geometric reconstruction.

(i) Connectivity Reconstruction (Graph Assembly)

ChemFAST reconstructs the global topology graph by stitching local subgraphs: * Nodes: Replaced by the atomic graph defined in the bead’s SMILES. * Edges: Created by applying the SMARTS transformation rules. * Stereochemistry: Chirality (R/S) and isomerism (cis/trans) defined in the SMILES are explicitly enforced in the graph attributes, ensuring stereochemical fidelity.

(ii) Coordinate Embedding (Geometry)

Converting the graph to 3D coordinates involves a hierarchical optimization: 1. Fragment Generation: High-fidelity 3D conformers for each monomer are generated using the ETKDG algorithm. 2. Analytical Orientation Optimization: To connect fragments \(i\) and \(j\) without bond distortion, the algorithm analytically solves for the optimal rotation matrix \(M\) that minimizes the distance error between connecting atoms:

\[f(M) = \sum || M_i(x_i) - M_j(x_j) ||^2\]

Domain Decomposition: For macroscopic systems, the box is split into spatial chunks (chunks_per_d). Monomers are optimized in parallel within chunks, followed by a boundary relaxation pass.
Soft-Core Relaxation: A stepwise energy minimization with soft-core potentials is applied to resolve local steric clashes (e.g., ring spearing) without breaking the topology.

5. Automated Force Field Assignment

The final stage is the assignment of physical parameters (charges, \(\sigma\), \(\epsilon\), bonded terms). ChemFAST utilizes a Hybrid Strategy:

Database Search (Priority): The engine hashes the chemical environment of each atom (WL-Hash) and queries the local OPLS-AA Database (derived from BOSS/LigParGen). This ensures exact matches use validated parameters.
ML Prediction (Fallback): For novel chemical environments (e.g., conjugated polyimide backbones), a pre-trained Graph Attention Network (GAT) infers the parameters.
- Partial Charges: Predicted via physics-informed regression that enforces global electroneutrality.
- Bonded Types: Predicted via multi-class classification.

This hybrid approach allows ChemFAST to parameterize millions of atoms in minutes with accuracy comparable to quantum mechanical calculations.