======================================================
Wrapped functions
======================================================

The ``domd_tools`` module serves as the high-level "Chemical Compiler" interface for ChemFAST. It encapsulates the complex S-CGFG algorithms, Force Field assignment engines, and I/O exporters into four primary stages: Coarse-Graining, Backmapping, Parameterization, and Exporting.

The main APIs for each stage are:

* **Coarse-Grained System Generation**: ``create_cg_system``
* **CG Parameter Prediction**: ``get_cg_ff``
* **All-Atom Backmapping**: ``build_aa_topology``
* **Force Field Assignment**: ``assign_ff_parameters``
* **System Export**: ``get_gmx``

This document first explains the theoretical logic and API usage for each stage, followed by a complete end-to-end example.

Part 1: The Compilation Pipeline & wrapped functions
================================================

1. Chemical Definition Logic
----------------------------
Before utilizing the API, the system's chemical composition and reaction logic must be defined using standard human-readable descriptors:

* **SMILES**: Defines the atomic connectivity and structure of individual monomers.
* **SMARTS**: Defines the reaction rules for polymerization or crosslinking. Correct atom map indices within SMARTS (e.g., ``[#7H2:1]``) are crucial for defining regio-specificity (e.g., ensuring Head-to-Tail connectivity).

2. Coarse-Grained System Generation
-----------------------------------
To generate a relaxed, pre-equilibrated topology, ChemFAST maps the SMILES/SMARTS into a Coarse-Grained (CG) representation. The force field parameters for these CG beads are derived automatically using the **Hansen Solubility Parameter (HSP)** predictor.

**Wrapped function: ``create_cg_system``**

.. code-block:: python

    create_cg_system(molecules, reaction_template, cg_config, rc=1.0, rho=0.85, boxl=None, program='hoomd', filename='chemfast_cg')

* **molecules** (*dict*): Dictionary defining monomers (SMILES).
* **reaction_template** (*dict*): SMARTS rules defining how monomers connect.
* **cg_config** (*list[dict]*): The structural blueprint (e.g., number of chains, topology types, rigidity).
* **rc** (*float*, default: ``1.0``): Cutoff radius for initial geometric embedding.
* **rho** (*float*, default: ``0.85``): Target number density. ChemFAST calculates the box size automatically based on this density and the total number of generated beads.
* **boxl** (*float*, default: ``None``): Explicit box length. If provided, overrides ``rho``.
* **program** (*str*, default: ``'hoomd'``): Target engine for the CG output XML.
* **filename** (*str*, default: ``'chemfast_cg'``): Generates output files like ``out_chemfast_cg.xml``.

.. note::
    **Standalone Parameter Prediction:**
    If you only want to predict the CG force field parameters without building a full simulation box, you can bypass ``create_cg_system`` and use the HSP predictor directly. This is extremely useful for rapidly deriving parameters for novel monomers.

    *Example:* Call ``get_cg_ff(molecules, reaction_template, iterations=3)`` from ``domd_cgbuilder.cg_ff`` to extract bonded and non-bonded parameters.

    .. code-block:: python

        import json
        from domd_cgbuilder.cg_ff import get_cg_ff

        def predict_cg_params(molecules, reaction_template, iterations=3):
            graphs, prod_mols, stats, all_bonded, pairs, rigid_meta = get_cg_ff(molecules,reaction_template,iterations=iterations, create_epsilon=True)

            # Extract and format parameters
            nonBondPara = {k:stats[k] for k in stats if '-' not in k}
            BondPara = {}
            AngPara = {}
            for k in stats:
                if '-' in k:
                    if len(k.split('-')) == 2:
                        BondPara[k] = stats[k]
                    if len(k.split('-')) == 3:
                        AngPara[k] = stats[k]
            for k in pairs:
                stats[str(k)] = pairs[k]

            # Export to JSON
            json_str = json.dumps(stats, indent=4)
            with open('cg_parameters.txt', 'w', encoding='utf-8') as f:
                f.write(json_str)
            return

        # Execute prediction
        mols = {
            'A': {'smiles': 'Nc1ccc(Oc2ccc(N)cc2)cc1', 'file': None},
            'B': {'smiles': 'O1C(=O)c2cc(Oc3cc4C(=O)OC(=O)c4cc3)ccc2C1=O', 'file': None},
        }
        reaction_template = {
            'B-A': {
                'cg_reactant_list': [('A', 'B')],
                'smarts': '[#7H2:1].[#6:3](=[#8:4])[#8:2][#6:5]=[#8:6]>>[#6:3](=[#8:4])[#7:1][#6:5]=[#8:6].[#8:2]',
                'prod_idx': [0]
            }
        }
        predict_cg_params(mols, reaction_template)
        # Outputs 'cg_parameters.txt' containing the HSP-derived bonded and non-bonded parameters.

3. All-Atom Backmapping (CG to AA)
----------------------------------
This stage reconstructs the all-atom details from an equilibrated CG trajectory using geometric embedding algorithms and chemical SMARTS rules.

**Wrapped function: ``build_aa_topology``**

.. code-block:: python

    build_aa_topology(mols, reaction_template, xml, reactions=[], large=500, chunks_per_d=1, parallel=False)

* **mols** & **reaction_template**: The atomic structures and reaction rules that will replace the CG beads.
* **xml** (*str*): Path to the relaxed CG configuration file (e.g., ``final.xml``).
* **reactions** (*list*, default: ``[]``): Specific reaction tuples. If left empty, ChemFAST automatically infers connectivity directly from the bond section of the XML.
* **large** (*int*, default: ``500``): Atom count threshold. If a polymer network exceeds this number of atoms, ChemFAST switches to a grid-based embedding strategy to ensure stability and avoid spatial overlaps.
* **chunks_per_d** (*int*, default: ``1``): Grid subdivision parameter for embedding large molecules.
* **parallel** (*bool*, default: ``False``): If set to ``True``, utilizes multiprocessing to generate AA coordinates in parallel. Highly recommended for systems with thousands of chains.

.. note::
    **Advanced Usage: Standalone Topology Building and Backmapping**

    Under the hood, the ``build_aa_topology`` function executes a two-phase pipeline: first, it logically constructs the full all-atom connectivity (Topology Building), and second, it folds these structures into the 3D space defined by the CG beads (Backmapping).

    If you require granular control—for instance, if you only want to generate the 2D polymer structures without 3D coordinates, or if you want to custom-embed a single specific polymer chain for testing—you can decouple these steps and call the underlying modules directly.

    **Phase 1: Standalone Topology Building (Constructing the 2D Graph)**
    You can use the ``Reactor`` class to chemically link the monomers based on the CG connectivity matrix, bypassing any 3D embedding. This is highly useful for generating massive libraries of polymer SMILES or topological graphs.

    .. code-block:: python

        from domd_topology.reactor import Reactor
        from misc.cg_system import read_cg_topology
        from misc.io.xml_reader import XmlParser
        from domd_topology.functions import set_molecule_id_for_h
        from rdkit import Chem
        from misc.logger import logger
        logger.setLevel('INFO')

        reaction_template = ...
        mols = ...

        # 1. Parse the CG simulation output
        xml = XmlParser('final.xml')
        cg_sys, cg_mols = read_cg_topology(xml, mols)
        print(cg_mols[0].edges(data=True))

        # 2. Extract bond connections to act as reaction triggers
        reactions = [(bond[0], bond[1], bond[2]) for bond in xml.data['bond']]

        # 3. Initialize the Reactor and synthesize the 2D all-atom structures
        reactor = Reactor(mols, reaction_template)
        aa_mols, meta = reactor.process(cg_mols, reactions)

        # 4. Standardize the RDKit molecules (add hydrogens for full valency)
        [Chem.SanitizeMol(_) for _ in aa_mols]
        aa_mols_h = [Chem.AddHs(m) for m in aa_mols]
        [set_molecule_id_for_h(m) for m in aa_mols_h]

        # Output: aa_mols_h now contains the fully connected 2D polymer structures!

    **Phase 2: Standalone Backmapping (Embedding into 3D Space)**
    Once you have the synthesized 2D RDKit molecule (``aa_mol_h``) and its corresponding 3D CG scaffold (``cg_mol``), you can embed the all-atom structure into the simulation box independently using the ``embed_molecule`` function.

    .. code-block:: python

        from domd_xyz.embed_molecule import embed_molecule
        import numpy as np

        # Define the simulation box boundaries
        # the box information can be extracted from the XML or defined based on your system's requirements
        # box = np.array([xml.box.lx, xml.box.ly, xml.box.lz]).astype(float) *10  # Extracted from XML, usually in nm, convert to Angstroms
        box = np.array([100.0, 100.0, 100.0]) # Angstroms, adjust as needed based on your CG system

        # Select a specific chain (e.g., the first molecule in the system)
        target_aa_mol = aa_mols_h[0]
        target_cg_mol = cg_mols[0]

        # Embed the 2D molecule into the 3D CG beads
        # 'large' triggers grid-decomposition for massive chains; 'chunk_per_d' defines grid resolution
        conf = embed_molecule(target_aa_mol, target_cg_mol, box=box, large=500, chunk_per_d=1)

        # Extract the generated 3D coordinates
        positions = conf.GetPositions()

        # Output: 'positions' is a NumPy array containing the exact 3D coordinates of the embedded atoms.


4. Force Field Assignment
-------------------------
The assignment engine processes the reconstructed all-atom graphs to allocate charges, bonds, angles, and dihedrals.

**Wrapped function: ``assign_ff_parameters``**

.. code-block:: python

    assign_ff_parameters(molecules, strategies, custom_rules=[], parallel=False)

* **molecules** (*list*): The NetworkX graphs returned by the backmapping step.
* **strategies** (*list[str]*): Specifies the parameterization strategy for each molecule.
  Options include:

  * ``'hybrid'`` *(default)*: A robust method utilizing the BOSS database, GMX template rules, and ML models. **The assignment logic follows a strict ML -> GMX -> DB priority sequence.** If a parameter is predicted/found in a higher priority method, it safely overwrites the lower one.
  * ``'ml'``: Pure Graph Attention Network (GAT) ML model.
  * ``'db'``: Pure BOSS database lookup.
  * ``'tpl'``: Pure GMX rules.
* **custom_rules** (*list[str]*, default: ``['none']``): Provides fine-grained control over interaction parameterization. You can pass specific rule flags to override the default assignment logic:

  * ``'none'`` *(default)*: Applies the standard strategy logic without custom overrides.
  * ``'all'``: Forces *all* parameters (both bonded and non-bonded) to be generated by the ML model, regardless of the overarching strategy chosen.
  * ``'bonded'``: Forces only the *bonded* parameters (bonds, angles, dihedrals) to be generated by the ML model, leaving non-bonded parameters to the standard hierarchy.
* **parallel** (*bool*, default: ``False``): Distributes parameter assignment across CPU cores.

5. System Export
----------------
Assembles the global geometry, connectivity, and force field data into formats ready for MD engines.

**Wrapped function: ``get_gmx``**

.. code-block:: python

    get_gmx(molecules, confs, ffs, box, filename='chemfast', crossbk_scale=1.0, crossak_scale=1.0, charge_scale=1.0)

* **molecules**, **confs**, **ffs**, **box**: The accumulated outputs from all previous steps (topology graphs, coordinates, parameters, and box dimensions).
* **filename** (*str*, default: ``'chemfast'``): Output prefix (generates ``.gro``, ``.top``, and ``.xml``).
* **crossbk_scale**, **crossak_scale**, **charge_scale** (*float*, default: ``1.0``): Scaling factors for bonds, angles, and charges that span across different CG beads.
.. note::
    This scaling protocol (typically set to < 1.0) is specifically designed to preserve local stereochemistry (such as chirality and *cis/trans* isomerism) during Energy Minimization (EM). By temporarily softening these boundary interactions, it prevents the system from experiencing unphysically high energy spikes that could trigger impossible conformational inversions. The `charge_scale` factor is used with polyelectrolytes, usually `charge_scale=0.5-0.8`.


Part 2: End-to-End Example (Polyimide)
==============================================

To demonstrate how these APIs are chained together in a real-world scenario, we walk through the ``polyimides.py`` example script. This script utilizes the compiler to build a cross-linked polyimide network.

The Complete Script
-------------------

Here is the complete code for the end-to-end workflow. You can copy and run this directly if you have the required pre-equilibrated XML file.

.. code-block:: python

    from domd_tools import create_cg_system, build_aa_topology, assign_ff_parameters, get_gmx
    from misc.logger import logger

    logger.setLevel('INFO')

    # 1. Chemical Definition
    smiA = 'Nc1ccc(Oc2ccc(N)cc2)cc1'
    smiB = 'O1C(=O)c2cc(Oc3cc4C(=O)OC(=O)c4cc3)ccc2C1=O'

    mols = {
        'A': {'smiles': smiA, 'file': None},
        'B': {'smiles': smiB, 'file': None},
    }

    reaction_template = {
        'B-A': {
            'cg_reactant_list': [('A', 'B')],
            'smarts': '[#7H2:1].[#6:3](=[#8:4])[#8:2][#6:5]=[#8:6]>>[#6:3](=[#8:4])[#7:1][#6:5]=[#8:6].[#8:2]',
            'prod_idx': [0]
        }
    }

    # 2. CG Pre-equilibrium (Optional if pre-equilibrated XML is provided)
    """
    meta1 = {'type': ['A','B']*10, 'bond': '0-1,1-2...', 'N': 20, 'is_rigid': False, 'is_angle': True}
    meta2 = {'type': ['A','B']*5, 'bond': '0-1,1-2...', 'N': 20, 'is_rigid': False, 'is_angle': True}
    create_cg_system(mols, reaction_template, [meta1, meta2])
    """

    # We use a pre-equilibrated CG configuration for this example
    xmlfile = 'CG_pre_eq/final.xml'

    # 3. AA Backmapping
    confs, mol_graphs, box = build_aa_topology(mols, reaction_template, xmlfile)

    # 4. Force Field Assignment
    ffs = assign_ff_parameters(mol_graphs, strategies=['ml'] * len(mol_graphs))

    # 5. GROMACS Output
    get_gmx(mol_graphs, confs, ffs, box, filename='chemfast')

Step-by-Step Breakdown
----------------------

Let's break down what is happening at each stage of the script above.

**Step 1: Chemical Definition**
We start by feeding the script the SMILES of our two constituent monomers (A and B). The ``reaction_template`` uses SMARTS to tell the compiler exactly how an amine group on monomer A reacts with an anhydride group on monomer B.

**Step 2: CG Pre-equilibrium**
If starting from scratch, we would define topological blueprints (``meta1``, ``meta2``) and call ``create_cg_system`` to generate a random walk box and predicted HSP parameters. To save computational time in this tutorial, we bypass this and directly load a pre-equilibrated trajectory (``CG_pre_eq/final.xml``).

**Step 3: All-Atom Backmapping**
The ``build_aa_topology()`` function parses the ``final.xml`` box. It reads the coarse-grained positions, applies our SMARTS reaction rules to link the monomers, and embeds the all-atom geometries into the 3D space, returning the coordinates (``confs``) and the topology maps (``mol_graphs``).

**Step 4: Force Field Assignment**
We pass our reconstructed topology graphs to ``assign_ff_parameters()``. By providing the list ``['ml'] * len(mol_graphs)``, we are explicitly instructing the compiler to use the pure Graph Attention Network (GAT) ML model to predict and assign force field parameters for every single molecule in the system.

**Step 5: System Export**
Finally, ``get_gmx()`` takes all the gathered data—the connectivity graphs, the 3D coordinates, the assigned force fields, and the box dimensions—and translates them into formats ready for MD engines.

**Outputs Generated:**

* ``chemfast.gro``: The equilibrated all-atom coordinates.
* ``chemfast.top``: The complete system topology (atom types, charges, interactions).
* ``out_chemfast.xml``: Topology and coordinates formatted for PyGAMD/GALAMOST.