Molecule PolyGraphDiscrepancy

MoleculePGD is a PolyGraphDiscrepancy metric based on different molecule descriptors.

By default, we use TabPFN for binary classification and evaluate it by data log-likelihood, obtaining a PolyGraphDiscrepancy that provides an estimated lower bound on the Jensen-Shannon distance between the generated and true graph distribution.

import rdkit.Chem
from polygraph.metrics.molecule_pgd import MoleculePGD

smiles_a = [
    "CC(=O)Oc1ccccc1C(=O)O",
    "CC(=O)Nc1ccc(O)cc1",
    "CC(C)Cc1ccc(cc1)C(C)C(=O)O",
    "CC1(C)SC2C(NC(=O)C2=O)C1(C)C(=O)N",
    "C1C(=O)N(C2=CC=CC=C12)C3=CC=C(C=C3)C(F)(F)F",
    "CCCCCCOc1ccc(C(=O)C=Cc2c(C=Cc3ccc(OC)cc3)cc(OC)cc2OC)cc1",
    "O=C(Nc1nc(-c2ccc(Cl)s2)cs1)c1ccncc1",
    "COc1nc(N(C)C)ncc1-n1nc2c(c1C(C)C)C(c1ccc(C#N)c(F)c1)N(c1c[nH]c(=O)c(Cl)c1)C2=O",
]
smiles_b = [
    "CC1=C(C=CC=C1)NC2=NC=CC(=N2)NC3=CC=CC=C3C(=O)NC4=CC=CC=N4",
    "CN1CCN(C2=CC3=C(C=C2)N=CN3C)C4=CC=CC=C14",
    "CN(C)CCCN1C2=CC=CC=C2SC3=CC=CC=C31",
    "CC(C)C(C(=O)NCC(C)C)NC(=O)C1=CC=CC=C1C(C)C(C)NC(=O)C2=CN=CC=C2",
    "CN1C(=O)CN=C(C2=CC=CC=C12)C3=CC=CC=C3Cl",
    "O=C(c1cc(-c2ccc(Cl)cc2Cl)n[nH]1)N1CCCC1",
    "COc1cccc(OC)c1C=CC(=O)NC1CCCCC1",
    "O=C1NC(O)CCN1C1OC(CO)C(O)C1O",
]
mols_a = [rdkit.Chem.MolFromSmiles(smiles) for smiles in smiles_a]
mols_b = [rdkit.Chem.MolFromSmiles(smiles) for smiles in smiles_b]
metric = MoleculePGD(mols_a)
print(metric.compute(mols_b))

MoleculePGD

polygraph.metrics.molecule_pgd.MoleculePGD

Bases: PolyGraphDiscrepancy[Mol]

MoleculePGD to compare molecule distributions, combining different molecule descriptors.

Parameters:
  • reference_molecules (Collection[Mol]) –

    Reference rdkit molecules

polygraph.metrics.molecule_pgd.MoleculePGDInterval

Bases: PolyGraphDiscrepancyInterval[Mol]

Uncertainty quantification for MoleculePGD.

Parameters:
  • reference_molecules (Collection[Mol]) –

    Reference rdkit molecules

  • subsample_size (int) –

    Size of each subsample, should be consistent with the number of reference and generated molecules passed to MoleculePGD for point estimates.

  • num_samples (int, default: 10 ) –

    Number of samples to draw for uncertainty quantification.