Chemistry – Compound Lookups

Several Python packages exist that simplify the process of looking compounds and their chemical structure.

This means that we can look-up compounds by name or structure and create a generative sequence of steps that allows us obtain the structure of a compound, and various properties of it, simply by starting with its common name.

With a representation of the structure to hand, we can then visualise its structure as demonstrated in a later section.

This allows us in part to create one-piece generative document workflows where information rich outputs, as well as rich media outputs, can be generated from (potentially hidden) generative scripts contained within the source document.

For creators of educational materials, this means we can create narratives where a compound is identified by name, a lookup is made, and informational descriptions can then be generated from the looked up object.

This means that the materials will be factually correct as the informational properties will be rendered directly from the retrieved object.

Warning

One of the risks associated with generative documents is that if for some reason the generative code cannot be run or fails to execute as intended, the rendered document will be incomplete at best, meaningless at worst.

It is therefore recommended that generative documents should be saved in an archived form with all cells run.

Using a “loop back” generative method, where assets are created, saved to disk, and then loaded back in is another approach that can be used to generate assets that can be persisted for local use by the generating document, or even reused elsewhere.

Through being able to look up properties of materials, we can also generate factual statements about them or generate quizzes or opportunities for self-test activities around them.

pubchempy

The pubchempy package provides a Python interface to PubChem, a freely accessible chemical information lookup service. PubChem allows for the searching of chemicals by name and lookups of molecular formulas and structures using conventional representations such as a SMILES representation.

SMILES

SMILES (Simplified molecular-input line-entry system) is a simple text format for desctibing molecular structure.

Related: [SMILES tutorial](https://www.epa.gov/sites/production/files/2015-05/documents/appendf.pdf(

Packages like pubchempy can be used in conjunction with visualisers such as py3Dmol to look up a molecule by name (using pubchempy) and then display its structure as an interactive 3D model (using py3Dmol).

Import the base package:

import pubchempy as pcp

We can look up a chemical by name, select the first result, and then generate the canonical or isomeric SMILES representation:

ethanol = pcp.get_compounds('ethanol', 'name')[0]

ethanol.canonical_smiles, ethanol.isomeric_smiles
('CCO', 'CCO')

If we create a reference to the .iupac_name of the compound, we can use that in a markdown field.

from myst_nb import glue

glue("g_ethanol", ethanol.iupac_name, display=False)

Such a reference is embedded in this piece of text: you looked up the compound ethanol.

We can also look up compounds by formula.

Note

If required, responses from look-ups can be returned as pandas dataframes.

nitric_acid_df = pcp.get_compounds('HNO3', 'formula', as_dataframe=True)[:3]

nitric_acid_df[['iupac_name', 'molecular_formula', 'canonical_smiles', 'inchi']]
iupac_name molecular_formula canonical_smiles inchi
cid
944 nitric acid HNO3 [N+](=O)(O)[O-] InChI=1S/HNO3/c2-1(3)4/h(H,2,3,4)
123349 hydroxy nitrite HNO3 N(=O)OO InChI=1S/HNO3/c2-1-4-3/h3H
12025424 deuterio nitrate HNO3 [N+](=O)(O)[O-] InChI=1S/HNO3/c2-1(3)4/h(H,2,3,4)/i/hD

Round tripping of looking up a compound by name, obtaining its SMILES representation, looking the compound by SMILES representation, and then displaying the name of the final object retrieved, is also possible and can be used as an integrity check:

#Look up by SMILES string
compound = pcp.get_compounds('CCO', 'smiles')[0]
compound.iupac_name
'ethanol'

We can also use other chemical reference codes for the lookup, such as an InChI identifier:

InChI

InChI, the IUPAC (International Chemical Identifier) is a standardised textual identifier for chemical substances.

The InChIKey reference scheme is a condensed 27 character hashed version of the full InChI identifier string.

pcp.get_compounds("InChI=1S/HNO3/c2-1(3)4/h(H,2,3,4)","inchi",
                  as_dataframe=True).iloc[0]['iupac_name']
'nitric acid'

The retrieved object stores a wide range of useful informational properties associated with the compound.

For example, looking up the chemical formula (this is described as the molecular formula although I think it’s actually the empirical formula?):

from IPython.display import Latex

_compound_latex = '$\ce{'+compound.molecular_formula+'}$' 

# Save a reference to the Latex equivalent of the compound name and formula
glue("compoundName", compound.iupac_name, display=False)
glue("compoundLatex", Latex(_compound_latex), display=False)

compound.molecular_formula
'C2H6O'

A simple, literally stated formula can be rendered using MathJax if the mhchem package is imported.

\[\require{mhchem}\]

Note

Support for rendering chemical notation using inline LaTeX markup is enabled by including the following installation requirement into a markdown cell: $$\require{mhchem}$$.

Formulas can then be rendered using the an inline LaTeX invocation of the form $\ce{C2H6O}$ to give output of the form \(\ce{C2H6O}\) .

Embedding the glued reference for the equation of ethanol gives us:

()\[\ce{C2H6O}\]

We can also lookup various properties of the compound and use that as the basis of a custom display outputter for the pubchempy.Compound object.

Note

Remember, we always collapse or completely remove generator code elements included in the original source document from the rendered text.

# Create a function to display compound properties
def pcp_compound_properties(compound):
    """Display various compound properties."""
    properties = []
    properties.append(f'Name: <strong>{compound.iupac_name}</strong>')
    properties.append(f'Formula: <strong>{compound.molecular_formula}</strong>')
    properties.append(f'Common synonyms: <strong>{", ".join(compound.synonyms[:3])}</strong>')
    properties.append(f'SMILES: <strong>{compound.canonical_smiles}</strong>')
    properties.append(f'cid:  <strong>{compound.cid}</strong>')
    properties.append(f'inchi:  <strong>{compound.inchi}</strong>')
    properties.append(f'Molecular weight: <strong>{compound.molecular_weight}</strong>')

    return '<br/>'.join(properties)

# Get a reference to the Compound class
from pubchempy import Compound

# Create a custom formatter for the class
# https://ipython.readthedocs.io/en/stable/config/integrating.html 
html_formatter = get_ipython().display_formatter.formatters['text/html']
html_formatter.for_type(Compound, pcp_compound_properties);

That code can easily be added to a package or included and hidden within source document.

Now if we call the compound we get a prettified, custom output for it:

compound
Name: ethanol
Formula: C2H6O
Common synonyms: ethanol, ethyl alcohol, alcohol
SMILES: CCO
cid: 702
inchi: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3
Molecular weight: 46.07

chembl_webresource_client

The chembl_webresource_client package provides a Python interface to ChEMBL, a “manually curated chemical database of bioactive molecules with drug-like properties maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL)” to support the lookup of bioactive chemical compounds by common name.

Import the package and create a reference for a molecule lookup:

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule

From this client, we can look up bioactive compounds in a variety of ways, including by popular name:

aspirin = molecule.search('aspirin')

for r in aspirin:
    pref_name = r['pref_name']
    if pref_name is not None:
        print(pref_name)
ASPIRIN
ASPIRIN DL-LYSINE
ASPIRIN EUGENOL ESTER

We can also look-up syonyms of the compound, which might include trade names:

aspirin = [r for r in aspirin if r['pref_name']=='ASPIRIN'][0]
aspirin['molecule_synonyms'][:5]
[{'molecule_synonym': '8-hour bayer',
  'syn_type': 'TRADE_NAME',
  'synonyms': '8-HOUR BAYER'},
 {'molecule_synonym': 'Acetosalic Acid',
  'syn_type': 'TRADE_NAME',
  'synonyms': 'Acetosalic Acid'},
 {'molecule_synonym': 'Acetylsalic acid',
  'syn_type': 'TRADE_NAME',
  'synonyms': 'ACETYLSALIC ACID'},
 {'molecule_synonym': 'Acetylsalicylic Acid',
  'syn_type': 'INN',
  'synonyms': 'Acetylsalicylic Acid'},
 {'molecule_synonym': 'Acetylsalicylic Acid',
  'syn_type': 'TRADE_NAME',
  'synonyms': 'Acetylsalicylic Acid'}]

As before, we can generae an _repr_ method on the object to provide us with a summary output from the object in a form we desire:

from IPython.display import HTML

# Create a function to display compound properties
def chembl_wrc_molecule_properties(molecule):
    """Display various compound properties."""
    properties = []
    _wikipedia_name =  [c['xref_id'] for c in molecule['cross_references'] if c["xref_src"]=='Wikipedia']
    _wikipedia_name = '' if not _wikipedia_name else _wikipedia_name[0]
    properties.append(f'Name: <strong>{molecule["pref_name"]}</strong>')
    properties.append(f'Common Wikipedia (common) name: <strong>{_wikipedia_name}</strong>')
    properties.append(f'Formula: <strong>{molecule["molecule_properties"]["full_molformula"]}</strong>')
    properties.append(f'Species: <strong>{molecule["molecule_properties"]["molecular_species"]}</strong>')
    properties.append(f'SMILES: <strong>{molecule["molecule_structures"]["canonical_smiles"]}</strong>')
    properties.append(f'inchi:  <strong>{molecule["molecule_structures"]["standard_inchi"]}</strong>')
    properties.append(f'CHEMBL id:  <strong>{molecule["molecule_chembl_id"]}</strong>')
    properties.append(f'Therapeutic flag: <strong>{molecule["therapeutic_flag"]}</strong>')
    properties.append(f'First approval: <strong>{molecule["first_approval"]}</strong>')

    return HTML('<br/>'.join(properties))

If we call the object, we get a summary review of it:

chembl_wrc_molecule_properties(aspirin)
Name: ASPIRIN
Common Wikipedia (common) name: Aspirin
Formula: C9H8O4
Species: ACID
SMILES: CC(=O)Oc1ccccc1C(=O)O
inchi: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
CHEMBL id: CHEMBL25
Therapeutic flag: True
First approval: 1950

As well as looking up compounds by name, we can lookup a molecules more specifically from a SMILES string:

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
ethanol_chembl = molecule.get(ethanol.canonical_smiles)

chembl_wrc_molecule_properties(ethanol_chembl)
Name: ALCOHOL
Common Wikipedia (common) name: Ethanol
Formula: C2H6O
Species: NEUTRAL
SMILES: CCO
inchi: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3
CHEMBL id: CHEMBL545
Therapeutic flag: False
First approval: None

We can also start to visualise the structure of the compound:

from IPython.display import SVG
from chembl_webresource_client.utils import utils

chembl_smiles = ethanol_chembl["molecule_structures"]["canonical_smiles"]

SVG(utils.smiles2svg(chembl_smiles))
../_images/compound_lookups_38_0.svg

pypdb

The pypdb package provides simple Python API for performing searches with the RCSB (Research Collaboratory for Structural Bioinformatics) Protein Data Bank (PDB) [about].

import pypdb

# Broken?
#pypdb.Query('aspirin').search()