home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Datafile PD-CD 5
/
DATAFILE_PDCD5.iso
/
utilities
/
m
/
moledraw
/
!MoleDraw
/
docs
/
SMILES
< prev
next >
Wrap
Text File
|
1994-11-03
|
7KB
|
204 lines
SMILES notation
===============
This introduction to SMILES notation is based on a more detailed
description in the following paper,
"SMILES, a chemical language and information system", D. Weininger, Journal
of Chemical Information and Computer Sciences, 28 (1988) pp31-36.
SMILES (Simple Molecular Input Line Entry System) notation allows the two
dimensional graph of a molecule (and certain aspects of it's three
dimensional structure) to be written as a concise, one dimensional, string
of characters. This allows computers to store large numbers of chemical
structures in a small space and also enables them to be processed extremely
quickly. The beauty of SMILES, as compared to other encoding systems you
could devise, is that humans can also quite easily look at a SMILES string
and determine what molecule it represents and, conversely, easily construct
a SMILES string that represents a given structure.
A SMILES notation is a sequence of characters that ends with a white
space. Hydrogens may be omitted or included. Aromatic structures can be
specified directly or in their Kekulé form.
Atoms
=====
Atoms are represented by their atomic symbols. Atoms not from the
"organic subset", that is B, C, N, O, P, S, F, Cl, Br, and I, are written
enclosed in square brackets to separate them from the next. The presence of
enough hydrogen atoms to fill up any unused bonds to an atom is implied
unless the atom symbol is enclosed in square brackets and the number of
attached hydrogens is explicitly stated. Charges on an atom, if present, may
also be specified in the square brackets.
For example
SMILES Molecule
C methane
N ammonia
O water
[Au] elemental gold
[OH-] hydroxyl anion
[OH3+] hydronium cation
[Fe+2] or [Fe++] iron (II) cation
[NH4+] ammonium cation
Atoms in aromatic rings are specified by lower case letters, eg 'c' for
an aromatic carbon atom.
Bonds
=====
Single, double, triple, and aromatic bonds are represented by the symbols
'-', '=', '#', and ':' respectively. Single and aromatic bond symbols may
be, and usually are, omitted.
Examples
CC ethane
C=C ethene
CCO ethanol
C#N hydrogen cyanide
[H][H] molecular hydrogen
Branches
========
Branches off the main chain are enclosed in parentheses.
For example, (these and the following more complicated structures are
drawn out in the file 'SMILESegs')
CCN(CC)CC triethylamine
CC(=O)O ethanoic acid
Branches may be nested, for example
C=CC(CCC)C(C(C)C)CCC
is a perfectly valid SMILES string.
Cyclic structures
=================
Rings are first converted to linear structures by breaking a single (or
aromatic) bond. The SMILES for the resulting linear structure is then
written as normal except that a ring closure number is added after each of
the two atoms that had the bond between them broken.
For example, (remembering lower case atom symbols imply aromaticity)
C1CCCCC1 cyclohexane
c1ccccc1 benzene
C1C=CC=C1 cyclopentadiene
Oc1ccccc1 phenol
Brc1cc(Br)cc(Br)c1 tribromo-benzene
There may be more than one way of writing the structure as a SMILES. For
example 1-methyl-3-bromo-cyclohexene may be written as
CC1=CC(Br)CCC1
or as
CC1=CC(CCC1)Br
An individual atom may be involved in closing more than one ring, in
cubane for example, in this case all the ring closure numbers associated
with the atom are written after it.
So cubane may be written as
C12C3C4C1C5C4C3C25
Ring closure digits may be reused, however more than 9 may still be
needed, in this case (ie for ring closure numbers of 10 or greater) the two
digits are preceded by a '%' symbol.
To illustrate both these things
C%12CCCCC%12N=NC%12CCCCC%12
represents two cyclohexane rings joined by a two nitrogen atom linker.
Disconnected structures
=======================
Disconnected structures are written as individual SMILES separated by a
full stop. For example sodium phenoxide can be written as
[Na+].[O-]c1ccccc1
or even
c1cc([O-].[Na+])ccc1
Note, however, that no association of ions is implied by the order in
which disconnected structures appear in the SMILES.
Isomerism
=========
The stereochemistry at chiral centres can be specified in SMILES. The
chiral atom should be enclosed in square brackets with either one or two '@'
symbols following it. One '@' implies that the branches that follow it in
the SMILES string occur in an anticlockwise arrangement. Two '@' symbols
mean the branches occur in a clockwise arrangement. This is undoubtedly
totally unclear, so here is an example
OC(=O)[C@@]([H])(N)Cc1ccc(O)cc1 L-Tyrosine
here the [H], N and Cc1ccc(O)cc1 are arranged clockwise when viewed along
the bond from the carboxyl group, OC(=O), to the chiral carbon atom. The
other isomer is
OC(=O)[C@]([H])(N)Cc1ccc(O)cc1 D-Tyrosine
which has the three groups in an anticlockwise arrangement. These can be
written more simply as
OC(=O)[C@@H](N)Cc1ccc(O)cc1 L-Tyrosine
OC(=O)[C@H](N)Cc1ccc(O)cc1 D-Tyrosine
As an explanation for the use of the '@' symbol, and as an aid to
remembering which is which; an '@' symbol is an 'a' with an anticlockwise
circle around it.
The cis/trans isomerism of double bonds can also be specified. The
symbols '/' and '\' are used, they should precede and/or follow the atoms
which are doubly bonded. For example
Cl\C=C/Cl cis dichloro-ethene
Cl\C=C\C1 trans dichloro-ethene
or for a double bond with two groups at each end
Cl/C(Br)=C(/I)F
This will have Cl and I trans to each other, with the Br at the same end
as the Cl, and the F at the same end as the I.
Using the above rules almost all organic structures can be written in
SMILES notation. To demonstrate this point the final complicated example,
morphine
O1C2C(O)C=CC3C2(C4)c5c1c(O)ccc5CC3N(C)C4
can be written in a simple (it is when you get used to it!) and concise
way.
---
Simon Kilvington, 3/11/94