Chem Quest

  • Home
  • Features Extraction

Table of contents

  • What is this ?
  • Format of the input
  • Which features can be extracted
  • Output format
  • Github Link

What is this ?

This app is a command from some chemistry PhD students to easily and rapidly extract data from PubChem.

Format of the input

The input can be anything of .csv .txt .xls .xlsx .
It must includes a column named Molecules in which the ID of the molecules are. This ID can be either the name or the CAS. All other columns are not used.
If the CAS is used then there is a highly probability to recover all molecules.
Using the name, the chance to recover the properties is highly variable due to mispelling, additional white space. A list of unrecovered molecules would be possible to download to check why there are not recovered.

Which features can be extracted

Most of the Pubchem features can be extract but here is the exhaustive list
  • Synonyms
  • Identifiers : CAS, European Community (EC) Number, ICSC Number, NSC Number, UN Number, Pharos Ligand ID, UNII, DSSTox Substance ID, Nikkaji Number, Wikidata, Wikipedia, RXCUI, Metabolomic Workbench ID, ChEMBL ID, NCI Thesaurus Code
  • Formula : IUPAC Name, InChI, InChIKey, Canonical SMILES
  • Experimental properties : Physical Description, Color/Form, Odor, Boiling Point, Melting Point, Flash Point, Solubility, Density, Vapor Density, Vapor Pressure, LogP, Henry's Law Constant, Stability / Shelf Life, Decomposition, Corrosivity, Odor Threshold, Other Experimental Properties, Chemical Classes
  • Computed properties : Molecular Weight, XLogP3, Hydrogen Bond Donor Count, Hydrogen Bond Acceptor Count, Rotatable Bond Count, Exact Mass, Monoisotopic Mass, Topological Polar Surface Area, Heavy Atom Count, Formal Charge, Complexity, Isotope Atom Count, Defined Atom Stereocenter Count, Undefined Atom Stereocenter Count, Defined Bond Stereocenter Count, Undefined Bond Stereocenter Count, Covalently-Bonded Unit Count, Compound Is Canonicalized

  • A feature can be extract only if it exists on PubChem. No features can be computed here or scrap from another database.
    For experimental properties, only the first one to appear on the website of each category is extracted (clean extraction and output are under work)

    Output

    Chem Quest output first a view as a datatable with the input name, the CAS, the 2D formula and then the selected features. This datatable can be dowload in a csv format. The 2D formula will appear as the html code to display the image.
    If any molecules is not recovered for any reason, the unrecovered list can be dowloaded as a csv file, containing the molecules input ID.

    Github link

    Github Logo Git repository of the project
    If you have any comment, issue, features to add : click here and open an issue
    One column must be named "Molecules" and contains the IDs