Example 6 - Human-in-the-loop design of more soluble molecules with IBM’s embedding model (smi-TED)#

In this example, we use IBM’s embedding model (smi-TED) to design more soluble molecules. AqSolDB is used as training data, containing aqueous solubility (LogS) values for 9,982 unique molecules extended with 17 topological and physicochemical 2D descriptors calculated by RDKit. Model benchmarking is based on a recent paper by Mayk Ramos and Andrew White.

Loading the embedding model#

You can clone IBM’s SMILES-based Transformer Encoder-Decoder (SMI-TED) from this repo.

import warnings
warnings.filterwarnings("ignore")
from rdkit import rdBase
rdBase.DisableLog('rdApp.warning')


import sys
sys.path.append('../dziner/surrogates/smi-TED/')

# materials.smi-ted
from load import load_smi_ted

# Data
import torch
import pandas as pd

# Chemistry
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
PandasTools.RenderImagesInAllDataFrames(True)

# function to canonicalize SMILES
def normalize_smiles(smi, canonical=True, isomeric=False):
    try:
        normalized = Chem.MolToSmiles(
            Chem.MolFromSmiles(smi), canonical=canonical, isomericSmiles=isomeric
        )
    except:
        normalized = None
    return normalized

model_smi_ted = load_smi_ted(
    folder='../dziner/surrogates/smi-TED/',
    ckpt_filename='smi-ted-Light_40.pt'
)

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to patch pandas - unable to change molecule rendering
Failed to patch pandas - unable to change molecule rendering


Random Seed: 12345
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Using Rotation Embedding
Vocab size: 2393
[INFERENCE MODE - smi-ted-Light]

Loading training data#

The training data can be downloaded from here.

import numpy as np
np.random.seed(0)

AqSolDB = pd.read_csv(f"../dziner/surrogates/solubility/data/curated-solubility-dataset.csv")
df_AqSolDB = AqSolDB.copy()

import random
from rdkit import Chem

def randomize_smiles(mol, random_type=None, isomericSmiles=True, canonical=False):
    """
    From: https://github.com/rxn4chemistry/rxn_yields
    From: https://github.com/undeadpixel/reinvent-randomized and https://github.com/GLambard/SMILES-X
    Returns a random SMILES given a SMILES of a molecule.
    :param mol: A Mol object
    :param random_type: The type (unrestricted, restricted, rotated) of randomization performed.
    :return : A random SMILES string of the same molecule or None if the molecule is invalid.
    """
    if random_type is None:
      random_type = np.random.choice(['unrestricted', 'restricted', 'rotated'])

    if random_type == "unrestricted":
        return Chem.MolToSmiles(mol, canonical=canonical, doRandom=True, isomericSmiles=isomericSmiles)
    elif random_type == "restricted":
        new_atom_order = list(range(mol.GetNumAtoms()))
        random.shuffle(new_atom_order)
        random_mol = Chem.RenumberAtoms(mol, newOrder=new_atom_order)
        return Chem.MolToSmiles(random_mol, canonical=canonical, isomericSmiles=isomericSmiles)
    elif random_type == 'rotated':
        n_atoms = mol.GetNumAtoms()
        rotation_index = random.randint(0, n_atoms-1)
        atoms = list(range(n_atoms))
        new_atoms_order = (atoms[rotation_index%len(atoms):]+atoms[:rotation_index%len(atoms)])
        rotated_mol = Chem.RenumberAtoms(mol,new_atoms_order)
        return Chem.MolToSmiles(rotated_mol, canonical=canonical, isomericSmiles=isomericSmiles)
    raise ValueError("Type '{}' is not valid".format(random_type))

smiles = list(df_AqSolDB["SMILES"])
solubilities = list(df_AqSolDB["Solubility"])
aug_data = 10

aug_smiles=[]
aug_solubilities=[]
for sml, sol in zip(smiles, solubilities):
  new_smls = []
  new_smls.append(sml)
  aug_solubilities.append(sol)
  for _ in range(aug_data):
    try:
      new_sml = randomize_smiles(Chem.MolFromSmiles(sml), canonical=True)
      if new_sml not in new_smls:
        new_smls.append(new_sml)
        aug_solubilities.append(sol)
    except:
      continue
  aug_smiles.extend(new_smls)

aug_df_AqSolDB = pd.DataFrame(data={
    "SMILES": aug_smiles,
    "Solubility": aug_solubilities
})

print(f"The dataset was augmented from {len(df_AqSolDB)} to {len(aug_df_AqSolDB)}.")

The dataset was augmented from 9982 to 47617.

Let’s have a look at the dataset:

aug_df_AqSolDB

	SMILES	Solubility
0	[Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C	-3.616127
1	C(CCC)CCCCCCCCCCCCCC[N+](C)(C)C.[Br-]	-3.616127
2	CCCCCCCCCCCCCCCCCC[N+](C)(C)C.[Br-]	-3.616127
3	C(CC)CCCCCCCCCCCCCCC[N+](C)(C)C.[Br-]	-3.616127
4	C(CCCCCCCCCCCC)CCCCC[N+](C)(C)C.[Br-]	-3.616127
...	...	...
47612	O(C)c1c(OC)ccc(c1)C(CCCN(CCc1cc(c(OC)cc1)OC)C)...	-3.980000
47613	O(C)c1cc(ccc1OC)CCN(CCCC(C(C)C)(C#N)c1ccc(OC)c...	-3.980000
47614	CC(C)C(CCCN(CCc1ccc(OC)c(c1)OC)C)(C#N)c1ccc(c(...	-3.980000
47615	CC(=O)CC(c1ccccc1)c1c(O)c2ccccc2oc1=O	-4.780000
47616	C(C(c1c(oc2c(cccc2)c1O)=O)c1ccccc1)C(=O)C	-4.780000

47617 rows × 2 columns

Training#

from sklearn.model_selection import train_test_split


# First, split off the test set
train_df, test_df = train_test_split(aug_df_AqSolDB, test_size=0.2, random_state=0)

# Then, split the remaining data into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=0)

# Output DataFrames
print("Train Set Size:", len(train_df))
print("Validation Set Size:", len(val_df))
print("Test Set Size:", len(test_df))

Train Set Size: 30474
Validation Set Size: 7619
Test Set Size: 9524

with torch.no_grad():
    df_embeddings_train = model_smi_ted.encode(train_df['SMILES'])
    df_embeddings_val = model_smi_ted.encode(val_df['SMILES'])
    df_embeddings_test = model_smi_ted.encode(test_df['SMILES'])

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 304/304 [01:32<00:00,  3.30it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [00:24<00:00,  3.14it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95/95 [00:28<00:00,  3.29it/s]

from xgboost import XGBRegressor
from sklearn.metrics import roc_auc_score

model = XGBRegressor(n_estimators=800, learning_rate=0.2, max_depth=5,
        objective='reg:squarederror')

model.fit(df_embeddings_train, train_df['Solubility'])


from sklearn.metrics import root_mean_squared_error

y_pred_val = model.predict(df_embeddings_val)
rmse_val = root_mean_squared_error(val_df['Solubility'], y_pred_val)
print(f'Validation RMSE: {rmse_val:.3f}')

Validation RMSE: 0.275

y_pred_test = model.predict(df_embeddings_test)
rmse_test = root_mean_squared_error(test_df['Solubility'], y_pred_test)
print(f'Test RMSE: {rmse_test:.3f}')

Test RMSE: 0.261

Solubility Challenges#

Llinàs et al. introduced two solubility challenges featuring consistent data. These challenges highlighted the importance of data quality over model selection for accurate solubility predictions. You can download the challenges data from here.

from rdkit.Chem.Descriptors import MolWt
def calcS0(entry):
  try:
    return np.log10(1e-3*entry["expSolubility"]/MolWt(Chem.MolFromSmiles(entry["SMILES"])))
  except:
    return None

SolChal1 = pd.read_csv('../dziner/surrogates/solubility/data/SolChal1.csv').dropna()
SolChal1['SMILES'] = SolChal1['SMILES'].apply(normalize_smiles)
SolChal1["Solubility"] = SolChal1.apply(calcS0, axis=1)

SolChal2_1 = pd.read_csv('../dziner/surrogates/solubility/data/SolChal2-1.csv').dropna()
SolChal2_1['SMILES'] = SolChal2_1['SMILES'].apply(normalize_smiles)


SolChal2_2 = pd.read_csv('../dziner/surrogates/solubility/data/SolChal2-2.csv').dropna()
SolChal2_2['SMILES'] = SolChal2_2['SMILES'].apply(normalize_smiles)

Let’s take a look at SolChal1 data:

SolChal1

	name	SMILES	InChI	expSolubility	expUncertainty	Solubility
0	Acebutolol	CCCC(=O)Nc1ccc(OCC(O)CNC(C)C)c(C(C)=O)c1	InChI=1/C18H28N2O4/c1-5-6-18(23)20-14-7-8-17(1...	711.00	140.00	-2.675028
1	Amoxicillin	CC1(C)SC2C(NC(=O)C(N)c3ccc(O)cc3)C(=O)N2C1C(=O)O	InChI=1/C16H19N3O5S/c1-16(2)11(15(23)24)19-13(...	3900.00	200.00	-1.971717
2	Bendroflumethiazide	NS(=O)(=O)c1cc2c(cc1C(F)(F)F)NC(Cc1ccccc1)NS2(...	InChI=1/C15H18F3N3O4Rn2/c16-15(17,18)10-7-11-1...	21.20	2.60	-4.298381
3	Benzocaine	CCOC(=O)c1ccc(N)cc1	InChI=1/C9H11NO2/c1-2-12-9(11)7-3-5-8(10)6-4-7...	780.00	150.00	-2.325894
4	Benzthiazide	NS(=O)(=O)c1cc2c(cc1Cl)N=C(CSCc1ccccc1)NS2(=O)=O	InChI=1/C15H18ClN3O4Rn2S/c16-11-6-12-14(7-13(1...	6.40	0.10	-4.829251
6	Clozapine	CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1	InChI=1/C18H19ClN4/c1-22-8-10-23(11-9-22)18-14...	188.90	9.70	-3.238091
7	Dibucaine	CCCCOc1cc(C(=O)NCCN(CC)CC)c2ccccc2n1	InChI=1/C20H29N3O2/c1-4-7-14-25-19-15-17(16-10...	14.00	2.00	-4.389762
8	Diethylstilbestrol	CCC(=C(CC)c1ccc(O)cc1)c1ccc(O)cc1	InChI=1/C18H20O2/c1-3-17(13-5-9-15(19)10-6-13)...	10.00	3.00	-4.428711
9	Diflunisal	O=C(O)c1cc(-c2ccc(F)cc2F)ccc1O	InChI=1/C13H8F2O3/c14-8-2-3-9(11(15)6-8)7-1-4-...	0.29	0.02	-5.935889
10	Dipyridamole	OCCN(CCO)c1nc(N2CCCCC2)c2nc(N(CCO)CCO)nc(N3CCC...	InChI=1/C24H40N8O4/c33-15-11-31(12-16-34)23-26...	3.46	0.05	-5.163902
12	Folic_Acid	Nc1nc(O)c2nc(CNc3ccc(C(=O)NC(CCC(=O)O)C(=O)O)c...	InChI=1/C19H19N7O6/c20-19-25-15-14(17(30)26-19...	2.50	0.20	-5.246896
13	Furosemide	NS(=O)(=O)c1cc(C(=O)O)c(NCc2ccco2)cc1Cl	InChI=1/C12H13ClN2O5Rn/c13-9-5-10(15-6-7-2-1-3...	19.60	2.00	-4.227242
14	Hydrochlorothiazide	NS(=O)(=O)c1cc2c(cc1Cl)NCNS2(=O)=O	InChI=1/C7H12ClN3O4Rn2/c8-4-1-5-7(2-6(4)16(9,1...	625.00	11.00	-2.677964
15	Imipramine	CN(C)CCCN1c2ccccc2CCc2ccccc21	InChI=1/C19H24N2/c1-20(2)14-7-15-21-18-10-5-3-...	22.00	1.00	-4.105379
16	Indomethacin	COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1	InChI=1/C19H16ClNO4/c1-11-15(10-18(22)23)16-9-...	410.00	5.00	-2.940848
17	Ketoprofen	CC(C(=O)O)c1cccc(C(=O)c2ccccc2)c1	InChI=1/C16H14O3/c1-11(16(18)19)13-8-5-9-14(10...	157.00	4.00	-3.209421
18	Lidocaine	CCN(CC)CC(=O)Nc1c(C)cccc1C	InChI=1/C14H22N2O/c1-5-16(6-2)10-13(17)15-14-1...	3130.00	100.00	-1.874308
20	Meclofenamic_acid	Cc1ccc(Cl)c(Nc2ccccc2C(=O)O)c1Cl	InChI=1/C14H11Cl2NO2/c1-8-6-7-10(15)13(12(8)16...	0.16	0.01	-6.267396
21	Naphthoic_acid	O=C(O)c1ccc2ccccc2c1	InChI=1/C11H8O2/c12-11(13)10-6-5-8-3-1-2-4-9(8...	28.96	9.91	-3.774192
23	Probenecid	CCCN(CCC)S(=O)(=O)c1ccc(C(=O)O)cc1	InChI=1/C13H21NO4Rn/c1-3-9-14(10-4-2)19(17,18)...	3.90	0.10	-4.864336
24	Pyrimethamine	CCc1nc(N)nc(N)c1-c1ccc(Cl)cc1	InChI=1/C12H13ClN4/c1-2-9-10(11(14)17-12(15)16...	19.40	2.70	-4.107904
25	Salicylic_acid	O=C(O)c1ccccc1O	InChI=1/C7H6O3/c8-6-4-2-1-3-5(6)7(9)10/h1-4,8H...	1620.00	40.00	-1.930748
26	Sulfamerazine	Cc1ccnc(NS(=O)(=O)c2ccc(N)cc2)n1	InChI=1/C11H14N4O2Rn/c1-8-6-7-13-11(14-8)15-18...	200.00	6.00	-3.121084
27	Sulfamethizole	Cc1nnc(NS(=O)(=O)c2ccc(N)cc2)s1	InChI=1/C9H12N4O2RnS/c1-6-11-12-9(17-6)13-16(1...	450.00	40.00	-2.778696
29	Thiabendazole	c1ccc2[nH]c(-c3cscn3)nc2c1	InChI=1/C10H7N3S/c1-2-4-8-7(3-1)12-10(13-8)9-5...	66.00	4.00	-3.484201
30	Tolbutamide	CCCCNC(=O)NS(=O)(=O)c1ccc(C)cc1	InChI=1/C12H18N2O3S/c1-3-4-9-13-12(15)14-18(16...	93.00	5.00	-3.463450
31	Trazodone	O=c1n(CCCN2CCN(c3cccc(Cl)c3)CC2)nc2ccccn12	InChI=1/C19H22ClN5O/c20-16-5-3-6-17(15-16)23-1...	127.00	10.00	-3.466590

Now we create the embeddings for them using smi-TED:

with torch.no_grad():
    df_embeddings_SolChal1 = model_smi_ted.encode(SolChal1['SMILES'])
    df_embeddings_SolChal2_1 = model_smi_ted.encode(SolChal2_1['SMILES'])
    df_embeddings_SolChal2_2 = model_smi_ted.encode(SolChal2_2['SMILES'])

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 35.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.48it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 47.23it/s]

Benchmarks#

y_pred_SolChal1 = model.predict(df_embeddings_SolChal1)
y_pred_SolChal2_1 = model.predict(df_embeddings_SolChal2_1)
y_pred_SolChal2_2 = model.predict(df_embeddings_SolChal2_2)

rmse_SolChal1 = root_mean_squared_error(SolChal1['Solubility'], y_pred_SolChal1)
rmse_SolChal2_1 = root_mean_squared_error(SolChal2_1['Solubility'], y_pred_SolChal2_1)
rmse_SolChal2_2 = root_mean_squared_error(SolChal2_2['Solubility'], y_pred_SolChal2_2)

print(f'RMSE Challenge 1: {rmse_SolChal1:.3f}')
print(f'RMSE Challenge 2_1: {rmse_SolChal2_1:.3f}')
print(f'RMSE Challenge 2_2: {rmse_SolChal2_2:.3f}')

RMSE Challenge 1: 0.883
RMSE Challenge 2_1: 0.760
RMSE Challenge 2_2: 1.410

Comparing to Mayk Ramos’s work, our trained model with smi-TED embeddings performs better with respect to the best models reported in that study for Solubility challenge 1 and Solubility challenge 2_1.

Human-in-the-loop design of more soluble molecules#

Now that we have this awesome solubility predictor, let’s augment it with dZiner and make more soluble molecules by giving feedback on the design at each iteration.

Defining tools#

import warnings
warnings.filterwarnings('ignore')
from langchain._api import LangChainDeprecationWarning
warnings.simplefilter("ignore", category=LangChainDeprecationWarning)
import sys
import dziner
import os

os.environ["SERPER_API_KEY"] = "YOUR_SERPER_API_KEY"
os.environ["ANTHROPIC_API_KEY"] = "YOUR_ANTHROPIC_API_KEY"

from dziner.tools import search
print(f'Search tool:\n{search.description}\n')

from langchain.agents import tool

@tool
def predict_solubility(smiles):
    '''
    This tool predicts the aqueous solubility (LogS) via IBM's smi-TED embeddings. Model is trained using the AqSolDB dataset. 
    '''
    with torch.no_grad():
        mol_embedding = model_smi_ted.encode(smiles)
    return model.predict(mol_embedding)[0]


def get_input() -> str:
    print("Insert your text. Enter 'q' or press Ctrl-D (or Ctrl-Z on Windows) to end.")
    contents = []
    line = input()
    return line

from langchain_community.tools import HumanInputRun
human_feedback = HumanInputRun(input_func=get_input,
                              description="Use this tool to obtain feedback on the design of the new SMILES you suggested.",
                                                         name='human_feedback')

print(f'predict_solubility tool:\n{predict_solubility.description}\n')
print(f'human_feedback tool:\n{human_feedback.description}\n')

Search tool:
This tool performs a search on Google using SerpAPI and returns relevant information for the query.
It is useful for retrieving information from the web or when you need to look up current data.

predict_solubility tool:
This tool predicts the aqueous solubility (LogS) via IBM's smi-TED embeddings. Model is trained using the AqSolDB dataset.

human_feedback tool:
Use this tool to obtain feedback on the design of the new SMILES you suggested.

Defining custom system messages for the agent#

PREFIX = """You are a helpful Chemist AI assistant called dZiner. Answer to questions the best you can. Try to be funny when you can.
     Some questions may task you to make changes to a molecule based on some design guidelines
    and optimize the its property.
    You can look up these guidelines from the internet for more information on what changes
    changes you can make. If you don't know, do not make up answers. Suggest one new SMILES ONLY.
    Explain changes you make to the molecule in details at each step but DO NOT put redundant information and make it short.
    """

FORMAT_INSTRUCTIONS = """Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:
'''
Thought: Here's your final answer:
Final Answer: [your response here]
'''

Use the exact sequebce without any "\n" or any extra "()".
You have a final answer once you have as many new SMILES as you are asked.
       
"""

SUFFIX = """You should always follow these steps:
    1. Lookup and summarize design guidelines on what functional groups can help with better solubility.
    Make sure to add citations (paper DOI) for design guidelines if you use them, in your final response.
    Your final response should also contain the source for the tools used from their summary in description in {tool_desc}.
    2. Evaluate the solubility of the initial molecule you start with.
    3. Start making changes to the molecule based on guidelines you found in step 1. Try to use different changes at each iteration.
    4. Evaluate the solubility for the new molecule
    5. Ask for human feedback on the new design.
    If the solubility does not change properly, revert change to the best previous molecule and try something else by
    redoing steps 3-5. Iterate you get a positive feedback on the design from the human.

    Start by describing the problem: \n\nBegin!\n \n\nQuestion: {input}
Thought:{agent_scratchpad}\n"""

Running the agent#

from dziner.agents import dZiner
from langchain_anthropic import ChatAnthropic
import os

tools = [predict_solubility, search, human_feedback]
tool_names = [tool.name for tool in tools]  
tool_desc = [tool.description for tool in tools]

agent_model = ChatAnthropic(model="claude-3-5-sonnet-20240620", api_key=os.environ["ANTHROPIC_API_KEY"], temperature=0.4, max_tokens=8192)
agent = dZiner(tools, property="Solubility",
               model=agent_model, verbose=True,
              suffix=SUFFIX, format_instructions=FORMAT_INSTRUCTIONS).agent

inital_smiles = "CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1"

Human_prompt = f"Make changes to {inital_smiles} and make it more soluble."

input_data = {
            "input": Human_prompt,
            "tools": tools,
            "tool_names": tool_names,
            "tool_desc": tool_desc
        }

result = agent.invoke(input_data)

Error in StdOutCallbackHandler.on_chain_start callback: AttributeError("'NoneType' object has no attribute 'get'")


[32;1m[1;3mThought: To begin, I need to understand the design guidelines for improving solubility and evaluate the initial molecule's solubility. Let's start by searching for information on functional groups that can enhance solubility.

Action: search
Action Input: functional groups to improve aqueous solubility of organic compounds
[0m
Observation: [33;1m[1;3mAny functional group that can donate a hydrogen bond to water (eg. alcohols, amines) will significantly contribute to water solubility. Any functional group that can only accept a hydrogen bond from water (eg. ketones, aldehydes, ethers) will have a somewhat smaller but still significant effect on water solubility.[0m
Thought:[32;1m[1;3mThought: Thank you for the information. Now I have a better understanding of functional groups that can improve aqueous solubility. Let's summarize the key points:

1. Hydrogen bond donors (e.g., alcohols, amines) significantly contribute to water solubility.
2. Hydrogen bond acceptors (e.g., ketones, aldehydes, ethers) also improve water solubility, but to a lesser extent.

Now, let's evaluate the solubility of our initial molecule.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.25it/s]



Observation: [36;1m[1;3m-4.6247563[0m
Thought:[32;1m[1;3mThought: The initial molecule has a predicted LogS value of -4.62, which indicates poor water solubility. Let's make some changes to improve its solubility based on the design guidelines we found. We'll start by adding a hydroxyl group (OH) to one of the aromatic rings, as it's a hydrogen bond donor and should significantly improve solubility.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccc(O)cc32)CC1
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.28it/s]



Observation: [36;1m[1;3m-4.292879[0m
Thought:[32;1m[1;3mThought: The addition of a hydroxyl group has improved the solubility slightly, with the LogS value increasing from -4.62 to -4.29. However, we can try to make further improvements. Let's add another hydroxyl group to see if we can enhance the solubility even more.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(Cl)ccc3Nc3c(O)cc(O)cc32)CC1
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.49it/s]



Observation: [36;1m[1;3m-4.497037[0m
Thought:[32;1m[1;3mThought: I apologize for the confusion in my previous response. It seems that adding another hydroxyl group did not improve the solubility as expected. Let's try a different approach by replacing the chlorine atom with a more polar group, such as an amine (NH2), which is both a hydrogen bond donor and acceptor.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.41it/s]



Observation: [36;1m[1;3m-3.7260334[0m
Thought:[32;1m[1;3mThought: The replacement of the chlorine atom with an amine group (NH2) has significantly improved the solubility, with the LogS value increasing from -4.29 to -3.73. This is a notable improvement. Let's ask for human feedback on this new design.

Action: human_feedback
Action Input: I've made two changes to the original molecule to improve its solubility:
1. Added a hydroxyl group (OH) to one of the aromatic rings.
2. Replaced the chlorine atom with an amine group (NH2).
The new SMILES is: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1
The predicted LogS has improved from -4.62 to -3.73. Do you approve of these changes?
[0m

I've made two changes to the original molecule to improve its solubility:
1. Added a hydroxyl group (OH) to one of the aromatic rings.
2. Replaced the chlorine atom with an amine group (NH2).
The new SMILES is: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1
The predicted LogS has improved from -4.62 to -3.73. Do you approve of these changes?

Insert your text. Enter 'q' or press Ctrl-D (or Ctrl-Z on Windows) to end.


 Thanks. Try to add more hydrogen bond donors



Observation: [38;5;200m[1;3mThanks. Try to add more hydrogen bond donors[0m
Thought:[32;1m[1;3mThought: Thank you for the feedback. I'll focus on adding more hydrogen bond donors to further improve the solubility. Let's add another hydroxyl group (OH) to the molecule, this time on the piperazine ring.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1(O)
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.35it/s]



Observation: [36;1m[1;3m-3.4522743[0m
Thought:[32;1m[1;3mThought: The addition of another hydroxyl group to the piperazine ring has further improved the solubility, with the LogS value increasing from -3.73 to -3.45. This is a positive change. Let's ask for human feedback on this new design.

Action: human_feedback
Action Input: I've made an additional change to the molecule to improve its solubility:
1. Added another hydroxyl group (OH) to the piperazine ring.
The new SMILES is: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1(O)
The predicted LogS has improved from -3.73 to -3.45. Do you approve of this change?
[0m

I've made an additional change to the molecule to improve its solubility:
1. Added another hydroxyl group (OH) to the piperazine ring.
The new SMILES is: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)cc32)CC1(O)
The predicted LogS has improved from -3.73 to -3.45. Do you approve of this change?

Insert your text. Enter 'q' or press Ctrl-D (or Ctrl-Z on Windows) to end.


 Great. Now add more charged groups like carboxylate.



Observation: [38;5;200m[1;3mGreat. Now add more charged groups like carboxylate.[0m
Thought:[32;1m[1;3mThought: Thank you for the feedback. I'll now focus on adding a charged group, specifically a carboxylate group, to further improve the solubility. Let's add a carboxylate group (-COO-) to one of the aromatic rings.

Action: predict_solubility
Action Input: CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)c(C(=O)[O-])c32)CC1(O)
[0m

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.24it/s]



Observation: [36;1m[1;3m-2.8749063[0m
Thought:[32;1m[1;3mThought: Here's your final answer:

Final Answer: I have successfully made changes to the original molecule CN1CCN(C2=Nc3cc(Cl)ccc3Nc3ccccc32)CC1 to improve its solubility. The final optimized molecule has the following SMILES:

CN1CCN(C2=Nc3cc(N)ccc3Nc3ccc(O)c(C(=O)[O-])c32)CC1(O)

The changes made to improve solubility are:
1. Added a hydroxyl group (OH) to one of the aromatic rings.
2. Replaced the chlorine atom with an amine group (NH2).
3. Added a hydroxyl group (OH) to the piperazine ring.
4. Added a carboxylate group (-COO-) to one of the aromatic rings.

These modifications have significantly improved the predicted solubility (LogS) from the initial value of -4.62 to -2.87, which represents a substantial increase in aqueous solubility.

The design guidelines used for improving solubility include:
1. Adding hydrogen bond donors (e.g., hydroxyl and amine groups)
2. Adding charged groups (e.g., carboxylate)

These guidelines are consistent with general principles of medicinal chemistry for improving aqueous solubility, as described in various sources. However, for a more specific citation, you may refer to:

Waring, M.J. (2009). Lipophilicity in drug discovery. Expert Opinion on Drug Discovery, 5(3), 235-248. DOI: 10.1517/17460441003605098

The solubility predictions were made using the tool that "predicts the aqueous solubility (LogS) via IBM's smi-TED embeddings. Model is trained using the AqSolDB dataset." The search for design guidelines was performed using the tool that "performs a search on Google using SerpAPI and returns relevant information for the query." Human feedback was obtained using the tool to "obtain feedback on the design of the new SMILES suggested."

This optimized molecule should have significantly improved aqueous solubility while maintaining the core structure of the original compound.[0m

[1m> Finished chain.[0m

Looks like we have two orders of magnitude increase in solubility now :)

Example 6 - Human-in-the-loop design of more soluble molecules with IBM’s embedding model (smi-TED)

Contents