药物筛选进展
更新时间 ——2026年4月22日
课题目的:
旨在筛选影响 IRP2–IRE 铁稳态调控系统的肠道微生物代谢物,并初步探究其分子机制,为理解肠道微生物与宿主铁代谢互作提供新的分子依据
基于结构的高通量虚拟目的:
基于 AlphaFold 预测的 IRP2 蛋白三维结构,构建面向肠道微生物代谢物数据库(MiMeDB)的结构基础虚拟筛选流程,系统筛选可能干扰 IRP2–IRE 相互作用的候选代谢物
工作流程
受体IRP2的准备
-
Uniport下载AlphaFold预测的人源IRP2蛋白结构,编号:P48200。预处理脚本:
lsf prepare_receptor4.py \ -r IRP226126.pdb \ -o IRP2NEW26.pdbqt \ -A checkhydrogens \ -U nphs_lps_waters_nonstdres -
确定对接口袋和关键口袋残基。主要思路和方法如下,获得IRP2的关键映射残基文件
````md
目的1:寻找IRP2与IRE结合的关键残基
目的2:量化IRP2预测模型与IRP1的结构相似度
目的3:输出 IRP1 接触 IRE 的残基名单(氨基酸名称加位置),我自己需要查手动查论文验证
目的4:输出投射到IRP2上的残基名单(氨基酸名称加位置),我自己需要查手动查论文验证
hand_clean_IRP1.pdb 不是“只蛋白”,而是从 3SNP 里手动清理出来的那一套(IRP1 + IRE)
hand_clean_IRP1.pdb(= IRP1–IRE 复合物,单套)IRP2.pdb(alphafpld 预测的人源IRP2结构)
然后三步走:对齐 → 找 IRP1 接触 IRE 的残基 → 投射到 IRP2。
0)加载 + 清理水
```pymol reinitialize load hand_clean_IRP1.pdb, cplx load IRP2.pdb, irp2
可选:去水
remove cplx and solvent remove cplx and resn HOH ```
1)把复合物里的 IRP1 和 IRE 分开成选择集
```pymol
复合物中的蛋白 = IRP1
select irp1, cplx and polymer.protein
复合物中的RNA = IRE
select ire, cplx and polymer.nucleic ```
自检一下:你应该能看到 RNA 被选出来
pymol show sticks, ire
2)IRP1 对齐到 IRP2(只用 CA 对齐)
pymol align irp1 and name CA, irp2 and name CA
3)找 IRP1 上“直接接触 IRE”的残基(这一步是关键)
推荐两档阈值:
- 3.5 Å:更像“直接接触/氢键距离”(更干净)
- 4.5 Å:更像“界面残基”(会多一些)
我选用用 4 Å 作为折中:
pymol select irp1_iface, byres (irp1 within 4.0 of ire)自检一下数量大不大(看 CA 个数):
pymol count_atoms irp1_iface and name CA
4)把这批界面残基“map/投射”到 IRP2(空间映射)
我用 5 Å(口袋邻域),6 Å 更宽松:
pymol select irp2_pocket, byres (irp2 within 5.0 of irp1_iface) count_atoms irp2_pocket and name CA
5)显示、上色、标注口袋
```pymol hide everything show cartoon, irp2 show cartoon, irp1 show sticks, ire
color cyan, irp2 color green, irp1 color yellow, irp1_iface color magenta, irp2_pocket
标签(只标 CA,避免炸屏)
set label_size, 16 set label_color, white label irp2_pocket and name CA, "%s%s" % (resn,resi) ```
6)保存结果
pymol save IRP2_IRE_mapping.pse save IRP2_pocket_residues.pdb, irp2_pocket
````
-
获得Autodock vina的对接口袋参数
```
工具,pymol插件:GetBox-Pymol-Plugin
项目地址:https://github.com/MengwuXiao/GetBox-PyMOL-Plugin
参数文件:
center_x = -12.9 center_y = 2.3 center_z = 3.6 size_x = 56.0 size_y = 53.2 size_z = 71.4 num_modes = 20 ```
配体准备
-
下载MIMEDB2.0数据库,链接:MiMeDB: Downloads
-
读取MIMEDB2.0 SDF文件,成功读取到24376个, 分子总数24409个,读取失败33个,去除重复254个,保留24122个
-
rdkit聚类,相似度设置为0.9,对24122个进行聚类,共生成6751簇,前5大簇共包含分子12546个。对前5大簇,进行中心分子和随机分子提取,以及亲水指数和分子量分析,确认为膜脂类,所以去除了前5大类的分子,保留11576个分子,储存在all_others_remaining_2d.sdf中。
-
使用prep_to_pdbqt_filter_metal_2.py进行配体准备,成功准备11459个:
```Python
!/usr/bin/env python3
-- coding: utf-8 --
import os import re import sys import csv import time import traceback from multiprocessing import Pool
import pandas as pd from rdkit import Chem from rdkit.Chem import AllChem, Descriptors from rdkit.Chem.MolStandardize import rdMolStandardize from rdkit import RDLogger from meeko import MoleculePreparation
================= 配置区 =================
INPUT_SDF = "all_others_remaining_2d.sdf" OUTPUT_DIR = "docking_ligands" REPORT_CSV = "docking_prep_report.csv" N_CORES = 2
RANDOM_SEED = 2026
==========================================
可选:关闭 RDKit ERROR/Warning 刷屏(你仍然会在报告里看到失败原因)
RDLogger.DisableLog("rdApp.error") RDLogger.DisableLog("rdApp.warning")
常见“金属/类金属”原子序数集合(用于策略A:含金属直接跳过)
METAL_ATOMIC_NUMS = { # 碱金属/碱土 3, 4, 11, 12, 19, 20, 37, 38, 55, 56, # 常见金属/类金属(含Al等) 13, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 57, 72, 73, 74, 75, 76, 77, 78, 79, 80 }
def has_metal(mol: Chem.Mol) -> bool: """策略A:只要分子里含金属原子,直接判定为需要跳过。""" for a in mol.GetAtoms(): if a.GetAtomicNum() in METAL_ATOMIC_NUMS: return True return False
def safe_filename(s: str, max_len: int = 80) -> str: s = s.strip() if not s: return "NONAME" s = re.sub(r"\s+", "", s) s = re.sub(r"[^0-9a-zA-Z-.]+", "_", s) return s[:max_len]
def standardize_mol(mol: Chem.Mol): """ 标准化:Cleanup + 取最大片段 + 去电荷 注意:策略A过滤金属后,这些步骤更稳定。 """ try: mol = rdMolStandardize.Cleanup(mol)
chooser = rdMolStandardize.LargestFragmentChooser() mol = chooser.choose(mol) uncharger = rdMolStandardize.Uncharger() mol = uncharger.uncharge(mol) return mol, "Cleaned" except Exception as e: return None, f"{type(e).__name__}: {e}"def embed_and_optimize(mol: Chem.Mol, random_seed: int): """ETKDGv3 构象 + MMFF94s 优化(不行则 UFF),失败也不直接崩。""" params = AllChem.ETKDGv3() params.useRandomCoords = True params.randomSeed = int(random_seed) params.maxIterations = 500
conf_id = AllChem.EmbedMolecule(mol, params) if conf_id == -1: # 兜底:再来一次更宽松的 embedding conf_id = AllChem.EmbedMolecule( mol, useRandomCoords=True, randomSeed=int(random_seed), ) if conf_id == -1: return None, "3D_Embedding_Failed" # 力场优化:MMFF → UFF(仍可能失败,但不影响后续导出) try: props = AllChem.MMFFGetMoleculeProperties(mol, mmffVariant="MMFF94s") if props: AllChem.MMFFOptimizeMolecule(mol, mmffVariant="MMFF94s", maxIters=500) else: raise ValueError("No_MMFF_Props") except Exception: try: AllChem.UFFOptimizeMolecule(mol, maxIters=500) except Exception: # 允许优化失败继续往下走(有些分子仍可转PDBQT) pass return mol, "OK"def mol_to_pdbqt_string(mol: Chem.Mol): """ Meeko 生成 PDBQT 字符串。 不同版本 meeko API 可能略有差异,这里做兼容处理。 """ try: meeko_prep = MoleculePreparation()
# 兼容:有的版本 prepare() 返回 setup(或 list),有的版本就地修改 setup = meeko_prep.prepare(mol) if setup is None: # 某些版本是就地准备:write_pdbqt_string() 不需要参数 pdbqt_string = meeko_prep.write_pdbqt_string() else: # 某些版本返回 MoleculeSetup 或 list[setup] if isinstance(setup, (list, tuple)) and len(setup) > 0: setup_obj = setup[0] else: setup_obj = setup try: pdbqt_string = meeko_prep.write_pdbqt_string(setup_obj) except TypeError: # 兜底:如果这个版本不接受参数 pdbqt_string = meeko_prep.write_pdbqt_string() if not pdbqt_string or "ATOM" not in pdbqt_string: return None, "PDBQT_EmptyOrNoATOM" return pdbqt_string, "OK" except Exception as e: return None, f"Meeko_Exception: {type(e).__name__}: {e}"def process_single_entry(args): """ 单个任务处理逻辑 """ mol_block, mol_name, original_idx = args result = { "original_idx": original_idx, "name": mol_name, "status": "Failed", "reason": "Unknown", "n_atoms": 0, "heavy_atoms": 0, "mw": 0.0, }
try: # 1) 重建分子(你当前是MolBlock流程;保留但把 sanitize 打开) mol = Chem.MolFromMolBlock(mol_block, sanitize=True, removeHs=False) if mol is None: result["reason"] = "RDKit_MolFromMolBlock_Failed" return result # 2) 名称兜底 if not mol_name or str(mol_name).strip() == "": mol_name = f"Mol_{original_idx}" mol.SetProp("_Name", str(mol_name)) # 3) 策略A:含金属直接跳过 if has_metal(mol): result["reason"] = "Contains_Metal_Skip" return result # 4) 标准化(去盐/最大片段/去电荷) mol, msg = standardize_mol(mol) if mol is None: result["reason"] = f"Standardization_Failed: {msg}" return result # 5) 加氢(给3D/力场/Meeko用) mol = Chem.AddHs(mol) # 记录基本属性(加氢后:总原子数;heavy_atoms不变) result["n_atoms"] = int(mol.GetNumAtoms()) result["heavy_atoms"] = int(mol.GetNumHeavyAtoms()) result["mw"] = float(round(Descriptors.MolWt(mol), 2)) # 6) 3D + 优化 mol, msg = embed_and_optimize(mol, RANDOM_SEED) if mol is None: result["reason"] = msg return result # 7) 转 PDBQT(Meeko) pdbqt_string, msg = mol_to_pdbqt_string(mol) if pdbqt_string is None: result["reason"] = msg return result # 8) 写文件 if not os.path.exists(OUTPUT_DIR): os.makedirs(OUTPUT_DIR, exist_ok=True) safe_name = safe_filename(str(mol_name)) filename = f"lig_{original_idx}_{safe_name}.pdbqt" out_path = os.path.join(OUTPUT_DIR, filename) with open(out_path, "w", encoding="utf-8") as f: f.write(pdbqt_string) result["status"] = "Success" result["reason"] = "OK" return result except Exception as e: result["reason"] = f"Exception: {type(e).__name__}: {e}" return resultdef main(): os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"[Init] Reading SDF: {INPUT_SDF}") # sanitize=False 用于最大限度读入(坏分子先不san),后面 MolFromMolBlock 会 sanitize suppl = Chem.SDMolSupplier(INPUT_SDF, sanitize=False, removeHs=False) tasks = [] valid_mols = 0 skipped_none = 0 for i, mol in enumerate(suppl): if mol is None: skipped_none += 1 continue try: original_idx = mol.GetProp("Original_Index") if mol.HasProp("Original_Index") else str(i) name = mol.GetProp("_Name") if mol.HasProp("_Name") else "" mol_block = Chem.MolToMolBlock(mol) tasks.append((mol_block, name, original_idx)) valid_mols += 1 except Exception: skipped_none += 1 print(f"[Data] Total valid blocks loaded: {valid_mols}") print(f"[Data] Skipped corrupt records in SDF: {skipped_none}") print(f"[Run] Starting processing on {N_CORES} cores...") results_list = [] with Pool(processes=N_CORES) as pool: for res in pool.imap_unordered(process_single_entry, tasks, chunksize=20): results_list.append(res) if len(results_list) % 500 == 0: print(f" Processed {len(results_list)} / {valid_mols}") df = pd.DataFrame(results_list) df.to_csv(REPORT_CSV, index=False) success_count = int(df[df["status"] == "Success"].shape[0]) fail_count = int(df[df["status"] == "Failed"].shape[0]) print("\n" + "=" * 50) print("Processing Complete!") print(f"Total Attempted: {len(results_list)}") print(f"Success: {success_count}") print(f"Failed: {fail_count}") print(f"Check Report: {REPORT_CSV}") print("=" * 50) if fail_count > 0: print("\nFailure Reasons Breakdown (Top 20):") print(df[df["status"] == "Failed"]["reason"].value_counts().head(20)) # 额外:策略A统计(含金属跳过数) metal_skips = int((df["reason"] == "Contains_Metal_Skip").sum()) print(f"\n[Stats] Contains_Metal_Skip: {metal_skips}")if name == "main": main()
```
阳性分子准备
1. 3个阳性分子名称:KS-20073 KS-20226 V004-0827
2. 阳性分子储存路径:/public/home/hxsun25/mimedb2.0/2026_2_3dockingwork/阳性分子预处理
分子对接
-
分子对接总数:11459个肠道微生物代谢物分子+3个阳性分子=11462个分子
-
对接脚本1,任务提交脚本1如下。成功对接了10660个分子,有802个分子因为某种原因导致的程序中断未对接。
```LSF
LSF任务提交脚本
!/bin/bash
BSUB -J vina_IRP2_260203autobox
BSUB -q normal
BSUB -n 4
BSUB -R "span[hosts=1]"
BSUB -o logs/vina_%J.out
BSUB -e logs/vina_%J.err
set -euo pipefail
====== 1) 工作目录(你已经确认)======
WORKDIR="/public/home/hxsun25/mimedb2.0/2026_2_3dockingwork" cd "$WORKDIR"
====== 2) 输入文件(你已经确认)======
RECEPTOR="$WORKDIR/IRP2NEW26.pdbqt" CONFIG="$WORKDIR/config260203autobox.txt"
====== 3) 本次要跑的 chunk(由外部传入环境变量 CHUNK)======
- "${CHUNK:?ERROR: CHUNK is not set (e.g. chunk_001)}" CHUNK_DIR="$WORKDIR/chunks/$CHUNK"
====== 4) 输出目录(每个 chunk 独立输出,避免覆盖)======
OUTDIR="$WORKDIR/results/$CHUNK" mkdir -p "$OUTDIR" logs
====== 5) 激活环境(你之前稳定的写法)======
source /public/home/hxsun25/miniconda3/etc/profile.d/conda.sh conda activate vina23
echo "===== JOB INFO =====" echo "HOST : $(hostname)" echo "PWD : $(pwd)" echo "QUEUE : high" echo "NCORES : 4" echo "CHUNK : $CHUNK" echo "CHUNK_DIR : $CHUNK_DIR" echo "OUTDIR : $OUTDIR" echo "PY : $(which python || true)" echo "PY3 : $(which python3 || true)" echo "VINA : $(which vina || true)" vina --version || true python3 --version || true echo "RECEPTOR : $RECEPTOR" echo "CONFIG : $CONFIG" echo "===================="
====== 6) 运行你的 python 入口(我们下一步再改它支持 chunk)======
约定:python 脚本接收 3 个参数:chunk_dir / outdir / config / receptor
python3 dock_chunk_v1.py \ --chunk_dir "$CHUNK_DIR" \ --outdir "$OUTDIR" \ --config "$CONFIG" \ --receptor "$RECEPTOR" \ --nproc 4 ```
```python
!/usr/bin/env python3
-- coding: utf-8 --
import os import sys import time import socket import platform import traceback import argparse from pathlib import Path from datetime import datetime from concurrent.futures import ProcessPoolExecutor, as_completed
------------------ 默认参数(可被命令行覆盖) ------------------
DEFAULT_EXHAUSTIVENESS = 16 DEFAULT_NUM_MODES = 20 DEFAULT_VERBOSITY = 1
---------------------------------------------------------------
--------- 进程内全局对象(每个 worker 各自一份)---------
_VINA_OBJ = None _CFG = None _CENTER = None _SIZE = None _N_POSES = None _META = None
def parse_vina_config(config_path: Path) -> dict: cfg = {} with open(config_path, "r", encoding="utf-8", errors="ignore") as f: for line in f: s = line.strip() if not s or s.startswith("#") or s.startswith(";"): continue if "=" in s: k, v = s.split("=", 1) else: parts = s.split() if len(parts) < 2: continue k, v = parts[0], parts[1] cfg[k.strip().lower()] = v.strip() return cfg
def _to_float(cfg: dict, k: str, default=None): if k not in cfg: return default try: return float(cfg[k]) except Exception: return default
def _to_int(cfg: dict, k: str, default=None): if k not in cfg: return default try: return int(float(cfg[k])) except Exception: return default
def check_environment(receptor: Path, config: Path, chunk_dir: Path, out_dir: Path): print(f"Start Time : {datetime.now().isoformat(timespec='seconds')}") print(f"HOST : {socket.gethostname()}") print(f"PWD : {Path.cwd().resolve()}") print(f"Chunk Dir : {chunk_dir}") print(f"Out Dir : {out_dir}")
try: import vina # noqa except Exception as e: print("❌ ERROR: cannot import vina in current env. Activate correct conda env (e.g., vina23).") print(f"import vina error: {e}") sys.exit(1) if not receptor.exists(): sys.exit(f"❌ Missing receptor: {receptor}") if not config.exists(): sys.exit(f"❌ Missing config: {config}") if not chunk_dir.exists(): sys.exit(f"❌ Missing chunk_dir: {chunk_dir}") out_dir.mkdir(parents=True, exist_ok=True)def init_worker(receptor_path: str, config_path: str, out_dir: str, exhaustiveness: int, verbosity: int, num_modes_override: int | None): """ 每个进程启动时执行一次:读 config / 初始化 Vina / 计算 maps / 保存元信息 """ global _VINA_OBJ, _CFG, _CENTER, _SIZE, _N_POSES, _META
from vina import Vina import vina as vina_mod cfg = parse_vina_config(Path(config_path)) center = ( _to_float(cfg, "center_x"), _to_float(cfg, "center_y"), _to_float(cfg, "center_z"), ) size = ( _to_float(cfg, "size_x"), _to_float(cfg, "size_y"), _to_float(cfg, "size_z"), ) # config 里如果写了 num_modes,就用它;否则默认 20;命令行可强制覆盖 n_poses = _to_int(cfg, "num_modes", DEFAULT_NUM_MODES) if num_modes_override is not None: n_poses = int(num_modes_override) if any(v is None for v in center) or any(v is None for v in size): raise RuntimeError("Config missing center_x/y/z or size_x/y/z.") v = Vina(sf_name="vina", verbosity=int(verbosity)) v.set_receptor(receptor_path) v.compute_vina_maps(center=center, box_size=size) _VINA_OBJ = v _CFG = cfg _CENTER = center _SIZE = size _N_POSES = int(n_poses) _META = { "hostname": socket.gethostname(), "pid": os.getpid(), "python": sys.version.replace("\n", " "), "platform": platform.platform(), "vina_py_version": getattr(vina_mod, "__version__", "unknown"), "receptor": receptor_path, "config": config_path, "out_dir": out_dir, "center": center, "size": size, "num_modes": int(n_poses), "exhaustiveness": int(exhaustiveness), "vina_verbosity": int(verbosity), }def _format_table(energies): lines = [] lines.append("-----+------------+----------+----------") lines.append("mode | affinity | rmsd l.b.| rmsd u.b.") lines.append(" | (kcal/mol) | | ") lines.append("-----+------------+----------+----------") for i, row in enumerate(energies, 1): aff, lb, ub = float(row[0]), float(row[1]), float(row[2]) lines.append(f"{i:>4} | {aff:>10.3f} | {lb:>8.3f} | {ub:>8.3f}") lines.append("-----+------------+----------+----------") return "\n".join(lines)
def run_docking_api(ligand_path_str: str): """ 返回 (status, ligand_name, msg) msg: OK 时给 top1 affinity;FAIL 时给简短错误 """ global _VINA_OBJ, _N_POSES, _META
ligand_path = Path(ligand_path_str) ligand_name = ligand_path.stem out_dir = Path(_META["out_dir"]) out_pdbqt = out_dir / f"{ligand_name}_out.pdbqt" out_log = out_dir / f"{ligand_name}.log" # 断点续跑 if out_pdbqt.exists() and out_pdbqt.stat().st_size > 0: with open(out_log, "w", encoding="utf-8") as f: f.write(f"[SKIP] {datetime.now().isoformat(timespec='seconds')}\n") for k, v in _META.items(): f.write(f"{k}: {v}\n") f.write(f"ligand: {ligand_path}\n") f.write("reason: output exists and non-empty\n") return ("SKIP", ligand_name, "File exists") t0 = time.time() try: v = _VINA_OBJ if v is None: raise RuntimeError("Worker vina object not initialized.") v.set_ligand_from_file(str(ligand_path)) v.dock(exhaustiveness=int(_META["exhaustiveness"]), n_poses=int(_N_POSES)) energies = v.energies(n_poses=int(_N_POSES)) table = _format_table(energies) v.write_poses(str(out_pdbqt), n_poses=int(_N_POSES), overwrite=True) dt = time.time() - t0 with open(out_log, "w", encoding="utf-8") as f: f.write(f"[OK] {datetime.now().isoformat(timespec='seconds')}\n") for k, vv in _META.items(): f.write(f"{k}: {vv}\n") f.write(f"ligand: {ligand_path}\n") f.write(f"output: {out_pdbqt}\n") f.write(f"runtime_seconds: {dt:.3f}\n\n") f.write(table + "\n") top1 = float(energies[0][0]) if out_pdbqt.exists() and out_pdbqt.stat().st_size > 0: return ("OK", ligand_name, f"{top1:.3f}") return ("FAIL", ligand_name, "No output pdbqt generated") except Exception as ex: dt = time.time() - t0 tb = traceback.format_exc() with open(out_log, "w", encoding="utf-8") as f: f.write(f"[FAIL] {datetime.now().isoformat(timespec='seconds')}\n") for k, vv in _META.items(): f.write(f"{k}: {vv}\n") f.write(f"ligand: {ligand_path}\n") f.write(f"runtime_seconds: {dt:.3f}\n") f.write(f"error: {type(ex).__name__}: {ex}\n\n") f.write(tb) return ("FAIL", ligand_name, f"{type(ex).__name__}: {ex}")def is_positive_name(name: str) -> bool: # 你三个阳性都以 POS_ 开头 return name.startswith("POS_")
def main(): p = argparse.ArgumentParser(description="Chunked AutoDock Vina docking runner (vina python API).") p.add_argument("--chunk_dir", required=True, help="chunks/chunk_XXX directory containing pdbqt ligands") p.add_argument("--outdir", required=True, help="output directory for this chunk") p.add_argument("--config", required=True, help="vina config file (center/size/num_modes)") p.add_argument("--receptor", required=True, help="receptor pdbqt path") p.add_argument("--nproc", type=int, default=4, help="number of worker processes (match BSUB -n)") p.add_argument("--exhaustiveness", type=int, default=DEFAULT_EXHAUSTIVENESS) p.add_argument("--num_modes", type=int, default=None, help="override num_modes (optional)") p.add_argument("--verbosity", type=int, default=DEFAULT_VERBOSITY) args = p.parse_args()
chunk_dir = Path(args.chunk_dir).resolve() out_dir = Path(args.outdir).resolve() receptor = Path(args.receptor).resolve() config = Path(args.config).resolve() check_environment(receptor, config, chunk_dir, out_dir) # per-chunk logs log_success = out_dir / "docking_success.tsv" log_failure = out_dir / "docking_failure.tsv" log_skipped = out_dir / "docking_skipped.tsv" summary_log = out_dir / "docking_summary.txt" # ligand list from chunk_dir ligands = sorted(chunk_dir.glob("*.pdbqt")) total = len(ligands) if total == 0: print(f"❌ No ligands found in chunk_dir: {chunk_dir}") sys.exit(1) print(f"Found {total} ligands in {chunk_dir}") print(f"Running with nproc={args.nproc} | exhaustiveness={args.exhaustiveness} | num_modes={args.num_modes or '(from config)'}") # touch logs for f in [log_success, log_failure, log_skipped]: if not f.exists(): f.touch() counts = {"OK": 0, "FAIL": 0, "SKIP": 0} t_all = time.time() with ProcessPoolExecutor( max_workers=int(args.nproc), initializer=init_worker, initargs=(str(receptor), str(config), str(out_dir), int(args.exhaustiveness), int(args.verbosity), args.num_modes), ) as executor: future_to_lig = {executor.submit(run_docking_api, str(lig)): lig for lig in ligands} with open(log_success, "a", encoding="utf-8") as f_ok, \ open(log_failure, "a", encoding="utf-8") as f_fail, \ open(log_skipped, "a", encoding="utf-8") as f_skip: for future in as_completed(future_to_lig): status, name, msg = future.result() counts[status] += 1 done = counts["OK"] + counts["FAIL"] + counts["SKIP"] elapsed = time.time() - t_all speed = done / elapsed if elapsed > 0 else 0.0 # 记录 + 阳性高亮 tag = "POS" if is_positive_name(name) else "NA" if status == "OK": f_ok.write(f"OK\t{tag}\t{name}\t{msg}\n") f_ok.flush() elif status == "FAIL": f_fail.write(f"FAIL\t{tag}\t{name}\t{msg}\n") f_fail.flush() if tag == "POS": print(f"🚨 POS FAIL: {name} ({msg})") else: print(f"⚠️ FAIL: {name} ({msg})") else: f_skip.write(f"SKIP\t{tag}\t{name}\n") f_skip.flush() if done % 25 == 0 or done == total: print(f"Progress: {done}/{total} | OK={counts['OK']} FAIL={counts['FAIL']} SKIP={counts['SKIP']} | {speed:.2f} lig/s") total_time = time.time() - t_all summary = ( f"TOTAL={total} OK={counts['OK']} FAIL={counts['FAIL']} SKIP={counts['SKIP']} " f"SECONDS={total_time:.1f} SPEED={total/total_time:.3f}_lig/s" ) print("-" * 60) print(f"Done: {datetime.now().isoformat(timespec='seconds')}") print(summary) with open(summary_log, "w", encoding="utf-8") as f: f.write(summary + "\n") f.write(f"chunk_dir={chunk_dir}\n") f.write(f"out_dir={out_dir}\n") f.write(f"receptor={receptor}\n") f.write(f"config={config}\n")if name == "main": main()
```
-
重新对接802个分子,对接脚本2和提交脚本2如下。成功对接761个分子,报错41个(可能分子柔性太高)
```sh
!/bin/bash
TODO_LIST="todo_list.txt" LOG_DIR="logs_remaining" mkdir -p $LOG_DIR
echo "开始从第一行开始提交所有任务,每间隔 2 秒提交一个..."
使用 cat 读取完整列表,确保从第一个分子开始
cat "$TODO_LIST" | while IFS=, read -r LIG_PATH OUT_PATH CK_NAME do LIG_ID=$(basename "$LIG_PATH" .pdbqt)
echo "[$(date +%H:%M:%S)] 正在提交: ${LIG_ID} (来自 ${CK_NAME})" # 提交任务 # -n 1: 单核 # -M 5000: 内存硬限制 5G # -R: 资源预留 3000MB,提高在集群中的调度优先级 bsub -q smp \ -n 1 \ -M 5000 \ -R "rusage[mem=3000] span[hosts=1]" \ -J "dock_${LIG_ID}" \ -o "${LOG_DIR}/${LIG_ID}.out" \ -e "${LOG_DIR}/${LIG_ID}.err" \ "source /public/home/hxsun25/miniconda3/etc/profile.d/conda.sh && \ conda activate vina23 && \ export OMP_NUM_THREADS=1 && \ export MKL_NUM_THREADS=1 && \ export OPENBLAS_NUM_THREADS=1 && \ python3 dock_single.py \"$LIG_PATH\" \"$OUT_PATH\"" # 核心:每提交一个任务休息 2 秒,平滑集群压力 sleep 2done
echo "所有任务已按照 2 秒间隔提交完毕。" ```
```python import sys, os, time, socket, platform from pathlib import Path from datetime import datetime from vina import Vina import vina as vina_mod
1. 接收参数
ligand_path = sys.argv[1] out_dir = Path(sys.argv[2]) config_path = "/public/home/hxsun25/mimedb2.0/2026_2_3dockingwork/config260203autobox.txt" receptor_path = "/public/home/hxsun25/mimedb2.0/2026_2_3dockingwork/IRP2NEW26.pdbqt"
out_dir.mkdir(parents=True, exist_ok=True) lig_name = Path(ligand_path).stem out_pdbqt = out_dir / f"{lig_name}_out.pdbqt" out_log = out_dir / f"{lig_name}.log"
2. Vina 配置读取
def get_cfg(p): d = {} with open(p) as f: for l in f: if '=' in l: k, v = l.split('='); d[k.strip().lower()] = v.strip() return d
cfg = get_cfg(config_path)
3. 能量列表格式化函数
def _format_table(energies): lines = [] lines.append("-----+------------+----------+----------") lines.append("mode | affinity | rmsd l.b.| rmsd u.b.") lines.append(" | (kcal/mol) | | ") lines.append("-----+------------+----------+----------") for i, row in enumerate(energies, 1): aff, lb, ub = float(row[0]), float(row[1]), float(row[2]) lines.append(f"{i:>4} | {aff:>10.3f} | {lb:>8.3f} | {ub:>8.3f}") lines.append("-----+------------+----------+----------") return "\n".join(lines)
4. 对接逻辑
t0 = time.time() v = Vina(sf_name="vina", cpu=1, verbosity=0) v.set_receptor(receptor_path) v.compute_vina_maps( center=(float(cfg['center_x']), float(cfg['center_y']), float(cfg['center_z'])), box_size=(float(cfg['size_x']), float(cfg['size_y']), float(cfg['size_z'])) )
v.set_ligand_from_file(ligand_path) v.dock(exhaustiveness=16, n_poses=20) v.write_poses(str(out_pdbqt), n_poses=20, overwrite=True)
5. 写入详细日志 (修正了 f-string 包含反斜杠的问题)
energies = v.energies(n_poses=20) runtime = time.time() - t0
提前处理包含反斜杠的字符串
py_version_clean = sys.version.replace('\n', ' ') vina_ver = getattr(vina_mod, "version", "unknown")
with open(out_log, "w", encoding="utf-8") as f: f.write(f"[OK] {datetime.now().isoformat(timespec='seconds')}\n") f.write(f"hostname: {socket.gethostname()}\n") f.write(f"pid: {os.getpid()}\n") f.write(f"python: {py_version_clean}\n") # 使用处理好的变量 f.write(f"platform: {platform.platform()}\n") f.write(f"vina_py_version: {vina_ver}\n") f.write(f"receptor: {receptor_path}\n") f.write(f"config: {config_path}\n") f.write(f"out_dir: {out_dir}\n") f.write(f"center: ({cfg['center_x']}, {cfg['center_y']}, {cfg['center_z']})\n") f.write(f"size: ({cfg['size_x']}, {cfg['size_y']}, {cfg['size_z']})\n") f.write(f"num_modes: 20\n") f.write(f"exhaustiveness: 16\n") f.write(f"vina_verbosity: 0\n") f.write(f"ligand: {ligand_path}\n") f.write(f"output: {out_pdbqt}\n") f.write(f"runtime_seconds: {runtime:.3f}\n\n") f.write(_format_table(energies) + "\n")
print(f"Done: {lig_name}") ```
分子对接后分析
-
第一批次对接成功10660+第二批对接成功761=11421个
-
把这11421个分子的对接结果导出,结合理化性质,筛选出符合类药五原则,并且对接结合能小于等于-6的分子,共选择出5814个分子
-
使用以下脚本,对对接复合物的受体与配体相互作用情况进行解析,生成统计表格。
```python import os import xml.etree.ElementTree as ET import pandas as pd from tqdm import tqdm
1. 配置路径(请确保路径正确)
base_path = "/public/home/hxsun25/mimedb2.0/2026_2_3dockingwork/分子对接后分析_20260311/2对接作用力分析/batch_complex_plip5814" output_detailed = "All_Interactions_Detailed_Analysis_Corrected5814.csv" output_table3 = "Table3_Ligand_Interaction_Profile5814.csv" output_table4 = "Table4_Residue_Hotspot_Analysis5814.csv"
定义相互作用类型映射
interaction_types = { "hydrogen_bonds": "H-Bond", "hydrophobic_interactions": "Hydrophobic", "pi_stacking": "Pi-Stacking", "pi_cation_interactions": "Pi-Cation", "salt_bridges": "Salt-Bridge", "water_bridges": "Water-Bridge", "halogen_bonds": "Halogen-Bond", "metal_complexes": "Metal-Complex" }
all_data = []
获取所有子文件夹
folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))]
print(f"正在解析 {len(folders)} 个分子的 PLIP 报告...")
for folder_name in tqdm(folders, desc="Processing"): xml_path = os.path.join(base_path, folder_name, "plip_results/pose1", f"{folder_name}_pose1_complex_report.xml")
if not os.path.exists(xml_path): continue try: tree = ET.parse(xml_path) root = tree.getroot() interactions_node = root.find(".//bindingsite/interactions") if interactions_node is None: continue for xml_tag, label in interaction_types.items(): group = interactions_node.find(xml_tag) if group is not None: for item in group: # --- 核心修正点:适配 restype 标签 --- res_num = item.find('resnr').text if item.find('resnr') is not None else "N/A" # 优先读取 restype,如果不存在则尝试 resname res_name = "N/A" if item.find('restype') is not None: res_name = item.find('restype').text elif item.find('resname') is not None: res_name = item.find('resname').text chain = item.find('reschain').text if item.find('reschain') is not None else "A" # 距离提取逻辑(适配多种距离标签) dist = "N/A" for d_tag in ['dist', 'dist_h-a', 'dist_d-a']: if item.find(d_tag) is not None: dist = item.find(d_tag).text break all_data.append({ "Ligand_ID": folder_name, "Interaction_Type": label, "Residue_Name": res_name, "Resnr": res_num, "Residue": f"{res_name}{res_num}", # 生成类似 ARG668 的格式 "Chain": chain, "Distance": dist }) except Exception as e: continue # 忽略个别损坏的文件2. 转换为 DataFrame
df = pd.DataFrame(all_data)
if not df.empty: # --- 保存详细清单 --- df.to_csv(output_detailed, index=False)
# --- 生成表格 3:配体-作用力丰度表 (Ligand Profile) --- table3 = df.pivot_table( index='Ligand_ID', columns='Interaction_Type', values='Residue', aggfunc='count', fill_value=0 ) table3['Total_Interactions'] = table3.sum(axis=1) # 统计涉及的不同残基总数 table3['Unique_Residues_Count'] = df.groupby('Ligand_ID')['Residue'].nunique() # 拼接残基列表字符串,方便查看 table3['Residue_List'] = df.groupby('Ligand_ID')['Residue'].apply(lambda x: ', '.join(sorted(set(x)))) table3.to_csv(output_table3) # --- 生成表格 4:残基热点统计表 (Residue Hotspots) --- total_ligands = 5472 table4 = df.groupby('Residue').agg( Ligand_Count=('Ligand_ID', 'nunique') ).reset_index() table4['Frequency_%'] = (table4['Ligand_Count'] / total_ligands * 100).round(2) # 补充主要作用类型 main_type = df.groupby('Residue')['Interaction_Type'].agg(lambda x: x.value_counts().index[0]).reset_index() table4 = table4.merge(main_type, on='Residue') # 按出现频率降序排列 table4 = table4.sort_values(by='Ligand_Count', ascending=False) table4.to_csv(output_table4, index=False) print(f"\n成功!所有表格已生成:\n1. {output_detailed}\n2. {output_table3}\n3. {output_table4}")else: print("\n警告:未提取到任何有效数据,请检查文件夹权限。")
```
实验可行性筛选
- 对5814个分子查询多家供应商。包括MCE,targetmol,Moiport,chemspace,还有手动编写的查询脚本,最终确定。只有1382个分子是可以购买获得的,其它可能需要定制或者寻找类似物,暂时搁置。我计算重心放在这1382个分子
多指标综合打分排序
-
对1382个分子,进行相似度为0.6的分子指纹聚类,共产生720个簇,有530个单分子簇,前5大簇的分子数目为70,32,27,26,21。有185个簇介于2和20个分子数。
-
对前1382个分子进行信息补全,制作成一个包含完整信息的表格。包含以下信息列:
对接结合能,MW LogP HBD HBA TPSA RotatableBonds H-Bond Halogen-Bond Hydrophobic Pi-Cation Salt-Bridge Unique_Residues_Count
- 排序实现过程
``` ### 思路 以结合能(affinity)、相互作用类型质量和理化性质(drug-likeness)三类指标构建主排序模型;与此同时,将候选分子的热点残基命中情况作为辅助注释信息保留,用于后续人工优选和结果解释
###定义变量 设第i个候选分子为i ####结合能变量 Ai = RankPct(-affinity) Ai表示第i个分子的结合能分数 RankPct()表示按所有候选分子中的相对位置计算百分位排名,结果介于0和1之间,分数越接近1,表示结合能越优。原因在于:表示按所有候选分子中的相对位置计算百分位排名,结果介于 0 和 1 之间
#####相互作用类型变量 为了评价候选分子与蛋白结合模式的质量,不仅考虑是否结合,还考虑形成了哪些类型的分子间相互作用。对每个分子,统计以下几类相互作用数量: H-Bond:氢键数; Halogen-Bond:卤键数; Hydrophobic:疏水作用数; Pi-Cation:阳离子-π相互作用数; Salt-Bridge:盐桥数
氢键是常见且重要的定向相互作用,因此赋权 1.0; 卤键数量通常较少,但方向性较强,赋权 0.8; 疏水作用常较普遍,特异性相对弱,因此赋权 0.5; 阳离子-π作用在某些结合模式中具有较高贡献,赋权 1.2; 盐桥通常代表较强的静电相互作用,赋权 1.5 根据系数计算的相互作用原始分记录为Qi,然后对Qi进行百分位标准化,得到相互作用类型分数Ti
####理化性质变量 MW:分子量; LogP:脂水分配系数; HBD:氢键供体数; HBA:氢键受体数; TPSA:拓扑极性表面积; RotatableBonds:可旋转键数。 由于这些变量的“理想范围”不同,不能简单直接相加,因此对每个变量分别构建 desirability 函数,将其映射为 0 到 1 之间的单项得分。 随后,采用几何平均的方式整合六项理化性质得分,得到总体理化性质分数Di
###最终计算权重系数 Ai为0.45;Ti为0.35;Di为0.20 以对接表现为主,以结合模式为辅,以理化性质为约束 在实际分析中,也测试0.5 0.3 0.2;0.40 0.4 0.2。并计算Spearman 秩相关系数比较排序结果一致性。最终显示一致性高。 ```
深度学习模型Gnina软件打分排序
-
对1382个分子的对接结果,保留第一个最优构象,不重新对接,只对Autodock vina对接结果进行重评分,执行评分命令
python gnina -r 受体.pdbqt -l 配体.pdbqt --score_only --cnn crossdock_default2018 -
提取所以评分,然后CNNscore * CNNaffinity 对着1382个分子依据Gnina打分排序
深度学习模型Gnina软件对1385个分子重对接
- 设置与vina对接是相同的受体,口袋参数,对接参数进行Gnina对接,对Gnina对接结果进行CNNscore * CNNaffinity 计算,计算后排序
核心交集+尖子生并集
1. 对多指标综合打分,vina对接结果的Gnina重打分,Gnina重对接,分别采用这3种方法单独进行排序。
2. 对这3种指标的前25%取交集,共交集出25个分子
3. 对vina多指标综合排序的名单取前5%并且Gnina置信度大于0.5,34个
4. 取Gnina重对接排名前5%的分子,20个
5,对3种结果进行并集,最终获得50个候选分子
💬 留言