基于生存分析的NIPT时间选择与胎儿异常判定
作者:余明羲,叶子枫,吴悠
日期:2025年9月7日
摘要
无创产前检测(NIPT)因其高准确性和低风险在胎儿染色体异常筛查中得到广泛应用,但检测结果受孕周、BMI、胎儿性别等多因素影响。本文基于大规模孕妇检测数据,构建数学模型,系统分析了孕妇特征与胎儿性染色体浓度的关系,并提出数据驱动的BMI分组与最佳检测时点推荐方法。
对于问题一,本文首先对Y染色体浓度、孕期周数、BMI等数据进行了数据预处理和可视化分析,为后续模型的建立提供可靠的数据支持。随后,由于各变量均未通过 Sapiro–Wilk检验 ,使用Spearman偏相关系数 分析Y染色体浓度与孕妇各生理指标间的相关性。为验证并量化影响,进一步采用多元线性回归模型 ,用最小二乘法 拟合得到最终的模型,并进行残差分析以及显著性检验,计算得到决定系数**R 2 = 0.045 R^2=0.045 R 2 = 0.045 ,F统计量的p值为 1.66 × 10 − 8 1.66\times10^{-8} 1.66 × 1 0 − 8 **,表明整个回归模型是高度统计显著的。
对于问题二,本文采用结合 B样条 的半参数Cox比例风险模型 ,并为每个个体预测其95%成功率的达标孕周(Q95时间点)。由于临床数据删失严重,本文将每位孕妇的纵向观测数据转化为生存分析中的标准区间数据,并使用了多重插补(MI)缓解Cox比例风险模型无法处理的 左删失 问题。然后,本文利用保序回归 对此“BMI-Q95预测周数”关系进行单调化处理 ,以此为监督信号,通过回归树 算法寻找最优的BMI分组切点,得到4个分组,并结合剪枝与最小样本量等策略保证分组的稳健性。最后,在每个分组内,使用Kaplan–Meier 方法估计生存曲线和推荐检测时间点(t95),通过敏感性分析确认了模型在面对测量误差时的稳健性与可靠性。
对于问题三,为综合考虑多种因素对Y染色体浓度达标比例的影响,本文在问题二的生存分析框架下的标准区间数据的基础上,拟合包含BMI、年龄等协变量的区间删失 加速失效时间模型 (AFT),最佳分布为log-normal 。随后,本文对每个个体的首次达标时间分布进行初步估计,得到个体层的精确分布参数与分位点(Q90/Q95)及 π(25)=P(T≤25)。与问题二类似,为最大化组间差异,本文采用回归树 进行分组;在得到的3个分组内,采用基于 AFT 条件分布的多重插补(MI) ,再结合KM 方法估计生存曲线和推荐检测时间点(t95),当 KM 结果不可靠时则回退至 AFT 模型的组内中位预测。
对于问题四,本文首先完成特征工程以增强原始数据。随后,考虑到临床上漏诊付出代价较高,本文以 最小化非对称临床代价函数 为唯一优化目标,构建了端到端集成学习模型 。该方法融合了XGBoost 和核SVM 两种模型的优势,并通过100次K折交叉验证 对所有超参数进行联合寻优。在验证集上,训练得到的最佳模型的临床代价为297 ,AUC 分数为 0.8216 ,表明模型具有良好的区分与预测能力。最后,本文使用SHAP 分析各特征的特征重要性并将结果可视化,增强了模型的可解释性及临床应用能力。
关键词: 多元线性回归, Cox模型, AFT模型, KM模型, 回归树, 集成学习模型
问题重述
问题背景
问题背景
无创产前检测(NIPT,Non-invasive Prenatal Testing)是一种用来筛查胎儿是否存在染色体异常的技术,具有采集母体血液、检测胎儿的游离DNA片段、分析胎儿染色体是否存在异常三种产前检测技术来鉴定胎儿的健康状况,近年来在产前检测领域得到了广泛应用。NIPT能够在早期检测出唐氏综合征、爱德华氏综合征、帕陶氏综合征等胎儿染色体异常情况,这三种体征分别由胎儿21号、18号、13号染色体浓度是否异常决定。该方法具有较高的准确性和较低的风险,是临床上推荐使用的产前筛查方法。
尽管NIPT在临床中表现出较高的精确度,其结果仍受多种因素影响,尤其是胎儿性染色体浓度的测定——男胎的Y染色体或女胎的X染色体的游离DNA片段的测定比例。此外,不同孕妇的孕期、体重指数(BMI)以及胎儿的性别等因素都可能对检测结果产生影响。其中对于男胎,由于其携带Y染色体,其浓度的变化受到孕妇孕周数和BMI等因素的显著影响。
为了提高NIPT检测的准确性并降低潜在风险,临床实践中通常根据孕妇的BMI值对其进行分组,并根据不同群体的特点选择最佳的检测时点。然而,由于每个孕妇的个体差异,简单的经验分组并不能适用于所有情况。因此,如何在不同BMI组别中合理选择检测时点,确保检测准确性并尽量减少胎儿异常风险,是一个亟待解决的问题。
本研究旨在通过数学建模,分析胎儿性染色体浓度与孕妇的孕期、BMI等因素的关系,探讨如何根据不同孕妇群体的特点,优化NIPT的检测时点,以提高检测的准确性并降低早期发现异常所带来的潜在风险。同时,本文将考虑不同因素对检测误差的影响,以制定更加科学合理的分组策略和检测时点选择标准。
问题要求
本题以无创产前检测(NIPT)为背景,利用附件提供的孕妇检测数据,包括孕周、BMI、身高、体重、染色体浓度、Z值、GC含量等数据,用来研究胎儿染色体浓度的变化规律及其影响因素,建立数学模型,分析胎儿性染色体浓度与孕妇特征的关系,并确定不同孕妇群体的最佳检测时点,此外再针对女胎提出异常判定方法,旨在为临床优化NIPT检测策略提供科学依据。
问题1: 依据表格有关男胎的Y染色体浓度数据和孕妇特征数据,研究Y染色体浓度与孕周、BMI等变量的相关性,建立定量关系模型,并对模型进行显著性检验,验证关系是否可靠。
问题2: BMI是影响男胎Y染色体浓度最早达标时间的主要因素,需要按BMI对男胎孕妇进行分组并确定最佳检测时点(浓度首次大于等于4%),从而最小化潜在风险,并分析检测误差影响。
问题3: 在问题二中BMI分组的基础上,综合考虑身高、体重、年龄等多种因素,以及检测误差和达标比例(浓度达到或超过4%的比例),再结合男胎孕妇的BMI,适当调整分组情况以及每组的最佳NIPT时点,使得孕妇潜在风险达到最小化,并分析检测误差对结果的影响。
问题4: 针对女胎,为了明确女胎异常的判定方法,需结合21号、18号、13号染色体的Z值、GC含量、读段数及其比例、X染色体相关指标、孕妇BMI等特征,输出一个可用于判定女胎是否异常的模型或规则。
问题分析
问题一分析
针对问题一,本文选取孕妇的检测孕周、BMI、年龄三个指标,并对其与Y染色体浓度的偏相关性进行分析。首先通过Sapiro–Wilk检验 (W检验)确认了所有变量均不服从正态分布,因此本文选用非参数的Spearman偏相关性分析 替代 Pearson偏相关性分析,在控制其他变量后,即可得出Y染色体浓度与孕妇特征指标的偏相关性正负情况。其次,为验证并量化影响,进一步采用多元线性回归模型 ,并使用最小二乘法 来寻找最佳的系数估计值,由此得到最终的拟合模型。最后,对该回归模型进行评估,计算R 2 R^2 R 2 与F统计量的p值,分别评判该模型的的可解释性与显著性。
问题二分析
针对问题二,需要基于孕妇BMI对其分组,并为其推荐一个能以高概率成功检测到Y染色体的最优孕周。由于数据上严重的左删失 问题,本文将每位孕妇的纵向观测数据转化为生存分析框架 下的标准区间数据,明确标识出删失类型,并采用了多重插补(MI)技术,对删失区间内的真实达标时间进行合理的随机插补,以构建可供分析的完整数据集。然后,拟合结合 B样条 的半参数Cox比例风险模型 ,以灵活捕捉BMI对达标风险的非线性影响,并为每个个体预测其95%成功率的达标孕周(Q95时间点)。为确保业务逻辑的合理性,本文利用保序回归 对此“BMI-Q95预测周数”关系进行单调化处理 ,然后以此为监督信号,通过CART回归树 算法以数据驱动的方式寻找最优的BMI分组切点,并结合剪枝与最小样本量等策略保证分组的稳健性。最后,在每个分组内,使用Kaplan–Meier 方法估计生存曲线和推荐时间点(t95),再经跨组保序和半周圆整后得出最终推荐时点。为了分析检测误差对结果的影响,本文通过“模糊阈值”分析与“加噪蒙特卡洛 ”分析两种独立的敏感性分析来评估模型的稳健性,最终确认了模型在面对测量误差时的稳健性与可靠性。
问题三分析
针对问题三,需要在问题二的基础上考虑多种因素(如身高、体重、年龄等)的影响,对孕妇BMI对其进行分组,并为每组推荐一个最佳检测时点。在问题二的生存分析框架下的标准区间数据的基础上,首先拟合包含BMI、年龄等协变量的区间删失加速失效时间模型 (AFT),比较得出最佳分布为log-normal ,对每个个体的首次达标时间分布进行初步估计,得到个体层的精确分布参数与分位点(Q90/Q95)及 π(25)=P(T≤25)。其次,基于 AFT 模型的预测结果,利用CART回归树 以 π(25) 为监督信号(最小化SSE)进行分组,以最大化组间差异;在每个分组内,为在高删失背景下稳健估计组内生存曲线,采用基于 AFT 条件分布的多重插补(MI) ,再结合KM方法估计生存曲线和推荐时间点(t95),当 KM 结果不可靠时则回退至 AFT 模型的组内中位预测。此外,与问题二同理,本文通过“模糊区间”分析与“加噪蒙特卡洛 ”分析两种独立的敏感性分析来评估模型的稳健性,最终确认了模型在面对测量误差时的稳健性与可靠性。
问题四分析
针对女胎非整倍体异常的判定问题,核心挑战在于假阴性(漏诊)的临床代价远高于假阳性(误诊)。为应对此挑战,本文在进行特征工程 后,构建了一个以最小化非对称临床代价函数 为唯一优化目标的端到端集成学习模型 。该方法融合了XGBoost 和核SVM 两种模型的优势,并通过100次5折交叉验证对包括模型参数、集成权重和分类阈值在内的所有超参数进行联合寻优。此策略确保了最终模型的所有决策都精确地服务于“不惜代价降低漏诊率”这一核心临床需求,而非追求传统的准确率或AUC指标,从而在经过数据质控筛选后的高置信度样本上,实现了临床效用最大化的智能诊断。
符号说明
符号
说明
单位
$Y $
Y染色体浓度
/
$X_1 $
孕妇检测孕周
周
$X_2 $
孕妇BMI值
k g / m 2 kg/m^2 k g / m 2
$X_3 $
孕妇年龄
岁
$\beta $
回归系数
/
$\epsilon $
随机误差
/
$T_i $
第i i i 位受试者的达标时间
周
$L_i $
达标时间左边界阈值
周
$R_i $
达标时间右边界阈值
周
数据分析
本题旨在探究孕妇的各项生理指标(如孕周、年龄、BMI)与胎儿游离性染色体浓度之间的关系。数据集包含了数千份胎儿的母体血浆样本检测记录。本题所提供的孕妇一系列特征数据具有明确的分布特点及规律,本节以男胎检测数据为例进行分析。
首先,从妊娠方式来看,绝大多数孕妇为自然受孕(98.5%),通过体外受精(IVF)或宫腔内人工授精(IUI)等辅助生殖技术受孕的孕妇比例极低,表明样本主要代表了普遍的自然受孕人群。其次,在胎儿健康状况方面,96.5%的胎儿被记录为健康,这为后续分析提供了一个相对均质的基线。
年龄、BMI、孕周以及Y染色体浓度等指标的分布揭示了样本的关键构成。通过分布柱状图可知,在年龄分布上,25-30岁年龄段的孕妇构成了最大的群体(超过400人),其次是30-35岁年龄段,整体呈现以青壮年育龄女性为主的典型分布。然而,在BMI方面,数据显示出本研究孕妇的一个显著特点:绝大多数孕妇(近770人)的BMI在27-37之间,少部分低于27或高于37。这表明本研究的样本群体主要由高BMI孕妇构成,为深入探究BMI对检测指标的影响提供了充足的数据支持。检测孕周无论按周还是按天排列,都呈现出明显的多峰形态,峰值大致出现在90天(约13周)、110天(约16周)和150天(约21周)附近。这表明样本的采集并非在孕期内均匀分布,而是集中在几个关键的临床检查时间点。作为本研究的核心因变量,Y染色体浓度的分布呈现出典型的严重右偏态。绝大多数样本的浓度值集中在较低的区间(0.05-0.10),仅有少数样本具有非常高的浓度值。X染色体的浓度分布呈现较标准的正态分布,峰值在0.05左右,但此数据对本题研究男胎的帮助不大。
此外,本文对不同指标的特殊值进行了整理与探究。通过四张箱线图,直观地揭示了研究孕妇队列中关键变量的分布特征。它表明,该研究的样本主要由年龄集中在26-32岁的青壮年女性构成,但一个显著的特点是孕妇的身体质量指数(BMI)普遍偏高,且存在大量高值异常点。更重要的是,作为核心指标的Y染色体浓度呈现出典型的严重右偏态分布,即绝大多数样本的浓度值都集中在较低的区间,仅有少数样本具有非常高的浓度。
最后,从数据采集的时间趋势来看,从2023年1月至2024年5月,每月的检测数量呈现出一定的周期性波动,峰值出现在2023年春夏季。同时,孕妇的平均BMI在不同月份间也存在小幅波动,但未显示出与检测量同步的明确趋势。这些时间维度的信息有助于理解数据采集的背景,并评估潜在的时间混杂效应。此外,对孕妇的检测抽血次数分析显示,绝大多数孕妇仅进行1-2次检测,也为模型的构建提供了数据结构信息。
数据预处理
本题所提供的数据存在格式不统一、读取不方便、数据不合理等问题,在建模之前,需要先进行数据预处理,以确保分析结果更加科学、合理,模型建构更加稳定。本文对原始数据集进行了系统化的调整,包括对日期格式的统一、孕周格式的转换以及对无效数据的剔除。该流程旨在提升数据的一致性、完整性与可用性,从而减少噪声与偏差对模型性能的干扰。
日期格式的统一
原始数据中,不同的日期存在格式不统一的情况——"末次月经"的数据的年月日被“/”分开,而“检测日期”的数据的年月日被直接拼接在一起。为了读取数据更方便,本文将日期字段的数据都改成了“某年某月某日”的格式。具体示例如下:
日期格式统一示例
原始数据
改后数据
2023/5/20或20230520
2023年5月20日
孕周格式的转换
附件表格中的“检测孕周”字段的数据存在“周+天”的混合表示。本文将其全部换算成天数,方便数据的读取与比较。举例如下:
孕周格式转换示例
所以,初步的统计分析显示:样本覆盖的孕周主要集中在12周至20周之间,中位孕周约为14周;孕妇年龄分布广泛,平均年龄约31岁;BMI的中位数为23.5 k g / m 2 23.5kg/m^2 23.5 k g / m 2 ,但存在部分高BMI(> 30 k g / m 2 >30kg/m^2 > 30 k g / m 2 )的样本,呈现右偏态分布。关键指标Y染色体浓度的原始数值分布极不均匀,同样呈现明显的右偏态,说明大部分样本浓度值较低,少数样本浓度非常高,为后续的统计建模提供了明确思路。
唯一比对读段数的筛选
在无创产前检测中,“唯一比对的读段数”是衡量数据有效性的重要指标,其能够唯一映射到参考基因组某一位置,反映了测序读段在参考基因组上的有效比对数量,还可有效减少因重复序列或错误比对导致的假阳性结构变异信号。本文对该指标进行了两个层面的筛选——读段数范围的界定以及读段数异常值的剔除。
(1)唯一比对读段数范围的界定
本文采用文献检索法 ,根据多篇方法学和临床研究,唯一比对读段数不应低于一个最低阈值,否则可能会因被检测基因片段过少而导致检测不准确。因此,依据文献内容,本文规定在检测第21号、18号、13号染色体浓度时,将0.15 × 覆盖度 ≈ 3 0.15\times\text{覆盖度}\approx3 0.15 × 覆盖度 ≈ 3 M条作为常用NIPT测序量的最低阈值。据此剔除掉了所有唯一比对读段数低于300万条的数据,避免因测序量不足或比对效率低导致的浓度估计偏差。
(2)唯一比对读段数异常值的剔除
附件中数据存在“唯一比对读段数”大于“原始读段数”的情况。因为前者是后者的一个子集,所以这在正常情况下是不可能发生的。因此,本文剔除了所有存在此情况的孕妇数据,一共71条。部分异常值数据如下:
唯一比对读段数异常值部分示例
序号
孕妇代码
原始读段数
唯一比对的读段数
690
A169
2132408
4395037
695
A171
2879248
3626619
696
A171
3636973
3737311
698
A171
3439440
3745489
经过上述步骤,数据集在时间格式、孕周表示及测序质量方面均实现了统一与优化。此外,本文对BMI值与身高、体重是否匹配进行了检验,发现全部匹配,无需调整。显著降低了因数据格式不一致或低质量样本引入的偏差风险,为后续的相关性分析、分组策略制定及数学模型构建奠定了坚实基础。
问题一的模型的建立和求解
数据整理与目标分析
本节旨在探究男胎样本中,哪些因素对母亲血浆中的胎儿Y染色体浓度产生影响。本文选取了孕周、孕妇BMI和孕妇年龄三个关键变量,希望评估它们与Y染色体浓度之间的独立关系。
针对预处理后的男胎检测数据,实行进一步的整理。对孕周数,仅提取周数作为数值,舍弃额外不满一周的天数;对需研究的四个变量,定义Y染色体浓度为Y Y Y ,检测孕周数为X 1 X_1 X 1 ,孕妇BMI为X 2 X_2 X 2 ,孕妇年龄为X 3 X_3 X 3 。由于我们需要判定变量间纯净的相关性,首先考虑使用偏相关系数进行判定。但是,标准的偏相关系数内使用的是Pearson相关系数,而使用Pearson相关分析的前提条件是所有变量均服从正态分布。经过w检验,发现数据并不满足正态性,则放弃Pearson相关,转而采用更适合的非参数方法——Spearman偏相关性分析。使用该方法,可以计算每个变量与其他所有变量之间的偏相关系数,从而分析变量之间的相关性强弱与正负。为了给出相应的关系模型,本节采用了多元线性回归,并对其显著效果进行更深层次的分析,验证了这三个因素对Y染色体浓度的独立影响。
为探究孕周、孕妇BMI及年龄对Y染色体浓度的独立影响,本研究构建了系统的统计分析模型。首先,通过正态性检验(Shapiro–Wilk检验)评估数据分布特性,结果显示所有关键变量均不服从正态分布。基于此,本文采用非参数的Spearman偏相关分析来衡量各变量间的单调关系强度。同时,为量化各因素的综合影响并建立预测模型,本文运用普通最小二乘法构建了多元线性回归模型,最终得到一个能够解释Y染色体浓度变化的数学方程。
正态性检验
Pearson相关系数的有效性依赖于变量服从正态分布的假设。为检验此假设,本文对四个核心变量采用Shapiro–Wilk检验进行正态性分析。
原假设H 0 H_0 H 0 :变量的样本数据来自于一个正态分布的总体。
备择假设H 1 H_1 H 1 :变量的样本数据不来自于一个正态分布的总体。
检测结构如下表:
Shapiro–Wilk检验结构
变量
检验p值
是否服从正态分布
Y染色体浓度
<0.0001
否(p<0.05)
检测孕周
<0.0001
否(p<0.05)
孕妇BMI
<0.0001
否(p<0.05)
年龄
<0.0001
否(p<0.05)
由此可知,所有变量的p值均远小于0.05的显著性水平。因此,本文拒绝了所有变量服从正态分布的假设。由于数据不满足正态性假设,使用Pearson偏相关分析可能会导致结果的偏差。因此,本文选择Spearman偏相关分析作为替代方法。Spearman相关是基于等级的非参数检验,它不要求数据服从特定的分布,对于非线性的关系也更为稳健,是处理当前数据的更优选择。
Spearman偏相关分析
Spearman偏相关分析结合了Spearman秩相关与偏相关的思想,既能处理非正态、非线性数据,又能在分析两个变量关系时控制其他变量的干扰。它不要求数据服从正态分布,适用于像本题数据一样分布未知的数据。
(1)数据排序 :将所有涉及的变量( Y , X 1 , X 2 , X 3 ) (Y,X_1,X_2,X_3) ( Y , X 1 , X 2 , X 3 ) 的原始数据转换为等级。
(2)计算偏相关系数 :当需要评估多个变量之间的独立性关系时,最系统的方法是使用逆矩阵法 一次性计算所有偏相关系数。
构建相关系数矩阵:首先,本文为所有涉及的变量( Y , X 1 , X 2 , X 3 ) (Y,X_1,X_2,X_3) ( Y , X 1 , X 2 , X 3 ) 创建一个零阶Spearman相关系数矩阵C \mathbf{C} C 。
C = ( 1 r Y X 1 r Y X 2 r Y X 3 r T 1 r X 1 X 2 r X 1 X 3 r X 2 Y r X 2 X 1 1 r X 2 X 3 r X 3 Y r X 3 X 1 r X 3 X 2 1 ) C = \begin{pmatrix}
1 & r_{YX_1} & r_{YX_2} & r_{YX_3} \\
r_{T} & 1 & r_{X_1X_2} & r_{X_1X_3} \\
r_{X_2Y} & r_{X_2X_1} & 1 & r_{X_2X_3} \\
r_{X_3Y} & r_{X_3X_1} & r_{X_3X_2} & 1
\end{pmatrix}
C = 1 r T r X 2 Y r X 3 Y r Y X 1 1 r X 2 X 1 r X 3 X 1 r Y X 2 r X 1 X 2 1 r X 3 X 2 r Y X 3 r X 1 X 3 r X 2 X 3 1
计算逆矩阵:计算该相关系数矩阵C \mathbf{C} C 的逆矩阵,记为P = C − 1 P = C^{-1} P = C − 1 。
计算偏相关系数:任意两个变量 i i i 和 j j j 在控制了集合中所有其他变量后的偏相关系数,可以通过逆矩阵 P \mathbf{P} P 的元素计算得出:
r i j ⋅ Z = − p i j p i i p j j r_{ij \cdot Z} = - \frac{p_{ij}}{\sqrt{p_{ii} p_{jj}}}
r ij ⋅ Z = − p ii p jj p ij
例如,要计算Y染色体浓度(Y Y Y )与检测孕周(X 1 X_1 X 1 )在控制了BMI(X 2 X_2 X 2 )和年龄(X 3 X_3 X 3 )后的偏相关系数 r Y X 1 ⋅ X 2 X 3 r_{YX_1 \cdot X_2X_3} r Y X 1 ⋅ X 2 X 3 ,本文使用以下公式:
r Y X 1 ⋅ X 2 X 3 = − p Y X 1 p Y Y p X 1 X 1 r_{YX_1 \cdot X_2X_3} = - \frac{p_{YX_1}}{\sqrt{p_{YY} p_{X_1X_1}}}
r Y X 1 ⋅ X 2 X 3 = − p YY p X 1 X 1 p Y X 1
这种方法为计算多变量控制下的偏相关系数提供了一个系统性的框架。其通过对相关矩阵求逆,一次性获得在控制其他变量影响后的全部成对相关系数,计算效率高、结果结构化且对称一致,适合多变量和高维数据分析;结合Spearman秩相关矩阵时,还能兼具抗异常值和非正态分布的稳健性。
(3)假设检验 :对计算出的每个偏相关系数进行显著性检验。
原假设H 0 H_0 H 0 :两个变量在控制了协变量后不相关,即偏相关系数ρ = 0 \rho = 0 ρ = 0 。
备择假设H 1 H_1 H 1 :两个变量在控制了协变量后存在相关性,即ρ ≠ 0 \rho \neq 0 ρ = 0 。
p值的计算基于t分布,其统计量和自由度的计算如下:
t统计量:t = r n − k − 2 1 − r 2 t = r \sqrt{\frac{n - k - 2}{1 - r^2}} t = r 1 − r 2 n − k − 2
自由度:d f = n − k − 2 df = n - k - 2 df = n − k − 2 ,其中,n n n 是样本量,k k k 是控制变量的数量。
使用Spearman方法,计算了每个变量与其他所有变量之间的偏相关系数。结果矩阵如下:
Spearman偏相关系数矩阵 (基于秩次)
Y染色体浓度
检测孕周
孕妇BMI
年龄
Y染色体浓度
1.000000
0.086422
-0.140745
-0.095451
检测孕周
0.086422
1.000000
0.145435
-0.013050
孕妇BMI
-0.140745
0.145435
1.000000
0.027144
年龄
-0.095451
-0.013050
0.027144
1.000000
结果显示:
(1)Y染色体浓度与检测孕周(r = 0.086 r=0.086 r = 0.086 ):在控制了孕妇BMI和年龄后,Y染色体浓度与检测孕周存在微弱的正相关 关系。
(2)Y染色体浓度与孕妇BMI(r = − 0.141 r=-0.141 r = − 0.141 ):在控制了孕周和年龄后,Y染色体浓度与孕妇BMI存在弱的负相关 关系。这是三个因素中相关性最强的一个。
(3)Y染色体浓度与孕妇年龄(r = − 0.095 r=-0.095 r = − 0.095 ):在控制了孕周和BMI后,Y染色体浓度与孕妇年龄存在微弱的负相关 关系。
通过对相关性系数的分析与对比,得出结论:孕周与Y染色体浓度呈正相关,在BMI和年龄相近的情况下,孕周越长,Y染色体浓度越高;孕妇BMI与Y染色体浓度呈负相关。在孕周和年龄相近的情况下,孕妇BMI越高,Y染色体浓度越低,这可能与“稀释效应”有关;孕妇年龄与Y染色体浓度呈负相关。在孕周和BMI相近的情况下,孕妇年龄越大,Y染色体浓度越低。
多元线性回归模型的建立与求解
根据上述分析,为量化孕周、孕妇BMI、孕妇年龄三个自变量与Y染色体浓度的关系模型,本文采用多元线性回归,其能够同时考虑多个自变量对因变量的影响,在控制其他因素的情况下量化各变量的独立贡献,从而更全面、准确地解释因变量的变化规律;其可通过回归系数、显著性检验和拟合优度等指标评估模型的统计可靠性与预测能力。本题中,该模型不仅能对变量间的相关性情况进行验证,还能量化其影响,更深层次地评估变量间的关联情况。
多元线性回归模型的建立
(1)模型设定
根据题意,结合模型可基本假设因变量Y Y Y 可以表示为自变量X 1 , X 2 , X 3 X_1, X_2, X_3 X 1 , X 2 , X 3 的线性组合,加上一个随机误差项ϵ \epsilon ϵ 。理论上,总体回归模型可以表示为:
Y i = β 0 + β 1 X i 1 + β 2 X i 2 + β 3 X i 3 + ϵ i Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i3} + \epsilon_i
Y i = β 0 + β 1 X i 1 + β 2 X i 2 + β 3 X i 3 + ϵ i
其中:
i i i 代表第i i i 个观测样本。
Y i Y_i Y i 是第i i i 个样本的Y染色体浓度观测值。
X i 1 , X i 2 , X i 3 X_{i1}, X_{i2}, X_{i3} X i 1 , X i 2 , X i 3 分别是第i个样本的检测孕周、孕妇BMI和年龄的观测值。
β 0 \beta_0 β 0 是截距项,表示当所有自变量都为0时,Y Y Y 的期望值。
β 1 , β 2 , β 3 \beta_1, \beta_2, \beta_3 β 1 , β 2 , β 3 是回归系数。β j \beta_j β j 表示在其他自变量保持不变的情况下,X j X_j X j 每增加一个单位,Y Y Y 的平均变化量。
ϵ i \epsilon_i ϵ i 是随机误差项,代表了模型未能解释的所有其他因素对Y i Y_i Y i 的影响。它满足高斯-马尔可夫假设,即期望为0,方差恒定,且相互独立。
(2)模型拟合:最小二乘法
由于总体系数β j \beta_j β j 无法被直接观测到,需要通过样本数据来估计它们。本节使用的是普通最小二乘法 来寻找最佳的系数估计值。该方法的目标是找到一组系数估计值β ^ 0 , β ^ 1 , β ^ 2 , β ^ 3 \hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3 β ^ 0 , β ^ 1 , β ^ 2 , β ^ 3 ,使得残差平方和(SSR)最小。样本回归模型是总体模型的估计形式:
Y ^ i = β ^ 0 + β ^ 1 X i 1 + β ^ 2 X i 2 + β ^ 3 X i 3 \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \hat{\beta}_2 X_{i2} + \hat{\beta}_3 X_{i3}
Y ^ i = β ^ 0 + β ^ 1 X i 1 + β ^ 2 X i 2 + β ^ 3 X i 3
其中 Y ^ i \hat{Y}_i Y ^ i 是Y染色体浓度的拟合值或预测值。对于每个观测值,残差e i e_i e i 是观测值与拟合值之差:
e i = Y i − Y ^ i e_i = Y_i - \hat{Y}_i
e i = Y i − Y ^ i
最小二乘法的目标是最小化所有残差的平方和 (SSR):
SSR = ∑ i = 1 n e i 2 = ∑ i = 1 n ( Y i − Y ^ i ) 2 = ∑ i = 1 n ( Y i − β ^ 0 − β ^ 1 X i 1 − β ^ 2 X i 2 − β ^ 3 X i 3 ) 2 \text{SSR} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{i1} - \hat{\beta}_2 X_{i2} - \hat{\beta}_3 X_{i3})^2
SSR = i = 1 ∑ n e i 2 = i = 1 ∑ n ( Y i − Y ^ i ) 2 = i = 1 ∑ n ( Y i − β ^ 0 − β ^ 1 X i 1 − β ^ 2 X i 2 − β ^ 3 X i 3 ) 2
其中 n n n 是样本量(在本文的案例中,n=851)。
为了找到最小化SSR的 β ^ j \hat{\beta}_j β ^ j ,需要对SSR分别求关于 β ^ 0 , β ^ 1 , β ^ 2 , β ^ 3 \hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3 β ^ 0 , β ^ 1 , β ^ 2 , β ^ 3 的偏导数,并令其等于0。这个过程会得到一组正规方程组。在矩阵形式下,这个过程更为简洁。假设:\
1.Y \mathbf{Y} Y 是一个 n × 1 n \times 1 n × 1 的因变量观测值向量。\
2.X \mathbf{X} X 是一个 n × 4 n \times 4 n × 4 的设计矩阵(包含一列全为1的截距项和三列自变量)。\
3.β \boldsymbol{\beta} β 是一个 4 × 1 4 \times 1 4 × 1 的系数向量。\
4.η ^ \hat{\boldsymbol{\eta}} η ^ 是β \boldsymbol{\beta} β 的估计向量。
Y = ( Y 1 Y 2 ⋮ Y n ) , X = ( 1 X 11 X 12 X 13 1 X 21 X 22 X 23 ⋮ ⋮ ⋮ ⋮ 1 X n 1 X n 2 X n 3 ) , β ^ = ( β ^ 0 β ^ 1 β ^ 2 β ^ 3 ) \mathbf{Y} = \begin{pmatrix}
Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix}, \quad
\mathbf{X} = \begin{pmatrix}
1 & X_{11} & X_{12} & X_{13} \\
1 & X_{21} & X_{22} & X_{23}\\
\vdots & \vdots & \vdots & \vdots \\ 1 & X_{n1} & X_{n2} & X_{n3} \end{pmatrix},\quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \hat{\beta}_3 \end{pmatrix}
Y = Y 1 Y 2 ⋮ Y n , X = 1 1 ⋮ 1 X 11 X 21 ⋮ X n 1 X 12 X 22 ⋮ X n 2 X 13 X 23 ⋮ X n 3 , β ^ = β ^ 0 β ^ 1 β ^ 2 β ^ 3
SSR可以表示为:
SSR = ( Y − X β ^ ) T ( Y − X β ^ ) \text{SSR} = (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})
SSR = ( Y − X β ^ ) T ( Y − X β ^ )
通过求解 ∂ ( SSR ) ∂ β ^ = 0 \frac{\partial(\text{SSR})}{\partial \hat{\boldsymbol{\beta}}} = 0 ∂ β ^ ∂ ( SSR ) = 0 ,本文得到正规方程的矩阵形式:
( X T X ) β ^ = X T Y (\mathbf{X}^T \mathbf{X}) \hat{\boldsymbol{\beta}} = \mathbf{X}^T \mathbf{Y}
( X T X ) β ^ = X T Y
最终,本文可以解出系数的估计向量 β ^ \hat{\boldsymbol{\beta}} β ^ :
β ^ = ( X T X ) − 1 X T Y \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}
β ^ = ( X T X ) − 1 X T Y
本次模型构建旨在量化检测孕周、孕妇BMI及年龄对Y染色体浓度的独立影响。为此,本文采用了双重分析策略:
首先,通过Spearman偏相关分析,在克服数据非正态分布的挑战后,确立了各变量与Y染色体浓度之间独立关系的性质与方向(正相关或负相关)。
随后,构建多元线性回归模型,其目的不仅在于验证偏相关分析的发现,更在于将这些影响进行精确量化,从而建立一个可用的预测方程。
两个模型的结果高度一致:回归系数的符号(正/负)与偏相关系数的方向完全吻合,这为结论的稳健性提供了有力支持。
多元线性回归模型的求解
具体的系数估计值:
{ β ^ 0 = 0.1384 β ^ 1 (检测孕周) = 0.0002 β ^ 2 (孕妇BMI) = − 0.0016 β ^ 3 (年龄) = − 0.0010 \begin{cases}
\hat{\beta}_0 = 0.1384\\
\hat{\beta}_1\text{(检测孕周)} = 0.0002\\
\hat{\beta}_2 \text{(孕妇BMI)} = -0.0016\\
\hat{\beta}_3 \text{(年龄)} = -0.0010
\end{cases}
⎩ ⎨ ⎧ β ^ 0 = 0.1384 β ^ 1 (检测孕周) = 0.0002 β ^ 2 (孕妇 BMI ) = − 0.0016 β ^ 3 (年龄) = − 0.0010
将这些估计值代入样本回归方程,得到最终的拟合模型:
Y染色体浓度 ^ = 0.1384 + 0.0002 × ( 检测孕周 ) − 0.0016 × ( 孕妇BMI ) − 0.0010 × ( 年龄 ) \widehat{\text{Y染色体浓度}} = 0.1384 + 0.0002 \times (\text{检测孕周}) - 0.0016 \times (\text{孕妇BMI}) - 0.0010 \times (\text{年龄})
Y 染色体浓度 = 0.1384 + 0.0002 × ( 检测孕周 ) − 0.0016 × ( 孕妇 BMI ) − 0.0010 × ( 年龄 )
该方程就是对Y染色体浓度进行预测的数学模型。例如,对于一个检测孕周为20周、BMI为25、年龄为30岁的孕妇,其Y染色体浓度的预测值为:
Y ^ = 0.1384 + 0.0002 × 20 − 0.0016 × 25 − 0.0010 × 30 = 0.0724 \hat{Y} = 0.1384 + 0.0002 \times 20 - 0.0016 \times 25 - 0.0010 \times 30 = 0.0724
Y ^ = 0.1384 + 0.0002 × 20 − 0.0016 × 25 − 0.0010 × 30 = 0.0724
这个过程清晰地展示了如何从理论模型出发,利用OLS方法和样本数据,最终得到一个可用于解释和预测的、具体的数学方程。
模型评估
本文中的多元线性回归模型的R 2 = 0.045 R^2=0.045 R 2 = 0.045 ,表明检测孕周、孕妇BMI和年龄这三个变量能共同解释Y染色体浓度约4.5%的变异。这表明模型虽然显著,但解释力有限;F统计量的p值为1.66 × 10 − 8 1.66\times10^{-8} 1.66 × 1 0 − 8 ,p值极小,远低于0.05,表明整个回归模型是高度统计显著的。
本次构建的回归模型在统计学上是高度显著的。模型的F统计量对应的p值(1.66 × 10 − 8 1.66\times10^{-8} 1.66 × 1 0 − 8 )远小于0.05的显著性水平,这有力地证明了检测孕周、孕妇BMI和年龄这三个自变量联合起来对Y染色体浓度确实存在显著的预测关系,整个方程是成立的,其揭示的规律不太可能由随机误差所致。
从模型系数来看,所有纳入的自变量——检测孕周、孕妇BMI和年龄——均为统计上非常显著的预测因子。具体而言,检测孕周与Y染色体浓度呈显著正相关,而孕妇BMI和年龄则与其呈显著负相关。这意味着,在控制了其他变量后,孕周越长、BMI越低、年龄越轻的孕妇,其血浆中胎儿Y染色体浓度倾向于更高。
然而,模型的实际解释能力非常有限。R 2 R^2 R 2 仅为0.045,表明该模型只能解释Y染色体浓度总变异的4.5%。这是一个相当低的比例,但在已有的数据下已经是较好的结果了。实际上,在模型的初步探索阶段,我们尝试了多种方法,包括但不限于决策树、Xgboost等非线性回归模型。尽管这些模型在训练集上的效果很好(R 2 R^2 R 2 大于0.9),但在训练集上的效果却很差(R 2 R^2 R 2 小于0),说明即使是功能强大的预测模型也很难在已有数据集上很好的捕捉到变量间的关系。在查找相关医学文献后,发现多元线性回归是解释Y染色体浓度与BMI等指标关系的常见模型。综合以上研究结果,本文认为,多元线性回归在现有数据集上效果虽然并不理想,但仍是解决此问题的最佳方案。数据集上诸多重要但未被捕捉的因素的缺失才是造成效果不良的根本原因。这些未被捕捉的因素可能包括复杂的生物学机制,如胎盘功能状态、母体与胎儿间的个体生物学差异、遗传背景,以及检测技术本身的波动等。
综上所述,该模型成功地识别出了几个影响Y染色体浓度的关键统计学指标及其影响方向,为理解这一生理现象提供了有价值的线索。但其较低的R平方值也明确警示本文,Y染色体浓度是一个受多重复杂因素共同调控的指标,孕妇检测孕周、孕妇BMI、孕妇年龄这三个指标对Y染色体浓度的影响的贡献很小,但显著性很强。
问题二的模型的建立和求解
为解决传统BMI分组在预测检测成功率方面区分度不足的问题,本研究构建了一套数据驱动的监督式分箱。该方法首先将问题转化为一个生存分析框架,通过区间删失方法处理纵向检测数据中“指标达标”事件发生时间的不确定性。随后,本文采用多重插补技术来应对区间删失带来的问题,并在每个插补数据集上拟合带B样条变换的Cox比例风险模型,为每个BMI值预测一个“指标达标孕周”。最后,以该预测孕周为监督信号,通过CART回归树算法以数据驱动的方式寻找最优的BMI分组切点,并结合剪枝与最小样本量等策略保证分箱的稳健性。最后,在每个分组内,使用Kaplan–Meier方法估计生存曲线和推荐时间点(t95),再经跨组保序和半周圆整后得出最终推荐时点。为了分析检测误差对结果的影响,本文通过“模糊区间”分析与“加噪蒙特卡洛”分析两种独立的敏感性分析来评估模型的稳健性,最终确认了模型在面对测量误差时的稳健性与可靠性。
数据探索性分析(EDA)
在正式构建监督式分箱模型之前,本文首先对原始数据进行了探索性分析(EDA),以评估传统BMI分组的预测能力。通过绘制BMI与关键检测指标的散点图并叠加LOWESS平滑曲线,探究BMI与关键检测指标的相关性情况,这为后续建模提供了理论依据。随后,本文将孕妇按照常规的的BMI标准(例如[20,28),[28,32),[32,36),[36,40),40 以上)进行分组,并对各组分别进行了KM生存分析,以比较其“指标达标”事件的发生率曲线。分析结果显示,尽管不同分组的KM曲线呈现出一定的分离趋势,但各曲线间区分度不足,甚至存在交叉现象。这一发现明确地揭示了传统分组方法无法有效且稳定地对风险进行分层,从而凸显了采用数据驱动的监督式学习方法来寻找最优风险分割点的必要性。
探索性数据分析结果
对本题所给的原始分组样例进行分析后,得出了孕妇BMI值与关键检测指标之间的关系,如下图所示。图中包含了所有数据点的散点图,以及一条LOWESS平滑拟合曲线。曲线清晰地揭示了BMI与关键检测指标之间存在负相关趋势。即随着BMI的增高,可能导致检测成功的关键生物指标浓度趋于下降,这是后续建模的理论基础。其次,基于传统的BMI分组,绘制了KM生存曲线。这里的“生存”事件可以理解为“未发生检测误差”。
图中各曲线存在一定的分离趋势,表明不同BMI分组的检测误差率确实存在差异。然而,曲线之间的分离度存在交叉,说明简单的BMI分组不足以清晰、稳定地划分风险等级,这凸显了后续采用“监督式分箱”的必要性。
阈值的设定
在本研究中,阈值 c 是定义核心分析事件的基石。它代表了一个关键的临床或技术判断标准,用于判定某次检测的测量值 y i j y_{ij} y ij 是否“达标”。具体而言,当 y i j > = c y_{ij} >= c y ij >= c 时,本文将其定义为一次成功的“命中”(hit)事件。这个看似简单的二元判定是整个建模流程的起点,它将原始的、纵向的测量值序列,转化为一个标准的生存分析问题。模型的核心目标正是预测每位孕妇的测量值首次“命中”该阈值 c 所需的孕周 T i T_i T i 。通过在时间序列上应用此规则,得以构建出区间删失数据 [L i , R i L_i, R_i L i , R i ],为后续的Cox风险建模和监督式分箱提供了必需的输入。此外,为了评估模型对该关键参数的稳健性,本文还引入了“模糊阈值”区间 [c l , c u c_l, c_u c l , c u ] 的概念,用以模拟阈值本身存在不确定性的情况,从而检验最终结论的稳定性。
令原始行级观测集合为R = { ( i , j ) : ( p i d i , t i j , y i j , B M I i ) } \mathcal{R}=\{(i,j): (\mathrm{pid}_i, t_{ij}, y_{ij}, \mathrm{BMI}_i )\} R = {( i , j ) : ( pid i , t ij , y ij , BMI i )} ,其中p i d i \mathrm{pid}_i pid i 是第i i i 位受试者,i = 1 , … , n i=1,\dots,n i = 1 , … , n ;阈值c c c 用于判定“达标”:h i t i j = 1 { y i j ≥ c } \mathrm{hit}_{ij} = \mathbf{1}\{y_{ij} \ge c\} hit ij = 1 { y ij ≥ c } 。对每位受试者的真实达标时间记为T i T_i T i ,若未发现则视为右删失。观测可表现为:区间删失、左删失或右删失。
给定模糊阈值区间 [ c ℓ , c u ] [c_\ell,c_u] [ c ℓ , c u ] ,对观测序列定义如下规则:
(1)若存在最早的 j j j 使得 y i j > c u y_{ij} > c_u y ij > c u ,则把 R i = w i j R_i=w_{ij} R i = w ij ,并令 L i L_i L i 为 R i R_i R i 之前最近一次时间点(若无则设为 0 或预先设定的下界),此时为区间删失。
(2)若全序列中没有 y i j > c u y_{ij} > c_u y ij > c u ,则视为右删失,L i = max j w i j L_i=\max_j w_{ij} L i = max j w ij 。
多重插补方法(MI)
由于本文无法观测到每位孕妇检测指标首次“达标”的确切孕周,只能确定它发生于某两次检测之间的时间窗口 [L i , R i L_i, R_i L i , R i ] 内,并且由于很多数据在第一次检测时就达标,出现严重的左删失情况,因此无法直接应用标准生存模型。为解决此问题,本文采用了多重插补技术。该方法的核心思想是:与其用一个单一值(如区间中点)来估计未知的“达标”时间,不如通过重复抽样生成 M M M 个合理的“伪”完整数据集。在每个数据集中,本文根据一个预设的概率分布为每个区间删失的样本随机赋一个“达标”时间点 T i T_i T i 。后续的Cox风险模型将在这M M M 个插补数据集上分别独立运行,最后将 M M M 次的分析结果进行汇总,从而得到一个考虑了原始数据不确定性的、更为稳健的最终估计。
为处理“不确定”的达标时间,根据区间化规则,对第 i i i 位受试者,若存在最早满足 h i t i k = 1 \mathrm{hit}_{ik}=1 hit ik = 1 的索引 k k k :
(1)若存在 j ∗ < k j^*<k j ∗ < k 使 h i t i j ∗ = 0 \mathrm{hit}_{ij^*}=0 hit i j ∗ = 0 ,则定义区间删失L i = w i , j ∗ , R i = w i , k , ctype i = 区间删失 L_i=w_{i,j^*},\quad R_i=w_{i,k},\quad
\text{ctype}_i=\text{区间删失} L i = w i , j ∗ , R i = w i , k , ctype i = 区间删失 。
(2)若无前置负例,则视为左删失,取
L i = w l b (常设为检测下限,例如 6 周) , R i = w i , k , ctype i = 左删失 L_i = w_{\mathrm{lb}}\ \text{(常设为检测下限,例如 6 周)},\quad R_i = w_{i,k},\quad \text{ctype}_i=\text{左删失} L i = w lb ( 常设为检测下限,例如 6 周 ) , R i = w i , k , ctype i = 左删失 。
(3)若序列中无正例,则右删失,设L i = w i , max , R i = + ∞ , ctype i = 右删失 L_i = w_{i,\max},\quad R_i = +\infty,\quad \text{ctype}_i=\text{右删失} L i = w i , m a x , R i = + ∞ , ctype i = 右删失 。
左删失比例与BMI值的关系如图:
然后,对每个左删失样本做M M M 次插补 ,构造完整时间样本:
(1)均匀插补: T i ( m ) ∼ U n i f ( L i , R i ) , m = 1 , … , M T_i^{(m)} \sim \mathrm{Unif}(L_i,R_i),\quad m=1,\dots,M T i ( m ) ∼ Unif ( L i , R i ) , m = 1 , … , M ;
(2)**截断指数插补:**给定尺度参数 θ > 0 \theta>0 θ > 0 ,密度:
f ( t ) = ( 1 / θ ) e − ( t − L i ) / θ 1 − e − ( R i − L i ) / θ , t ∈ ( L i , R i ) , f(t) = \frac{(1/\theta) e^{-(t-L_i)/\theta}}{1 - e^{-(R_i-L_i)/\theta}},\quad t\in(L_i,R_i),
f ( t ) = 1 − e − ( R i − L i ) / θ ( 1/ θ ) e − ( t − L i ) / θ , t ∈ ( L i , R i ) ,
其逆变换采样为
t = − θ ln ( 1 − U ( 1 − e − ( R i − L i ) / θ ) ) + L i , U ∼ U n i f ( 0 , 1 ) , t = -\theta \ln\big(1 - U (1-e^{-(R_i-L_i)/\theta})\big) + L_i,\quad U\sim\mathrm{Unif}(0,1),
t = − θ ln ( 1 − U ( 1 − e − ( R i − L i ) / θ ) ) + L i , U ∼ Unif ( 0 , 1 ) ,
(3)**自适应插补:**先按 BMI 分组计算每组左删失比例 r g r_g r g ,若 r g r_g r g 超过阈值 τ \tau τ (默认 0.6),则对该组采用截断指数插补并调整左界(例如 L i ← 10 L_i\leftarrow 10 L i ← 10 周)以引入向右偏的插补分布。
对右删失样本取 T i = L i T_i=L_i T i = L i 且事件指示 δ i = 0 \delta_i=0 δ i = 0 。由此得到 M M M 个完整数据集 D ( m ) \mathcal{D}^{(m)} D ( m ) 。
Cox模型的建立与分位时间预测
在本研究中,Cox比例风险模型是连接孕妇BMI与其检测指标“达标”事件风险的核心预测引擎。由于BMI与“达标”风险之间的关系可能并非简单的线性,本文不直接使用原始BMI值,而是采用B样条对其进行变换。B样条能将BMI灵活地表示为一组分段多项式基函数,从而有效捕捉两者间复杂的非线性模式。在经过多重插补后,本文在每个插补数据集上独立拟合一个Cox模型,该模型以BMI的B样条变换结果作为协变量,用以估计每个BMI值对应的“达标”瞬时风险。通过该模型,能为每位孕妇计算出其个体化的生存函数S i ^ ( t ) \hat{S_i}(t) S i ^ ( t ) ,并从中推导出关键的预测分位时间T i ^ \hat{T_i} T i ^ ,p——即预测该孕妇有p概率实现指标达标的孕周。这个预测孕周最终将作为监督式学习的目标,被用于训练后续的回归树,以找出最佳的BMI风险分割点。
在每一插补集 D ( m ) \mathcal{D}^{(m)} D ( m ) 上,使用 BMI 的样条基向量 X i X_i X i (B样条)拟合 Cox 模型:
λ i ( t ) = λ 0 ( t ) exp ( β ⊤ X i ) \lambda_i(t) = \lambda_0(t) \exp(\beta^\top X_i) λ i ( t ) = λ 0 ( t ) exp ( β ⊤ X i ) ,
估计后得到个体生存函数(在离散时间网格 T \mathcal{T} T 上):
S ^ i ( m ) ( t ) = S ^ 0 ( m ) ( t ) exp ( β ⊤ X i ) \widehat S_i^{(m)}(t) = \widehat S_0^{(m)}(t)^{\exp(\beta^\top X_i)} S i ( m ) ( t ) = S 0 ( m ) ( t ) e x p ( β ⊤ X i ) ,随即对给定概率 p p p (常用 p = 0.95 p=0.95 p = 0.95 ),定义第 p p p 分位预测:
T ^ i , p ( m ) = inf { t ∈ T : S ^ i ( m ) ( t ) ≤ 1 − p } ; \widehat T_{i,p}^{(m)} = \inf\{t\in\mathcal{T}:\ \widehat S_i^{(m)}(t) \le 1-p\};
T i , p ( m ) = inf { t ∈ T : S i ( m ) ( t ) ≤ 1 − p } ;
跨插补聚合(本文取中位数)得到最终预测:
T ^ i , p = m e d i a n m = 1 M T ^ i , p ( m ) \widehat T_{i,p} = \mathrm{median}_{m=1}^M \widehat T_{i,p}^{(m)} T i , p = median m = 1 M T i , p ( m ) 。
曲线单调化
为保证 BMI 增加时预测不减少,令样本按 BMI 升序排列为 x ( 1 ) ≤ ⋯ ≤ x ( n ) x_{(1)}\le\dots\le x_{(n)} x ( 1 ) ≤ ⋯ ≤ x ( n ) ,对应预测值为 y ( i ) = T ^ ( i ) , p y_{(i)}=\widehat T_{(i),p} y ( i ) = T ( i ) , p 。求单调不减序列 y ~ ( i ) \widetilde y_{(i)} y ( i ) 以最小化平方误差:
y ~ = arg min y ~ ( 1 ) ≤ ⋯ ≤ y ~ ( n ) ∑ i = 1 n ( y ( i ) − y ~ ( i ) ) 2 \widetilde y = \arg\min_{\widetilde y_{(1)}\le\cdots\le\widetilde y_{(n)}} \sum_{i=1}^n (y_{(i)} - \widetilde y_{(i)})^2
y = arg y ( 1 ) ≤ ⋯ ≤ y ( n ) min i = 1 ∑ n ( y ( i ) − y ( i ) ) 2
由单调回归求解,求得y i m o n o y_i^{\mathrm{mono}} y i mono 。
回归树模型与KM估计
单变量回归树是一种决策树回归模型,用于预测一个连续型目标变量,其输入只有一个特征变量。它通过递归地划分输入变量的取值区间,在每个区间内用一个常数值来进行预测。以单变量 BMI 为自变量拟合回归树,来拟合 y m o n o y^{\mathrm{mono}} y mono 。理想上希望得到将 BMI 实轴划分为 K K K 个区间{ I g } g = 1 K \{\mathcal{I}_g\}_{g=1}^K { I g } g = 1 K ,使得区间内 y m o n o y^{\mathrm{mono}} y mono 方差尽可能小。
通过遍历剪枝参数,生成候选切点集合,并对候选方案施加最小宽度约束(相邻切点间距 ≥ w min \ge w_{\min} ≥ w m i n )与最小样本数约束(每叶子样本数 ≥ n min \ge n_{\min} ≥ n m i n ),否则将邻区间合并。其评价准则是优先选择最终叶子数等于目标 K = 4 K=4 K = 4 的候选,并在这些候选中最小化 MAE:
M A E = 1 n ∑ i = 1 n ∣ y i − y ^ g ( i ) ∣ \mathrm{MAE} = \frac{1}{n} \sum_{i=1}^n | y_i - \widehat y_{g(i)} |
MAE = n 1 i = 1 ∑ n ∣ y i − y g ( i ) ∣
其中 y ^ g \widehat y_{g} y g 为组内中位数。
在最终分组下对每组做KM估计,并估计第p p p 分位t g , p t_{g,p} t g , p 。
卡氏估计是一种用于生存分析的非参数统计方法,主要用于估计某个事件发生前的时间分布。它不依赖于生存时间的特点分布假设,适用于各种类型的生存数据;它能有效处理右删失样本,且可用于比较不同组的生存差异。
对最终分箱的每组 g g g 进行KM估计 S ^ g ( t ) \widehat S_g(t) S g ( t ) ,并求组内第 p p p 分位:
t g , p = inf { t : S ^ g ( t ) ≤ 1 − p } t_{g,p} = \inf\{t: \widehat S_g(t) \le 1-p\}
t g , p = inf { t : S g ( t ) ≤ 1 − p }
在 MI 环境下,对每个插补集分别计算 t g , p ( m ) t_{g,p}^{(m)} t g , p ( m ) ,再跨插补聚合得到最终推荐。
回归树分组与KM估计结果
利用单变量回归树模型,得到最佳BMI切点数值及各组孕妇的最佳检测时点。
最佳BMI分组及检测时点
BMI组别
人数
最佳检测周数
29.0及以下
32
18.0
{}[29.0,31.1)
89
18.5
{}[31.1,33.2)
68
19.5
33.2及以上
68
23.0
对此分组结果重新绘制KM生存曲线,如下图所示:
与原始分组的KM曲线相比,此图中的各组曲线分离得更为清晰、层次分明,高风险组的“无误差率”显著低于低风险组。这证明本文的算法成功地找到了能最大化风险差异的BMI阈值。
模型评估
log-rank对数秩检验
在确定分组后,为量化各组别之间的区分度,本文还对相邻两个风险组的KM生存曲线进行了对数秩检验,得到其p值。如下表所示:
对数秩检验p值
组别比较
p m p_m p m
p f p_f p f
29及以下与[29.0,31.1)
0.349
0.638
\relax[29.0,31.1)与[31.1,33.2)
0.168
0.302
\relax[31.1,33.2)与33.2及以上
0.093
0.051
由此可知,p值并非极小,这表明相邻风险组之间在KM生存曲线上的差异并非极其显著,有偶然因素掺杂。
敏感性分析
为探究检测误差对分组结果和最佳 NIPT时点的影响,本文分别进行了两个分析实验,分别测量噪声和模糊阈值对结果的影响。
首先是测量噪声的蒙特卡洛模拟。在每次模拟 b = 1 , … , B b=1,\dots,B b = 1 , … , B 中,对所有行级观测加入独立同分布噪声:
y ~ i j ( b ) = y i j + ε i j ( b ) , ε i j ( b ) ∼ N ( 0 , σ 2 ) , σ = 0.01. \tilde y_{ij}^{(b)} = y_{ij} + \varepsilon_{ij}^{(b)},\qquad \varepsilon_{ij}^{(b)}\sim\mathcal{N}(0,\sigma^2),\sigma=0.01.
y ~ ij ( b ) = y ij + ε ij ( b ) , ε ij ( b ) ∼ N ( 0 , σ 2 ) , σ = 0.01.
对每次扰动数据,重复区间化、MI(M M M 次)、Cox 预测、Isotonic 单调化、回归树分箱(目标 K = 4 K=4 K = 4 )与组内 KM,得到
( C ( b ) , t 1 , p ( b ) , … , t K , p ( b ) ) . (\mathcal{C}^{(b)},\; t_{1,p}^{(b)},\dots,t_{K,p}^{(b)} ).
( C ( b ) , t 1 , p ( b ) , … , t K , p ( b ) ) .
收集成功次样本的经验分布以估计切点与组内推荐的不确定性(均值、方差、置信区间、直方图等)。
噪声实验下的小提琴图
该过程等价于研究映射
Φ : { y i j } ↦ ( C , t 1 , p , … , t K , p ) \Phi:\ \{y_{ij}\} \mapsto (\mathcal{C},\; t_{1,p},\dots,t_{K,p})
Φ : { y ij } ↦ ( C , t 1 , p , … , t K , p )
在加噪扰动下的分布。若映射对噪声敏感,则输出分布会显示大方差或多峰性。实验结果如图11 。
图中每个切点的小提琴形状都非常狭窄,且集中在一个很小的BMI值范围内。每个风险组对应的推荐孕周同样呈现出非常集中的分布。这说明基于本文分箱模型的BMI切点和临床建议(即对不同风险的个体建议不同的复查时间)都非常稳定。这强力证明了本文找到的BMI阈值是数据内在的、稳定的结构性特征,对噪声具有高度的鲁棒性。
然后是模糊阈值实验。给定模糊阈值区间 [ 0.039 , 0.041 ] [0.039,0.041] [ 0.039 , 0.041 ] ,对观测序列定义如下规则:
1.若存在最早的 j j j 使得 y i j > c u y_{ij} > c_u y ij > c u ,则把 R i = w i j R_i=w_{ij} R i = w ij ,并令 L i L_i L i 为 R i R_i R i 之前最近一次时间点(若无则设为 0 或预先设定的下界),此时为 interval。\
2.若全序列中没有 y i j > c u y_{ij} > c_u y ij > c u ,则视为右删失,L i = max j w i j L_i=\max_j w_{ij} L i = max j w ij 。
与精确阈值判定不同,模糊阈值把只有超过上界 c u c_u c u 的观测视作确定达标,而对处于 ( c ℓ , c u ] (c_\ell,c_u] ( c ℓ , c u ] 的观测不直接判定为达标,从而扩大区间不确定性。
对区间样本采用一次均匀插补:
T i f u z z y ∼ U n i f ( L i , R i ) . T_i^{\mathrm{fuzzy}} \sim \mathrm{Unif}(L_i,R_i).
T i fuzzy ∼ Unif ( L i , R i ) .
然后在已给定的 BMI 分箱下分别计算两种规则(精确阈值 vs. 模糊阈值)得到的组内 KM 估计与第 p p p 分位 t g , p exact t_{g,p}^{\text{exact}} t g , p exact 与 t g , p fuzzy t_{g,p}^{\text{fuzzy}} t g , p fuzzy ,以比较阈值模糊性对推荐的影响。
不同风险组的模糊阈值与精确阈值对比结果如图所示。据图分析,即使在考虑了测量误差的更苛刻条件下,不同风险组之间的差异依然显著。且组内 KM 估计与第 p p p 分位 t g , p exact t_{g,p}^{\text{exact}} t g , p exact 与 t g , p fuzzy t_{g,p}^{\text{fuzzy}} t g , p fuzzy 差异较小。这说明模型对于输入数据的检验误差具有良好的鲁棒性。
问题三的模型的建立和求解
根据问题二对原始BMI分组的探究,可知原始分组在本题也不是最优解。由于原始数据中大量的左删失情况,MI+Cox的解决方法效果并不理想,因此本文使用了在高删失下更稳定地估计高分位的AFT模型,另外AFT和半参数的Cox一样支持BMI、年龄、IVF等协变量,符合问题三中要考虑多因素影响的要求。
模型建立
为了将 BMI 连续变量分割为 k k k 个有序区间,使得组内监督目标的离散度最小,设数据集 D = { ( x i , y i ) } i = 1 n \mathcal{D}=\{(x_i,y_i)\}_{i=1}^n D = {( x i , y i ) } i = 1 n ,其中 x i ∈ R x_i\in\mathbb{R} x i ∈ R 为第 i i i 个个体的 BMI,y i ∈ R y_i\in\mathbb{R} y i ∈ R 为监督目标(如 π 25 = P ( T ≤ 25 ) \pi_{25}=P(T\le 25) π 25 = P ( T ≤ 25 ) 或 t 95 t_{95} t 95 等)。我们以“首次达标时间” T T T 为生存时间,采用“区间删失 + AFT 模型 + MI 条件插补 + KM 估计”的联合策略;在分组层面,以监督式分箱(等价于一维回归树)最小化组内目标的平方误差和,获得分割点与稳定的组间梯度。
区间删失数据处理
对个体 i i i ,检测时间序列为 { t i , 1 , … , t i , m i } \{t_{i,1},\ldots,t_{i,m_i}\} { t i , 1 , … , t i , m i } ,对应 Y 浓度为 { y i , 1 , … , y i , m i } \{y_{i,1},\ldots,y_{i,m_i}\} { y i , 1 , … , y i , m i } 。首次达标阈值取 τ = 0.04 \tau=0.04 τ = 0.04 。首次达标的区间 [ L i , R i ] [L_i,R_i] [ L i , R i ] 构造为
[ L i , R i ] = { [ 0 , t i , j ∗ ] , 左删失: y i , 1 ≥ τ , [ t i , j ∗ − 1 , t i , j ∗ ] , 区间删失: ∃ j ∗ s.t. y i , j ∗ − 1 < τ ≤ y i , j ∗ , [ t i , m i , + ∞ ) , 右删失: ∀ j , y i , j < τ . [L_i,R_i]=
\begin{cases}
[0,t_{i,j^\ast}], & \text{左删失:} y_{i,1}\ge \tau,\\
[t_{i,j^\ast-1},t_{i,j^\ast}], & \text{区间删失:}\exists j^\ast \text{ s.t. } y_{i,j^\ast-1}<\tau\le y_{i,j^\ast},\\
[t_{i,m_i},+\infty), & \text{右删失:}\forall j,\;y_{i,j}<\tau.
\end{cases}
[ L i , R i ] = ⎩ ⎨ ⎧ [ 0 , t i , j ∗ ] , [ t i , j ∗ − 1 , t i , j ∗ ] , [ t i , m i , + ∞ ) , 左删失: y i , 1 ≥ τ , 区间删失: ∃ j ∗ s.t. y i , j ∗ − 1 < τ ≤ y i , j ∗ , 右删失: ∀ j , y i , j < τ .
定义删失类型指示 ( δ i left , δ i int , δ i right ) ∈ { 0 , 1 } 3 (\delta_i^{\text{left}},\delta_i^{\text{int}},\delta_i^{\text{right}})\in\{0,1\}^3 ( δ i left , δ i int , δ i right ) ∈ { 0 , 1 } 3 ,三者互斥且至多一者为 1,用于后续似然与插补。基于全体个体的识别结果,可统计三类删失的频数与比例(示例如表 \ref{tab:censoring_types})。
区间删失对统计推断的含义
左删失仅给出 T i ≤ R i T_i\le R_i T i ≤ R i 的信息;右删失仅给出 T i > L i T_i> L_i T i > L i 的信息;区间删失给出 L i < T i ≤ R i L_i<T_i\le R_i L i < T i ≤ R i 的信息。
在参数模型中,这三类观测贡献不同的似然项;在非参 KM 框架中,需先通过“条件抽样”将其转化为仅含右删失的样本以便估计阶梯生存曲线。
AFT 加速失效时间模型
AFT 模型刻画 log T \log T log T 与协变量的线性关系:
log T i = x i ⊤ β + σ ϵ i , ϵ i ∼ i.i.d. F 0 , \log T_i=\mathbf{x}_i^\top\boldsymbol{\beta}+\sigma\epsilon_i,\qquad \epsilon_i\overset{\text{i.i.d.}}{\sim}F_0,
log T i = x i ⊤ β + σ ϵ i , ϵ i ∼ i.i.d. F 0 ,
其中 x i \mathbf{x}_i x i 含 BMI、年龄与 IVF 类别等,F 0 F_0 F 0 取自一族基准分布(本研究候选为对数正态、Weibull、对数逻辑)。令
μ i ≡ x i ⊤ β , F i ( t ) ≡ P ( T i ≤ t ∣ x i ) = F 0 ( log t − μ i σ ) , S i ( t ) = 1 − F i ( t ) . \mu_i\equiv\mathbf{x}_i^\top\boldsymbol{\beta},\quad
F_i(t)\equiv P(T_i\le t\mid \mathbf{x}_i)=F_0\!\left(\frac{\log t-\mu_i}{\sigma}\right),\quad
S_i(t)=1-F_i(t).
μ i ≡ x i ⊤ β , F i ( t ) ≡ P ( T i ≤ t ∣ x i ) = F 0 ( σ log t − μ i ) , S i ( t ) = 1 − F i ( t ) .
log-normal:F 0 = Φ F_0=\Phi F 0 = Φ ,则 t p = exp { μ i + σ Φ − 1 ( p ) } t_p=\exp\{\mu_i+\sigma \Phi^{-1}(p)\} t p = exp { μ i + σ Φ − 1 ( p )} ,S i ( t ) = 1 − Φ ( ( log t − μ i ) / σ ) S_i(t)=1-\Phi\big((\log t-\mu_i)/\sigma\big) S i ( t ) = 1 − Φ ( ( log t − μ i ) / σ ) 。\
Weibull:设形状 k = 1 / σ k=1/\sigma k = 1/ σ 、尺度 λ = exp ( μ i ) \lambda=\exp(\mu_i) λ = exp ( μ i ) ,F i ( t ) = 1 − exp { − ( t / λ ) k } F_i(t)=1-\exp\{-(t/\lambda)^k\} F i ( t ) = 1 − exp { − ( t / λ ) k } 。\
log-logistic:F i ( t ) = [ 1 + ( t / λ ) − k ] − 1 F_i(t)=\big[1+\big(t/\lambda\big)^{-k}\big]^{-1} F i ( t ) = [ 1 + ( t / λ ) − k ] − 1 。
区间删失 AFT 似然:
记基分布 CDF/密度为 F 0 , f 0 F_0,f_0 F 0 , f 0 。第 i i i 个体对对数似然的贡献为
ℓ i ( β , σ ) = { log F i ( R i ) , δ i left = 1 , log { F i ( R i ) − F i ( L i ) } , δ i int = 1 , log { 1 − F i ( L i ) } , δ i right = 1 , \ell_i(\boldsymbol{\beta},\sigma)=
\begin{cases}
\log F_i(R_i),& \delta_i^{\text{left}}=1,\\
\log\big\{F_i(R_i)-F_i(L_i)\big\},& \delta_i^{\text{int}}=1,\\
\log\big\{1-F_i(L_i)\big\},& \delta_i^{\text{right}}=1,
\end{cases}
ℓ i ( β , σ ) = ⎩ ⎨ ⎧ log F i ( R i ) , log { F i ( R i ) − F i ( L i ) } , log { 1 − F i ( L i ) } , δ i left = 1 , δ i int = 1 , δ i right = 1 ,
总体对数似然 ℓ = ∑ i ℓ i \ell=\sum_i \ell_i ℓ = ∑ i ℓ i 。用极大似然估计 ( β ^ , σ ^ ) (\hat{\boldsymbol{\beta}},\hat{\sigma}) ( β ^ , σ ^ ) ,并以 AIC = 2 k − 2 ℓ ( θ ^ ) 2k-2\ell(\hat{\theta}) 2 k − 2 ℓ ( θ ^ ) 比较候选分布,k k k 为自由参数个数。
个体层预测量:
分位时间:t 90 , t 95 t_{90},t_{95} t 90 , t 95 由 t p = inf { t : F i ( t ) ≥ p } t_p=\inf\{t:F_i(t)\ge p\} t p = inf { t : F i ( t ) ≥ p } 给出;lognormal时 t p = exp { μ i + σ Φ − 1 ( p ) } t_p=\exp\{\mu_i+\sigma \Phi^{-1}(p)\} t p = exp { μ i + σ Φ − 1 ( p )} ,固定时点达标概率:π 25 , i = P ( T i ≤ 25 ) = F i ( 25 ) \pi_{25,i}=P(T_i\le 25)=F_i(25) π 25 , i = P ( T i ≤ 25 ) = F i ( 25 ) 。\
这些量既用于监督分箱的目标(如 π 25 \pi_{25} π 25 ),也用于 MI 条件插补与推荐时点的兜底(AFT 中位 t 95 t_{95} t 95 )。
基于 AFT 拟合结果的 MI 条件插补
令 F i F_i F i 为 AFT 下个体条件 CDF。对于非右删失个体,进行 M M M 次条件抽样:
T i ( m ) ∼ { F i − 1 ( U ⋅ F i ( R i ) ) , 左删失 ( 0 , R i ] , F i − 1 ( F i ( L i ) + U ⋅ ( F i ( R i ) − F i ( L i ) ) ) , 区间删失 ( L i , R i ] , U ∼ Unif ( 0 , 1 ) . T_i^{(m)}\sim
\begin{cases}
F_i^{-1}\!\big(U\cdot F_i(R_i)\big), & \text{左删失 } (0,R_i],\\
F_i^{-1}\!\big(F_i(L_i)+U\cdot(F_i(R_i)-F_i(L_i))\big), & \text{区间删失 } (L_i,R_i],
\end{cases}
\quad U\sim \text{Unif}(0,1).
T i ( m ) ∼ { F i − 1 ( U ⋅ F i ( R i ) ) , F i − 1 ( F i ( L i ) + U ⋅ ( F i ( R i ) − F i ( L i )) ) , 左删失 ( 0 , R i ] , 区间删失 ( L i , R i ] , U ∼ Unif ( 0 , 1 ) .
右删失保持不变(仅记录删失界与删失指示)。每次插补得到仅含右删失的样本,之后据此计算 KM 曲线与分位数。本文取 M = 200 M=200 M = 200 ,以插补中位数与四分位距(IQR)表征不确定性。
监督式分箱(回归树)(SSE 准则)
给定分组数 K K K 与最小叶大小 MIN_LEAF \text{MIN\_LEAF} MIN_LEAF ,在 BMI 轴上搜索切点集合 C = { c 1 < ⋯ < c K − 1 } \mathcal{C}=\{c_1<\cdots<c_{K-1}\} C = { c 1 < ⋯ < c K − 1 } ,最小化组内平方误差和:
min C ∑ g = 1 K ∑ i ∈ I g ( C ) ( y i − y ˉ g ) 2 , y ˉ g = 1 ∣ I g ∣ ∑ i ∈ I g y i . \min_{\mathcal{C}}\; \sum_{g=1}^K \sum_{i\in \mathcal{I}_g(\mathcal{C})}\big(y_i-\bar{y}_g\big)^2,
\quad \bar{y}_g=\frac{1}{|\mathcal{I}_g|}\sum_{i\in \mathcal{I}_g} y_i.
C min g = 1 ∑ K i ∈ I g ( C ) ∑ ( y i − y ˉ g ) 2 , y ˉ g = ∣ I g ∣ 1 i ∈ I g ∑ y i .
使用贪心策略:从全集出发,枚举可行切点,以降幅
Δ SSE = SSE parent − SSE left − SSE right \Delta \text{SSE}=\text{SSE}_{\text{parent}}-\text{SSE}_{\text{left}}-\text{SSE}_{\text{right}}
Δ SSE = SSE parent − SSE left − SSE right
最大者为优,直至达到 K K K 组或无显著增益(Δ SSE < MIN_GAIN \Delta \text{SSE}<\text{MIN\_GAIN} Δ SSE < MIN_GAIN )或触达样本量约束;若失败则回退等频分箱。本文以 y i = π 25 , i y_i=\pi_{25,i} y i = π 25 , i 为监督目标,K = 3 , MIN_LEAF = 30 K=3,\ \text{MIN\_LEAF}=30 K = 3 , MIN_LEAF = 30 。
KM 估计与多重合并
对第 m m m 次插补样本 { ( T i ( m ) , δ i ( m ) ) } \{(T_i^{(m)},\delta_i^{(m)})\} {( T i ( m ) , δ i ( m ) )} ,记事件时点序列为 { t j } \{t_j\} { t j } ,其 KM 估计
S ^ ( m ) ( t ) = ∏ t j ≤ t ( 1 − d j ( m ) n j ( m ) ) , \widehat{S}^{(m)}(t)=\prod_{t_j\le t}\left(1-\frac{d_j^{(m)}}{n_j^{(m)}}\right),
S ( m ) ( t ) = t j ≤ t ∏ ( 1 − n j ( m ) d j ( m ) ) ,
其中 d j ( m ) d_j^{(m)} d j ( m ) 为 t j t_j t j 的事件数,n j ( m ) n_j^{(m)} n j ( m ) 为风险集大小。多重插补的合并采用逐点中位:
S ^ ( t ) = median { S ^ ( 1 ) ( t ) , … , S ^ ( M ) ( t ) } , \widehat{S}(t)=\operatorname{median}\big\{\widehat{S}^{(1)}(t),\ldots,\widehat{S}^{(M)}(t)\big\},
S ( t ) = median { S ( 1 ) ( t ) , … , S ( M ) ( t ) } ,
并记录 25–75% 分位带作为不确定性区间。
分位数与推荐时点
KM 分位数取
t ^ α KM = inf { t : S ^ ( t ) ≤ α } , \hat{t}_\alpha^{\text{KM}}=\inf\{t:\widehat{S}(t)\le \alpha\},
t ^ α KM = inf { t : S ( t ) ≤ α } ,
AFT 分位数取个体分位的组内中位。对第 k k k 组,推荐时点
y k = { round ( median { t ^ 0.05 KM , ( m ) } m = 1 M / 0.5 ) × 0.5 , KM 可用 ; round ( median { t ^ 0.05 AFT } / 0.5 ) × 0.5 , 否则 , y_k=
\begin{cases}
\text{round}\Big(\operatorname{median}\{\hat{t}_{0.05}^{\text{KM},(m)}\}_{m=1}^M/0.5\Big)\times 0.5,& \text{KM 可用};\\[2pt]
\text{round}\Big(\operatorname{median}\{\hat{t}_{0.05}^{\text{AFT}}\}/0.5\Big)\times 0.5,& \text{否则},
\end{cases}
y k = ⎩ ⎨ ⎧ round ( median { t ^ 0.05 KM , ( m ) } m = 1 M /0.5 ) × 0.5 , round ( median { t ^ 0.05 AFT } /0.5 ) × 0.5 , KM 可用 ; 否则 ,
其中 round 表示四舍五入到 0.5 周。该规则保证 P ( T > y k ) ≈ 0.05 P(T>y_k)\approx 0.05 P ( T > y k ) ≈ 0.05 的风险控制目标。
模型求解
本节给出似然构造、参数估计、分布选择、监督分箱、MI+KM 合并、KM–AFT 对齐度与推荐生成的完整实现细节与数值结果。
区间删失 AFT 的极大似然与分布选择
以log-normal / Weibull / log-logistic为候选,采用R语言中的\texttt{survreg}做interval-censor生存回归,在 Surv ( L i , R i , type=''interval2'' ) \text{Surv}(L_i,R_i,\texttt{type=''interval2''}) Surv ( L i , R i , type=’’interval2’’ ) 上求解 MLE,自动剔除全 NA 协变量或仅单水平的分类变量。
AIC 判别:结果为对数正态 AIC=337.4786、Weibull AIC=337.6242,二者接近但以log-normal为优(亦便于与 scipy 参数化对齐),并且由 μ ^ i = x i ⊤ β ^ \hat{\mu}_i=\mathbf{x}_i^\top\hat{\boldsymbol{\beta}} μ ^ i = x i ⊤ β ^ 与 σ ^ \hat{\sigma} σ ^ 导出个体 t 90 , t 95 t_{90},t_{95} t 90 , t 95 与 π 25 = F i ( 25 ) \pi_{25}=F_i(25) π 25 = F i ( 25 ) 。
固定时点达标概率:
在log-normal情形,
π 25 , i = P ( T i ≤ 25 ) = Φ ( ( log 25 − μ ^ i ) / σ ^ ) , \pi_{25,i}=P(T_i\le 25)=\Phi\big((\log 25-\hat{\mu}_i)/\hat{\sigma}\big),
π 25 , i = P ( T i ≤ 25 ) = Φ ( ( log 25 − μ ^ i ) / σ ^ ) ,
是监督分箱的首选目标(USE_METRIC=‘‘pi_25’’),能更直接反映“到 25 周是否已达标”的风险梯度。
监督式分箱的求解与性质
切点搜索:排序 BMI 后仅在相邻样本中点处枚举切点,时间复杂度 O ( n ) O(n) O ( n ) 扫描一次可得每个候选的 SSE 降幅,整体为 O ( n K ) O(nK) O ( n K ) 。对于每一侧叶子 ≥ MIN_LEAF \ge \text{MIN\_LEAF} ≥ MIN_LEAF ;若本轮最佳 Δ SSE < MIN_GAIN \Delta\text{SSE}<\text{MIN\_GAIN} Δ SSE < MIN_GAIN 则提前停止并回退等频分箱。最终得到分组结果,如表格所示。
组别编号
BMI范围
0
[ 20.70 , 31.73 ] [20.70,\,31.73] [ 20.70 , 31.73 ]
1
[ 31.75 , 35.63 ] [31.75,\,35.63] [ 31.75 , 35.63 ]
2
[ 35.67 , 46.88 ] [35.67,\,46.88] [ 35.67 , 46.88 ]
MI 条件插补与 KM 合并的实现
对每个左/区间删失样本,按个体 F i F_i F i 进行条件抽样(AFT 条件 MI);若缺参则回退区间均匀;重复 M = 200 M=200 M = 200 次。
KM:每次插补对各 BMI 组分别拟合 KM,出生命中位曲线与 25–75% 分位带;再在统一网格 t ∈ [ 0 , 26 ] t\in[0,26] t ∈ [ 0 , 26 ] 上取逐点中位合并,得到组层的 S ^ ( t ) \widehat{S}(t) S ( t ) 。
高分位可达性:若删失结构导致 KM 的 t 95 t_{95} t 95 不可达或不稳,则回退至 AFT 个体 t 95 t_{95} t 95 的组内中位数,保证推荐的稳定输出。
KM 分位时间估计结果:
组别
t 95 t_{95} t 95 中位
t 95 t_{95} t 95 下四分位
t 95 t_{95} t 95 上四分位
t 90 t_{90} t 90 中位
t 90 t_{90} t 90 下四分位
t 90 t_{90} t 90 上四分位
0
17.520
17.049
18.218
14.456
14.114
14.836
1
/
/
/
17.675
17.177
18.145
2
/
/
/
/
/
/
组别编号
BMI值范围
推荐最佳检测点
0
31.7及以下
17.5周
1
[ 31.7 , 35.7 ) [31.7,35.7) [ 31.7 , 35.7 )
20.5周
2
35.7及以上
23.5周
模型评估
目标单调性与对数秩检验
分组配对的对数秩检验(合并 p 值)
配对组
卡方值
自由度
合并p值
统计量均值
样本数
0 vs 1
585.307
400
3.948 × 10 − 9 3.948\times 10^{-9} 3.948 × 1 0 − 9
1.511
200
1 vs 2
824.746
400
1.096 × 10 − 31 1.096\times 10^{-31} 1.096 × 1 0 − 31
2.369
200
目标的单调性强(Spearman( BMI , pred_t95 ) = 0.953 (\text{BMI},\ \text{pred\_t95})=0.953 ( BMI , pred_t95 ) = 0.953 );外部生存差异(相邻组)合并 p 值极显著(0 vs 1:3.95 × 10 − 9 3.95\times 10^{-9} 3.95 × 1 0 − 9 ;1 vs 2:1.10 × 10 − 31 1.10\times 10^{-31} 1.10 × 1 0 − 31 ),证明分箱有效区分风险层级。
KM–AFT 对齐度的定义与计算
在诊断窗 [ 8 , 24 ] [8,24] [ 8 , 24 ] 周内,令 S ~ KM ( t ) \widetilde{S}^{\text{KM}}(t) S KM ( t ) 为 MI 中位 KM 曲线,S ~ AFT ( t ) \widetilde{S}^{\text{AFT}}(t) S AFT ( t ) 为 AFT 精确中位曲线。定义
align_L1 8 – 24 = 1 ∣ G ∣ ∑ t ∈ G ∣ S ~ KM ( t ) − S ~ AFT ( t ) ∣ , align_sup 8 – 24 = max t ∈ G ∣ S ~ KM ( t ) − S ~ AFT ( t ) ∣ , \text{align\_L1}_{8\text{--}24}
=\frac{1}{|G|}\sum_{t\in G}\big|\widetilde{S}^{\text{KM}}(t)-\widetilde{S}^{\text{AFT}}(t)\big|,\quad
\text{align\_sup}_{8\text{--}24}
=\max_{t\in G}\big|\widetilde{S}^{\text{KM}}(t)-\widetilde{S}^{\text{AFT}}(t)\big|,
align_L1 8 – 24 = ∣ G ∣ 1 t ∈ G ∑ S KM ( t ) − S AFT ( t ) , align_sup 8 – 24 = t ∈ G max S KM ( t ) − S AFT ( t ) ,
其中 G = { 8 , 8.1 , … , 24 } G=\{8,8{.}1,\ldots,24\} G = { 8 , 8 . 1 , … , 24 } 为步长 0.1 0.1 0.1 的网格。数值上,L1 越小代表整体更一致,sup 越小代表最坏点差异更小。对齐度与左删失率、样本量共同反映模型—数据一致性。
计算结果:
对齐度:
组0:L1 0.00617,sup 0.01898(优秀)
组1:L1 0.02591,sup 0.04716(良好)
组2:L1 0.06069,sup 0.09140(一般,由于样本小且左删失严重)
敏感性分析
为了验证模型结果的稳定性,项目进行了敏感性分析,通过向原始数据注入随机噪声并重复建模过程,观察关键结果(BMI切分点和推荐孕周)的变动情况。
噪声实验下的分布小提琴图
左图展示了在数据有噪声的情况下,两个BMI切分点的分布。可以看出,第一个切分点(粉色)非常稳定,集中在31-34之间;第二个切分点(绿色)波动稍大,但仍稳定在33-36之间。这证明了BMI分组方式是稳健的。
右图展示了各组推荐孕周的分布。可以看出,低BMI组的推荐时间非常稳定(小提琴很“瘦”),而高BMI组的推荐时间不确定性更大(小提琴更“胖”),这与高BMI组样本量少、删失率高的现实情况相符。尽管如此,各组的推荐时间核心区间清晰可辨,证明了最终推荐策略的整体可靠性。
此外,针对这三个不同的BMI分组,绘制出KM生存曲线,验证其敏感性。
从图中可以看出,在所有三个组中,橙色曲线(模糊定义)都位于蓝色曲线(精确定义)的下方,这意味着在模糊定义下,达标时间似乎发生得更早。然而,两条曲线的整体形状、趋势以及置信区间的大部分是重叠的,表明虽然定义不同会导致数值上的轻微差异,但并不会从根本上改变“BMI越高,达标时间越晚”这一核心结论。
因此,这张图证明了模型结果对于达标阈值的微小波动具有良好的稳健性。
问题四的模型的建立和求解
该题的核心任务是为女胎建立一个准确的异常判定方法。与男胎不同,女胎不携带Y染色体,因此无法使用Y染色体浓度作为直接的判断依据。因此,必须综合利用其他多种生物信息学指标,如各关键染色体(13, 18, 21, X)的Z值、GC含量、测序读段数以及孕妇的BMI等个人信息,来构建一个高精度的分类模型,以判断胎儿是否存在21、18或13号染色体的非整倍体异常。
模型建立
考虑到临床应用的严肃性,模型的评估不能仅仅依赖于传统的准确率。在产前检测中,假阴性,即未能检测出实际异常的胎儿,会带来严重的临床后果而错过干预窗口,其代价远高于假阳性,即错误地将正常胎儿标记为异常。因此,本文采纳了更符合临床需求的非对称代价函数作为模型优化的最终目标。
特征工程
在将数据输入模型之前,本文执行了一系列特征工程步骤,以增强原始数据的表达能力并满足模型要求。这些步骤包括:
(1)**缺失值处理:**部分样本的“孕妇BMI”特征存在缺失值。本文采用中位数插补的方法来填充这些缺失值,以保证数据的完整性。
(2)数据清洗 :对数据进行了清洗,并根据文献以及临床经验设定了一个质量控制标准。本文仅在X染色体Z值的绝对值小于2.5的“高置信度”样本上进行后续所有操作。这一步骤排除了14个信号可能不可靠的样本,旨在构建一个在常规情况下更稳定、更可靠的模型。
(3)**交互特征构建:**为了捕捉关键变量之间可能存在的非线性协同效应,本文构建了新的交互特征。具体而言,将各染色体的Z值与X染色体浓度相乘:
z s c o r e c h r f f = z s c o r e c h r × x c o n c e n t r a t i o n , where c h r ∈ { 13 , 18 , 21 } , z_score_chr_ff = z_score_chr \times x_concentration, \quad \text{where } chr \in \{13, 18, 21\} ,
z s cor e c h r f f = z s cor e c h r × x c o n ce n t r a t i o n , where c h r ∈ { 13 , 18 , 21 } ,
这些交互特征旨在放大在高胎儿浓度下Z值的信号。
(4)**特征离散化:**观察到X染色体的Z值的绝对值在特定区间有不同的临床意义,因此对其进行分箱处理,将其转化为一个分类特征:
1.区间: [ 0 , 2.5 ) , [ 2.5 , 3 ) , [ 3 , + ∞ ) [0, 2.5), [2.5, 3), [3, +∞) [ 0 , 2.5 ) , [ 2.5 , 3 ) , [ 3 , + ∞ )
2.标签: 正常(ZX), 临界(ZX), 异常(ZX)
随后,这个新生成的分类特征通过独热编码转换为多列二进制特征,以便于线性模型和树模型进行学习。
(5)**特征缩放:**由于支持向量机(SVM)对特征的尺度非常敏感,在将其输入SVM模型之前,对所有数值型特征进行了标准化处理,将每个特征j j j 转换为:
x i j ′ = x i j − μ j σ j x'_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}
x ij ′ = σ j x ij − μ j
其中,μ j \mu_j μ j 和 σ j \sigma_j σ j 分别是训练集中特征 j j j 的均值和标准差。此步骤确保所有特征具有零均值和单位方差,避免了某些特征因尺度问题在模型训练中占据主导地位。
(6)**最终特征列表:**经过上述所有处理步骤后,最终输入到模型中的完整特征列表是:年龄,孕周,BMI,比对率,重复率,唯一比对读段数,GC 含量,13 号染色体 Z 分数,18 号染色体Z值 ,21 号染色体Z值,X染色体Z值,X染色体浓度,21 号染色体Z值与胎儿 DNA 含量的交互特征,18 号染色体Z值与胎儿DNA含量的交互特征,13号染色体Z值与胎儿DNA含量的交互特征,X 染色体 Z 值分箱后为正常(ZX) 的独热编码特征,X 染色体 Z 值分箱后为临界(ZX) 的独热编码特征,X 染色体 Z值分箱后为异常(ZX) 的独热编码特征。
支持向量机(SVM)
SVM的核心思想是在特征空间中寻找一个能将不同类别样本最大程度分开的最优超平面。对于给定的训练数据集 D = { ( x i , y i ) } i = 1 N D = \{(x_i, y_i)\}_{i=1}^N D = {( x i , y i ) } i = 1 N ,其中 x i ∈ R p x_i \in \mathbb{R}^p x i ∈ R p 是p维特征向量,y i ∈ { − 1 , 1 } y_i \in \{-1, 1\} y i ∈ { − 1 , 1 } 是类别标签。软间隔SVM的原始优化问题可以表示为:
min w , b , ξ 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i , \min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^N \xi_i ,
w , b , ξ min 2 1 ∥ w ∥ 2 + C i = 1 ∑ N ξ i ,
约束条件为:
y i ( w T x i + b ) ≥ 1 − ξ i , ∀ i = 1 , … , N , ξ i ≥ 0 , ∀ i = 1 , … , N , y_i(w^T x_i + b) \ge 1 - \xi_i, \quad \forall i=1, \dots, N ,
\xi_i \ge 0, \quad \forall i=1, \dots, N ,
y i ( w T x i + b ) ≥ 1 − ξ i , ∀ i = 1 , … , N , ξ i ≥ 0 , ∀ i = 1 , … , N ,
其中,w ∈ R p w \in \mathbb{R}^p w ∈ R p 是超平面的法向量,b ∈ R b \in \mathbb{R} b ∈ R 是偏置项,∥ w ∥ 2 \|w\|^2 ∥ w ∥ 2 是正则化项,旨在最大化几何间隔,C > 0 C > 0 C > 0 是一个正则化超参数,用于权衡间隔大小与误分类样本的容忍度,ξ i \xi_i ξ i 是松弛变量,允许部分样本不满足间隔约束。
接着,为了处理非线性可分的数据,SVM使用核技巧 将原始特征空间映射到一个更高维的希尔伯特空间:H \mathcal{H} H ,并在这个高维空间中寻找线性超平面。这是通过一个非线性映射函数 ϕ : R p → H \phi: \mathbb{R}^p \to \mathcal{H} ϕ : R p → H 实现的。用核函数 K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j )⟩ 来替代在高维空间中的点积运算,从而避免了对映射 ϕ ( x ) \phi(x) ϕ ( x ) 的显式计算。
在本项目中,本文选用了高斯核(RBF核),其定义如下:
K ( x i , x j ) = exp ( − γ ∥ x i − x j ∥ 2 ) K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)
K ( x i , x j ) = exp ( − γ ∥ x i − x j ∥ 2 )
其中,γ > 0 \gamma > 0 γ > 0 是一个超参数,它定义了单个训练样本对于分类决策的影响范围。γ \gamma γ 越小,影响范围越大。
通过求解原始问题的对偶问题(Dual Problem),得到最终的决策函数:
f ( x ) = sgn ( ∑ i = 1 N α i y i K ( x i , x ) + b ) f(x) = \text{sgn} \left( \sum_{i=1}^N \alpha_i y_i K(x_i, x) + b \right)
f ( x ) = sgn ( i = 1 ∑ N α i y i K ( x i , x ) + b )
其中,α i \alpha_i α i 是拉格朗日乘子,只有支持向量(Support Vectors)对应的 α i \alpha_i α i 才非零。
极端梯度提升(XGBoost)
XGBoost是一种基于梯度提升决策树算法的高效、可扩展的实现。其构建的是一个由 K K K 棵决策树组成的加法模型。对于一个样本 x i x_i x i ,其预测值 y ^ i \hat{y}_i y ^ i 为:
y ^ i = ∑ k = 1 K f k ( x i ) , f k ∈ F \hat{y}_i = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F}
y ^ i = k = 1 ∑ K f k ( x i ) , f k ∈ F
其中 F \mathcal{F} F 是所有可能的决策树组成的函数空间。
模型通过最小化一个包含损失函数和正则化项的目标函数来进行训练:
Obj = ∑ i = 1 N l ( y i , y ^ i ) + ∑ k = 1 K Ω ( f k ) , \text{Obj} = \sum_{i=1}^N l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k),
Obj = i = 1 ∑ N l ( y i , y ^ i ) + k = 1 ∑ K Ω ( f k ) ,
其中,l ( y i , y ^ i ) l(y_i, \hat{y}_i) l ( y i , y ^ i ) 是损失函数,对于二分类问题,通常是对数损失:
l ( y i , y ^ i ) = − [ y i log ( p i ) + ( 1 − y i ) log ( 1 − p i ) ] l(y_i, \hat{y}_i) = -[y_i \log(p_i) + (1-y_i) \log(1-p_i)]
l ( y i , y ^ i ) = − [ y i log ( p i ) + ( 1 − y i ) log ( 1 − p i )]
其中 p i = σ ( y ^ i ) p_i = \sigma(\hat{y}_i) p i = σ ( y ^ i ) ,σ \sigma σ 是Sigmoid函数。Ω ( f ) \Omega(f) Ω ( f ) 是正则化项,用于控制模型复杂度,防止过拟合:
Ω ( f ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2
Ω ( f ) = γ T + 2 1 λ j = 1 ∑ T w j 2
T T T 是树的叶子节点数量,w j w_j w j 是第 j j j 个叶子节点的分数(权重),γ \gamma γ 和 λ \lambda λ 是正则化超参数。
由于模型是分步迭代训练的,在第 t t t 轮,本文旨在找到一棵树 f t f_t f t 来最小化目标:
Obj ( t ) = ∑ i = 1 N l ( y i , y ^ i ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) \text{Obj}^{(t)} = \sum_{i=1}^N l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)
Obj ( t ) = i = 1 ∑ N l ( y i , y ^ i ( t − 1 ) + f t ( x i )) + Ω ( f t )
通过对损失函数进行二阶泰勒展开,可以近似得到在第 t t t 轮需要优化的目标:
Obj ( t ) ≈ ∑ i = 1 N [ l ( y i , y ^ i ( t − 1 ) ) + g i f t ( x i ) + 1 2 h i f t ( x i ) 2 ] + Ω ( f t ) \text{Obj}^{(t)} \approx \sum_{i=1}^N [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2] + \Omega(f_t)
Obj ( t ) ≈ i = 1 ∑ N [ l ( y i , y ^ i ( t − 1 ) ) + g i f t ( x i ) + 2 1 h i f t ( x i ) 2 ] + Ω ( f t )
其中 g i g_i g i 和 h i h_i h i 分别是损失函数关于 y ^ i ( t − 1 ) \hat{y}_i^{(t-1)} y ^ i ( t − 1 ) 的一阶和二阶梯度。
模型集成与代价优化
本文将两个基模型的概率输出进行线性加权平均,以得到最终的集成概率:
P ensemble ( x ) = w ⋅ P XGB ( x ) + ( 1 − w ) ⋅ P SVM ( x ) , P_{\text{ensemble}}(x) = w \cdot P_{\text{XGB}}(x) + (1-w) \cdot P_{\text{SVM}}(x) ,
P ensemble ( x ) = w ⋅ P XGB ( x ) + ( 1 − w ) ⋅ P SVM ( x ) ,
其中,P XGB ( x ) P_{\text{XGB}}(x) P XGB ( x ) 和 P SVM ( x ) P_{\text{SVM}}(x) P SVM ( x ) 分别是XGBoost和经过概率校准的SVM模型对于样本 x x x 的预测概率,w ∈ [ 0 , 1 ] w \in [0, 1] w ∈ [ 0 , 1 ] 是分配给XGBoost模型的权重。
基于集成概率,本文使用一个分类阈值 τ \tau τ 来做出最终的二分类决策:
y ^ = { 1 if P ensemble ( x ) > τ 0 otherwise \hat{y} = \begin{cases} 1 & \text{if } P_{\text{ensemble}}(x) > \tau \\ 0 & \text{otherwise} \end{cases}
y ^ = { 1 0 if P ensemble ( x ) > τ otherwise
在本题中,所有超参数的优化目标是最小化一个自定义的临床代价函数,而非传统的准确率或AUC。该函数定义为:
Cost = c F N ⋅ FN + c F P ⋅ FP \text{Cost} = c_{FN} \cdot \text{FN} + c_{FP} \cdot \text{FP}
Cost = c FN ⋅ FN + c FP ⋅ FP
其中 FN \text{FN} FN 和 FP \text{FP} FP 分别是假阴性和假阳性的样本数量,而 c F N c_{FN} c FN 和 c F P c_{FP} c FP 是对应的代价权重。在最终模型中,设 c F N = 15 c_{FN}=15 c FN = 15 ,c F P = 1 c_{FP}=1 c FP = 1 。
因此,整个自动化机器学习过程的最终优化问题是:
min H ( 15 ⋅ FN ( H ) + 1 ⋅ FP ( H ) ) \min_{\mathbf{H}} \left( 15 \cdot \text{FN}(\mathbf{H}) + 1 \cdot \text{FP}(\mathbf{H}) \right)
H min ( 15 ⋅ FN ( H ) + 1 ⋅ FP ( H ) )
其中,超参数集合 H \mathbf{H} H 包括了XGBoost的所有相关参数、SVM的参数(C , γ C, \gamma C , γ )、集成权重 w w w 以及分类阈值 τ \tau τ 。代价函数中的 FN ( H ) \text{FN}(\mathbf{H}) FN ( H ) 和 FP ( H ) \text{FP}(\mathbf{H}) FP ( H ) 是在5折交叉验证(5-fold Cross-Validation)过程中,使用超参数集 H \mathbf{H} H 所得到的假阴性和假阳性总数。
模型的优化过程——寻找最优超参数集合 H \mathbf{H} H ),是在全体训练样本上进行的,其目标是最小化在所有样本上定义的临床代价函数。
然而,在最终生成报告以评估最优模型的性能时,引入了一个额外的质量控制步骤,以更贴近临床实际应用。对于X染色体Z值绝对值大于等于2.5的样本,将其视为“低置信度”或需要人工复核的样本,并将其从性能评估的数据集中排除。
因此,最终报告中的所有性能指标(如AUC、分类报告、混淆矩阵等)均在满足以下条件的样本子集上计算得出:
{ ( x i , y i ) ∈ D ∣ ∣ z _ s c o r e _ x ( i ) ∣ < 2.5 } , \{ (x_i, y_i) \in D \mid |z\_score\_x(i)| < 2.5 \} ,
{( x i , y i ) ∈ D ∣ ∣ z _ score _ x ( i ) ∣ < 2.5 } ,
其中,z _ s c o r e _ x ( i ) z\_score\_x(i) z _ score _ x ( i ) 是样本 x i x_i x i 的X染色体Z值。这一步骤旨在评估模型在排除极端异常或不可靠的X染色体信号后,在“高置信度”样本上的真实表现。
综上所述,本项目构建了一个复杂的集成学习系统。它不仅融合了两种强大的机器学习模型,更重要的是,它的整个超参数空间(包括基模型参数和集成参数)都是为了一个明确的、与业务紧密相关的临床代价函数而进行端到端优化的,从而确保模型在实际应用中能够取得最优的效用。
模型求解
最终目标函数公式为:
最小化临床代价 (Cost) = 15 × FN + 1 × FP , \text{最小化临床代价 (Cost)} = 15 \times \text{FN} + 1 \times \text{FP} ,
最小化临床代价 (Cost) = 15 × FN + 1 × FP ,
这个目标函数明确指出,漏诊一个异常样本的代价是误诊一个正常样本的12倍。
优化过程与结果分析
优化目标 : 最小化临床代价 15 × F N + 1 × F P 15 \times \mathrm{FN} + 1 \times \mathrm{FP} 15 × FN + 1 × FP 。
得到最佳模型 : 进行 100 次k折交叉验证(k = 5 k=5 k = 5 ),且每次均保证训练集不会污染验证集,得到在交叉验证中临床代价最小的模型,即最佳模型。最佳模型在验证集上的临床代价为297
模型综合判别能力 (AUC) : 达到最低代价的这次最佳试验,其对应的 AUC 分数为 0.8216 ,这表明模型在不依赖特定阈值的情况下,具有良好的区分异常和正常样本的总体能力。
最佳超参数组合 如下:
{ e n s e m b l e _ w : 0.8768 ( X G B o o s t 权重 ) s v m _ C : 1.0459 s v m _ g a m m a : 0.0136 t h r e s h o l d : 0.1584 ( 分类阈值 ) x g b _ c o l s a m p l e _ b y t r e e : 0.9384 x g b _ g a m m a : 0.2413 x g b _ l e a r n i n g _ r a t e : 0.0104 x g b _ m a x _ d e p t h : 6 x g b _ n _ e s t i m a t o r s : 479 x g b _ s u b s a m p l e : 0.7500 \begin{cases}
ensemble\_w: 0.8768 (XGBoost\text{权重})\\
svm\_C: 1.0459\\
svm\_gamma: 0.0136\\
threshold: 0.1584 (\text{分类阈值})\\
xgb\_colsample\_bytree: 0.9384\\
xgb\_gamma: 0.2413\\
xgb\_learning\_rate: 0.0104\\
xgb\_max\_depth: 6\\
xgb\_n\_estimators: 479\\
xgb\_subsample: 0.7500\\
\end{cases}
⎩ ⎨ ⎧ e n se mb l e _ w : 0.8768 ( XGB oos t 权重 ) s v m _ C : 1.0459 s v m _ g amma : 0.0136 t h res h o l d : 0.1584 ( 分类阈值 ) xg b _ co l s am pl e _ b y t ree : 0.9384 xg b _ g amma : 0.2413 xg b _ l e a r nin g _ r a t e : 0.0104 xg b _ ma x _ d e pt h : 6 xg b _ n _ es t ima t ors : 479 xg b _ s u b s am pl e : 0.7500
混淆矩阵分析
以下是在高置信度样本子集上的最佳模型的具体表现:
预测为正常
预测为异常
实际为正常
355
57
实际为异常
16
36
依据该矩阵可知,成功检测的有36例;发生漏诊有16例,这是最关键的指标,模型未能识别出16例异常样本。根据代价函数,这部分产生了16 × 15 = 240 16\times15=240 16 × 15 = 240 的代价;发生误诊的有57例,模型将57例正常样本标记为需要复核的“异常”,这部分产生了57 × 1 = 57 57\times1=57 57 × 1 = 57 的代价;成功排除的有355例,模型成功地将355例正常样本判断为正常。
为了更深入地理解模型的性能,本文对分类报告的各项指标进行解读:
精确率
召回率
F1分数
样本数
正常
0.96
0.86
0.91
412
异常
0.39
0.69
0.50
52
总体准确率
0.84
464
针对“异常”样本:
精确率= 0.39 :在所有被模型预测为“异常”的样本中,只有 39% 是真正的异常。这意味着有较多的假阳性,这也是为了降低更昂贵的假阴性所付出的代价。
召回率 = 0.69 :这是本案例的核心指标之一。它表示在所有真实为“异常”的样本中,模型成功“召回”或识别出了其中的 69%。根据代价函数的设计,模型牺牲了一部分精确率,以换取尽可能高的召回率,从而最大程度地避免漏诊。
针对“正常”样本:
精确率 = 0.96 :在所有被模型预测为“正常”的样本中,有 96% 是真正的正常。这是一个非常高的数值,说明模型给出的“正常”判断具有很高的可靠性。
召回率= 0.86 :在所有真实为“正常”的样本中,有 86% 被模型正确识别。另外的 14% 被错误地划分为“异常”(即假阳性)。
综上所述,模型的AUC值为0.8216 ,表明其在区分异常和正常样本方面具有良好的整体能力。混淆矩阵显示,模型在464个高置信度样本中成功识别了36个异常样本,同时将355个正常样本正确分类为正常。尽管存在57个假阳性,但这是为了最大限度地减少16个假阴性所做的权衡。本文成功构建并优化了一个专门针对女胎NIPT数据异常判定的高级集成模型。该模型通过在筛选后的高置信度数据集上,端到端地学习一个15:1的非对称临床代价函数,实现了在“漏诊”和“误诊”之间的高度定制化的平衡,最终达到了297.0的最低临床代价分数。模型的召回率(69%)显著高于精确率(39%),这与设定的“不惜一切代价避免漏诊”的优化目标一致。
模型各特征的特征重要性
模型的评价
模型的优点
使用生存模型自然地处理删失数据,能提供更符合临床需求的预测结果。
通过融合SVM和XGBoost两种强大的机器学习算法,并进行端到端的自动化超参数调优,模型能够捕捉复杂的非线性关系和特征交互,获得了很高的整体判别能力(AUC=0.8216)。
分箱不是基于先验知识(如传统的“偏瘦/正常/超重”),而是完全由数据驱动,以最大化风险区分度为目标。从结果看,监督式分箱后的KM曲线分离度远优于传统分箱,证明了其有效性。
模型的缺点
多元线性回归模型无法捕捉变量之间复杂的交互作用或非线性模式,因此其预测精度通常不如更复杂的机器学习模型。
最小二乘法容易受到极端异常值的影响,导致模型参数估计产生偏差。
附录
文件列表
文件名
说明
p1\_literature\_based\_analysis.py
文献驱动的基础数据分析(问题一)
p1\_relationship\_analysis.py
特征相关性分析(问题一)
p1\_xgboost\_analysis.py
XGBoost建模与预测(问题一)
p2\_bmi\_supervised\_binning.py
BMI变量监督分箱处理(问题二)
p2\_eda.py
探索性数据分析(问题二)
p2\_noise\_grouped\_sensitivity\_analysis.py
分组噪声敏感性分析(问题二)
p2\_plot\_sensitivity\_trends.py
敏感性趋势可视化(问题二)
p2\_fuzzy\_interval\_modeling.py
模糊区间建模(问题二)
p3\_aft.R
加速失效时间(AFT)模型建模(问题三)
p3\_bmi\_group\_plots.py
BMI分组可视化绘图(问题三)
p3\_bmi\_supervised\_binning.py
BMI监督分箱处理(问题三)
p3\_fuzzy\_interval\_modeling.py
模糊区间建模(问题三)
p3\_noise\_grouped\_sensitivity\_analysis.py
分组噪声敏感性分析(问题三)
p4\_automl\_ensemble\_tuning.py
AutoML集成建模与调参(问题四)
p4\_shap\_analysis.py
SHAP模型可解释性分析(问题四)
代码
p1_literature_based_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 ''' 本脚本根据指定文献的方法,分析Y染色体浓度与孕妇关键特征(年龄、BMI、孕周)之间的关系。 分析流程包括: 1. 孕妇年龄与BMI的相关性分析。 2. Y染色体浓度与孕周的相关性分析。 3. 在不同孕周分组下,校正BMI后,分析Y染色体浓度与孕妇年龄的相关性。 4. 在不同孕周分组下,校正年龄后,分析Y染色体浓度与孕妇BMI的相关性。 新增功能:在进行相关性分析前,会进行正态性检验(Shapiro-Wilk test),并根据检验结果自动选择Pearson或Spearman相关性分析。 ''' import pandas as pdimport numpy as npfrom scipy.stats import pearsonr, shapiro, spearmanrimport redef get_correlation (series1, series2 ): """ 检验两组数据的正态性,并根据结果选择合适的相关性分析方法。 如果两组数据都服从正态分布,则使用Pearson相关系数。 否则,使用Spearman等级相关系数。 """ if len (series1) < 3 or len (series2) < 3 : return np.nan, np.nan, "数据不足 (样本量<3)" shapiro_stat1, shapiro_p1 = shapiro(series1) shapiro_stat2, shapiro_p2 = shapiro(series2) alpha = 0.05 if shapiro_p1 > alpha and shapiro_p2 > alpha: corr, p_value = pearsonr(series1, series2) method = "Pearson" else : corr, p_value = spearmanr(series1, series2) method = "Spearman" return corr, p_value, methodtry : df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='gbk' )except UnicodeDecodeError: df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='utf-8' ) df['孕周' ] = pd.to_numeric(df['检测孕天数' ], errors='coerce' ) // 7 relevant_cols = ['Y染色体浓度' , '孕周' , '孕妇BMI' , '年龄' ] analysis_df = df[relevant_cols].dropna()for col in ['Y染色体浓度' , '孕妇BMI' , '年龄' ]: analysis_df[col] = pd.to_numeric(analysis_df[col])print ("--- 数据加载和预处理完成 ---" )print (f"处理后总样本数: {len (analysis_df)} " )print ("转换后的孕周(周数)描述性统计:" )print (analysis_df[['孕周' ]].describe()) print ("-" * 50 + "\n" )print ("--- 2. 孕妇年龄与BMI相关性分析 ---" ) age_bmi_grouped = analysis_df.groupby('年龄' )['孕妇BMI' ].agg(['median' , 'count' ]) age_bmi_filtered = age_bmi_grouped[age_bmi_grouped['count' ] >= 5 ] ages = age_bmi_filtered.index mi_medians = age_bmi_filtered['median' ] corr, p_value, method = get_correlation(ages, mi_medians)print (f"孕妇年龄与BMI中值的相关性分析 (样本数>=5的组):" )print (f" - 使用方法: {method} " )print (f" - 相关系数: {corr:.4 f} " )print (f" - p-value: {p_value:.4 f} " )print ("-" * 50 + "\n" )print ("--- 3. Y染色体浓度与孕周相关性分析 ---" ) week_dna_grouped = analysis_df.groupby('孕周' )['Y染色体浓度' ].agg(['median' , 'count' ]) week_dna_filtered = week_dna_grouped[week_dna_grouped['count' ] >= 5 ] weeks = week_dna_filtered.index dna_medians_by_week = week_dna_filtered['median' ] corr, p_value, method = get_correlation(weeks, dna_medians_by_week)print (f"孕周与Y染色体浓度中值的相关性分析 (样本数>=5的组):" )print (f" - 使用方法: {method} " )print (f" - 相关系数: {corr:.4 f} " )print (f" - p-value: {p_value:.4 f} " )print ("-" * 50 + "\n" )print ("--- 4. Y染色体浓度与孕妇年龄相关性 (校正BMI) ---" ) analysis_df['cfEB' ] = (analysis_df['Y染色体浓度' ] / analysis_df['孕妇BMI' ]) * 1000 bins = [11 , 14 , 16 , 18 , 20 , 26 ] labels = ['12-14周' , '15-16周' , '17-18周' , '19-20周' , '21-26周' ] analysis_df['孕周分组' ] = pd.cut(analysis_df['孕周' ], bins=bins, right=True , labels=labels)print ("按孕周分组,分析孕妇年龄与cfEB的相关性:" )for group_name, group_df in analysis_df.groupby('孕周分组' ): age_cfeb_grouped = group_df.groupby('年龄' )['cfEB' ].agg(['median' , 'count' ]) age_cfeb_filtered = age_cfeb_grouped[age_cfeb_grouped['count' ] >= 5 ] print (f"\n孕周组: {group_name} " ) if len (age_cfeb_filtered) < 2 : print (" - 数据不足,无法进行相关性分析。" ) continue ages_in_group = age_cfeb_filtered.index cfeb_medians = age_cfeb_filtered['median' ] corr, p_value, method = get_correlation(ages_in_group, cfeb_medians) print (f" - 使用方法: {method} " ) if pd.isna(corr): continue print (f" - 相关系数: {corr:.4 f} " ) print (f" - p-value: {p_value:.4 f} " )print ("-" * 50 + "\n" )print ("--- 5. Y染色体浓度与孕妇BMI相关性 (校正年龄) ---" ) analysis_df['cfEA' ] = (analysis_df['Y染色体浓度' ] / analysis_df['年龄' ]) * 1000 print ("按孕周分组,分析孕妇BMI与cfEA的相关性:" )for group_name, group_df in analysis_df.groupby('孕周分组' ): bmi_cfea_grouped = group_df.groupby('孕妇BMI' )['cfEA' ].agg(['median' , 'count' ]) bmi_cfea_filtered = bmi_cfea_grouped[bmi_cfea_grouped['count' ] >= 5 ] print (f"\n孕周组: {group_name} " ) if len (bmi_cfea_filtered) < 2 : print (" - 数据不足,无法进行相关性分析。" ) continue bmis_in_group = bmi_cfea_filtered.index cfea_medians = bmi_cfea_filtered['median' ] corr, p_value, method = get_correlation(bmis_in_group, cfea_medians) print (f" - 使用方法: {method} " ) if pd.isna(corr): continue print (f" - 相关系数: {corr:.4 f} " ) print (f" - p-value: {p_value:.4 f} " )print ("-" * 50 + "\n" )
p1_relationship_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 import pandas as pdimport statsmodels.api as smimport reimport pingouin as pgfrom scipy.stats import shapiroimport numpy as nptry : df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='gbk' )except UnicodeDecodeError: df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='utf-8' )def clean_gestational_week (gw_str ): if isinstance (gw_str, str ): match = re.search(r'\d+' , gw_str) if match : return int (match .group(0 )) try : return int (gw_str) except (ValueError, TypeError): return None df['检测孕周_cleaned' ] = df['检测孕天数' ].apply(clean_gestational_week) relevant_cols = ['Y染色体浓度' , '检测孕周_cleaned' , '孕妇BMI' , '年龄' ] analysis_df = df[relevant_cols].dropna() analysis_df['Y染色体浓度' ] = pd.to_numeric(analysis_df['Y染色体浓度' ]) analysis_df['检测孕周_cleaned' ] = pd.to_numeric(analysis_df['检测孕周_cleaned' ]) analysis_df['孕妇BMI' ] = pd.to_numeric(analysis_df['孕妇BMI' ]) analysis_df['年龄' ] = pd.to_numeric(analysis_df['年龄' ])print ("--- 正在进行特征工程 ---" ) alpha = 0.05 print ("--- 正态性检验 (Shapiro-Wilk) ---" )print ("原假设 (H0): 数据服从正态分布" )print (f"显著性水平 (alpha) = {alpha} \n" ) all_normal = True for column in ['Y染色体浓度' , '检测孕周_cleaned' , '孕妇BMI' , '年龄' ]: stat, p_value = shapiro(analysis_df[column]) print (f"变量: {column} " ) print (f" - 检验统计量: {stat:.4 f} " ) print (f" - p-value: {p_value:.4 f} " ) if p_value > alpha: print (f" - 结论: p > {alpha} ,不能拒绝原假设,数据可视为服从正态分布。" ) else : all_normal = False print (f" - 结论: p <= {alpha} ,拒绝原假设,数据不服从正态分布。" ) print ("-" * 30 )print ("\n--- 偏相关系数分析 ---" )if not all_normal: print ("*** 警告: 由于部分或全部数据未通过正态性检验,将使用Spearman方法进行相关性分析。***" ) print ("--- Spearman 偏相关系数 (基于秩次) ---" ) partial_corr_df = analysis_df[['Y染色体浓度' , '检测孕周_cleaned' , '孕妇BMI' , '年龄' ]].rank().pcorr()else : print ("--- Pearson 偏相关系数 ---" ) partial_corr_df = analysis_df[['Y染色体浓度' , '检测孕周_cleaned' , '孕妇BMI' , '年龄' ]].pcorr()print (partial_corr_df)print ("\n" ) Y = analysis_df['Y染色体浓度' ] X_engineered = analysis_df[['检测孕周_cleaned' , '孕妇BMI' , '年龄' ]] X_engineered = sm.add_constant(X_engineered) model_engineered = sm.OLS(Y, X_engineered).fit()print ("\n--- 改造后的线性回归模型摘要 ---" )print (model_engineered.summary())print ("\n" )print ("--- 结果解读 ---" ) r_squared = model_engineered.rsquared_adj f_pvalue = model_engineered.f_pvalue coefficients = model_engineered.params p_values = model_engineered.pvaluesprint (f"调整后的R平方 (Adj. R-squared): {r_squared:.4 f} " )print (f"F统计量 p值: {f_pvalue:.4 f} " )print ("\n系数及其p值:" )for var in coefficients.index: print (f" {var} : {coefficients[var]:.4 f} (p-value: {p_values[var]:.4 f} )" )print ("\n" )print (f"显著性水平 (alpha) = {alpha} " )if f_pvalue < alpha: print ("整体模型在统计上是显著的 (F检验 p-value < 0.05)。" )else : print ("整体模型在统计上不显著 (F检验 p-value >= 0.05)。" )print ("\n各系数显著性:" )for var in p_values.index: if var == 'const' : continue if p_values[var] < alpha: print (f"- 特征 '{var} ' 在统计上是显著的。" ) else : print (f"- 特征 '{var} ' 在统计上不显著。" )
p1_xgboost_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 import pandas as pdimport xgboost as xgbimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.inspection import PartialDependenceDisplayimport redef main (): print ("--- 1. 开始加载和准备数据 ---" ) try : df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='gbk' ) except UnicodeDecodeError: df = pd.read_csv('../男胎检测数据_filtered.csv' , encoding='utf-8' ) def clean_gestational_week (gw_str ): if isinstance (gw_str, str ): match = re.search(r'\d+' , gw_str) if match : return int (match .group(0 )) try : return int (gw_str) except (ValueError, TypeError): return None df['检测孕周_cleaned' ] = df['检测孕天数' ].apply(clean_gestational_week) relevant_cols = ['Y染色体浓度' , '检测孕周_cleaned' , '孕妇BMI' , '年龄' ] analysis_df = df[relevant_cols].dropna() X = analysis_df[['检测孕周_cleaned' , '孕妇BMI' , '年龄' ]] Y = analysis_df['Y染色体浓度' ] print ("数据加载和准备完成。" ) plt.rcParams['font.sans-serif' ] = ['SimHei' ] plt.rcParams['axes.unicode_minus' ] = False xgb_model = xgb.XGBRegressor( objective='reg:squarederror' , random_state=42 ) param_grid = { 'n_estimators' : [100 , 150 , 200 ], 'max_depth' : [3 , 4 , 5 ], 'learning_rate' : [0.05 , 0.1 ], 'subsample' : [0.8 , 0.9 ], 'colsample_bytree' : [0.8 , 0.9 ] } print ("\n--- 3. 使用GridSearchCV进行自动调参 ---" ) grid_search = GridSearchCV( estimator=xgb_model, param_grid=param_grid, cv=5 , scoring='r2' , verbose=1 , n_jobs=-1 ) grid_search.fit(X, Y) print (f"\n找到的最佳参数: {grid_search.best_params_} " ) print (f"使用最佳参数在交叉检验中的最佳R²分数: {grid_search.best_score_:.4 f} " ) xgb_model_tuned = grid_search.best_estimator_ print ("\n--- 4. 最佳模型已在全部数据上完成训练 ---" ) print ("最终模型已准备好用于分析。" ) print ("\n--- 5. 分析特征重要性 ---" ) importances = xgb_model_tuned.feature_importances_ feature_names = X.columns importance_df = pd.DataFrame({ 'Feature' : feature_names, 'Importance' : importances }).sort_values(by='Importance' , ascending=False ) print ("各特征重要性排序:" ) print (importance_df) plt.figure(figsize=(10 , 6 )) sns.barplot(x='Importance' , y='Feature' , data=importance_df) plt.title('特征重要性排序 (正则化XGBoost)' ) plt.show() print ("\n--- 6. 生成所有特征的部分依赖图 ---" ) features_to_plot = [0 , 1 , 2 ] display = PartialDependenceDisplay.from_estimator( xgb_model_tuned, X, features_to_plot, feature_names=feature_names, n_jobs=3 , grid_resolution=30 , ) display.figure_.suptitle('所有特征对Y染色体浓度的部分依赖性 (正则化XGBoost)' , size=16 ) plt.subplots_adjust(top=0.9 ) plt.show() print ("\n分析完成。" )if __name__ == '__main__' : main()
p2_bmi_supervised_binning.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 """ 生存导向监督分箱(BMI)— 支持“区间删失 + 多重插补(MI)”与自适应插补、log-rank 显著性 - 固定为4组 + 可复现 + 单调化 + 稳健切分 + 切点持久化 - 支持 time-dependent AUC(scikit-survival,不装则跳过) - 新增:对左删失占比高的 BMI 段,自适应采用“向右偏重的插补(trunc_exp)并把左界抬到 10 周” - 新增:相邻 BMI 组的 log-rank 检验(基于 MI 多次插补,Fisher 法合并 p 值),导出 CSV 运行: python p2_bmi_supervised_binning.py 输出目录: ./outputs_binning """ import osimport jsonimport randomimport warningsimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt plt.rcParams['font.sans-serif' ] = ['SimHei' ] plt.rcParams['axes.unicode_minus' ] = False from lifelines import CoxPHFitter, KaplanMeierFitterfrom lifelines.utils import k_fold_cross_validationfrom lifelines.statistics import logrank_testfrom patsy import dmatrixfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.isotonic import IsotonicRegressiontry : from sksurv.metrics import cumulative_dynamic_auc from sksurv.util import Surv as SKSurv HAS_SK_SURV = True except Exception: HAS_SK_SURV = False warnings.warn("未安装 scikit-survival,将跳过 time-dependent AUC。pip install scikit-survival" )try : from scipy.stats import chi2 HAS_SCIPY = True except Exception: HAS_SCIPY = False warnings.warn("未安装 SciPy,将无法用 Fisher 法合并 p 值。pip install scipy" ) SEED = 42 def set_global_seed (seed: int = 42 ): os.environ["PYTHONHASHSEED" ] = str (seed) random.seed(seed) np.random.seed(seed) set_global_seed(SEED) PATIENT_CSV = "./eda_outputs/patient_level_summary.csv" RAW_CSV = "../男胎检测数据_filtered.csv" OUTDIR = "./outputs_binning" TIME_LOWER, TIME_UPPER = 10.0 , 30.0 N_TIME_POINTS = 801 P_LIST = [0.90 , 0.95 ] MAIN_P = 0.95 MONOTONE_Q90 = True REQUIRED_BINS = 4 MIN_SAMPLES_PER_BIN = 40 MIN_BIN_WIDTH = 2.0 PLOT_MIN_SAMPLES = 20 PLOT_EVEN_IF_SMALL = True TREE_BASE_MAX_DEPTH = 5 TREE_BASE_MIN_SAMPLES_LEAF = 10 TREE_RANDOM_STATE = SEED SPLINE_DF = 4 SPLINE_DEGREE = 3 COX_PENALIZER = None COX_PENALIZER_GRID = [0.0 , 0.02 , 0.05 , 0.1 , 0.2 ] CV_FOLDS = 5 USE_INTERVAL_MI = True MI_M = 20 MI_SAMPLING = "uniform" CV_ONCE_FOR_MI = True DETECTION_LOWER_BOUND = 6.0 ADAPTIVE_ENABLE = True LEFT_CENSOR_RATE_THRESHOLD = 0.6 ADAPTIVE_METHOD_HIGH = "trunc_exp" ADAPTIVE_LB_HIGH = 10.0 ADAPTIVE_M = 20 EARLY_WINDOW = (10.0 , 14.0 ) COL_PATIENT = "孕妇代码" COL_GA_DAYS = "检测孕天数" COL_DATE = "检测日期" COL_Y_CONC = "Y染色体浓度" COL_BMI = "孕妇BMI" COL_HEIGHT = "身高" COL_WEIGHT = "体重" THRESH_Y_FRAC = 0.04 def to_num (s ): return pd.to_numeric(s, errors="coerce" )def derive_week (s_days ): return to_num(s_days) / 7.0 def compute_bmi (h_cm, w_kg ): h_m = to_num(h_cm) / 100.0 w = to_num(w_kg) return w / (h_m ** 2 )def parse_y_as_fraction (s ): if s.dtype == object : s_str = s.astype(str ).str .strip() if s_str.str .contains("%" ).any (): vals = pd.to_numeric(s_str.str .replace("%" ,"" ).str .replace("," ,"" ), errors="coerce" ) return vals / 100.0 y = pd.to_numeric(s, errors="coerce" ) finite = y[np.isfinite(y)] if finite.size >= 10 : q95 = np.nanpercentile(finite, 95 ) if 1.0 < q95 <= 100.0 : return y / 100.0 return ydef safe_min (a ): a = np.asarray(a) return float (np.nanmin(a)) if np.isfinite(a).any () else np.nandef safe_max (a ): a = np.asarray(a) return float (np.nanmax(a)) if np.isfinite(a).any () else np.nandef build_patient_interval_from_raw (raw_csv, detection_lower_bound=DETECTION_LOWER_BOUND ): df = pd.read_csv(raw_csv) if COL_GA_DAYS not in df.columns or COL_PATIENT not in df.columns or COL_Y_CONC not in df.columns: raise ValueError("原始CSV缺少必要列(孕妇代码/检测孕天数/Y染色体浓度)。" ) df["孕周" ] = derive_week(df[COL_GA_DAYS]) if COL_DATE in df.columns: df[COL_DATE] = pd.to_datetime(df[COL_DATE], errors="coerce" ) df["Y_frac" ] = parse_y_as_fraction(df[COL_Y_CONC]) df["有效测量" ] = ~df["Y_frac" ].isna() df["达标" ] = df["有效测量" ] & (df["Y_frac" ] >= THRESH_Y_FRAC) if COL_BMI in df.columns: df["BMI_num" ] = to_num(df[COL_BMI]) else : df["BMI_num" ] = np.nan if df["BMI_num" ].isna().any () and (COL_HEIGHT in df.columns) and (COL_WEIGHT in df.columns): calc = compute_bmi(df[COL_HEIGHT], df[COL_WEIGHT]) df.loc[df["BMI_num" ].isna(), "BMI_num" ] = calc[df["BMI_num" ].isna()] sort_cols = ["孕周" ] + ([COL_DATE] if COL_DATE in df.columns else []) df = df.sort_values(sort_cols) rows = [] for pid, g in df.groupby(COL_PATIENT, dropna=False ): bmi = g["BMI_num" ].dropna().iloc[0 ] if g["BMI_num" ].notna().any () else np.nan g_valid = g[g["有效测量" ]].copy() times = g_valid["孕周" ].values hits = g_valid["达标" ].values.astype(bool ) if len (times) == 0 : L, R, ctype = np.nan, np.nan, "missing" else : pos_idx = np.where(hits)[0 ] if len (pos_idx) > 0 : first_pos_i = int (pos_idx[0 ]) neg_before = np.where(~hits[:first_pos_i])[0 ] if len (neg_before) > 0 : L = float (times[neg_before[-1 ]]) R = float (times[first_pos_i]) if R < L: L, R = R, L ctype = "interval" else : L = float (detection_lower_bound) R = float (times[first_pos_i]) if R < L: L = max (R - 1e-3 , detection_lower_bound) ctype = "left" else : last_valid_week = float (times[-1 ]) L, R, ctype = last_valid_week, np.inf, "right" rows.append({ "patient_id" : pid, "BMI" : bmi, "L" : L, "R" : R, "ctype" : ctype }) df_int = pd.DataFrame(rows) return df_intdef _rand_trunc_expon (L, R, scale=2.0 ): if not np.isfinite(R): R = TIME_UPPER U = random.random() denom = 1.0 - np.exp(-(R - L) / scale) if denom <= 1e-12 : return float ((L + R) / 2.0 ) t = -scale * np.log(1 - U * denom) + L return float (min (max (t, L + 1e-6 ), R))def sample_time_from_interval (L, R, left_lower_bound=DETECTION_LOWER_BOUND, method="uniform" ): if not np.isfinite(R): return np.nan if not np.isfinite(L): L = float (left_lower_bound) L_eff = float (L); R_eff = float (R) if R_eff <= L_eff: return float (R_eff) if method == "uniform" : return float (np.random.uniform(L_eff, R_eff)) else : return _rand_trunc_expon(L_eff, R_eff, scale=2.0 )def multiple_imputations_from_intervals (df_int, M=MI_M, method=MI_SAMPLING, left_lower_bound=DETECTION_LOWER_BOUND ): dfs = [] base = df_int.copy() keep_mask = base["ctype" ].isin(["interval" , "left" , "right" ]) base = base[keep_mask].copy() for m in range (M): rows = [] for _, r in base.iterrows(): ctype = r["ctype" ]; BMI = r["BMI" ]; pid = r["patient_id" ]; L, R = r["L" ], r["R" ] if ctype in ["interval" , "left" ] and np.isfinite(R): t = sample_time_from_interval(L, R, left_lower_bound=left_lower_bound, method=method) rows.append({"patient_id" : pid, "BMI" : BMI, "time" : t, "event" : 1 }) elif ctype == "right" and np.isfinite(L): rows.append({"patient_id" : pid, "BMI" : BMI, "time" : float (L), "event" : 0 }) else : continue df_m = pd.DataFrame(rows) df_m = df_m[(df_m["time" ] >= 6 ) & (df_m["time" ] <= 40 )] df_m["event" ] = df_m["event" ].astype(int ) dfs.append(df_m.reset_index(drop=True )) return dfsdef bins_from_cuts (x, cuts ): if len (cuts) == 0 : bins = [-np.inf, np.inf] else : bins = [-np.inf] + cuts + [np.inf] labels = [] for i in range (len (bins)-1 ): a, b = bins[i], bins[i+1 ] labels.append(f"[{round (a,1 ) if np.isfinite(a) else '-inf' } , {round (b,1 ) if np.isfinite(b) else '+inf' } )" ) idx = np.digitize(x, bins[1 :-1 ], right=False ) return pd.Series([labels[i] for i in idx], index=x.index), bins, labelsdef parse_label_left (label: str ) -> float : left = label.split("," )[0 ][1 :].strip() return float (left.replace("-inf" , "-1e9" ))def compute_left_censor_rates (df_int, labels_series ): groups = [] for g, idxs in labels_series.groupby(labels_series).groups.items(): sub = df_int.loc[idxs] n = len (sub) if n == 0 : rate = np.nan else : rate = float ((sub["ctype" ] == "left" ).sum ()) / n groups.append({"group" : g, "n" : n, "left_censor_rate" : rate}) res = pd.DataFrame(groups).sort_values("group" , key=lambda s: s.map (parse_label_left)).reset_index(drop=True ) return resdef multiple_imputations_adaptive (df_int, labels_series, left_rate_df, threshold=LEFT_CENSOR_RATE_THRESHOLD, M=ADAPTIVE_M, method_default=MI_SAMPLING, method_high=ADAPTIVE_METHOD_HIGH, lb_default=DETECTION_LOWER_BOUND, lb_high=ADAPTIVE_LB_HIGH ): left_rate_map = dict (zip (left_rate_df["group" ], left_rate_df["left_censor_rate" ])) dfs = [] for m in range (M): rows = [] for idx, r in df_int.iterrows(): if r["ctype" ] not in ["interval" , "left" , "right" ]: continue group = labels_series.at[idx] if idx in labels_series.index else None rate = left_rate_map.get(group, 0.0 ) use_high = (rate is not None ) and (np.isfinite(rate)) and (rate >= threshold) method = method_high if use_high else method_default lb = lb_high if use_high else lb_default pid = r["patient_id" ]; BMI = r["BMI" ]; L, R, ct = r["L" ], r["R" ], r["ctype" ] if ct in ["interval" , "left" ] and np.isfinite(R): t = sample_time_from_interval(L, R, left_lower_bound=lb, method=method) rows.append({"patient_id" : pid, "BMI" : BMI, "time" : t, "event" : 1 }) elif ct == "right" and np.isfinite(L): rows.append({"patient_id" : pid, "BMI" : BMI, "time" : float (L), "event" : 0 }) else : continue df_m = pd.DataFrame(rows) df_m = df_m[(df_m["time" ] >= 6 ) & (df_m["time" ] <= 40 )] df_m["event" ] = df_m["event" ].astype(int ) dfs.append(df_m.reset_index(drop=True )) return dfsdef km_quantile_time (durations, events, target_S ): kmf = KaplanMeierFitter() kmf.fit(durations=durations, event_observed=events) sf = kmf.survival_function_.reset_index().rename(columns={"KM_estimate" :"S" ,"timeline" :"t" }) sf = sf.sort_values("t" ) hit = sf[sf["S" ] <= target_S] return float (hit["t" ].iloc[0 ]) if len (hit) else np.nandef per_group_km_quantiles (df, group_col, p_list ): rows = [] for g, sub in df.dropna(subset=["time" ,"BMI" ]).groupby(group_col): res = {"group" : str (g), "n" : len (sub)} if len (sub) >= 20 : for p in p_list: res[f"KM_t{int (p*100 )} " ] = km_quantile_time(sub["time" ].values, sub["event" ].values, target_S=1.0 - p) else : for p in p_list: res[f"KM_t{int (p*100 )} " ] = np.nan rows.append(res) return pd.DataFrame(rows)def group_km_recommendations_from_imputations (imputed_sets, cuts_final, p=MAIN_P ): label_vals = {} for df_m in imputed_sets: labels, _, _ = bins_from_cuts(df_m["BMI" ], cuts_final) rec = per_group_km_quantiles(df_m.assign(group=labels.values), "group" , [p]) col = f"KM_t{int (p*100 )} " for _, r in rec.iterrows(): label_vals.setdefault(r["group" ], []).append(r[col]) rows = [] for g, L in label_vals.items(): arr = np.array([v for v in L if np.isfinite(v)]) rows.append({"group" : g, f"KM_t{int (p*100 )} _MI_med" : (float (np.median(arr)) if len (arr) > 0 else np.nan)}) out = pd.DataFrame(rows) out = out.sort_values("group" , key=lambda s: s.map (parse_label_left)).reset_index(drop=True ) return outdef predict_tp_from_cox (cph, df_feat, p=0.90 , time_grid=None ): if time_grid is None : time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS) surv = cph.predict_survival_function(df_feat, times=time_grid) target_S = 1.0 - p t_list = [] for col in surv.columns: s = surv[col].values hit = np.where(s <= target_S)[0 ] t_list.append(np.nan if len (hit) == 0 else time_grid[hit[0 ]]) return pd.Series(t_list, index=df_feat.index)def make_bmi_spline (bmi_series, bmi_center, df=SPLINE_DF, degree=SPLINE_DEGREE ): Xs = dmatrix( f"bs(BMI_centered, df={df} , degree={degree} , include_intercept=False)" , {"BMI_centered" : (bmi_series - bmi_center).values}, return_type='dataframe' ) Xs.index = bmi_series.index return Xsdef extract_tree_cuts_1d (tree: DecisionTreeRegressor ): thr = tree.tree_.threshold cuts = sorted ([float (t) for t in thr if t != -2.0 ]) return cutsdef apply_min_width (cuts, min_width=2.0 ): if not cuts: return [] kept = [cuts[0 ]] for c in cuts[1 :]: if c - kept[-1 ] >= min_width: kept.append(c) return keptdef merge_small_bins (df_labels, label_col, min_samples=40 ): labels_order = sorted (df_labels[label_col].unique(), key=parse_label_left) counts = df_labels[label_col].value_counts().to_dict() def neighbors (idx ): left = labels_order[idx-1 ] if idx-1 >= 0 else None right = labels_order[idx+1 ] if idx+1 < len (labels_order) else None return left, right changed = True while changed: changed = False smalls = [lab for lab in labels_order if counts.get(lab, 0 ) < min_samples] for lab in smalls: if lab not in labels_order: continue idx = labels_order.index(lab) L, R = neighbors(idx) if L is None and R is None : continue cL = counts.get(L, -1 ) cR = counts.get(R, -1 ) target = L if (cL >= cR and L is not None ) else (R if R is not None else L) df_labels.loc[df_labels[label_col] == lab, label_col] = target counts[target] = counts.get(target, 0 ) + counts.get(lab, 0 ) counts[lab] = 0 labels_order.remove(lab) changed = True break return df_labels[label_col]def canonicalize_from_labels (x_series, labels_series ): uniq_labels = pd.unique(labels_series) lefts = sorted (set (parse_label_left(l) for l in uniq_labels)) cuts_final = [v for v in lefts if v > -1e8 ] labels_final, bins_edges, labels_text = bins_from_cuts(x_series, cuts_final) return cuts_final, labels_final, bins_edges, labels_textdef evaluate_bins_mae (bmi, y_true, labels ): df_tmp = pd.DataFrame({"BMI" : bmi, "y" : y_true, "label" : labels}) preds = df_tmp.groupby("label" )["y" ].median().to_dict() y_hat = df_tmp["label" ].map (preds).astype(float ) mae = np.nanmean(np.abs (df_tmp["y" ].values - y_hat.values)) sizes = df_tmp["label" ].value_counts().to_dict() return mae, sizesdef fit_cox_with_cv_penalizer (cox_df, penalizer_grid, duration_col="time" , event_col="event" , k=5 ): best_pen = None best_c = -np.inf results = [] for pen in penalizer_grid: cph_try = CoxPHFitter(penalizer=float (pen)) try : cph_try.fit(cox_df, duration_col=duration_col, event_col=event_col, show_progress=False ) scores = k_fold_cross_validation( cph_try, cox_df, duration_col=duration_col, event_col=event_col, k=k, scoring_method="concordance_index" ) mean_c = float (np.mean(scores)) except Exception: mean_c = np.nan results.append((float (pen), mean_c)) if np.isfinite(mean_c) and mean_c > best_c: best_c = mean_c; best_pen = float (pen) if best_pen is None : best_pen = 0.0 return best_pen, resultsdef time_dependent_auc_curve (df, lower=10.0 , upper=30.0 , n_points=40 , kfold=5 , penalizer=0.05 ): if not HAS_SK_SURV: return None , None , None df = df.dropna(subset=["BMI" ,"time" ,"event" ]).copy() df = df[(df["time" ] >= 6 ) & (df["time" ] <= 40 )] if len (df) < 50 : warnings.warn("样本过少,跳过 AUC 评估。" ) return None , None , None BMI_CENTER = df["BMI" ].mean() X = dmatrix("bs(BMI_centered, df=4, degree=3, include_intercept=False)" , {"BMI_centered" : (df["BMI" ] - BMI_CENTER).values}, return_type='dataframe' ) X.index = df.index y = df[["time" ,"event" ]].copy() idx = np.arange(len (df)); np.random.shuffle(idx) folds = np.array_split(idx, kfold) times_global = np.linspace(lower, upper, n_points) auc_matrix = []; used_fold = 0 for i in range (kfold): test_idx = folds[i]; train_idx = np.setdiff1d(idx, test_idx) if len (train_idx) < 20 or len (test_idx) < 20 : continue cph = CoxPHFitter(penalizer=penalizer) train_df = pd.concat([y.iloc[train_idx].reset_index(drop=True ), X.iloc[train_idx].reset_index(drop=True )], axis=1 ) try : cph.fit(train_df, duration_col="time" , event_col="event" , show_progress=False ) except Exception as e: warnings.warn(f"Cox 拟合失败 fold {i+1 } : {e} " ); continue risk_test = cph.predict_partial_hazard(X.iloc[test_idx]).values.ravel() y_train = SKSurv.from_arrays(event=y.iloc[train_idx]["event" ].astype(bool ).values, time=y.iloc[train_idx]["time" ].values) y_test = SKSurv.from_arrays(event=y.iloc[test_idx]["event" ].astype(bool ).values, time=y.iloc[test_idx]["time" ].values) tmin = float (np.min (y.iloc[test_idx]["time" ].values)) tmax = float (np.max (y.iloc[test_idx]["time" ].values)) mask = (times_global > tmin + 1e-8 ) & (times_global < tmax - 1e-8 ) if not np.any (mask): continue times_fold = times_global[mask] try : auc_t, _ = cumulative_dynamic_auc(y_train, y_test, risk_test, times_fold) except ValueError: eps = 1e-3 mask2 = (times_global > tmin + eps) & (times_global < tmax - eps) if not np.any (mask2): continue times_fold = times_global[mask2] auc_t, _ = cumulative_dynamic_auc(y_train, y_test, risk_test, times_fold) mask = mask2 fold_vec = np.full_like(times_global, np.nan, dtype=float ) fold_vec[mask] = auc_t auc_matrix.append(fold_vec); used_fold += 1 if used_fold == 0 : warnings.warn("无可用折进行 AUC 评估。" ); return None , None , None auc_matrix = np.vstack(auc_matrix) mean_auc = np.nanmean(auc_matrix, axis=0 ) std_auc = np.nanstd(auc_matrix, axis=0 ) plt.figure(figsize=(7.5 ,4.5 )) plt.plot(times_global, mean_auc, lw=2 , label="Mean AUC(t)" ) lo = np.where(np.isfinite(mean_auc - std_auc), mean_auc - std_auc, np.nan) hi = np.where(np.isfinite(mean_auc + std_auc), mean_auc + std_auc, np.nan) plt.fill_between(times_global, np.nan_to_num(lo, nan=0.5 ), np.nan_to_num(hi, nan=1.0 ), alpha=0.2 , label="±1 SD" ) plt.ylim(0.5 , 1.0 ); plt.xlabel("孕周(周)" ); plt.ylabel("AUC(t)" ) plt.title("Time-dependent AUC(Cox + BMI 样条,K折,自动裁剪时域)" ) plt.grid(alpha=0.3 ); plt.legend() plt.tight_layout(); plt.savefig(os.path.join(OUTDIR, "auc_curve.png" ), dpi=180 ); plt.close() auc_df = pd.DataFrame({"t" : times_global, "auc_mean" : mean_auc, "auc_sd" : std_auc}) auc_df.to_csv(os.path.join(OUTDIR, "auc_curve.csv" ), index=False , encoding="utf-8-sig" ) return times_global, mean_auc, std_aucdef main (): os.makedirs(OUTDIR, exist_ok=True ) meta_path_prev = os.path.join(OUTDIR, "bmi_supervised_bins_cuts.json" ) if os.path.exists(meta_path_prev): try : with open (meta_path_prev, "r" , encoding="utf-8" ) as f: prev = json.load(f) print (f"[INFO] 上一次运行 seed={prev.get('seed' )} , cuts_final={prev.get('chosen' ,{} ).get('cuts_final')}" ) except Exception as e: print ("[WARN] 读取上一次 JSON 失败:" , e) df_pred = None mode_used = "exact" df_int = None if USE_INTERVAL_MI and os.path.exists(RAW_CSV): print ("[INFO] 使用区间删失 + 多重插补(MI)路径..." ) mode_used = "interval_mi" df_int = build_patient_interval_from_raw(RAW_CSV, detection_lower_bound=DETECTION_LOWER_BOUND) out_int_csv = os.path.join(OUTDIR, "patient_level_intervals.csv" ) df_int.to_csv(out_int_csv, index=False , encoding="utf-8-sig" ) print (f"[INFO] 已保存区间数据: {out_int_csv} (n={len (df_int)} )" ) set_global_seed(SEED) imputed_sets = multiple_imputations_from_intervals( df_int, M=MI_M, method=MI_SAMPLING, left_lower_bound=DETECTION_LOWER_BOUND ) print (f"[INFO] 生成多重插补数据集 M={len (imputed_sets)} " ) agg_preds = None penalizer_used = None df_pred_first = None for m_idx, df_m in enumerate (imputed_sets): if len (df_m) == 0 : continue BMI_CENTER = df_m["BMI" ].mean() X_spline = make_bmi_spline(df_m["BMI" ], BMI_CENTER, df=SPLINE_DF, degree=SPLINE_DEGREE) cox_df = pd.concat([df_m[["time" ,"event" ]].reset_index(drop=True ), X_spline.reset_index(drop=True )], axis=1 ) if COX_PENALIZER is None : if (m_idx == 0 ) or (not CV_ONCE_FOR_MI): best_pen, pen_cv = fit_cox_with_cv_penalizer( cox_df, COX_PENALIZER_GRID, duration_col="time" , event_col="event" , k=CV_FOLDS ) penalizer_used = float (best_pen) print (f"[MI {m_idx+1 } ] Cox penalizer CV: {pen_cv} -> best={best_pen} " ) else : best_pen = penalizer_used else : best_pen = float (COX_PENALIZER) if m_idx == 0 : print (f"[MI {m_idx+1 } ] Cox penalizer fixed to: {best_pen} " ) cph = CoxPHFitter(penalizer=best_pen) cph.fit(cox_df, duration_col="time" , event_col="event" , show_progress=False ) time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS) pred_cols = {} for p in P_LIST: pred_cols[f"pred_t{int (p*100 )} " ] = predict_tp_from_cox(cph, X_spline, p=p, time_grid=time_grid) df_pred_m = pd.concat([df_m.reset_index(drop=True ), pd.DataFrame(pred_cols).reset_index(drop=True )], axis=1 ) df_pred_m["BMI" ] = df_m["BMI" ].values if agg_preds is None : agg_preds = df_pred_m[["patient_id" ,"BMI" ] + [f"pred_t{int (p*100 )} " for p in P_LIST]].copy() for p in P_LIST: agg_preds[f"pred_t{int (p*100 )} _list" ] = agg_preds[f"pred_t{int (p*100 )} " ].apply(lambda x: [x]) else : tmp = df_pred_m.set_index("patient_id" ) base = agg_preds.set_index("patient_id" ) for p in P_LIST: vals = tmp[f"pred_t{int (p*100 )} " ] lst = base[f"pred_t{int (p*100 )} _list" ] base[f"pred_t{int (p*100 )} _list" ] = [ (lst.iloc[i] + [vals.iloc[i]]) if (i < len (vals)) else lst.iloc[i] for i in range (len (lst)) ] agg_preds = base.reset_index() if m_idx == 0 : df_pred_first = df_pred_m.copy() try : scores = k_fold_cross_validation(cph, cox_df, duration_col="time" , event_col="event" , k=CV_FOLDS, scoring_method="concordance_index" ) print (f"[MI {m_idx+1 } ] {CV_FOLDS} -fold C-index: mean={np.mean(scores):.3 f} , std={np.std(scores):.3 f} " ) except Exception as e: print ("[MI] 交叉验证跳过:" , e) if agg_preds is None : raise RuntimeError("多重插补生成的预测为空,请检查数据。" ) for p in P_LIST: lists = agg_preds[f"pred_t{int (p*100 )} _list" ] agg_preds[f"pred_t{int (p*100 )} " ] = lists.apply( lambda L: np.nanmedian([v for v in L if np.isfinite(v)]) if isinstance (L, list ) and len (L)>0 else np.nan ) df_pred = pd.DataFrame({ "patient_id" : agg_preds["patient_id" ], "BMI" : agg_preds["BMI" ] }) for p in P_LIST: df_pred[f"pred_t{int (p*100 )} " ] = agg_preds[f"pred_t{int (p*100 )} " ].values df_pred["time" ] = df_pred_first["time" ].values df_pred["event" ] = df_pred_first["event" ].values else : print ("[INFO] 使用原始(把首次达标视为精确事件)的路径..." ) mode_used = "exact" if os.path.exists(PATIENT_CSV): df_pat = pd.read_csv(PATIENT_CSV) if "patient_id" not in df_pat.columns: if "孕妇代码" in df_pat.columns: df_pat = df_pat.rename(columns={"孕妇代码" :"patient_id" }) else : df_pat["patient_id" ] = np.arange(len (df_pat)) if "BMI" not in df_pat.columns and "BMI_num" in df_pat.columns: df_pat = df_pat.rename(columns={"BMI_num" :"BMI" }) else : df_int_tmp = build_patient_interval_from_raw(RAW_CSV, detection_lower_bound=DETECTION_LOWER_BOUND) exact_rows = [] for _, r in df_int_tmp.iterrows(): if r["ctype" ] in ["interval" , "left" ] and np.isfinite(r["R" ]): exact_rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : r["R" ], "event" : 1 }) elif r["ctype" ] == "right" and np.isfinite(r["L" ]): exact_rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : r["L" ], "event" : 0 }) df_pat = pd.DataFrame(exact_rows) needed = ["patient_id" ,"BMI" ,"time" ,"event" ] for k in needed: if k not in df_pat.columns: raise ValueError(f"数据缺列:{k} " ) df = df_pat[needed].copy().dropna(subset=["BMI" ,"time" ,"event" ]) df = df[(df["time" ] >= 6 ) & (df["time" ] <= 40 )] df["event" ] = df["event" ].astype(int ) BMI_CENTER = df["BMI" ].mean() X_spline_train = make_bmi_spline(df["BMI" ], BMI_CENTER, df=SPLINE_DF, degree=SPLINE_DEGREE) cox_df = pd.concat([df[["time" ,"event" ]].reset_index(drop=True ), X_spline_train.reset_index(drop=True )], axis=1 ) if COX_PENALIZER is None : best_pen, pen_cv = fit_cox_with_cv_penalizer( cox_df, COX_PENALIZER_GRID, duration_col="time" , event_col="event" , k=CV_FOLDS ) penalizer = float (best_pen) print (f"Cox penalizer CV: {pen_cv} -> best={best_pen} " ) else : penalizer = float (COX_PENALIZER) print (f"Cox penalizer fixed to: {penalizer} " ) cph = CoxPHFitter(penalizer=penalizer) cph.fit(cox_df, duration_col="time" , event_col="event" , show_progress=False ) try : scores = k_fold_cross_validation(cph, cox_df, duration_col="time" , event_col="event" , k=CV_FOLDS, scoring_method="concordance_index" ) print (f"{CV_FOLDS} -fold C-index: mean={np.mean(scores):.3 f} , std={np.std(scores):.3 f} " ) except Exception: pass time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS) pred = {} for p in P_LIST: pred[f"pred_t{int (p*100 )} " ] = predict_tp_from_cox(cph, X_spline_train, p=p, time_grid=time_grid) df_pred = pd.concat([df.reset_index(drop=True ), pd.DataFrame(pred).reset_index(drop=True )], axis=1 ) main_col = f"pred_t{int (MAIN_P*100 )} " x_bmi = df_pred["BMI" ].values order = np.argsort(x_bmi) iso = IsotonicRegression(increasing=True , out_of_bounds="clip" ) y95_raw = df_pred[main_col].fillna(TIME_UPPER).values y95_sorted = iso.fit_transform(x_bmi[order], y95_raw[order]) y95_fit = np.empty_like(y95_sorted); y95_fit[order] = y95_sorted df_pred["t_main_mono" ] = y95_fit if MONOTONE_Q90 and "pred_t90" in df_pred.columns: y90_raw = df_pred["pred_t90" ].fillna(TIME_UPPER).values y90_sorted = iso.fit_transform(x_bmi[order], y90_raw[order]) y90_fit = np.empty_like(y90_sorted); y90_fit[order] = y90_sorted df_pred["t90_mono" ] = y90_fit else : df_pred["t90_mono" ] = df_pred.get("pred_t90" , pd.Series(np.nan, index=df_pred.index)) y_target = df_pred["t_main_mono" ].values X = df_pred[["BMI" ]].values base_tree = DecisionTreeRegressor( max_depth=TREE_BASE_MAX_DEPTH, min_samples_leaf=TREE_BASE_MIN_SAMPLES_LEAF, random_state=TREE_RANDOM_STATE ) base_tree.fit(X, y_target) path = base_tree.cost_complexity_pruning_path(X, y_target) ccp_alphas = np.unique(path.ccp_alphas) candidates = [] for alpha in ccp_alphas: tree = DecisionTreeRegressor( max_depth=TREE_BASE_MAX_DEPTH, min_samples_leaf=TREE_BASE_MIN_SAMPLES_LEAF, ccp_alpha=float (alpha), random_state=TREE_RANDOM_STATE ) tree.fit(X, y_target) cuts_raw = extract_tree_cuts_1d(tree) cuts = apply_min_width(cuts_raw, MIN_BIN_WIDTH) labels_series, _, _ = bins_from_cuts(df_pred["BMI" ], cuts) df_lab = pd.DataFrame({"label" : labels_series}) labels_merged = merge_small_bins(df_lab, "label" , MIN_SAMPLES_PER_BIN) cuts_final, labels_final, bins_edges_final, _ = canonicalize_from_labels(df_pred["BMI" ], labels_merged) mae, sizes = evaluate_bins_mae(df_pred["BMI" ].values, y_target, labels_final.values) n_bins_final = len (set (labels_final.values)) candidates.append({ "alpha" : float (alpha), "cuts_raw" : cuts_raw, "cuts_after_minwidth" : cuts, "cuts_final" : cuts_final, "bins_edges_final" : [None if not np.isfinite(b) else float (b) for b in bins_edges_final], "n_bins_final" : n_bins_final, "mae" : float (mae), "sizes" : {k:int (v) for k,v in sizes.items()}, "labels_series_final" : labels_final.copy() }) def dist_to_required (n ): return abs (n - REQUIRED_BINS) exact = [c for c in candidates if c["n_bins_final" ] == REQUIRED_BINS] if exact: chosen = sorted (exact, key=lambda c: (c["mae" ]))[0 ] else : chosen = sorted (candidates, key=lambda c: (dist_to_required(c["n_bins_final" ]), c["mae" ]))[0 ] df_pred["BMI_bin_supervised" ] = chosen["labels_series_final" ].values def enforce_bins_and_min_samples (): df_lab2 = pd.DataFrame({"label" : df_pred["BMI_bin_supervised" ]}) labels_merged2 = merge_small_bins(df_lab2, "label" , max (MIN_SAMPLES_PER_BIN, PLOT_MIN_SAMPLES)) cuts_final2, labels_final2, bins_edges_final2, _ = canonicalize_from_labels(df_pred["BMI" ], labels_merged2) n_bins2 = len (set (labels_final2.values)) return labels_final2, cuts_final2, bins_edges_final2, n_bins2 labels_final2, cuts_final2, bins_edges_final2, n_bins2 = enforce_bins_and_min_samples() if n_bins2 != REQUIRED_BINS: alt = None for cnd in sorted (candidates, key=lambda c: (dist_to_required(c["n_bins_final" ]), c["mae" ])): df_pred["BMI_bin_supervised" ] = cnd["labels_series_final" ].values labels_final2, cuts_final2, bins_edges_final2, n_bins2 = enforce_bins_and_min_samples() if n_bins2 == REQUIRED_BINS: alt = (cnd, labels_final2, cuts_final2, bins_edges_final2); break if alt is not None : chosen, labels_final2, cuts_final2, bins_edges_final2 = alt df_pred["BMI_bin_supervised" ] = labels_final2.values chosen["cuts_final" ] = cuts_final2 chosen["bins_edges_final" ] = [None if not np.isfinite(b) else float (b) for b in bins_edges_final2] n_bins_final = len (set (df_pred["BMI_bin_supervised" ])) print (f"[INFO] 最终组数: {n_bins_final} (目标 {REQUIRED_BINS} )" ) km_tab = per_group_km_quantiles(df_pred, "BMI_bin_supervised" , P_LIST) pred_tab = df_pred.groupby("BMI_bin_supervised" )[["pred_t90" ,"pred_t95" ,"t_main_mono" ,"t90_mono" ]].median().reset_index() \ .rename(columns={"pred_t90" :"Cox_pred_t90_median" , "pred_t95" :"Cox_pred_t95_median" , "t_main_mono" :"Cox_t_main_mono_median" , "t90_mono" :"Cox_t90_mono_median" }) summary = km_tab.merge(pred_tab, left_on="group" , right_on="BMI_bin_supervised" , how="left" ).drop(columns=["BMI_bin_supervised" ]) summary = summary.sort_values("group" , key=lambda s: s.map (parse_label_left)).reset_index(drop=True ) rec = summary["Cox_t_main_mono_median" ].apply(lambda x: np.nan if pd.isna(x) else round (x*2 )/2 ).values for i in range (1 , len (rec)): if not np.isnan(rec[i-1 ]) and (np.isnan(rec[i]) or rec[i] < rec[i-1 ]): rec[i] = rec[i-1 ] summary["recommended_week" ] = rec adaptive_outputs = {} if ADAPTIVE_ENABLE and (df_int is not None ): labels_int, _, _ = bins_from_cuts(df_int["BMI" ], chosen["cuts_final" ]) left_rates = compute_left_censor_rates(df_int, labels_int) left_rates.to_csv(os.path.join(OUTDIR, "left_censor_rates_by_group.csv" ), index=False , encoding="utf-8-sig" ) print ("\n[INFO] 左删失比例(按最终 BMI 组):" ) print (left_rates.to_string(index=False )) set_global_seed(SEED) imps_adapt = multiple_imputations_adaptive( df_int, labels_int, left_rates, threshold=LEFT_CENSOR_RATE_THRESHOLD, M=ADAPTIVE_M, method_default=MI_SAMPLING, method_high=ADAPTIVE_METHOD_HIGH, lb_default=DETECTION_LOWER_BOUND, lb_high=ADAPTIVE_LB_HIGH ) rec_adapt = group_km_recommendations_from_imputations(imps_adapt, chosen["cuts_final" ], p=MAIN_P) rec_vals = rec_adapt[f"KM_t{int (MAIN_P*100 )} _MI_med" ].values.copy() for i in range (1 , len (rec_vals)): if np.isfinite(rec_vals[i-1 ]) and (np.isnan(rec_vals[i]) or rec_vals[i] < rec_vals[i-1 ]): rec_vals[i] = rec_vals[i-1 ] rec_vals = np.array([np.nan if np.isnan(v) else round (v*2 )/2 for v in rec_vals]) rec_adapt["recommended_week_adaptive" ] = rec_vals rec_adapt.to_csv(os.path.join(OUTDIR, "recommendations_adaptive.csv" ), index=False , encoding="utf-8-sig" ) print ("\n[INFO] 自适应插补后的组内 KM Q95(插补中位数)与推荐:" ) print (rec_adapt.to_string(index=False )) adaptive_outputs["left_rates" ] = left_rates adaptive_outputs["rec_adapt" ] = rec_adapt ordered_groups = sorted (rec_adapt["group" ].tolist(), key=parse_label_left) pairs = [(ordered_groups[i], ordered_groups[i+1 ]) for i in range (len (ordered_groups)-1 )] rows_lr = [] for gL, gR in pairs: p_list = [] for df_m in imps_adapt: labels_m, _, _ = bins_from_cuts(df_m["BMI" ], chosen["cuts_final" ]) subL = df_m[labels_m.values == gL] subR = df_m[labels_m.values == gR] if len (subL) < 5 or len (subR) < 5 : continue try : res = logrank_test(subL["time" ], subR["time" ], event_observed_A=subL["event" ], event_observed_B=subR["event" ]) p_list.append(float (res.p_value)) except Exception: continue if len (p_list) == 0 : p_fisher = np.nan; p_median = np.nan else : p_median = float (np.median(p_list)) if HAS_SCIPY: stat = -2.0 * np.sum (np.log(np.clip(p_list, 1e-300 , 1.0 ))) df_chi = 2 * len (p_list) p_fisher = float (1.0 - chi2.cdf(stat, df_chi)) else : p_fisher = np.nan rows_lr.append({"group_left" : gL, "group_right" : gR, "n_imputations" : len (p_list), "p_median" : p_median, "p_fisher" : p_fisher}) df_lr = pd.DataFrame(rows_lr) df_lr.to_csv(os.path.join(OUTDIR, "logrank_adjacent_adaptive.csv" ), index=False , encoding="utf-8-sig" ) print ("\n[INFO] 相邻 BMI 组 log-rank(自适应 MI,Fisher 合并 p):" ) print (df_lr.to_string(index=False )) out_csv = os.path.join(OUTDIR, "bmi_supervised_bins_summary.csv" ) summary.to_csv(out_csv, index=False , encoding="utf-8-sig" ) print (f"\n已保存分箱汇总:{out_csv} " ) print (summary.to_string(index=False )) bmi_min = float (df_pred["BMI" ].min ()); bmi_max = float (df_pred["BMI" ].max ()) meta = { "seed" : SEED, "mode" : mode_used, "time_grid" : {"lower" : TIME_LOWER, "upper" : TIME_UPPER, "n_points" : N_TIME_POINTS}, "main_quantile" : MAIN_P, "monotone_q90" : MONOTONE_Q90, "required_bins" : REQUIRED_BINS, "min_samples_per_bin" : MIN_SAMPLES_PER_BIN, "min_bin_width" : MIN_BIN_WIDTH, "tree_base" : {"max_depth" : TREE_BASE_MAX_DEPTH, "min_samples_leaf" : TREE_BASE_MIN_SAMPLES_LEAF, "random_state" : TREE_RANDOM_STATE}, "interval_mi" : { "use_interval_mi" : USE_INTERVAL_MI, "mi_M" : MI_M, "mi_sampling" : MI_SAMPLING, "cv_once_for_mi" : CV_ONCE_FOR_MI, "detection_lower_bound" : DETECTION_LOWER_BOUND, "adaptive" : { "enabled" : ADAPTIVE_ENABLE, "left_censor_rate_threshold" : LEFT_CENSOR_RATE_THRESHOLD, "method_high" : ADAPTIVE_METHOD_HIGH, "lb_high" : ADAPTIVE_LB_HIGH, "adaptive_M" : ADAPTIVE_M } }, "cox" : {"spline_df" : SPLINE_DF, "spline_degree" : SPLINE_DEGREE}, "data" : {"bmi_min" : bmi_min, "bmi_max" : bmi_max, "n" : int (len (df_pred))}, "chosen" : { "cuts_final" : [round (x, 6 ) for x in chosen.get("cuts_final" , [])], "bin_edges_final" : chosen.get("bins_edges_final" ), "groups" : summary["group" ].tolist(), "group_sizes" : summary["n" ].fillna(0 ).astype(int ).tolist(), "recommended_week" : summary["recommended_week" ].tolist() } } if ADAPTIVE_ENABLE and (df_int is not None ) and ("left_rates" in adaptive_outputs): meta["adaptive_results" ] = { "left_censor_rates" : adaptive_outputs["left_rates" ].to_dict(orient="list" ), "recommendations_adaptive" : adaptive_outputs["rec_adapt" ].to_dict(orient="list" ) } with open (os.path.join(OUTDIR, "bmi_supervised_bins_cuts.json" ), "w" , encoding="utf-8" ) as f: json.dump(meta, f, ensure_ascii=False , indent=2 ) print (f"已保存切点与参数:{os.path.join(OUTDIR, 'bmi_supervised_bins_cuts.json' )} " ) plt.figure(figsize=(9 ,5.5 )) order_idx = np.argsort(df_pred["BMI" ].values) bmi_sorted = df_pred["BMI" ].values[order_idx] q90_sorted = df_pred[("t90_mono" if MONOTONE_Q90 else "pred_t90" )].values[order_idx] q95_sorted = df_pred[main_col].fillna(TIME_UPPER).values[order_idx] iso_g = IsotonicRegression(increasing=True , out_of_bounds='clip' ) q95_mono_sorted = iso_g.fit_transform(bmi_sorted, q95_sorted) for c in chosen["cuts_final" ]: plt.axvline(c, color="red" , ls="--" , alpha=0.6 ) plt.plot(bmi_sorted, q90_sorted, lw=2 , label=f"{'Q90 单调化' if MONOTONE_Q90 else 'Q90 原始' } (BMI) - Cox" ) plt.plot(bmi_sorted, q95_mono_sorted, lw=2 , label=f"Q{int (MAIN_P*100 )} (BMI) - Cox 单调化" ) plt.xlabel("BMI" ); plt.ylabel("预测达标周数" ) plt.title("预测分位时间 vs BMI(Cox/MI)与监督分箱切点" ) plt.legend(); plt.grid(alpha=0.3 ); plt.tight_layout() plt.savefig(os.path.join(OUTDIR, "qcurves_with_cuts.png" ), dpi=200 ) plt.close() plt.figure(figsize=(8 ,4 )) plt.hist(df_pred["BMI" ], bins=20 , color="#8fbcd4" , alpha=0.85 , edgecolor="#333" ) for c in chosen["cuts_final" ]: plt.axvline(c, color="red" , ls="--" , alpha=0.7 ) plt.xlabel("BMI" ); plt.ylabel("频数" ); plt.title("BMI 分布与切点" ) plt.tight_layout(); plt.savefig(os.path.join(OUTDIR, "bmi_hist_with_cuts.png" ), dpi=200 ); plt.close() for p_key, arr in [("Q90" , df_pred.get("pred_t90" , pd.Series(np.nan)).values), (f"Q{int (MAIN_P*100 )} " , df_pred[main_col].values)]: vals = arr.copy() n = np.isfinite(vals).sum () if n == 0 : print (f"{p_key} 边界夹住比例: 数据全 NaN" ) continue at_low = np.nanmean(vals <= TIME_LOWER + 1e-8 ) at_up = np.nanmean(vals >= TIME_UPPER - 1e-8 ) print (f"{p_key} 边界夹住比例: at_lower={at_low:.3 f} , at_upper={at_up:.3 f} (n={n} )" ) kmf = KaplanMeierFitter() plt.figure(figsize=(9.5 ,6.2 )) for label, sub in df_pred.groupby("BMI_bin_supervised" ): sub = sub.dropna(subset=["time" ]) if len (sub) == 0 : continue lab_txt = f"{label} (n={len (sub)} )" kmf.fit(durations=sub["time" ].values, event_observed=sub["event" ].values, label=lab_txt) if len (sub) < PLOT_MIN_SAMPLES and PLOT_EVEN_IF_SMALL: kmf.plot(ci_show=False , lw=1.8 , ls="--" , alpha=0.95 ) else : kmf.plot(ci_show=True , ci_alpha=0.15 , lw=2.2 ) plt.xlabel("孕周(周)" ); plt.ylabel("未达标概率 S(t)" ) plt.title("Kaplan–Meier(监督分箱后的 BMI 组) — 基于MI近似" ) plt.grid(alpha=0.2 ); plt.tight_layout() plt.savefig(os.path.join(OUTDIR, "km_by_supervised_bins.png" ), dpi=200 ) plt.close() try : df_learn_for_auc = df_pred[["BMI" ,"time" ,"event" ]].copy() times, mean_auc, std_auc = time_dependent_auc_curve(df_learn_for_auc, lower=10.0 , upper=30.0 , n_points=40 , kfold=5 , penalizer=0.05 ) if times is not None : print (f"[INFO] 已生成 AUC 曲线(点数={len (times)} ),结果输出至 {OUTDIR} /auc_curve.png 与 auc_curve.csv" ) else : print ("[INFO] 跳过 AUC 曲线(未安装 scikit-survival 或样本不足)" ) except Exception as e: warnings.warn(f"AUC 评估阶段出现异常(不影响主流程):{e} " ) print ("\n推荐结果(用于报告):" ) cols = ["group" ,"n" ] + [f"KM_t{int (p*100 )} " for p in P_LIST] + \ ["Cox_pred_t90_median" ,"Cox_pred_t95_median" ,"Cox_t_main_mono_median" ,"Cox_t90_mono_median" ,"recommended_week" ] print (summary[cols].to_string(index=False ))if __name__ == "__main__" : main()
p2_eda.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 """ NIPT Step1: 数据准备与 EDA(修正:Y 浓度按“比例”处理,阈值=0.04) - 直接读取 ../男胎检测数据_filtered.csv - 统一把“Y染色体浓度”解析为比例(0–1),若似乎是百分数(0–100),自动 /100 - 达标判定阈值使用 0.04(即 4%) - 产出: - outputs/patient_level_summary.csv - outputs/km_by_bmi.png - outputs/longitudinal_examples.png - outputs/scatter_lowess_by_bmi.png """ import osimport warningsfrom typing import Optional , Tuple import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom lifelines import KaplanMeierFitterfrom statsmodels.nonparametric.smoothers_lowess import lowess warnings.filterwarnings("ignore" ) CSV_PATH = "../男胎检测数据_filtered.csv" OUTPUT_DIR = "./eda_outputs" THRESH_Y_FRAC = 0.04 RANDOM_SEED = 42 N_TRAJ = 30 np.random.seed(RANDOM_SEED) COL_PATIENT = "孕妇代码" COL_GA_DAYS = "检测孕天数" COL_DATE = "检测日期" COL_Y_CONC = "Y染色体浓度" COL_BMI = "孕妇BMI" COL_HEIGHT = "身高" COL_WEIGHT = "体重" def ensure_dir (p: str ): if not os.path.exists(p): os.makedirs(p, exist_ok=True )def to_numeric_series (s: pd.Series ) -> pd.Series: return pd.to_numeric(s, errors="coerce" )def derive_week_from_days (s_days: pd.Series ) -> pd.Series: days = to_numeric_series(s_days) return days / 7.0 def compute_bmi (height_cm: pd.Series, weight_kg: pd.Series ) -> pd.Series: h_m = to_numeric_series(height_cm) / 100.0 w = to_numeric_series(weight_kg) bmi = w / (h_m ** 2 ) return bmidef parse_y_as_fraction (s: pd.Series ) -> Tuple [pd.Series, str ]: """ 将 Y 染色体浓度统一解析为“比例”(0–1)。 规则: - 如果字符串中带有 %,去掉 % 并 /100 - 否则转为数值;若数值的高分位(如95分位)>1 且 <=100,判为百分数,/100 - 其余情况按比例使用 返回:y_frac, scale_note """ s_raw = s.copy() if s_raw.dtype == object : s_str = s_raw.astype(str ).str .strip() if s_str.str .contains("%" ).any (): vals = pd.to_numeric(s_str.str .replace("%" , "" ).str .replace("," , "" ), errors="coerce" ) return vals / 100.0 , "parsed_from_percent_symbol" y = pd.to_numeric(s_raw, errors="coerce" ) finite = y[np.isfinite(y)] scale_note = "as_fraction" if finite.size >= 10 : q95 = np.nanpercentile(finite, 95 ) if (q95 > 1.0 ) and (q95 <= 100.0 ): y = y / 100.0 scale_note = "auto_div_100_from_percent_range" return y, scale_notedef mark_failure_from_y (y_frac: pd.Series ) -> pd.Series: return y_frac.isna()def prepare_row_level (df_raw: pd.DataFrame ) -> pd.DataFrame: df = df_raw.copy() if COL_GA_DAYS not in df.columns: raise ValueError("缺少列:检测孕天数" ) df["孕周" ] = derive_week_from_days(df[COL_GA_DAYS]) if COL_DATE in df.columns: df[COL_DATE] = pd.to_datetime(df[COL_DATE], errors="coerce" ) y_frac, scale_note = parse_y_as_fraction(df[COL_Y_CONC]) df["Y_frac" ] = y_frac df["测序失败" ] = mark_failure_from_y(df["Y_frac" ]) df["有效测量" ] = ~df["测序失败" ] df["达标" ] = (df["有效测量" ]) & (df["Y_frac" ] >= THRESH_Y_FRAC) if COL_BMI in df.columns: df["BMI_num" ] = to_numeric_series(df[COL_BMI]) else : df["BMI_num" ] = np.nan if df["BMI_num" ].isna().any (): if (COL_HEIGHT in df.columns) and (COL_WEIGHT in df.columns): bmi_calc = compute_bmi(df[COL_HEIGHT], df[COL_WEIGHT]) df.loc[df["BMI_num" ].isna(), "BMI_num" ] = bmi_calc[df["BMI_num" ].isna()] yfin = df["Y_frac" ][df["有效测量" ]] print ("\n===== Y 浓度单位与分布(已统一为“比例”0–1)=====" ) print (f"单位推断: {scale_note} " ) if len (yfin): desc = yfin.describe(percentiles=[0.1 , 0.25 , 0.5 , 0.75 , 0.9 , 0.95 ]) print (desc.to_string()) print (f"以阈值 {THRESH_Y_FRAC:.3 f} 判定达标的测次比例: {float ((df['达标' ]).mean()):.3 f} " ) else : print ("有效测量为空,检查原始数据。" ) return dfdef aggregate_patient_level (df: pd.DataFrame ) -> pd.DataFrame: sort_cols = ["孕周" ] if COL_DATE in df.columns: sort_cols.append(COL_DATE) df_sorted = df.sort_values(sort_cols) rows = [] for pid, g in df_sorted.groupby(COL_PATIENT, dropna=False ): bmi_values = g["BMI_num" ].dropna() bmi = bmi_values.iloc[0 ] if len (bmi_values) else np.nan n_total = len (g) n_valid = int (g["有效测量" ].sum ()) n_fail = n_total - n_valid ga_all = g["孕周" ].dropna() earliest_week = ga_all.min () if len (ga_all) else np.nan latest_week = ga_all.max () if len (ga_all) else np.nan g_valid = g[g["有效测量" ]] first_ge4_week = np.nan if len (g_valid): hit = g_valid[g_valid["达标" ]] if len (hit): first_ge4_week = hit["孕周" ].iloc[0 ] if pd.notna(first_ge4_week): event = 1 time = first_ge4_week else : event = 0 time = latest_week rows.append({ COL_PATIENT: pid, "BMI" : bmi, "n_records" : n_total, "n_valid" : n_valid, "n_fail" : n_fail, "earliest_week" : earliest_week, "latest_week" : latest_week, "event" : event, "time" : time, "all_failed" : int (n_valid == 0 ), }) df_pat = pd.DataFrame(rows) bmi_clean = df_pat["BMI" ].dropna() if len (bmi_clean): q1, q3 = np.percentile(bmi_clean, [25 , 75 ]) iqr = q3 - q1 mild_low, mild_high = q1 - 1.5 * iqr, q3 + 1.5 * iqr extreme_low, extreme_high = q1 - 3.0 * iqr, q3 + 3.0 * iqr else : mild_low = mild_high = extreme_low = extreme_high = np.nan df_pat["BMI_outlier_mild" ] = (df_pat["BMI" ] < mild_low) | (df_pat["BMI" ] > mild_high) df_pat["BMI_outlier_extreme" ] = (df_pat["BMI" ] < extreme_low) | (df_pat["BMI" ] > extreme_high) print ("\n===== 患者级事件与删失情况 =====" ) print (f"总人数: {len (df_pat)} , 有事件(首次达标)人数: {int (df_pat['event' ].sum ())} " f"({df_pat['event' ].mean():.3 f} ), 全失败人数: {int (df_pat['all_failed' ].sum ())} " ) return df_patdef make_bmi_bins (df_pat_or_rows: pd.DataFrame, colname: str = "BMI" , bins: Optional [list ] = None , labels: Optional [list ] = None ) -> pd.Series: if bins is None : bins = [20 , 28 , 32 , 36 , 40 , np.inf] if labels is None : labels = ["[20,28)" , "[28,32)" , "[32,36)" , "[36,40)" , "[40,+)" ] return pd.cut(df_pat_or_rows[colname], bins=bins, labels=labels, right=False , include_lowest=True )def set_cn_font (): try : plt.rcParams["font.sans-serif" ] = ["SimHei" , "Noto Sans CJK SC" , "Microsoft YaHei" ] plt.rcParams["axes.unicode_minus" ] = False except Exception: pass def plot_km_by_bmi (df_pat: pd.DataFrame, outdir: str ): set_cn_font() df_pat = df_pat.copy() df_pat["BMI_bin" ] = make_bmi_bins(df_pat, colname="BMI" ) kmf = KaplanMeierFitter() plt.figure(figsize=(9 , 6 )) any_group_plotted = False for label, sub in df_pat.groupby("BMI_bin" , dropna=False ): sub = sub.dropna(subset=["time" ]) if len (sub) < 5 : continue kmf.fit(durations=sub["time" ].values, event_observed=sub["event" ].values, label=str (label)) kmf.plot(ci_show=True , ci_alpha=0.15 , lw=2 ) any_group_plotted = True plt.title("Kaplan–Meier(事件=首次 Y≥4%)按 BMI 分层" ) plt.xlabel("孕周(周)" ) plt.ylabel("未达标概率 S(t)" ) plt.grid(alpha=0.2 ) plt.tight_layout() fp = os.path.join(outdir, "km_by_bmi.png" ) if any_group_plotted: plt.savefig(fp, dpi=200 ) plt.close()def plot_longitudinal_examples (df_rows: pd.DataFrame, outdir: str , n_patients: int = 30 ): set_cn_font() dfv = df_rows[df_rows["有效测量" ]].copy() counts = dfv.groupby(COL_PATIENT).size() eligible_ids = counts[counts >= 2 ].index if len (eligible_ids) == 0 : return sample_ids = np.random.choice(eligible_ids, size=min (n_patients, len (eligible_ids)), replace=False ) sub = dfv[dfv[COL_PATIENT].isin(sample_ids)].copy() plt.figure(figsize=(10 , 7 )) for pid, g in sub.groupby(COL_PATIENT): g = g.sort_values("孕周" ) plt.plot(g["孕周" ], g["Y_frac" ], marker="o" , ms=3 , lw=1 , alpha=0.6 ) plt.axhline(THRESH_Y_FRAC, color="red" , ls="--" , lw=1.5 , alpha=0.8 , label=f"阈值 {THRESH_Y_FRAC:.2 %} " ) plt.title(f"随机抽样 {len (sample_ids)} 位孕妇的纵向 Y 浓度轨迹(比例)" ) plt.xlabel("孕周(周)" ) plt.ylabel("Y 染色体浓度(比例)" ) plt.grid(alpha=0.2 ) plt.tight_layout() fp = os.path.join(outdir, "longitudinal_examples.png" ) plt.savefig(fp, dpi=200 ) plt.close()def plot_scatter_lowess_by_bmi (df_rows: pd.DataFrame, outdir: str ): set_cn_font() dfv = df_rows[df_rows["有效测量" ]].copy() dfv = dfv.dropna(subset=["孕周" , "Y_frac" , "BMI_num" ]) dfv = dfv[(dfv["孕周" ] >= 6 ) & (dfv["孕周" ] <= 35 )] dfv = dfv.rename(columns={"BMI_num" : "BMI" }) dfv["BMI_bin" ] = make_bmi_bins(dfv, colname="BMI" ) g = sns.FacetGrid(dfv, col="BMI_bin" , col_wrap=3 , sharex=True , sharey=True , height=3.2 ) def facet_scatter (data, color, **kwargs ): plt.scatter(data["孕周" ], data["Y_frac" ], s=8 , alpha=0.35 , color=color) x = data["孕周" ].values y = data["Y_frac" ].values if len (x) >= 20 : fitted = lowess(y, x, frac=0.3 , return_sorted=True ) plt.plot(fitted[:, 0 ], fitted[:, 1 ], color="black" , lw=2 ) plt.axhline(THRESH_Y_FRAC, color="red" , ls="--" , lw=1.2 , alpha=0.8 ) g.map_dataframe(facet_scatter) g.set_axis_labels("孕周(周)" , "Y 浓度(比例)" ) g.fig.subplots_adjust(top=0.88 ) g.fig.suptitle("Y 浓度 vs 孕周(按 BMI 分层,比例刻度)" ) fp = os.path.join(outdir, "scatter_lowess_by_bmi.png" ) plt.savefig(fp, dpi=200 ) plt.close()def summarize_and_save (df_pat: pd.DataFrame, outdir: str ): print ("\n===== 患者级汇总(前几行) =====" ) print (df_pat.head()) print ("\n===== 每位孕妇检测次数分布 =====" ) print (df_pat["n_records" ].describe()) print (df_pat["n_records" ].value_counts().head(10 )) print ("\n===== 每位孕妇测序失败次数分布 =====" ) print (df_pat["n_fail" ].describe()) print (df_pat["n_fail" ].value_counts().head(10 )) print ("\n===== 最早/最晚检测孕周 =====" ) print (df_pat[["earliest_week" , "latest_week" ]].describe()) print ("\n===== BMI 分布 =====" ) print (df_pat["BMI" ].describe()) n_mild = int (df_pat["BMI_outlier_mild" ].sum ()) n_ext = int (df_pat["BMI_outlier_extreme" ].sum ()) print (f"轻度离群(1.5*IQR)人数: {n_mild} , 极端离群(3*IQR)人数: {n_ext} " ) out_csv = os.path.join(outdir, "patient_level_summary.csv" ) df_pat.to_csv(out_csv, index=False , encoding="utf-8-sig" ) print (f"\n已保存患者级汇总:{out_csv} " )def main (): ensure_dir(OUTPUT_DIR) df_raw = pd.read_csv(CSV_PATH) for col in [COL_PATIENT, COL_GA_DAYS, COL_Y_CONC]: if col not in df_raw.columns: raise ValueError(f"未找到列:{col} " ) df_rows = prepare_row_level(df_raw) df_pat = aggregate_patient_level(df_rows) summarize_and_save(df_pat, OUTPUT_DIR) plot_km_by_bmi(df_pat, OUTPUT_DIR) plot_longitudinal_examples(df_rows, OUTPUT_DIR, n_patients=N_TRAJ) plot_scatter_lowess_by_bmi(df_rows, OUTPUT_DIR) print (f"\n图像已保存至目录:{OUTPUT_DIR} " ) print (" - km_by_bmi.png" ) print (" - longitudinal_examples.png" ) print (" - scatter_lowess_by_bmi.png" )if __name__ == "__main__" : main()
p2_noise_grouped_sensitivity_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 """ 敏感性分析(分组):评估观测噪声对BMI分箱切点和最终推荐周数的影响。 """ import pandas as pdimport numpy as npfrom lifelines import KaplanMeierFitter, CoxPHFitterfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.isotonic import IsotonicRegressionfrom patsy import dmatriximport osimport randomimport warnings warnings.filterwarnings("ignore" ) N_BOOTSTRAPS = 50 NOISE_STD_DEV = 0 CFFDNA_THRESHOLD = 0.04 RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv" OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs" COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码" , "检测孕天数" , "Y染色体浓度" , "孕妇BMI" SEED = 42 REQUIRED_BINS = 4 MIN_SAMPLES_PER_BIN = 20 MI_M = 10 MAIN_P = 0.95 DETECTION_LOWER_BOUND = 10.0 def set_global_seed (seed: int ): os.environ["PYTHONHASHSEED" ] = str (seed) random.seed(seed) np.random.seed(seed)def build_patient_intervals (df_raw, detection_lower_bound ): df = df_raw.copy() df["孕周" ] = df[COL_GA_DAYS] / 7.0 df["达标" ] = df[COL_Y_CONC] >= CFFDNA_THRESHOLD rows = [] for pid, g in df.groupby(COL_PATIENT): bmi = g[COL_BMI].iloc[0 ] g = g.sort_values("孕周" ) times, hits = g["孕周" ].values, g["达标" ].values pos_idx = np.where(hits)[0 ] if len (pos_idx) > 0 : first_pos_i = int (pos_idx[0 ]) neg_before = np.where(~hits[:first_pos_i])[0 ] R = float (times[first_pos_i]) L = float (times[neg_before[-1 ]]) if len (neg_before) > 0 else float (detection_lower_bound) ctype = "interval" if len (neg_before) > 0 else "left" else : L, R, ctype = float (times[-1 ]), np.inf, "right" rows.append({"patient_id" : pid, "BMI" : bmi, "L" : L, "R" : R, "ctype" : ctype}) return pd.DataFrame(rows)def sample_time_from_interval (L, R, lower_bound ): if not np.isfinite(R): return np.nan L_eff = float (L) if np.isfinite(L) else float (lower_bound) return np.random.uniform(L_eff, float (R))def multiple_imputations (df_int, M, lower_bound ): dfs = [] for _ in range (M): rows = [] for _, r in df_int.iterrows(): if r["ctype" ] in ["interval" , "left" ]: t = sample_time_from_interval(r["L" ], r["R" ], lower_bound) rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : t, "event" : 1 }) elif r["ctype" ] == "right" : rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : r["L" ], "event" : 0 }) df_m = pd.DataFrame(rows).dropna() df_m["event" ] = df_m["event" ].astype(int ) dfs.append(df_m) return dfsdef get_cox_predictions (imputed_sets, p_list ): if not imputed_sets: return None agg_preds = None for m_idx, df_m in enumerate (imputed_sets): if len (df_m) < 20 : continue BMI_CENTER = df_m["BMI" ].mean() X_spline = dmatrix(f"bs(BMI_centered, df=4, degree=3, include_intercept=False)" , {"BMI_centered" : (df_m["BMI" ] - BMI_CENTER).values}, return_type='dataframe' ) cox_df = pd.concat([df_m[["time" , "event" ]].reset_index(drop=True ), X_spline.reset_index(drop=True )], axis=1 ) cph = CoxPHFitter(penalizer=0.05 ) try : cph.fit(cox_df, duration_col="time" , event_col="event" , show_progress=False , robust=True ) except Exception: continue time_grid = np.linspace(DETECTION_LOWER_BOUND, 35.0 , 200 ) surv = cph.predict_survival_function(X_spline, times=time_grid) df_pred_m = df_m[["patient_id" , "BMI" ]].copy() for p in p_list: target_S, t_preds = 1.0 - p, [] for col in surv.columns: s = surv[col].values hit = np.where(s <= target_S)[0 ] t_preds.append(time_grid[hit[0 ]] if len (hit) > 0 else np.nan) df_pred_m[f"pred_t{int (p*100 )} " ] = t_preds if agg_preds is None : agg_preds = df_pred_m else : current_preds = df_pred_m.rename(columns={f"pred_t{int (p*100 )} " : f"pred_t{int (p*100 )} _{m_idx} " for p in p_list}) agg_preds = agg_preds.merge(current_preds, on=["patient_id" , "BMI" ], how="left" ) if agg_preds is None : return None for p in p_list: pred_cols = [col for col in agg_preds.columns if col.startswith(f"pred_t{int (p*100 )} " )] agg_preds[f"pred_t{int (p*100 )} _final" ] = agg_preds[pred_cols].median(axis=1 ) return agg_preds[["patient_id" , "BMI" ] + [f"pred_t{int (p*100 )} _final" for p in p_list]]def get_supervised_cuts (df_pred, y_col, n_bins, min_samples ): df_pred = df_pred.dropna(subset=["BMI" , y_col]) if len (df_pred) < min_samples * n_bins: return [] iso = IsotonicRegression(increasing=True , out_of_bounds="clip" ) y_mono = iso.fit_transform(df_pred["BMI" ], df_pred[y_col]) tree = DecisionTreeRegressor(max_leaf_nodes=n_bins, min_samples_leaf=min_samples, random_state=SEED) tree.fit(df_pred[["BMI" ]], y_mono) return sorted ([t for t in tree.tree_.threshold if t != -2.0 ])def get_group_recommendations (df_with_groups, group_col ): recommendations = {} for name, group in df_with_groups.groupby(group_col): if len (group) < 10 : continue kmf = KaplanMeierFitter() kmf.fit(group['time' ], group['event' ]) sf = kmf.survival_function_.reset_index() hit = sf[sf["KM_estimate" ] <= (1 - MAIN_P)] rec = hit["timeline" ].iloc[0 ] if len (hit) > 0 else group['time' ].max () recommendations[name] = rec return recommendationsdef run_single_simulation (df_raw, noise_std ): set_global_seed(random.randint(0 , 100000 )) df_noisy = df_raw.copy() df_noisy[COL_Y_CONC] += np.random.normal(0 , noise_std, size=len (df_noisy)) df_int = build_patient_intervals(df_noisy, DETECTION_LOWER_BOUND) imputed_sets = multiple_imputations(df_int, M=MI_M, lower_bound=DETECTION_LOWER_BOUND) if not imputed_sets: return None df_pred = get_cox_predictions(imputed_sets, [MAIN_P]) if df_pred is None : return None y_col = f"pred_t{int (MAIN_P*100 )} _final" cuts = get_supervised_cuts(df_pred, y_col, REQUIRED_BINS, MIN_SAMPLES_PER_BIN) if len (cuts) != REQUIRED_BINS - 1 : return None df_m1 = imputed_sets[0 ] bin_edges = [-np.inf] + cuts + [np.inf] df_m1["group" ] = pd.cut(df_m1["BMI" ], bins=bin_edges, labels=range (REQUIRED_BINS)) recs = get_group_recommendations(df_m1, "group" ) if len (recs) != REQUIRED_BINS: return None return {"cuts" : cuts, "recommendations" : [recs.get(i, np.nan) for i in range (REQUIRED_BINS)]}if __name__ == "__main__" : os.makedirs(OUTPUT_DIR, exist_ok=True ) set_global_seed(SEED) try : df_raw = pd.read_csv(RAW_DATA_FILE, encoding='gbk' ) except UnicodeDecodeError: df_raw = pd.read_csv(RAW_DATA_FILE, encoding='utf-8' ) df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna() for col in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]: df_raw[col] = pd.to_numeric(df_raw[col], errors='coerce' ) df_raw = df_raw.dropna() results = [] print (f"Starting {N_BOOTSTRAPS} grouped sensitivity simulations (for {REQUIRED_BINS} groups)..." ) for i in range (N_BOOTSTRAPS): print (f" Running simulation {i+1 } /{N_BOOTSTRAPS} ..." ) sim_result = run_single_simulation(df_raw, NOISE_STD_DEV) if sim_result: results.append(sim_result) print ("Simulations complete." ) if not results: print ("No valid simulation results were obtained. The process might be too unstable or parameters too strict." ) else : df_results = pd.DataFrame(results) df_cuts = pd.DataFrame(df_results["cuts" ].tolist(), columns=[f"Cut_{i+1 } " for i in range (REQUIRED_BINS - 1 )]) df_recs = pd.DataFrame(df_results["recommendations" ].tolist(), columns=[f"Group_{i+1 } _Rec" for i in range (REQUIRED_BINS)]) output_csv_path = os.path.join(OUTPUT_DIR, "grouped_sensitivity_results_4_groups.csv" ) pd.concat([df_cuts, df_recs], axis=1 ).to_csv(output_csv_path, index=False ) print (f"Saved detailed simulation results to {output_csv_path} " )
p2_plot_sensitivity_trends.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport os INPUT_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs/grouped_sensitivity_results_4_groups.csv" OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs" N_GROUPS = 3 if __name__ == "__main__" : if not os.path.exists(INPUT_CSV): print (f"Error: Input file not found at {INPUT_CSV} " ) exit() df = pd.read_csv(INPUT_CSV) cut_cols = [f"Cut_{i+1 } " for i in range (N_GROUPS - 1 )] rec_cols = [f"Group_{i+1 } _Rec" for i in range (N_GROUPS)] cuts_melted = pd.melt(df, value_vars=cut_cols, var_name='Cut_Point' , value_name='BMI_Value' ) recs_melted = pd.melt(df, value_vars=rec_cols, var_name='Recommendation_Group' , value_name='Gestational_Week' ) plt.style.use('seaborn-v0_8-whitegrid' ) try : plt.rcParams['font.sans-serif' ] = ['SimHei' ] plt.rcParams['axes.unicode_minus' ] = False except : print ("SimHei font not found, using default font." ) fig1, ax1 = plt.subplots(figsize=(10 , 12 )) sns.violinplot(data=cuts_melted, x='Cut_Point' , y='BMI_Value' , ax=ax1, hue='Cut_Point' , palette="pastel" , legend=False ) ax1.set_title('不同切点的BMI值分布 (3组)' , fontsize=16 ) ax1.set_xlabel('BMI 切点' , fontsize=12 ) ax1.set_ylabel('BMI 值' , fontsize=12 ) plt.tight_layout() output_path_cuts = os.path.join(OUTPUT_DIR, "sensitivity_violin_cuts_3_groups.png" ) plt.savefig(output_path_cuts, dpi=150 ) print (f"Saved cut-off violin plot to {output_path_cuts} " ) plt.close(fig1) fig2, ax2 = plt.subplots(figsize=(10 , 12 )) sns.violinplot(data=recs_melted, x='Recommendation_Group' , y='Gestational_Week' , ax=ax2, hue='Recommendation_Group' , palette="pastel" , legend=False ) ax2.set_title('不同组别的推荐孕周分布 (3组)' , fontsize=16 ) ax2.set_xlabel('推荐组别' , fontsize=12 ) ax2.set_ylabel('推荐孕周' , fontsize=12 ) plt.xticks(rotation=45 ) plt.tight_layout() output_path_recs = os.path.join(OUTPUT_DIR, "sensitivity_violin_recs_3_groups.png" ) plt.savefig(output_path_recs, dpi=150 ) print (f"Saved recommendation violin plot to {output_path_recs} " ) plt.close(fig2)
p2_fuzzy_interval_modeling.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom lifelines import KaplanMeierFitterimport osimport json FUZZY_LOWER_BOUND = 0.039 FUZZY_UPPER_BOUND = 0.041 CFFDNA_THRESHOLD = 0.04 N_GROUPS = 4 RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv" OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs" CUTS_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/outputs_binning/bmi_supervised_bins_cuts.json" COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码" , "检测孕天数" , "Y染色体浓度" , "孕妇BMI" def get_event_data (df, threshold ): patient_results = [] for _, group in df.groupby(COL_PATIENT): group = group.sort_values(by=COL_GA_DAYS) weeks = group[COL_GA_DAYS] / 7.0 hits = group[COL_Y_CONC] >= threshold duration = weeks[hits.idxmax()] if hits.any () else weeks.max () observed = hits.any () patient_results.append({'patient_id' : group[COL_PATIENT].iloc[0 ], 'BMI' : group[COL_BMI].iloc[0 ], 'duration' : duration, 'observed' : observed}) return pd.DataFrame(patient_results)def get_fuzzy_event_data (df, lower_b, upper_b ): patient_results = [] for _, group in df.groupby(COL_PATIENT): group = group.sort_values(by=COL_GA_DAYS) weeks = group[COL_GA_DAYS] / 7.0 concentrations = group[COL_Y_CONC] L, R, status = 0 , np.inf, 'right_censored' above_indices = np.where(concentrations > upper_b)[0 ] if len (above_indices) > 0 : first_above_idx = above_indices[0 ] R = weeks.iloc[first_above_idx] possible_L_indices = np.where(weeks < R)[0 ] L = weeks.iloc[possible_L_indices[-1 ]] if len (possible_L_indices) > 0 else 0 status = 'interval' else : L = weeks.max () patient_results.append({'patient_id' : group[COL_PATIENT].iloc[0 ], 'BMI' : group[COL_BMI].iloc[0 ], 'L' : L, 'R' : R, 'status' : status}) df_intervals = pd.DataFrame(patient_results) imputed_times = [] for _, row in df_intervals.iterrows(): if row['status' ] == 'interval' : imputed_time = np.random.uniform(row['L' ], row['R' ]) imputed_times.append({'patient_id' : row['patient_id' ], 'BMI' : row['BMI' ], 'duration' : imputed_time, 'observed' : True }) elif row['status' ] == 'right_censored' : imputed_times.append({'patient_id' : row['patient_id' ], 'BMI' : row['BMI' ], 'duration' : row['L' ], 'observed' : False }) return pd.DataFrame(imputed_times)def get_q95_recommendation (df ): if df.empty or not df['observed' ].any (): return np.nan kmf = KaplanMeierFitter() kmf.fit(df['duration' ], df['observed' ]) sf = kmf.survival_function_.reset_index() hit = sf[sf["KM_estimate" ] <= 0.05 ] return hit["timeline" ].iloc[0 ] if len (hit) > 0 else df['duration' ].max ()if __name__ == "__main__" : os.makedirs(OUTPUT_DIR, exist_ok=True ) try : with open (CUTS_FILE, 'r' ) as f: cuts = json.load(f)['chosen' ]['cuts_final' ] except FileNotFoundError: print (f"Error: Cuts file not found at {CUTS_FILE} . Cannot perform grouped analysis." ) exit() bin_edges = [-np.inf] + cuts + [np.inf] labels = [f'Group {i+1 } : {bin_edges[i]:.1 f} <= BMI < {bin_edges[i+1 ]:.1 f} ' for i in range (N_GROUPS)] try : df_raw = pd.read_csv(RAW_DATA_FILE, encoding='gbk' ) except : df_raw = pd.read_csv(RAW_DATA_FILE, encoding='utf-8' ) df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna() for col in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]: df_raw[col] = pd.to_numeric(df_raw[col], errors='coerce' ) df_raw = df_raw.dropna() df_baseline = get_event_data(df_raw, CFFDNA_THRESHOLD) df_fuzzy = get_fuzzy_event_data(df_raw, FUZZY_LOWER_BOUND, FUZZY_UPPER_BOUND) df_baseline['group' ] = pd.cut(df_baseline['BMI' ], bins=bin_edges, labels=labels) df_fuzzy['group' ] = pd.cut(df_fuzzy['BMI' ], bins=bin_edges, labels=labels) fig, axes = plt.subplots(N_GROUPS, 1 , figsize=(10 , 18 ), sharex=True ) fig.suptitle('Kaplan-Meier Curves by BMI Group: Exact vs. Fuzzy Threshold' , fontsize=16 ) results_summary = [] for i, group_label in enumerate (labels): ax = axes[i] baseline_group = df_baseline[df_baseline['group' ] == group_label] fuzzy_group = df_fuzzy[df_fuzzy['group' ] == group_label] rec_baseline, rec_fuzzy = np.nan, np.nan if not baseline_group.empty: kmf_b = KaplanMeierFitter().fit(baseline_group['duration' ], baseline_group['observed' ], label=f'Exact Threshold' ) kmf_b.plot(ax=ax, ci_show=True ) rec_baseline = get_q95_recommendation(baseline_group) if not fuzzy_group.empty: kmf_f = KaplanMeierFitter().fit(fuzzy_group['duration' ], fuzzy_group['observed' ], label=f'Fuzzy Interval' ) kmf_f.plot(ax=ax, ci_show=True ) rec_fuzzy = get_q95_recommendation(fuzzy_group) ax.set_title(group_label) ax.set_ylabel('Probability of Not Reaching Threshold' ) ax.grid(True , linestyle='--' ) ax.legend() results_summary.append({'Group' : group_label, 'Q95_Week_Exact' : rec_baseline, 'Q95_Week_Fuzzy' : rec_fuzzy}) axes[-1 ].set_xlabel('Gestational Week' ) plt.tight_layout(rect=[0 , 0.03 , 1 , 0.96 ]) output_path_png = os.path.join(OUTPUT_DIR, "fuzzy_interval_comparison_by_group.png" ) plt.savefig(output_path_png, dpi=150 ) print (f"Saved grouped comparison plot to {output_path_png} " ) df_summary = pd.DataFrame(results_summary) output_path_csv = os.path.join(OUTPUT_DIR, "fuzzy_interval_summary_by_group.csv" ) df_summary.to_csv(output_path_csv, index=False ) print (f"Saved summary table to {output_path_csv} " ) print ("\nSummary Table:" ) print (df_summary.to_string())
p3_aft.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 RAW_CSV <- "../男胎检测数据_filtered.csv" OUTDIR <- "outputs_joint_r" THRESH <- 0.04 TIME_MIN <- 6.0 TIME_MAX <- 40.0 PRED_P <- c ( 0.90 , 0.95 ) PI_TIME <- 25.0 SEED <- 114514 set.seed( SEED) if ( ! dir.exists( OUTDIR) ) dir.create( OUTDIR, recursive = TRUE , showWarnings = FALSE ) suppressPackageStartupMessages( { library( survival) library( jsonlite) } ) logit <- function ( p, eps= 1e-3 ) { p <- pmax( pmin( p, 1 - eps) , eps) log ( p) - log ( 1 - p) } safe_read_csv <- function ( file_path) { candidates <- c ( file_path, file.path( ".." , basename( file_path) ) , file.path( "." , basename( file_path) ) ) sel <- NULL for ( p in candidates) if ( file.exists( p) ) { sel <- p; break } if ( is.null ( sel) ) stop( "找不到数据文件:" , paste( candidates, collapse= " | " ) ) ok <- FALSE ; df <- NULL if ( requireNamespace( "data.table" , quietly = TRUE ) ) { try( { df <- data.table:: fread( sel, encoding= "UTF-8" , showProgress= FALSE ) ; ok <- TRUE } , silent= TRUE ) if ( ! ok) try( { df <- data.table:: fread( sel, encoding= "GB18030" , showProgress= FALSE ) ; ok <- TRUE } , silent= TRUE ) } if ( ! ok) try( { df <- read.csv( sel, stringsAsFactors= FALSE , fileEncoding= "UTF-8" ) ; ok <- TRUE } , silent= TRUE ) if ( ! ok) try( { df <- read.csv( sel, stringsAsFactors= FALSE ) ; ok <- TRUE } , silent= TRUE ) if ( ! ok) stop( "CSV 读取失败(UTF-8/GB18030/默认编码均失败)" ) as.data.frame( df) } find_col <- function ( pattern, cols) { m <- grep( pattern, cols, ignore.case= TRUE , value= TRUE ) if ( length ( m) > 0 ) m[ 1 ] else NULL } as_frac <- function ( x) { if ( is.numeric ( x) ) return ( x) xs <- trimws( as.character ( x) ) if ( any ( grepl( "%" , xs, fixed= TRUE ) ) ) { v <- as.numeric ( gsub( "[%,]" , "" , xs) ) ; return ( v/ 100 ) } else { v <- suppressWarnings( as.numeric ( xs) ) f <- v[ is.finite ( v) ] if ( length ( f) >= 10 ) { q95 <- suppressWarnings( as.numeric ( stats:: quantile( f, 0.95 , na.rm= TRUE ) ) ) if ( is.finite ( q95) && q95> 1 && q95<= 100 ) return ( v/ 100 ) } return ( v) } } ivf_to_cat <- function ( x) { v <- trimws( as.character ( x) ) v[ v== "" ] <- NA v <- gsub( "(" , "(" , v, fixed= TRUE ) v <- gsub( ")" , ")" , v, fixed= TRUE ) v_low <- tolower( v) is_art <- grepl( "\\bivf\\b" , v_low) | grepl( "\\biui\\b" , v_low) | grepl( "试管" , v) | grepl( "人工授精" , v) | grepl( "体外受精" , v) | grepl( "辅助生殖" , v) | grepl( "\\bart\\b" , v_low) out <- ifelse( is.na ( v) , "natural" , ifelse( is_art, "art" , ifelse( grepl( "自然受孕" , v) | grepl( "自然" , v) , "natural" , "natural" ) ) ) out} cat( "=== Interval-censor AFT: 数据读取 ===\n" ) df <- safe_read_csv( RAW_CSV) cat( sprintf( "数据维度: %d 行 %d 列\n" , nrow( df) , ncol( df) ) ) col_patient <- find_col( "孕妇代码|patient" , names ( df) ) col_ga <- find_col( "检测孕天数|GA|day" , names ( df) ) col_y <- find_col( "Y染色体浓度|Y.*浓度|concentration" , names ( df) ) if ( is.null ( col_patient) || is.null ( col_ga) || is.null ( col_y) ) { stop( "未找到必要列(孕妇代码/检测孕天数/Y染色体浓度)" ) } dat <- data.frame( patient_id = as.character ( df[[ col_patient] ] ) , t_week = as.numeric ( df[[ col_ga] ] ) / 7 , Y_frac = as_frac( df[[ col_y] ] ) , stringsAsFactors = FALSE ) if ( "孕妇BMI" %in% names ( df) ) { BMI_raw <- suppressWarnings( as.numeric ( df[[ "孕妇BMI" ] ] ) ) } else { hcol <- find_col( "身高|height" , names ( df) ) wcol <- find_col( "体重|weight" , names ( df) ) if ( ! is.null ( hcol) && ! is.null ( wcol) ) { hh <- suppressWarnings( as.numeric ( df[[ hcol] ] ) ) ww <- suppressWarnings( as.numeric ( df[[ wcol] ] ) ) BMI_raw <- ww / ( ( hh/ 100 ) ^ 2 ) } else { BMI_raw <- rep ( NA_real_ , nrow( df) ) } } dat$ BMI <- BMI_raw acol <- find_col( "年龄|age" , names ( df) ) dat$ Age <- if ( ! is.null ( acol) ) suppressWarnings( as.numeric ( df[[ acol] ] ) ) else NA_real_ col_ivf <- if ( "IVF妊娠" %in% names ( df) ) "IVF妊娠" else find_col( "IVF|试管|人工|IUI" , names ( df) ) dat$ IVF_row <- if ( ! is.null ( col_ivf) ) ivf_to_cat( df[[ col_ivf] ] ) else "natural" dat <- dat[ is.finite ( dat$ t_week) & dat$ t_week>= TIME_MIN & dat$ t_week<= TIME_MAX, ] dat$ valid <- is.finite ( dat$ Y_frac) get_interval <- function ( d) { d <- d[ order( d$ t_week) , ] y <- d$ Y_frac t <- d$ t_week ok <- which( is.finite ( y) ) if ( length ( ok) == 0 ) return ( c ( L= NA , R= NA , type= "none" ) ) y <- y[ ok] ; t <- t[ ok] cross <- which( y >= THRESH) if ( length ( cross) == 0 ) { return ( c ( L = t[ length ( t) ] , R = Inf , type= "right" ) ) } else { j <- cross[ 1 ] if ( j == 1 ) { return ( c ( L = 0 , R = t[ 1 ] , type= "left" ) ) } else { prev_neg <- max ( which( y[ 1 : ( j- 1 ) ] < THRESH) ) L <- t[ prev_neg] ; R <- t[ j] return ( c ( L= L, R= R, type= "interval" ) ) } } } iv_rows <- by( dat, dat$ patient_id, get_interval) iv_dt <- do.call( rbind, lapply( names ( iv_rows) , function ( pid) { z <- iv_rows[[ pid] ] data.frame( patient_id= pid, L= as.numeric ( z[ "L" ] ) , R= as.numeric ( z[ "R" ] ) , type= as.character ( z[ "type" ] ) , stringsAsFactors = FALSE ) } ) ) iv_dt$ L <- as.numeric ( iv_dt$ L) ; iv_dt$ R <- as.numeric ( iv_dt$ R) cat( "区间类型计数:\n" ) ; print( table( iv_dt$ type, useNA= "ifany" ) ) event_out <- iv_dt[ iv_dt$ type %in% c ( "left" , "interval" , "right" ) , c ( "patient_id" , "L" , "R" , "type" ) ] out_intervals_csv <- file.path( OUTDIR, "event_intervals.csv" ) write.csv( event_out, out_intervals_csv, row.names = FALSE ) cat( "已导出区间删失明细到:" , out_intervals_csv, "(n=" , nrow( event_out) , ")\n" , sep= "" ) mode_char <- function ( v) { v <- v[ ! is.na ( v) & v!= "" ] if ( ! length ( v) ) return ( "natural" ) tab <- sort( table( v) , decreasing= TRUE ) names ( tab) [ 1 ] } covs_num <- stats:: aggregate( cbind( BMI, Age) ~ patient_id, data= dat, FUN= function ( z) suppressWarnings( median( as.numeric ( z) , na.rm= TRUE ) ) ) covs_ivf <- stats:: aggregate( IVF_row ~ patient_id, data= dat, FUN= mode_char) names ( covs_ivf) [ 2 ] <- "IVF_cat" covs <- merge( covs_num, covs_ivf, by= "patient_id" , all = TRUE ) if ( any ( ! is.finite ( covs$ BMI) ) ) { med_bmi <- median( covs$ BMI[ is.finite ( covs$ BMI) ] , na.rm= TRUE ) ; if ( ! is.finite ( med_bmi) ) med_bmi <- 25 covs$ BMI[ ! is.finite ( covs$ BMI) ] <- med_bmi} if ( any ( ! is.finite ( covs$ Age) ) ) { med_age <- median( covs$ Age[ is.finite ( covs$ Age) ] , na.rm= TRUE ) ; if ( ! is.finite ( med_age) ) med_age <- 30 covs$ Age[ ! is.finite ( covs$ Age) ] <- med_age} covs$ IVF_cat[ is.na ( covs$ IVF_cat) | covs$ IVF_cat== "" ] <- "natural" covs$ IVF_cat <- factor( covs$ IVF_cat, levels= c ( "natural" , "art" ) ) df_ic <- merge( iv_dt, covs, by= "patient_id" , all.x= TRUE ) Surv_ic <- with( df_ic, Surv( L, R, type= "interval2" ) ) use_terms <- c ( ) if ( any ( is.finite ( df_ic$ BMI) ) ) use_terms <- c ( use_terms, "BMI" ) if ( any ( is.finite ( df_ic$ Age) ) ) use_terms <- c ( use_terms, "Age" ) use_ivf <- length ( levels( droplevels( df_ic$ IVF_cat) ) ) >= 2 if ( use_ivf) use_terms <- c ( use_terms, "IVF_cat" ) form <- as.formula( paste0( "Surv_ic ~ " , ifelse( length ( use_terms) == 0 , "1" , paste( use_terms, collapse= " + " ) ) ) ) cat( "AFT 公式:" , deparse( form) , "\n" , sep= "" ) DIST <- Sys.getenv( "AFT_DIST" , unset= "auto" ) cands <- c ( "lognormal" , "weibull" , "loglogistic" ) fit_one <- function ( dist_name) { try( survreg( form, data= df_ic, dist= dist_name, na.action= na.omit) , silent= TRUE ) } fits <- list ( ) if ( DIST == "auto" ) { for ( d in cands) { obj <- fit_one( d) if ( ! inherits( obj, "try-error" ) && inherits( obj, "survreg" ) ) fits[[ d] ] <- obj else message( "[WARN] survreg 拟合失败:" , d) } if ( ! length ( fits) ) stop( "所有 AFT 分布拟合均失败;请检查数据。" ) aics <- sapply( fits, AIC) cat( "各分布 AIC:\n" ) ; print( aics) best_name <- names ( which.min( aics) ) } else { obj <- fit_one( DIST) if ( inherits( obj, "try-error" ) || ! inherits( obj, "survreg" ) ) stop( "指定分布拟合失败:" , DIST) fits[[ DIST] ] <- obj; best_name <- DIST} fit <- fits[[ best_name] ] cat( sprintf( "选择分布: %s(AIC=%.1f)\n" , best_name, AIC( fit) ) ) newdat <- covs[ , c ( "patient_id" , "BMI" , "Age" , "IVF_cat" ) ] if ( ! use_ivf) newdat$ IVF_cat <- NULL if ( "IVF_cat" %in% names ( newdat) ) { newdat$ IVF_cat <- factor( newdat$ IVF_cat, levels = levels( df_ic$ IVF_cat) ) } pred_q <- function ( fit_obj, p_vec, newdata) { out <- do.call( cbind, lapply( p_vec, function ( pp) as.numeric ( predict( fit_obj, newdata= newdata, type= "quantile" , p= pp) ) ) ) colnames( out) <- paste0( "t" , as.integer ( p_vec* 100 ) ) out} Q <- pred_q( fit, PRED_P, newdata = newdat) pi_at_t <- function ( fit_obj, t_scalar, newdata) { mu <- as.numeric ( predict( fit_obj, newdata= newdata, type= "lp" ) ) sc <- fit_obj$ scale dist <- fit_obj$ dist psurvreg( t_scalar, mean= mu, scale= sc, distribution= dist) } PI <- pi_at_t( fit, PI_TIME, newdat) res <- data.frame( patient_id = newdat$ patient_id, pred_t90 = Q[ , "t90" ] , pred_t95 = Q[ , "t95" ] , pi_25 = as.numeric ( PI) , BMI = covs$ BMI[ match( newdat$ patient_id, covs$ patient_id) ] , stringsAsFactors = FALSE ) if ( use_ivf) res$ IVF_cat <- covs$ IVF_cat[ match( newdat$ patient_id, covs$ patient_id) ] out_csv <- file.path( OUTDIR, "joint_tdcox_preds.csv" ) write.csv( res, out_csv, row.names = FALSE ) cat( "已保存预测到:" , out_csv, sprintf( "(n=%d)\n" , nrow( res) ) ) mu_lp <- as.numeric ( predict( fit, newdata= newdat, type= "lp" ) ) sc <- as.numeric ( fit$ scale) dist <- fit$ dist param_dt <- data.frame( patient_id = newdat$ patient_id, BMI = covs$ BMI[ match( newdat$ patient_id, covs$ patient_id) ] , Age = covs$ Age[ match( newdat$ patient_id, covs$ patient_id) ] , IVF_cat = if ( "IVF_cat" %in% names ( res) ) as.character ( res$ IVF_cat) else NA_character_ , dist = dist, mu = NA_real_ , sigma = NA_real_ , shape = NA_real_ , scale = NA_real_ , t90 = Q[ , "t90" ] , t95 = Q[ , "t95" ] , stringsAsFactors = FALSE ) if ( dist == "lognormal" ) { param_dt$ mu <- mu_lp param_dt$ sigma <- sc } else if ( dist == "weibull" ) { param_dt$ shape <- 1/ sc param_dt$ scale <- exp ( mu_lp) } else if ( dist == "loglogistic" ) { param_dt$ shape <- 1/ sc param_dt$ scale <- exp ( mu_lp) } param_csv <- file.path( OUTDIR, "aft_params_by_patient.csv" ) write.csv( param_dt, param_csv, row.names = FALSE ) cat( "已导出个体分布参数到:" , param_csv, " dist=" , dist, "\n" , sep= "" ) info <- list ( formula = deparse( form) , selected_dist = dist, scale_sigma = sc, coef = as.list( coef( fit) ) , AIC_selected = AIC( fit) , AIC_candidates = lapply( names ( fits) , function ( nm) list ( dist= nm, AIC= AIC( fits[[ nm] ] ) ) ) ) info_json <- file.path( OUTDIR, "aft_model_info.json" ) writeLines( jsonlite:: toJSON( info, pretty= TRUE , auto_unbox= TRUE ) , info_json) cat( "已写出模型信息:" , info_json, "\n" , sep= "" )
p3_bmi_group_plots.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 """ 图表: - fig_bmi_panels.png 面板图 - fig_mi_km_curves.png 区间删失 MI+KM 曲线(AFT 条件插补优先;否则 Uniform) - fig_aft_curves_exact.png AFT 精确曲线(基于 survreg 参数) """ import os, mathimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy.stats import norm, lognorm, weibull_min, fisk sns.set_theme(style="whitegrid" ) plt.rcParams.update({"font.size" :12 , "axes.labelsize" :12 , "axes.titlesize" :13 , "legend.fontsize" :10 , "figure.titlesize" :14 , "axes.unicode_minus" :False })for f in ["SimHei" ,"Microsoft YaHei" ,"Noto Sans CJK SC" ,"WenQuanYi Zen Hei" ,"PingFang SC" ,"Arial" ]: plt.rcParams["font.sans-serif" ] = [f, "DejaVu Sans" , "Arial" ]; break OUT_DIR = "outputs_binning" ; os.makedirs(OUT_DIR, exist_ok=True ) GROUP_CSV = os.path.join(OUT_DIR, "bmi_groups.csv" ) INTERVAL_CSV = os.path.join("outputs_joint_r" , "event_intervals.csv" ) AFT_PARAM_CSV = os.path.join("outputs_joint_r" , "aft_params_by_patient.csv" ) RECOMMEND_CSV = os.path.join(OUT_DIR, "recommendations_by_group.csv" ) PRED_CSV = os.path.join("outputs_joint_r" , "joint_tdcox_preds.csv" ) T_MIN, T_MAX, DT = 0.0 , 26.0 , 0.1 MI_M, MI_SEED = 200 , 20240922 PAL_LINE = sns.color_palette("tab10" , 10 ) PAL_BAR = sns.color_palette("Set2" , 8 )def require_file (path, desc ): if not os.path.exists(path): raise FileNotFoundError(f"缺少 {desc} : {path} " )def load_data (): require_file(GROUP_CSV, "bmi_groups.csv" ) groups = pd.read_csv(GROUP_CSV) preds = pd.read_csv(PRED_CSV) if os.path.exists(PRED_CSV) else pd.DataFrame() df = groups.merge(preds, on="patient_id" , how="left" , suffixes=("" , "_pred" )) for col in ["pred_t95" ,"pred_t90" ,"pi_25" ]: c2=f"{col} _pred" if c2 in df.columns and col not in df.columns: df.rename(columns={c2:col}, inplace=True ) return df, groups, predsdef make_group_labels (groups_df: pd.DataFrame, decimals=1 , include_n=True ): stat = groups_df.groupby("group_idx" ).agg(n=("patient_id" ,"size" ), bmin=("BMI" ,"min" ), bmax=("BMI" ,"max" )).reset_index().sort_values("group_idx" ) labels={}; G=stat["group_idx" ].tolist() for _, r in stat.iterrows(): g=int (r["group_idx" ]); n=int (r["n" ]); lo=round (float (r["bmin" ]),decimals); hi=round (float (r["bmax" ]),decimals) if g==G[0 ]: txt=f"≤ {hi:.{decimals} f}" elif g==G[-1 ]: txt=f"≥ {lo:.{decimals} f}" else : txt=f"{lo:.{decimals} f}–{hi:.{decimals} f}" if include_n: txt=f"{txt} (n={n} )" labels[g]=f"组 {g} {txt} " return labelsdef plot_panels (df_all, df_groups ): size_tab=df_groups["group_idx" ].value_counts().sort_index().reset_index() size_tab.columns=["group_idx" ,"n" ] if os.path.exists(INTERVAL_CSV): inter=pd.read_csv(INTERVAL_CSV) left_rate=inter.merge(df_groups[["patient_id" ,"group_idx" ]], on="patient_id" , how="inner" ) \ .groupby("group_idx" )["type" ].apply(lambda s: np.mean(s.values=="left" )).reset_index() \ .rename(columns={"type" :"left_censor_rate" }) else : left_rate=pd.DataFrame({"group_idx" : sorted (df_groups["group_idx" ].unique()), "left_censor_rate" : np.nan}) fig, axes = plt.subplots(2 ,2 , figsize=(12 ,8 )) ax=axes[0 ,0 ]; sns.barplot(data=size_tab, x="group_idx" , y="n" , ax=ax, color=PAL_BAR[0 ]); ax.set_title("BMI 组样本量" ); ax.set_xlabel("BMI 组" ); ax.set_ylabel("样本数" ) for p in ax.patches: h=p.get_height(); ax.annotate(f"{h:.0 f} " , (p.get_x()+p.get_width()/2 , h), ha="center" , va="bottom" , fontsize=10 ) ax=axes[0 ,1 ]; sns.barplot(data=left_rate, x="group_idx" , y="left_censor_rate" , ax=ax, color=PAL_BAR[1 ]); ax.set_title("左删失比例" ); ax.set_ylim(0 ,1 ) for p in ax.patches: h=p.get_height(); ax=axes[1 ,0 ] if "pi_25" in df_all.columns: sns.boxplot(data=df_all, x="group_idx" , y="pi_25" , ax=ax, color=PAL_BAR[2 ], width=0.5 , fliersize=2 ) sns.stripplot(data=df_all, x="group_idx" , y="pi_25" , ax=ax, color="#444" , size=2 , alpha=0.45 , jitter=0.22 ) ax.set_title("pi_25 组间分布" ); ax.set_ylim(0 ,1.02 ) else : ax.axis("off" ); ax.text(0.5 ,0.5 ,"缺少 pi_25 列" , ha="center" , va="center" ) ax=axes[1 ,1 ] if "pred_t95" in df_all.columns: sns.boxplot(data=df_all, x="group_idx" , y="pred_t95" , ax=ax, color=PAL_BAR[3 ], width=0.5 , fliersize=2 ) sns.stripplot(data=df_all, x="group_idx" , y="pred_t95" , ax=ax, color="#444" , size=2 , alpha=0.45 , jitter=0.22 ) ax.set_title("pred_t95(周)组间分布" ); ax.set_ylim(T_MIN-0.5 , T_MAX+0.5 ) else : ax.axis("off" ); ax.text(0.5 ,0.5 ,"缺少 pred_t95 列" , ha="center" , va="center" ) fig.tight_layout(); fig.savefig(os.path.join(OUT_DIR,"fig_bmi_panels.png" ), dpi=150 ); plt.close(fig) print ("[OK] 已保存:fig_bmi_panels.png" )def aft_objs_from_row (r ): d=str (r["dist" ]).strip().lower() if d=="lognormal" and np.isfinite(r["mu" ]) and np.isfinite(r["sigma" ]) and r["sigma" ]>0 : s=float (r["sigma" ]); sc=np.exp(float (r["mu" ])) return (lambda x: lognorm.cdf(np.maximum(x,1e-9 ), s=s, scale=sc), lambda u: lognorm.ppf(np.clip(u,1e-12 ,1 -1e-12 ), s=s, scale=sc)) if d=="weibull" and np.isfinite(r["shape" ]) and np.isfinite(r["scale" ]) and r["shape" ]>0 and r["scale" ]>0 : c=float (r["shape" ]); sc=float (r["scale" ]) return (lambda x: weibull_min.cdf(np.maximum(x,1e-9 ), c=c, scale=sc), lambda u: weibull_min.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc)) if d=="loglogistic" and np.isfinite(r["shape" ]) and np.isfinite(r["scale" ]) and r["shape" ]>0 and r["scale" ]>0 : c=float (r["shape" ]); sc=float (r["scale" ]) return (lambda x: fisk.cdf(np.maximum(x,1e-9 ), c=c, scale=sc), lambda u: fisk.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc)) return None , None def mi_km_curves (groups_df ): if not os.path.exists(INTERVAL_CSV): return None inter = pd.read_csv(INTERVAL_CSV) t_grid = np.arange(T_MIN, T_MAX+1e-9 , DT) rng = np.random.default_rng(MI_SEED) x = inter.merge(groups_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="inner" ) aft = pd.read_csv(AFT_PARAM_CSV) if os.path.exists(AFT_PARAM_CSV) else pd.DataFrame() if not aft.empty: aft = aft[["patient_id" ,"dist" ,"mu" ,"sigma" ,"shape" ,"scale" ]] x = x.merge(aft, on="patient_id" , how="left" ) def km_from_rc (times, events ): ord_idx=np.argsort(times); t=times[ord_idx]; e=events[ord_idx].astype(int ) uniq=np.unique(t[e==1 ]); if uniq.size==0 : return np.array([]), np.array([]) n_at=len (t); S=1.0 ; S_vals=[] for u in uniq: d=int (np.sum ((t==u)&(e==1 ))); c=int (np.sum ((t==u)&(e==0 ))) if n_at>0 : S*=(1.0 - d/n_at) S_vals.append(S); n_at-=(d+c) return uniq, np.array(S_vals, dtype=float ) groups=sorted (x["group_idx" ].unique()) S_store={g:[] for g in groups} for _ in range (MI_M): rows=[] for _, r in x.iterrows(): L=float (r["L" ]); R=r["R" ]; typ=r["type" ] if typ=="right" or (isinstance (R,float ) and (np.isinf(R))): rows.append((r["group_idx" ], L, 0 )); continue cdf, ppf = aft_objs_from_row(r) if typ=="left" : Rt=float (R) if cdf is not None : u = rng.uniform(0.0 , max (cdf(Rt), 1e-12 )) t = float (ppf(u)) else : t = rng.uniform(1e-6 , Rt) rows.append((r["group_idx" ], t, 1 )) else : Lt=float (L); Rt=float (R) if cdf is not None : uL, uR = cdf(Lt), cdf(Rt) if not (np.isfinite(uL) and np.isfinite(uR)) or uR <= uL + 1e-12 : t = 0.5 *(Lt+Rt) if Rt>Lt else Lt else : u = rng.uniform(uL, uR); t = float (ppf(u)) else : t = Lt if Rt<=Lt else rng.uniform(Lt, Rt) rows.append((r["group_idx" ], t, 1 )) imp=pd.DataFrame(rows, columns=["group_idx" ,"time" ,"event" ]) for g in groups: gi=imp[imp["group_idx" ]==g] if gi.empty: S_store[g].append(np.ones_like(t_grid)); continue times=gi["time" ].values.astype(float ); events=gi["event" ].values.astype(int ) ut,Sv=km_from_rc(times, events) idx=np.searchsorted(ut, t_grid, side="right" )-1 ; idx=np.clip(idx, -1 , len (Sv)-1 ) Sg=np.ones_like(t_grid, dtype=float ); m=idx>=0 ; Sg[m]=Sv[idx[m]]; S_store[g].append(Sg) curves=[] for g in groups: S_arr=np.vstack(S_store[g]); S_med=np.nanmedian(S_arr,axis=0 ); S_q25=np.nanpercentile(S_arr,25 ,axis=0 ); S_q75=np.nanpercentile(S_arr,75 ,axis=0 ) curves.append(pd.DataFrame({"group_idx" : g, "t" : t_grid, "S_med" : S_med, "S_q25" : S_q25, "S_q75" : S_q75})) return pd.concat(curves, ignore_index=True )def read_aft_params (): return pd.read_csv(AFT_PARAM_CSV) if os.path.exists(AFT_PARAM_CSV) else None def aft_S_of_t (dist, params, t_grid ): t=np.maximum(t_grid,1e-9 ) if dist=="lognormal" : return 1 - norm.cdf((np.log(t)-params["mu" ])/params["sigma" ]) if dist=="weibull" : return np.exp(- (t/params["scale" ]) ** params["shape" ]) if dist=="loglogistic" : return 1.0 / (1.0 + (t/params["scale" ]) ** params["shape" ]) raise ValueError("unknown dist" )def aft_curves_from_params (groups_df, aft_df ): if aft_df is None or aft_df.empty: return None aft_df = aft_df.drop(columns=[c for c in ["group_idx" ] if c in aft_df.columns]) \ .merge(groups_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="left" ) t_grid=np.arange(T_MIN, T_MAX+1e-9 , DT) curves=[] for g in sorted (aft_df["group_idx" ].dropna().unique()): sub=aft_df[aft_df["group_idx" ]==g]; S_mat=[] for _, r in sub.iterrows(): d=str (r["dist" ]).strip().lower() if d=="lognormal" and np.isfinite(r["mu" ]) and np.isfinite(r["sigma" ]) and r["sigma" ]>0 : S=aft_S_of_t("lognormal" , {"mu" :float (r["mu" ]), "sigma" :float (r["sigma" ])}, t_grid) elif d=="weibull" and np.isfinite(r["shape" ]) and np.isfinite(r["scale" ]) and r["shape" ]>0 and r["scale" ]>0 : S=aft_S_of_t("weibull" , {"shape" :float (r["shape" ]), "scale" :float (r["scale" ])}, t_grid) elif d=="loglogistic" and np.isfinite(r["shape" ]) and np.isfinite(r["scale" ]) and r["shape" ]>0 and r["scale" ]>0 : S=aft_S_of_t("loglogistic" , {"shape" :float (r["shape" ]), "scale" :float (r["scale" ])}, t_grid) else : continue S_mat.append(S) if not S_mat: continue S_arr=np.vstack(S_mat); S_med=np.nanmedian(S_arr,axis=0 ); S_q25=np.nanpercentile(S_arr,25 ,axis=0 ); S_q75=np.nanpercentile(S_arr,75 ,axis=0 ) curves.append(pd.DataFrame({"group_idx" : int (g), "t" : t_grid, "S_med" : S_med, "S_q25" : S_q25, "S_q75" : S_q75})) return pd.concat(curves, ignore_index=True ) if curves else None def plot_mi_km (groups_df, km_curves_df, rec_map ): if km_curves_df is None or km_curves_df.empty: return labels=make_group_labels(groups_df, include_n=True ) fig, ax = plt.subplots(figsize=(9 ,6 )) for i, g in enumerate (sorted (km_curves_df["group_idx" ].unique())): gd=km_curves_df[km_curves_df["group_idx" ]==g]; col=PAL_LINE[i%len (PAL_LINE)] ax.step(gd["t" ], gd["S_med" ], where="post" , color=col, lw=2 , label=labels.get(int (g), f"组{g} " )) ax.fill_between(gd["t" ], gd["S_q25" ], gd["S_q75" ], color=col, alpha=0.18 , step="post" ) if int (g) in rec_map: v=rec_map[int (g)]; ax.axvline(v, color=col, ls="--" , lw=1.2 ); ax.text(v,0.06 ,"推荐" , color=col, fontsize=9 , ha="right" , va="bottom" , rotation=90 ) ax.set_xlim(T_MIN,T_MAX); ax.set_ylim(0 ,1.02 ); ax.set_xlabel("孕周(周)" ); ax.set_ylabel("S(t) = P(T > t)" ) ax.set_title("区间删失 MI+KM 生存曲线(AFT 条件插补)" ); ax.legend(title="BMI 组(范围)" ); fig.tight_layout() fig.savefig(os.path.join(OUT_DIR,"fig_mi_km_curves.png" ), dpi=150 ); plt.close(fig)def plot_aft (groups_df, aft_curves_df, rec_map ): if aft_curves_df is None or aft_curves_df.empty: return labels=make_group_labels(groups_df, include_n=True ) fig, ax = plt.subplots(figsize=(9 ,6 )) for i, g in enumerate (sorted (aft_curves_df["group_idx" ].unique())): gd=aft_curves_df[aft_curves_df["group_idx" ]==g]; col=PAL_LINE[i%len (PAL_LINE)] ax.plot(gd["t" ], gd["S_med" ], color=col, lw=2 , label=labels.get(int (g), f"组{g} " )) ax.fill_between(gd["t" ], gd["S_q25" ], gd["S_q75" ], color=col, alpha=0.18 ) if int (g) in rec_map: v=rec_map[int (g)]; ax.axvline(v, color=col, ls="--" , lw=1.2 ); ax.text(v,0.06 ,"推荐" , color=col, fontsize=9 , ha="right" , va="bottom" , rotation=90 ) ax.set_xlim(T_MIN,T_MAX); ax.set_ylim(0 ,1.02 ); ax.set_xlabel("孕周(周)" ); ax.set_ylabel("S(t) = P(T > t)" ) ax.set_title("AFT 组生存曲线(基于 survreg 精确参数)" ); ax.legend(title="BMI 组(范围)" ); fig.tight_layout() fig.savefig(os.path.join(OUT_DIR,"fig_aft_curves_exact.png" ), dpi=150 ); plt.close(fig)def main (): df_all, groups, _ = load_data(); plot_panels(df_all, groups) rec_map={} if os.path.exists(RECOMMEND_CSV): rec=pd.read_csv(RECOMMEND_CSV); rec_map={int (r["group_idx" ]): float (r["recommended_week" ]) for _, r in rec.iterrows() if np.isfinite(r.get("recommended_week" , np.nan))} km_curves = mi_km_curves(groups) plot_mi_km(groups, km_curves, rec_map) aft_df = read_aft_params(); aft_curves = aft_curves_from_params(groups, aft_df) plot_aft(groups, aft_curves, rec_map) print ("[OK] 三张图已生成到 outputs_binning/" )if __name__ == "__main__" : main()
p3_bmi_supervised_binning.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 """ BMI 监督分箱 + 评估 + 每组推荐时点(标准档,KM(MI) 优先) - 改进:MI 插补支持“基于 AFT 条件分布”的半参数插补(优先采用);若缺少参数文件则回退 Uniform。 - 继续输出:KM t95 的中位与 IQR、AFT 个体 t95 的中位与 IQR、AFT 与 KM 的对齐度(8–24 周)。 """ import os, sys, mathimport numpy as npimport pandas as pdimport jsontry : from scipy.stats import chi2, norm, lognorm, weibull_min, fisk SCIPY_OK = True except Exception: SCIPY_OK = False PRED_CSV = os.path.join("outputs_joint_r" , "joint_tdcox_preds.csv" ) INTERVALS_CSV = os.path.join("outputs_joint_r" , "event_intervals.csv" ) AFT_PARAM_CSV = os.path.join("outputs_joint_r" , "aft_params_by_patient.csv" ) OUT_DIR = "outputs_binning" os.makedirs(OUT_DIR, exist_ok=True ) TARGET_K = 3 MIN_LEAF = 30 MIN_GAIN = 1e-5 USE_METRIC = "pi_25" N_QUANT_CANDIDATES = 400 MI_M = 200 MI_SEED = 114514 Q_T95 = 0.05 Q_T90 = 0.10 USE_AFT_CONDITIONAL_MI = True RECOMMEND_ROUND_STEP = 0.5 T_MIN, T_MAX, DT = 0.0 , 26.0 , 0.1 ALIGN_LO, ALIGN_HI = 8.0 , 24.0 def _sse (x: np.ndarray ) -> float : if len (x) == 0 : return 0.0 mu = float (np.mean(x)); return float (((x - mu) ** 2 ).sum ())def _round_to_step (x: float , step: float = 0.5 ): if x is None or not np.isfinite(x): return float ("nan" ) return round (x / step) * stepdef _load_predictions (path=PRED_CSV ): if not os.path.exists(path): print (f"[ERROR] 预测文件不存在: {path} " ); sys.exit(1 ) df = pd.read_csv(path) metric = USE_METRIC if USE_METRIC in df.columns else None if metric is None : for alt in ["pi_25" ,"pred_t95" ,"pred_t90" ]: if alt in df.columns: print (f"[WARN] 未找到 {USE_METRIC} ;回退到 {alt} 作为监督目标" ); metric = alt; break if metric is None : print ("[ERROR] 无可用监督目标列" ); sys.exit(1 ) needed = {"patient_id" ,"BMI" ,metric} if not needed.issubset(df.columns): print (f"[ERROR] 缺列: {needed - set (df.columns)} " ); sys.exit(1 ) df = df[np.isfinite(df["BMI" ]) & np.isfinite(df[metric])].copy() keep = ["patient_id" ,"BMI" ,"pred_t90" ,"pred_t95" ,"pi_25" ]; keep=[c for c in keep if c in df.columns] df = df[list (dict .fromkeys(["patient_id" ,"BMI" ,metric] + keep))].copy() df.rename(columns={metric:"target" }, inplace=True ) if "pred_t95" in df.columns: rho = pd.Series(df["BMI" ]).corr(pd.Series(df["pred_t95" ]), method="spearman" ) print (f"[DIAG] Spearman: BMI vs pred_t95 = {rho:.3 f} " ) if "pi_25" in df.columns: rho = pd.Series(df["BMI" ]).corr(pd.Series(df["pi_25" ]), method="spearman" ) print (f"[DIAG] Spearman: BMI vs pi_25 = {rho:.3 f} " ) print (f"[INFO] 使用监督目标: {USE_METRIC} ;有效样本数: n={len (df)} " ) return df, USE_METRICdef _best_split_one_leaf (df_leaf: pd.DataFrame, min_leaf=MIN_LEAF ): x = df_leaf["BMI" ].values qs = np.linspace(0.02 , 0.98 , max (2 , int (N_QUANT_CANDIDATES))) cuts = np.unique(np.quantile(x, qs)) if cuts.size == 0 : return None , float ("-inf" ) base = _sse(df_leaf["target" ].values) best_gain = float ("-inf" ); best_cut=None for c in cuts: left = df_leaf[df_leaf["BMI" ] <= c]; right= df_leaf[df_leaf["BMI" ] > c] if len (left) < min_leaf or len (right) < min_leaf: continue gain = base - (_sse(left["target" ].values) + _sse(right["target" ].values)) if gain > best_gain: best_gain=gain; best_cut=float (c) return best_cut, best_gaindef _greedy_supervised (df: pd.DataFrame, target_k=TARGET_K, min_leaf=MIN_LEAF, min_gain=MIN_GAIN ): leaves=[df.copy()]; cuts=[] while len (leaves) < target_k: best_idx=None ; best_cut=None ; best_gain=float ("-inf" ) for i, leaf in enumerate (leaves): cut, gain = _best_split_one_leaf(leaf, min_leaf=min_leaf) if cut is not None and gain > best_gain: best_idx=i; best_cut=cut; best_gain=gain if best_cut is None or best_gain < min_gain: break leaf = leaves.pop(best_idx) left = leaf[leaf["BMI" ] <= best_cut].copy(); right= leaf[leaf["BMI" ] > best_cut].copy() leaves.extend([left,right]); cuts.append(best_cut) cuts=sorted (cuts) labels=[] for b in df["BMI" ].values: g=0 for c in cuts: if b>c: g+=1 labels.append(g) out=df.copy(); out["group_idx" ]=labels ok=True for g in sorted (out["group_idx" ].unique()): if int ((out["group_idx" ]==g).sum ()) < MIN_LEAF: ok=False ; break return out, cuts, okdef _fallback_equal_frequency (df: pd.DataFrame, k=TARGET_K ): n=len (df); if k<=1 or n<=1 : out=df.copy(); out["group_idx" ]=0 ; edges=np.array([df["BMI" ].min (), df["BMI" ].max ()]); return out, edges df_sorted=df.sort_values("BMI" ).reset_index(drop=True ); b=df_sorted["BMI" ].values idxs=sorted (set ([int (round (i*n/k)) for i in range (1 ,k) if 0 <int (round (i*n/k))<n])) cuts=[] for idx in idxs: lb=b[idx-1 ]; rb=b[idx] if rb>lb: cuts.append((lb+rb)/2.0 ) cuts=sorted (set (cuts)) labels=[] for v in df["BMI" ].values: g=0 for c in cuts: if v>c: g+=1 labels.append(g) out=df.copy(); out["group_idx" ]=labels edges=np.array([df["BMI" ].min ()]+cuts+[df["BMI" ].max ()], dtype=float ) return out, edgesdef _group_labels_from_bmi (df_g: pd.DataFrame ): gstats=df_g.groupby("group_idx" ).agg(bmin=("BMI" ,"min" ), bmax=("BMI" ,"max" )).reset_index().sort_values("group_idx" ) lut={int (r["group_idx" ]): f"[{r['bmin' ]:.2 f} , {r['bmax' ]:.2 f} ]" for _, r in gstats.iterrows()} return df_g["group_idx" ].map (lut), lutdef _describe_groups (df_g: pd.DataFrame ): return df_g.groupby("group_idx" ).agg( n=("BMI" ,"size" ), bmi_min=("BMI" ,"min" ), bmi_max=("BMI" ,"max" ), bmi_med=("BMI" ,"median" ), target_mean=("target" ,"mean" ), target_med=("target" ,"median" ) ).reset_index().sort_values("group_idx" )def _read_aft_params (): if not (USE_AFT_CONDITIONAL_MI and os.path.exists(AFT_PARAM_CSV)): return None df = pd.read_csv(AFT_PARAM_CSV) for c in ["mu" ,"sigma" ,"shape" ,"scale" ]: if c in df.columns: df[c]=pd.to_numeric(df[c], errors="coerce" ) keep = ["patient_id" ,"dist" ,"mu" ,"sigma" ,"shape" ,"scale" ] keep = [c for c in keep if c in df.columns] return df[keep].copy()def _km_from_right_censored (times: np.ndarray, events: np.ndarray ): ord_idx=np.argsort(times); t=times[ord_idx]; e=events[ord_idx].astype(int ) uniq_times=np.unique(t[e==1 ]) if uniq_times.size==0 : return np.array([]), np.array([]) n_at=len (t); S=1.0 ; S_vals=[] for u in uniq_times: d=int (np.sum ((t==u)&(e==1 ))); c=int (np.sum ((t==u)&(e==0 ))) if n_at>0 : S*=(1.0 - d/n_at) S_vals.append(S); n_at -= (d+c) return uniq_times, np.array(S_vals, dtype=float )def _km_quantile (uniq_times: np.ndarray, S_vals: np.ndarray, alpha: float ): if uniq_times.size==0 : return float ("nan" ) idx=np.where(S_vals<=alpha)[0 ] if idx.size==0 : return float ("nan" ) return float (uniq_times[idx[0 ]])def _aft_dist_objs (row ): """返回 (cdf, ppf) 两个可调用对象,用于该个体的分布。""" dist=str (row["dist" ]).strip().lower() if dist=="lognormal" and np.isfinite(row["mu" ]) and np.isfinite(row["sigma" ]) and row["sigma" ]>0 : s=float (row["sigma" ]); sc=np.exp(float (row["mu" ])) def cdf (x ): return lognorm.cdf(np.maximum(x,1e-9 ), s=s, scale=sc) def ppf (u ): return lognorm.ppf(np.clip(u, 1e-12 , 1 -1e-12 ), s=s, scale=sc) return cdf, ppf if dist=="weibull" and np.isfinite(row["shape" ]) and np.isfinite(row["scale" ]) and row["shape" ]>0 and row["scale" ]>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return weibull_min.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return weibull_min.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf if dist=="loglogistic" and np.isfinite(row["shape" ]) and np.isfinite(row["scale" ]) and row["shape" ]>0 and row["scale" ]>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return fisk.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return fisk.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf return None , None def _mi_km_summary_and_curves (intervals_df: pd.DataFrame, groups_df: pd.DataFrame, aft_params: pd.DataFrame = None ): """ 返回: - km_summary: 各组 KM t95/t90 的 MI 中位数与 IQR - km_curves: 各组的 S_med/S_q25/S_q75 曲线(用于对齐度) 说明: - 若提供 aft_params,则对 left/interval 使用 AFT 条件分布插补; 否则使用 Uniform(L,R) 或 (0,R) 插补。 """ rng=np.random.default_rng(MI_SEED) df = intervals_df.merge(groups_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="inner" ) if aft_params is not None : df = df.merge(aft_params, on="patient_id" , how="left" ) groups=sorted (df["group_idx" ].unique()) t_grid=np.arange(T_MIN, T_MAX+1e-9 , DT) S_store={g:[] for g in groups}; t95_list={g:[] for g in groups}; t90_list={g:[] for g in groups} for _ in range (MI_M): imp_rows=[] for _, r in df.iterrows(): L=float (r["L" ]); R=r["R" ]; typ=r["type" ] if typ=="right" or (isinstance (R,float ) and (math.isinf(R) or np.isinf(R))): imp_rows.append((r["patient_id" ], r["group_idx" ], L, 0 )) continue if aft_params is not None and pd.notna(r.get("dist" , np.nan)): cdf, ppf = _aft_dist_objs(r) else : cdf, ppf = None , None if typ=="left" : Rt=float (R) if cdf is not None and ppf is not None : u_low = 0.0 u_high = float (cdf(Rt)) if not np.isfinite(u_high) or u_high <= 1e-12 : t = max (1e-6 , Rt*0.5 ) else : u = rng.uniform(u_low, max (u_high, u_low+1e-12 )) t = float (ppf(u)) else : t = rng.uniform(1e-6 , Rt) imp_rows.append((r["patient_id" ], r["group_idx" ], t, 1 )) else : Lt=float (L); Rt=float (R) if cdf is not None and ppf is not None : u_low = float (cdf(Lt)) u_high = float (cdf(Rt)) if not (np.isfinite(u_low) and np.isfinite(u_high)) or u_high <= u_low + 1e-12 : t = 0.5 *(Lt+Rt) if Rt>Lt else Lt else : u = rng.uniform(u_low, u_high) t = float (ppf(u)) else : t = Lt if Rt<=Lt else rng.uniform(Lt, Rt) imp_rows.append((r["patient_id" ], r["group_idx" ], t, 1 )) imp_df=pd.DataFrame(imp_rows, columns=["patient_id" ,"group_idx" ,"time" ,"event" ]) for g in groups: gi=imp_df[imp_df["group_idx" ]==g] if gi.empty: S_store[g].append(np.ones_like(t_grid)); t95_list[g].append(np.nan); t90_list[g].append(np.nan); continue times=gi["time" ].values.astype(float ); events=gi["event" ].values.astype(int ) ut, Sv = _km_from_right_censored(times, events) t95_list[g].append(_km_quantile(ut, Sv, Q_T95)) t90_list[g].append(_km_quantile(ut, Sv, Q_T90)) idx=np.searchsorted(ut, t_grid, side="right" )-1 ; idx=np.clip(idx, -1 , len (Sv)-1 ) Sg=np.ones_like(t_grid, dtype=float ); m=idx>=0 ; Sg[m]=Sv[idx[m]]; S_store[g].append(Sg) rows=[]; curves=[] for g in groups: arr95=np.array(t95_list[g], dtype=float ); arr90=np.array(t90_list[g], dtype=float ) rows.append({ "group_idx" : g, "KM_t95_MI_med" : float (np.nanmedian(arr95)), "KM_t95_q25" : float (np.nanpercentile(arr95,25 )) if np.isfinite(arr95).any () else np.nan, "KM_t95_q75" : float (np.nanpercentile(arr95,75 )) if np.isfinite(arr95).any () else np.nan, "KM_t90_MI_med" : float (np.nanmedian(arr90)), "KM_t90_q25" : float (np.nanpercentile(arr90,25 )) if np.isfinite(arr90).any () else np.nan, "KM_t90_q75" : float (np.nanpercentile(arr90,75 )) if np.isfinite(arr90).any () else np.nan, }) S_arr=np.vstack(S_store[g]); S_med=np.nanmedian(S_arr,axis=0 ); S_q25=np.nanpercentile(S_arr,25 ,axis=0 ); S_q75=np.nanpercentile(S_arr,75 ,axis=0 ) curves.append(pd.DataFrame({"group_idx" : g, "t" : t_grid, "S_med" : S_med, "S_q25" : S_q25, "S_q75" : S_q75})) km_summary=pd.DataFrame(rows).sort_values("group_idx" ) km_curves=pd.concat(curves, ignore_index=True ) return km_summary, km_curvesdef _aft_S_of_t (dist: str , params: dict , t_grid: np.ndarray ) -> np.ndarray: t = np.maximum(t_grid, 1e-9 ) d = str (dist).strip().lower() if d == "lognormal" : mu = float (params["mu" ]); sigma = float (params["sigma" ]) z = (np.log(t) - mu) / sigma if SCIPY_OK: return 1.0 - norm.cdf(z) else : return 0.5 * np.erfc(z / np.sqrt(2.0 )) elif d == "weibull" : shape = float (params["shape" ]); scale = float (params["scale" ]) return np.exp(- (t / scale) ** shape) elif d == "loglogistic" : shape = float (params["shape" ]); scale = float (params["scale" ]) return 1.0 / (1.0 + (t / scale) ** shape) else : raise ValueError("unknown dist" )def _aft_curves_from_params (groups_df: pd.DataFrame, aft_df: pd.DataFrame ): """ 由个体 AFT 参数计算每组的中位生存曲线与 25–75% 分位带。 返回列:group_idx, t, S_med, S_q25, S_q75 """ if aft_df is None or aft_df.empty: return None use_cols = [c for c in ["patient_id" ,"dist" ,"mu" ,"sigma" ,"shape" ,"scale" ] if c in aft_df.columns] a = aft_df[use_cols].copy() a = a.merge(groups_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="left" ) t_grid = np.arange(T_MIN, T_MAX+1e-9 , DT) curves=[] for g in sorted (a["group_idx" ].dropna().unique()): sub = a[a["group_idx" ]==g] S_mat=[] for _, r in sub.iterrows(): d = str (r["dist" ]).strip().lower() try : if d=="lognormal" and np.isfinite(r.get("mu" , np.nan)) and np.isfinite(r.get("sigma" , np.nan)) and r["sigma" ]>0 : S = _aft_S_of_t("lognormal" , {"mu" : float (r["mu" ]), "sigma" : float (r["sigma" ])}, t_grid) elif d=="weibull" and np.isfinite(r.get("shape" , np.nan)) and np.isfinite(r.get("scale" , np.nan)) and r["shape" ]>0 and r["scale" ]>0 : S = _aft_S_of_t("weibull" , {"shape" : float (r["shape" ]), "scale" : float (r["scale" ])}, t_grid) elif d=="loglogistic" and np.isfinite(r.get("shape" , np.nan)) and np.isfinite(r.get("scale" , np.nan)) and r["shape" ]>0 and r["scale" ]>0 : S = _aft_S_of_t("loglogistic" , {"shape" : float (r["shape" ]), "scale" : float (r["scale" ])}, t_grid) else : continue except Exception: continue S_mat.append(S) if not S_mat: continue S_arr = np.vstack(S_mat) S_med = np.nanmedian(S_arr, axis=0 ) S_q25 = np.nanpercentile(S_arr, 25 , axis=0 ) S_q75 = np.nanpercentile(S_arr, 75 , axis=0 ) curves.append(pd.DataFrame({"group_idx" : int (g), "t" : t_grid, "S_med" : S_med, "S_q25" : S_q25, "S_q75" : S_q75})) return pd.concat(curves, ignore_index=True ) if curves else None def _alignment_metrics (km_curves: pd.DataFrame, aft_curves: pd.DataFrame, lo=ALIGN_LO, hi=ALIGN_HI ): if km_curves is None or aft_curves is None : return None res=[] for g in sorted (set (km_curves["group_idx" ]).intersection(set (aft_curves["group_idx" ]))): km_g=km_curves[km_curves["group_idx" ]==g]; aft_g=aft_curves[aft_curves["group_idx" ]==g] grid=np.intersect1d(km_g["t" ].values, aft_g["t" ].values) mask=(grid>=lo)&(grid<=hi); grid=grid[mask] if grid.size==0 : res.append({"group_idx" : int (g), "align_L1_8_24" : np.nan, "align_sup_8_24" : np.nan}) continue km_med=km_g.set_index("t" ).loc[grid, "S_med" ].values aft_med=aft_g.set_index("t" ).loc[grid, "S_med" ].values diff=np.abs (km_med - aft_med) res.append({"group_idx" : int (g), "align_L1_8_24" : float (np.mean(diff)), "align_sup_8_24" : float (np.max (diff))}) return pd.DataFrame(res).sort_values("group_idx" )def _make_recommendations (km_summary: pd.DataFrame, df_groups: pd.DataFrame, left_censor: pd.DataFrame = None , aft_q: pd.DataFrame = None , align_df: pd.DataFrame = None ): if "group_label" in df_groups.columns: label_map = df_groups.groupby("group_idx" )["group_label" ].agg(lambda s: s.dropna().iloc[0 ] if s.dropna().size>0 else None ).to_dict() else : _, lut = _group_labels_from_bmi(df_groups); label_map = lut sizes = df_groups["group_idx" ].value_counts().sort_index() work = km_summary.copy() if aft_q is not None : work = work.merge(aft_q, on="group_idx" , how="left" ) if align_df is not None : work = work.merge(align_df, on="group_idx" , how="left" ) recs=[] for _, row in work.sort_values("group_idx" ).iterrows(): g=int (row["group_idx" ]) t95_km = float (row["KM_t95_MI_med" ]) if np.isfinite(row["KM_t95_MI_med" ]) else float ("nan" ) t95_fill = float (row["AFT_t95_med" ]) if "AFT_t95_med" in row.index and np.isfinite(row["AFT_t95_med" ]) else np.nan if np.isfinite(t95_km): rec_raw=t95_km; note="KM_t95_MI_med" elif np.isfinite(t95_fill): rec_raw=t95_fill; note="fallback: AFT group median t95" else : rec_raw=float ("nan" ); note="no t95 available" rec=_round_to_step(rec_raw, RECOMMEND_ROUND_STEP) recs.append({ "group_idx" : g, "group_label" : label_map.get(g, "" ), "n" : int (sizes.get(g, 0 )), "KM_t95_MI_med" : t95_km, "KM_t95_q25" : float (row.get("KM_t95_q25" , np.nan)), "KM_t95_q75" : float (row.get("KM_t95_q75" , np.nan)), "AFT_t95_med" : float (row.get("AFT_t95_med" , np.nan)), "AFT_t95_q25" : float (row.get("AFT_t95_q25" , np.nan)), "AFT_t95_q75" : float (row.get("AFT_t95_q75" , np.nan)), "align_L1_8_24" : float (row.get("align_L1_8_24" , np.nan)), "align_sup_8_24" : float (row.get("align_sup_8_24" , np.nan)), "recommended_week" : rec, "notes" : note }) rec_df=pd.DataFrame(recs).sort_values("group_idx" ) if left_censor is not None and {"group_idx" ,"left_censor_rate" }.issubset(left_censor.columns): rec_df=rec_df.merge(left_censor, on="group_idx" , how="left" ) out_csv=os.path.join(OUT_DIR, "recommendations_by_group.csv" ) rec_df.to_csv(out_csv, index=False , encoding="utf-8-sig" ) print (f"[OK] 已保存推荐与指标:{out_csv} " ) return rec_dfdef main (): print ("[INFO] 读取 R 侧预测:" , PRED_CSV) df, _ = _load_predictions(PRED_CSV) sup_df, sup_cuts, sup_ok = _greedy_supervised(df, TARGET_K, MIN_LEAF, MIN_GAIN) print (f"[INFO] 监督分箱 cutpoints(BMI): {', ' .join(f'{c:.3 f} ' for c in sorted (sup_cuts))} " if sup_cuts else "[INFO] 单叶" ) print (_describe_groups(sup_df).to_string(index=False )) final_df = sup_df.copy() used_fallback = False if (sup_df["group_idx" ].nunique() < TARGET_K) or (not sup_ok): print ("[WARN] 监督分箱未达标,回退等频分箱" ) final_df, _ = _fallback_equal_frequency(df, k=TARGET_K) print (_describe_groups(final_df).to_string(index=False )) used_fallback = True final_df["group_label" ], _ = _group_labels_from_bmi(final_df) keep = ["patient_id" ,"BMI" ,"group_idx" ,"group_label" ,"target" ] + [c for c in ["pi_25" ,"pred_t95" ,"pred_t90" ] if c in df.columns] final_df[keep].sort_values(["group_idx" ,"BMI" ,"patient_id" ]).to_csv(os.path.join(OUT_DIR, "bmi_groups.csv" ), index=False , encoding="utf-8-sig" ) cuts_for_json = [] uniq_groups = sorted (final_df["group_idx" ].unique()) K = len (uniq_groups) if (not used_fallback) and (len (sup_cuts) == max (0 , K - 1 )): cuts_for_json = sorted (map (float , sup_cuts)) else : gstats = final_df.groupby("group_idx" ).agg(bmin=("BMI" ,"min" ), bmax=("BMI" ,"max" )).sort_index() est = [] for g in range (K - 1 ): left_max = float (gstats.loc[g, "bmax" ]) right_min = float (gstats.loc[g + 1 , "bmin" ]) if np.isfinite(left_max) and np.isfinite(right_min): est.append(0.5 * (left_max + right_min)) cuts_for_json = sorted (est) cuts_obj = {"chosen" : {"cuts_final" : cuts_for_json, "k" : int (K), "source" : ("supervised" if not used_fallback else "equal_freq" )}} with open (os.path.join(OUT_DIR, "bmi_supervised_bins_cuts.json" ), "w" , encoding="utf-8" ) as f: json.dump(cuts_obj, f, ensure_ascii=False , indent=2 ) print (f"[OK] 已写出分箱 cuts JSON: {os.path.join(OUT_DIR, 'bmi_supervised_bins_cuts.json' )} " ) if not os.path.exists(INTERVALS_CSV): print (f"[ERROR] 缺少 {INTERVALS_CSV} " ); sys.exit(1 ) intervals = pd.read_csv(INTERVALS_CSV) if not {"patient_id" ,"L" ,"R" ,"type" }.issubset(intervals.columns): print ("[ERROR] event_intervals.csv 缺列" ); sys.exit(1 ) left_censor = intervals.merge(final_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="inner" ) \ .groupby("group_idx" )["type" ].apply(lambda s: np.mean(s.values=="left" )).reset_index() \ .rename(columns={"type" :"left_censor_rate" }) left_censor.to_csv(os.path.join(OUT_DIR,"left_censor_by_group.csv" ), index=False , encoding="utf-8-sig" ) aft_params_for_mi = _read_aft_params() if aft_params_for_mi is None : print ("[WARN] 未找到 AFT 参数或关闭了 AFT 条件插补,改用 Uniform MI。" ) km_summary, km_curves = _mi_km_summary_and_curves(intervals, final_df[["patient_id" ,"group_idx" ]], aft_params=aft_params_for_mi) aft_q = None if os.path.exists(AFT_PARAM_CSV): a = pd.read_csv(AFT_PARAM_CSV) if "patient_id" in a.columns: a2 = a.merge(final_df[["patient_id" ,"group_idx" ]], on="patient_id" , how="left" ) if "t95" in a2.columns: g = a2.groupby("group_idx" )["t95" ] aft_q = g.median().rename("AFT_t95_med" ).to_frame() aft_q["AFT_t95_q25" ] = g.quantile(0.25 ) aft_q["AFT_t95_q75" ] = g.quantile(0.75 ) aft_q = aft_q.reset_index().sort_values("group_idx" ) aft_curves = None if os.path.exists(AFT_PARAM_CSV): try : aft_df_full = pd.read_csv(AFT_PARAM_CSV) aft_curves = _aft_curves_from_params(final_df[["patient_id" ,"group_idx" ]], aft_df_full) except Exception as e: print (f"[WARN] 计算 AFT 曲线失败:{e} " ) aft_curves = None align_df = _alignment_metrics(km_curves, aft_curves, lo=ALIGN_LO, hi=ALIGN_HI) if aft_curves is not None else None if align_df is not None and not align_df.empty: print ("[OK] KM–AFT 对齐度已计算并并入 recommendations_by_group.csv" ) else : print ("[WARN] 未生成对齐度(缺少 AFT 曲线或无可比区间)" ) km_summary.to_csv(os.path.join(OUT_DIR,"km_quantiles_by_group.csv" ), index=False , encoding="utf-8-sig" ) _make_recommendations(km_summary, final_df, left_censor=left_censor, aft_q=aft_q, align_df=align_df)if __name__ == "__main__" : main()
p3_fuzzy_interval_modeling.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 import os, jsonimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom lifelines import KaplanMeierFittertry : from scipy.stats import lognorm, weibull_min, fisk SCIPY_OK = True except Exception: SCIPY_OK = False RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv" OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/sensitivity_analysis_outputs" CUTS_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_binning/bmi_supervised_bins_cuts.json" BINS_DIR = os.path.dirname(CUTS_FILE) GROUPS_CSV = os.path.join(BINS_DIR, "bmi_groups.csv" ) DEFAULT_K = 3 AFT_PARAM_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_joint_r/aft_params_by_patient.csv" COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码" , "检测孕天数" , "Y染色体浓度" , "孕妇BMI" CFFDNA_THRESHOLD = 0.04 FUZZY_LOWER_BOUND = 0.039 FUZZY_UPPER_BOUND = 0.041 os.makedirs(OUTPUT_DIR, exist_ok=True )def safe_read_csv (path ): try : return pd.read_csv(path, encoding="gbk" ) except Exception: return pd.read_csv(path, encoding="utf-8" )def get_event_data (df, threshold ): rows=[] for _, g in df.groupby(COL_PATIENT): g = g.sort_values(COL_GA_DAYS) weeks = g[COL_GA_DAYS].values/7.0 y = g[COL_Y_CONC].values hit_idx = np.where(y >= threshold)[0 ] if hit_idx.size>0 : i = int (hit_idx[0 ]) duration = float (weeks[i]) observed = True else : duration = float (weeks.max ()) observed = False rows.append({"patient_id" : g[COL_PATIENT].iloc[0 ], "BMI" : float (g[COL_BMI].iloc[0 ]), "duration" : duration, "observed" : int (observed)}) return pd.DataFrame(rows)def _aft_dist_objs (row ): if not SCIPY_OK: return None , None dist = str (row.get("dist" , "" )).strip().lower() if dist == "lognormal" and np.isfinite(row.get("mu" , np.nan)) and np.isfinite(row.get("sigma" , np.nan)) and float (row["sigma" ])>0 : s=float (row["sigma" ]); sc=np.exp(float (row["mu" ])) def cdf (x ): return lognorm.cdf(np.maximum(x,1e-9 ), s=s, scale=sc) def ppf (u ): return lognorm.ppf(np.clip(u,1e-12 ,1 -1e-12 ), s=s, scale=sc) return cdf, ppf if dist == "weibull" and np.isfinite(row.get("shape" , np.nan)) and np.isfinite(row.get("scale" , np.nan)) and float (row["shape" ])>0 and float (row["scale" ])>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return weibull_min.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return weibull_min.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf if dist == "loglogistic" and np.isfinite(row.get("shape" , np.nan)) and np.isfinite(row.get("scale" , np.nan)) and float (row["shape" ])>0 and float (row["scale" ])>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return fisk.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return fisk.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf return None , None def get_fuzzy_event_data (df, lower_b, upper_b, aft_params=None ): rows=[] for _, g in df.groupby(COL_PATIENT): g = g.sort_values(COL_GA_DAYS) t = (g[COL_GA_DAYS].values/7.0 ).astype(float ) y = g[COL_Y_CONC].values.astype(float ) above = np.where(y > upper_b)[0 ] if above.size>0 : j = int (above[0 ]) R = float (t[j]) L = float (t[j-1 ]) if j-1 >=0 else 0.0 rows.append({"patient_id" : g[COL_PATIENT].iloc[0 ], "BMI" : float (g[COL_BMI].iloc[0 ]), "L" : L, "R" : R, "ctype" : "interval" }) else : rows.append({"patient_id" : g[COL_PATIENT].iloc[0 ], "BMI" : float (g[COL_BMI].iloc[0 ]), "L" : float (t.max ()), "R" : np.inf, "ctype" : "right" }) iv = pd.DataFrame(rows) if aft_params is not None and len (aft_params): iv = iv.merge(aft_params, on="patient_id" , how="left" ) imputed=[] rng = np.random.default_rng(114514 ) for _, r in iv.iterrows(): if r["ctype" ] == "interval" and np.isfinite(r["R" ]): cdf, ppf = _aft_dist_objs(r) if ("dist" in iv.columns and pd.notna(r.get("dist" , np.nan))) else (None , None ) Lt, Rt = float (r["L" ]), float (r["R" ]) if cdf and ppf: u_lo, u_hi = float (cdf(Lt)), float (cdf(Rt)) if (not np.isfinite(u_lo)) or (not np.isfinite(u_hi)) or u_hi <= u_lo + 1e-12 : t = 0.5 *(Lt+Rt) if Rt>Lt else Lt else : u = float (rng.uniform(u_lo, u_hi)) t = float (ppf(u)) else : t = Lt if Rt<=Lt else float (rng.uniform(Lt, Rt)) imputed.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "duration" : t, "observed" : 1 }) else : imputed.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "duration" : r["L" ], "observed" : 0 }) return pd.DataFrame(imputed)def get_q95_recommendation (df_grp ): if df_grp.empty: return np.nan kmf = KaplanMeierFitter().fit(df_grp["duration" ], event_observed=df_grp["observed" ]) sf = kmf.survival_function_.reset_index() hit = sf[sf["KM_estimate" ] <= 0.05 ] return float (hit["timeline" ].iloc[0 ]) if len (hit)>0 else float (df_grp["duration" ].max ())def resolve_bin_edges_and_labels (df_raw ): try : with open (CUTS_FILE, "r" , encoding="utf-8" ) as f: cuts_obj = json.load(f)["chosen" ] cuts = cuts_obj.get("cuts_final" , []) K = int (cuts_obj.get("k" , len (cuts) + 1 )) edges = [-np.inf] + list (cuts) + [np.inf] labels = [f"Group {i+1 } " for i in range (K)] print (f"[INFO] 采用 cuts JSON(K={K} ):{cuts} " ) return edges, labels except FileNotFoundError: print (f"[WARN] 找不到 cuts JSON:{CUTS_FILE} " ) if os.path.exists(GROUPS_CSV): try : gdf = safe_read_csv(GROUPS_CSV) if {"BMI" ,"group_idx" }.issubset(gdf.columns): gstats = gdf.groupby("group_idx" ).agg(bmin=("BMI" ,"min" ), bmax=("BMI" ,"max" )).sort_index() est = [] for g in range (gstats.shape[0 ] - 1 ): left_max = float (gstats.iloc[g]["bmax" ]) right_min = float (gstats.iloc[g+1 ]["bmin" ]) if np.isfinite(left_max) and np.isfinite(right_min): est.append(0.5 * (left_max + right_min)) K = gstats.shape[0 ] edges = [-np.inf] + est + [np.inf] labels = [f"Group {i+1 } " for i in range (K)] print (f"[INFO] 采用 bmi_groups.csv 推回 cuts(K={K} ):{[round (c,3 ) for c in est]} " ) return edges, labels except Exception as e: print (f"[WARN] 解析 bmi_groups.csv 失败:{e} " ) bmi = pd.to_numeric(df_raw["孕妇BMI" ], errors="coerce" ).dropna() if len (bmi) < DEFAULT_K: k = 2 else : k = DEFAULT_K qs = np.linspace(0 , 1 , k + 1 )[1 :-1 ] cuts = sorted (bmi.quantile(qs).unique().tolist()) edges = [-np.inf] + cuts + [np.inf] labels = [f"Group {i+1 } " for i in range (k)] print (f"[INFO] 采用等频分箱(K={k} ):{[round (c,3 ) for c in cuts]} " ) return edges, labelsif __name__ == "__main__" : df_raw = safe_read_csv(RAW_DATA_FILE) df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna() for c in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]: df_raw[c] = pd.to_numeric(df_raw[c], errors="coerce" ) df_raw = df_raw.dropna() aft_params = None if os.path.exists(AFT_PARAM_CSV): try : a = pd.read_csv(AFT_PARAM_CSV) keep = [c for c in ["patient_id" ,"dist" ,"mu" ,"sigma" ,"shape" ,"scale" ] if c in a.columns] if "patient_id" in keep: aft_params = a[keep].copy() print (f"[INFO] Using AFT conditional imputation for fuzzy intervals (n={len (aft_params)} )" ) except Exception as e: print (f"[WARN] Failed to read AFT params; fallback to Uniform imputation: {e} " ) bin_edges, labels = resolve_bin_edges_and_labels(df_raw) df_exact = get_event_data(df_raw, CFFDNA_THRESHOLD) df_fuzzy = get_fuzzy_event_data(df_raw, FUZZY_LOWER_BOUND, FUZZY_UPPER_BOUND, aft_params=aft_params) df_exact["group" ] = pd.cut(df_exact["BMI" ], bins=bin_edges, labels=labels) df_fuzzy["group" ] = pd.cut(df_fuzzy["BMI" ], bins=bin_edges, labels=labels) fig, axes = plt.subplots(len (labels), 1 , figsize=(9 , 3.0 *len (labels)), sharex=True ) if len (labels)==1 : axes=[axes] fig.suptitle("Q3: KM curves by BMI groups (Exact 4% vs Fuzzy [3.9%, 4.1%])" , fontsize=13 ) summary=[] for i, glb in enumerate (labels): ax = axes[i] g0 = df_exact[df_exact["group" ]==glb] g1 = df_fuzzy[df_fuzzy["group" ]==glb] rec0 = np.nan; rec1 = np.nan if not g0.empty: KaplanMeierFitter().fit(g0["duration" ], g0["observed" ], label="Exact 4%" ).plot(ax=ax, ci_show=True ) rec0 = get_q95_recommendation(g0) if not g1.empty: KaplanMeierFitter().fit(g1["duration" ], g1["observed" ], label="Fuzzy [3.9%, 4.1%]" ).plot(ax=ax, ci_show=True ) rec1 = get_q95_recommendation(g1) ax.set_title(str (glb)) ax.set_ylabel("Survival S(t)" ) ax.grid(True , ls="--" , alpha=0.5 ) ax.legend() summary.append({"Group" : glb, "t95_exact" : rec0, "t95_fuzzy" : rec1}) axes[-1 ].set_xlabel("Gestational age (weeks)" ) out_png = os.path.join(OUTPUT_DIR, "fuzzy_interval_comparison_by_group.png" ) plt.tight_layout(rect=[0 ,0.03 ,1 ,0.96 ]) plt.savefig(out_png, dpi=150 ) print (f"[OK] 已保存对比图:{out_png} " ) out_csv = os.path.join(OUTPUT_DIR, "fuzzy_interval_summary_by_group.csv" ) pd.DataFrame(summary).to_csv(out_csv, index=False , encoding="utf-8-sig" ) print (f"[OK] 已保存汇总表:{out_csv} " )
p3_noise_grouped_sensitivity_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 """ 问题三:分组敏感性分析(蒙特卡洛) - 给 Y 测量值加入小幅高斯噪声,多次重复: 1) 重新学习监督分箱(等值回归 + 决策树)得到 BMI cuts; 2) 对每组做 KM,取 t95 作为推荐周数; - 输出:每次实验的 cuts 与各组推荐的 CSV;并绘制小提琴图(cuts 与推荐)。 """ import os, json, random, warningsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom lifelines import KaplanMeierFitter, CoxPHFitterfrom patsy import dmatrixfrom sklearn.isotonic import IsotonicRegressionfrom sklearn.tree import DecisionTreeRegressortry : from scipy.stats import lognorm, weibull_min, fisk SCIPY_OK = True except Exception: SCIPY_OK = False warnings.filterwarnings("ignore" ) RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv" OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/sensitivity_analysis_outputs" CUTS_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_binning/bmi_supervised_bins_cuts.json" AFT_PARAM_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_joint_r/aft_params_by_patient.csv" COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码" , "检测孕天数" , "Y染色体浓度" , "孕妇BMI" N_BOOTSTRAPS = 50 NOISE_STD_DEV = 0.0005 SEED = 42 CFFDNA_THRESHOLD = 0.04 DETECTION_LOWER_BOUND = 6.0 MI_M = 10 P_MAIN = 0.95 MIN_SAMPLES_PER_BIN = 20 TREE_PENALTY = 0.05 os.makedirs(OUTPUT_DIR, exist_ok=True )def set_seed (s ): os.environ["PYTHONHASHSEED" ] = str (s) random.seed(s) np.random.seed(s)def safe_read_csv (path ): try : return pd.read_csv(path, encoding="gbk" ) except : return pd.read_csv(path, encoding="utf-8" )def build_patient_intervals (df ): df2 = df.copy() df2["孕周" ] = df2[COL_GA_DAYS] / 7.0 df2["达标" ] = df2[COL_Y_CONC] >= CFFDNA_THRESHOLD out=[] for pid, g in df2.groupby(COL_PATIENT): g = g.sort_values("孕周" ) t = g["孕周" ].values.astype(float ) y = g["达标" ].values.astype(bool ) bmi = float (g[COL_BMI].iloc[0 ]) pos = np.where(y)[0 ] if pos.size>0 : j = int (pos[0 ]) R = float (t[j]) if j>0 : L = float (t[:j][~y[:j]].max ()) if (~y[:j]).any () else DETECTION_LOWER_BOUND ctype = "interval" if (~y[:j]).any () else "left" else : L = DETECTION_LOWER_BOUND; ctype = "left" else : L, R, ctype = float (t.max ()), np.inf, "right" out.append({"patient_id" : pid, "BMI" : bmi, "L" : L, "R" : R, "ctype" : ctype}) return pd.DataFrame(out)def sample_time_from_interval (L, R, lb ): if not np.isfinite(R): return np.nan L_eff = float (L) if np.isfinite(L) else float (lb) return np.random.uniform(L_eff, float (R)) if R > L_eff else L_effdef _aft_dist_objs (row ): if not SCIPY_OK: return None , None dist = str (row.get("dist" , "" )).strip().lower() if dist == "lognormal" and np.isfinite(row.get("mu" , np.nan)) and np.isfinite(row.get("sigma" , np.nan)) and float (row["sigma" ])>0 : s=float (row["sigma" ]); sc=np.exp(float (row["mu" ])) def cdf (x ): return lognorm.cdf(np.maximum(x,1e-9 ), s=s, scale=sc) def ppf (u ): return lognorm.ppf(np.clip(u,1e-12 ,1 -1e-12 ), s=s, scale=sc) return cdf, ppf if dist == "weibull" and np.isfinite(row.get("shape" , np.nan)) and np.isfinite(row.get("scale" , np.nan)) and float (row["shape" ])>0 and float (row["scale" ])>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return weibull_min.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return weibull_min.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf if dist == "loglogistic" and np.isfinite(row.get("shape" , np.nan)) and np.isfinite(row.get("scale" , np.nan)) and float (row["shape" ])>0 and float (row["scale" ])>0 : c=float (row["shape" ]); sc=float (row["scale" ]) def cdf (x ): return fisk.cdf(np.maximum(x,1e-9 ), c=c, scale=sc) def ppf (u ): return fisk.ppf(np.clip(u,1e-12 ,1 -1e-12 ), c=c, scale=sc) return cdf, ppf return None , None def multiple_imputations (iv_df, M, lb ): dfs=[] rng = np.random.default_rng(SEED) for _ in range (M): rows=[] for _, r in iv_df.iterrows(): ctype = r["ctype" ] if ctype in ("left" ,"interval" ): if ("dist" in iv_df.columns) and pd.notna(r.get("dist" , np.nan)): cdf, ppf = _aft_dist_objs(r) else : cdf, ppf = (None , None ) if ctype == "left" : Rt = float (r["R" ]) if cdf and ppf: u_lo, u_hi = 0.0 , float (cdf(Rt)) if not np.isfinite(u_hi) or u_hi <= 1e-12 : t = max (1e-6 , Rt*0.5 ) else : u = float (rng.uniform(u_lo, u_hi)) t = float (ppf(u)) else : t = float (rng.uniform(1e-6 , Rt)) rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : t, "event" : 1 }) else : Lt, Rt = float (r["L" ]), float (r["R" ]) if cdf and ppf: u_lo, u_hi = float (cdf(Lt)), float (cdf(Rt)) if (not np.isfinite(u_lo)) or (not np.isfinite(u_hi)) or u_hi <= u_lo + 1e-12 : t = 0.5 *(Lt+Rt) if Rt>Lt else Lt else : u = float (rng.uniform(u_lo, u_hi)) t = float (ppf(u)) else : t = Lt if Rt<=Lt else float (rng.uniform(Lt, Rt)) rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : t, "event" : 1 }) else : rows.append({"patient_id" : r["patient_id" ], "BMI" : r["BMI" ], "time" : float (r["L" ]), "event" : 0 }) d = pd.DataFrame(rows).dropna() d["event" ] = d["event" ].astype(int ) dfs.append(d) return dfsdef get_cox_predictions (imputed_sets, p_list ): agg=None for m, d in enumerate (imputed_sets): if len (d) < 20 : continue bmi_c = d["BMI" ].mean() X = dmatrix("bs(BMIc, df=4, degree=3, include_intercept=False)" , {"BMIc" : (d["BMI" ]-bmi_c).values}, return_type="dataframe" ) cph = CoxPHFitter(penalizer=TREE_PENALTY) try : cph.fit(pd.concat([d[["time" ,"event" ]].reset_index(drop=True ), X.reset_index(drop=True )], axis=1 ), duration_col="time" , event_col="event" , robust=True , show_progress=False ) except Exception: continue grid = np.linspace(DETECTION_LOWER_BOUND, 35.0 , 200 ) S = cph.predict_survival_function(X, times=grid) dfp = d[["patient_id" ,"BMI" ]].copy() for p in p_list: target = 1.0 - p t_pred=[] for col in S.columns: s = S[col].values idx = np.where(s <= target)[0 ] t_pred.append(float (grid[idx[0 ]]) if idx.size>0 else np.nan) dfp[f"pred_t{int (p*100 )} " ] = t_pred if agg is None : agg = dfp else : for p in p_list: agg = agg.merge(dfp[["patient_id" , f"pred_t{int (p*100 )} " ]].rename(columns={f"pred_t{int (p*100 )} " : f"pred_t{int (p*100 )} _{m} " }), on="patient_id" , how="left" ) if agg is None : return None for p in p_list: cols = [c for c in agg.columns if c.startswith(f"pred_t{int (p*100 )} " )] agg[f"pred_t{int (p*100 )} _final" ] = agg[cols].median(axis=1 ) keep = ["patient_id" ,"BMI" ] + [f"pred_t{int (p*100 )} _final" for p in p_list] return agg[keep]def get_supervised_cuts (df_pred, y_col, n_bins, min_leaf ): d = df_pred.dropna(subset=["BMI" , y_col]).copy() if len (d) < n_bins * min_leaf: return [] iso = IsotonicRegression(increasing=True , out_of_bounds="clip" ) y_mono = iso.fit_transform(d["BMI" ].values, d[y_col].values) tree = DecisionTreeRegressor(max_leaf_nodes=n_bins, min_samples_leaf=min_leaf, random_state=SEED) tree.fit(d[["BMI" ]].values, y_mono) cuts = sorted ([t for t in tree.tree_.threshold if t != -2.0 ]) cuts = [float (c) for c in cuts if np.isfinite(c)] return cutsdef get_group_recommendations (df_with_groups ): recs=[] for gid, g in df_with_groups.groupby("group" ): if len (g) < 10 : recs.append(np.nan); continue kmf = KaplanMeierFitter().fit(g["time" ], g["event" ]) sf = kmf.survival_function_.reset_index() hit = sf[sf["KM_estimate" ] <= (1.0 - P_MAIN)] recs.append(float (hit["timeline" ].iloc[0 ]) if len (hit)>0 else float (g["time" ].max ())) return recsdef run_one (df_raw, noise_std, req_bins, min_leaf, aft_params=None ): d = df_raw.copy() d[COL_Y_CONC] = d[COL_Y_CONC] + np.random.normal(0 , noise_std, size=len (d)) iv = build_patient_intervals(d) if aft_params is not None and len (aft_params): iv = iv.merge(aft_params, on="patient_id" , how="left" ) imps = multiple_imputations(iv, MI_M, DETECTION_LOWER_BOUND) if not imps: return None df_pred = get_cox_predictions(imps, [P_MAIN]) if df_pred is None : return None y_col = f"pred_t{int (P_MAIN*100 )} _final" cuts = get_supervised_cuts(df_pred, y_col, req_bins, min_leaf) if len (cuts) != req_bins - 1 : return None edges = [-np.inf] + cuts + [np.inf] dm = imps[0 ].copy() dm["group" ] = pd.cut(dm["BMI" ], bins=edges, labels=range (req_bins)) recs = get_group_recommendations(dm) if len (recs) != req_bins: return None return {"cuts" : cuts, "recs" : recs}if __name__ == "__main__" : set_seed(SEED) try : with open (CUTS_FILE, "r" , encoding="utf-8" ) as f: cuts_obj = json.load(f)["chosen" ] REQUIRED_BINS = int (cuts_obj.get("k" , len (cuts_obj.get("cuts_final" , []))+1 )) except Exception as e: print (f"[WARN] 读取 cuts 失败,将使用 3 组默认: {e} " ) REQUIRED_BINS = 3 df_raw = safe_read_csv(RAW_DATA_FILE) df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna() for c in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]: df_raw[c] = pd.to_numeric(df_raw[c], errors="coerce" ) df_raw = df_raw.dropna() aft_params = None if os.path.exists(AFT_PARAM_CSV): try : a = pd.read_csv(AFT_PARAM_CSV) keep = [c for c in ["patient_id" ,"dist" ,"mu" ,"sigma" ,"shape" ,"scale" ] if c in a.columns] if "patient_id" in keep: aft_params = a[keep].copy() print (f"[INFO] Loaded AFT params for MI: n={len (aft_params)} " ) except Exception as e: print (f"[WARN] Failed to read AFT params, fallback to Uniform MI: {e} " ) runs=[] print (f"[INFO] 开始蒙特卡洛:N={N_BOOTSTRAPS} , 组数={REQUIRED_BINS} , 噪声σ={NOISE_STD_DEV} " ) ok=0 for i in range (N_BOOTSTRAPS): print (f" - 运行 {i+1 } /{N_BOOTSTRAPS} ..." ) res = run_one(df_raw, NOISE_STD_DEV, REQUIRED_BINS, MIN_SAMPLES_PER_BIN, aft_params=aft_params) if res is not None : runs.append(res); ok+=1 print (f"[INFO] 完成:有效实验 {ok} /{N_BOOTSTRAPS} " ) if not runs: print ("[ERROR] 无有效结果,参数可能过严。" ) raise SystemExit(0 ) df_cuts = pd.DataFrame([r["cuts" ] for r in runs], columns=[f"Cut_{i+1 } " for i in range (REQUIRED_BINS-1 )]) df_recs = pd.DataFrame([r["recs" ] for r in runs], columns=[f"Group_{i+1 } _Rec" for i in range (REQUIRED_BINS)]) out_csv = os.path.join(OUTPUT_DIR, f"grouped_sensitivity_results_{REQUIRED_BINS} g.csv" ) pd.concat([df_cuts, df_recs], axis=1 ).to_csv(out_csv, index=False , encoding="utf-8-sig" ) print (f"[OK] 已保存模拟结果:{out_csv} " ) fig1, ax1 = plt.subplots(figsize=(9 , 5.2 )) parts = ax1.violinplot([df_cuts[c].dropna().values for c in df_cuts.columns], positions=np.arange(1 , len (df_cuts.columns)+1 ), showextrema=False ) pastel_colors = ["#FFD1DC" , "#C1E1C1" , "#FDFD96" , "#AEC6CF" , "#FFB347" , "#E6E6FA" , "#B5EAD7" , "#FFDAC1" , "#C7CEEA" ] for i, b in enumerate (parts["bodies" ]): b.set_facecolor(pastel_colors[i % len (pastel_colors)]) b.set_edgecolor("white" ) b.set_alpha(0.9 ) ax1.set_xticks(np.arange(1 , len (df_cuts.columns)+1 )) ax1.set_xticklabels(df_cuts.columns) ax1.set_ylabel("BMI cut value" ) ax1.set_title("Q3: Distribution of BMI cutpoints under noise (violin plot)" ) ax1.grid(True , ls="--" , alpha=0.5 ) out_png1 = os.path.join(OUTPUT_DIR, f"violin_cuts_{REQUIRED_BINS} g.png" ) plt.tight_layout(); plt.savefig(out_png1, dpi=150 ) print (f"[OK] 已保存小提琴图(cuts):{out_png1} " ) fig2, ax2 = plt.subplots(figsize=(9 , 5.2 )) parts2 = ax2.violinplot([df_recs[c].dropna().values for c in df_recs.columns], positions=np.arange(1 , len (df_recs.columns)+1 ), showextrema=False ) for i, b in enumerate (parts2["bodies" ]): b.set_facecolor(pastel_colors[i % len (pastel_colors)]) b.set_edgecolor("white" ) b.set_alpha(0.9 ) ax2.set_xticks(np.arange(1 , len (df_recs.columns)+1 )) ax2.set_xticklabels(df_recs.columns) ax2.set_ylabel("Recommended week (t95, KM)" ) ax2.set_title("Q3: Distribution of recommended weeks (t95, KM) under noise (violin plot)" ) ax2.grid(True , ls="--" , alpha=0.5 ) out_png2 = os.path.join(OUTPUT_DIR, f"violin_recs_{REQUIRED_BINS} g.png" ) plt.tight_layout(); plt.savefig(out_png2, dpi=150 ) print (f"[OK] 已保存小提琴图(推荐):{out_png2} " )
p4_automl_ensemble_tuning.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 import pandas as pdimport numpy as npimport xgboost as xgbimport optunafrom sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import classification_report, confusion_matrix, roc_auc_scorefrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.svm import SVCfrom sklearn.calibration import CalibratedClassifierCVimport matplotlib.pyplot as pltimport seaborn as snsimport osimport time plt.rcParams['font.sans-serif' ] = ['SimHei' ] plt.rcParams['axes.unicode_minus' ] = False X_GLOBAL = None Y_GLOBAL = None N_SPLITS_CV = 5 def save_best_results_callback (study, trial ): """Callback to save the results of the best trial.""" if study.best_trial == trial: best_params = trial.params best_w = best_params['ensemble_w' ] best_threshold = best_params['threshold' ] oof_xgb_proba = np.array(trial.user_attrs['oof_xgb_proba' ]) oof_svm_proba = np.array(trial.user_attrs['oof_svm_proba' ]) ensemble_proba = best_w * oof_xgb_proba + (1 - best_w) * oof_svm_proba y_pred = (ensemble_proba > best_threshold).astype(int ) report_dict = classification_report(Y_GLOBAL, y_pred, target_names=['Normal' , 'Abnormal' ], output_dict=True , zero_division=0 ) cm = confusion_matrix(Y_GLOBAL, y_pred) study.set_user_attr('best_trial_results' , { 'report_dict' : report_dict, 'confusion_matrix' : cm.tolist() })def objective (trial ): """ Optuna objective function to minimize clinical cost. """ xgb_params = { 'objective' : 'binary:logistic' , 'eval_metric' : 'logloss' , 'use_label_encoder' : False , 'scale_pos_weight' : (Y_GLOBAL == 0 ).sum () / (Y_GLOBAL == 1 ).sum (), 'random_state' : 42 , 'n_estimators' : trial.suggest_int('xgb_n_estimators' , 100 , 500 ), 'max_depth' : trial.suggest_int('xgb_max_depth' , 3 , 8 ), 'learning_rate' : trial.suggest_float('xgb_learning_rate' , 0.01 , 0.3 , log=True ), 'subsample' : trial.suggest_float('xgb_subsample' , 0.6 , 1.0 ), 'colsample_bytree' : trial.suggest_float('xgb_colsample_bytree' , 0.6 , 1.0 ), 'gamma' : trial.suggest_float('xgb_gamma' , 0 , 0.5 ) } svm_params = { 'kernel' : 'rbf' , 'C' : trial.suggest_float('svm_C' , 1e-2 , 1e3 , log=True ), 'gamma' : trial.suggest_float('svm_gamma' , 1e-4 , 1e-1 , log=True ), 'probability' : False , 'random_state' : 42 } ensemble_w = trial.suggest_float('ensemble_w' , 0 , 1 ) threshold = trial.suggest_float('threshold' , 0.01 , 0.99 ) skf = StratifiedKFold(n_splits=N_SPLITS_CV, shuffle=True , random_state=42 ) oof_xgb_proba = np.zeros(len (X_GLOBAL)) oof_svm_proba = np.zeros(len (X_GLOBAL)) for train_idx, val_idx in skf.split(X_GLOBAL, Y_GLOBAL): X_train, X_val = X_GLOBAL.iloc[train_idx], X_GLOBAL.iloc[val_idx] y_train, y_val = Y_GLOBAL.iloc[train_idx], Y_GLOBAL.iloc[val_idx] scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_val_scaled = scaler.transform(X_val) xgb_model = xgb.XGBClassifier(**xgb_params) xgb_model.fit(X_train, y_train) oof_xgb_proba[val_idx] = xgb_model.predict_proba(X_val)[:, 1 ] svc_model = SVC(**svm_params) calibrated_svc = CalibratedClassifierCV(svc_model, method='isotonic' , cv=3 ) calibrated_svc.fit(X_train_scaled, y_train) oof_svm_proba[val_idx] = calibrated_svc.predict_proba(X_val_scaled)[:, 1 ] ensemble_proba = ensemble_w * oof_xgb_proba + (1 - ensemble_w) * oof_svm_proba y_pred = (ensemble_proba > threshold).astype(int ) auc = roc_auc_score(Y_GLOBAL, ensemble_proba) trial.set_user_attr('auc' , auc) trial.set_user_attr('oof_xgb_proba' , oof_xgb_proba.tolist()) trial.set_user_attr('oof_svm_proba' , oof_svm_proba.tolist()) cm = confusion_matrix(Y_GLOBAL, y_pred) FN = cm[1 , 0 ] FP = cm[0 , 1 ] cost = 15 * FN + 1 * FP return costdef run_automl_tuning (file_path, output_dir ): global X_GLOBAL, Y_GLOBAL start_time = time.time() print ("Starting Final AutoML Ensemble Tuning Process..." ) df = pd.read_csv(file_path) rename_dict = { '检测孕天数' : 'gestational_week' , '年龄' : 'age' , '孕妇BMI' : 'bmi' , '在参考基因组上比对的比例' : 'alignment_ratio' , '重复读段的比例' : 'duplication_ratio' , '唯一比对的读段数' : 'unique_reads' , 'GC含量' : 'gc_content' , '13号染色体的Z值' : 'z_score_13' , '18号染色体的Z值' : 'z_score_18' , '21号染色体的Z值' : 'z_score_21' , 'X染色体的Z值' : 'z_score_x' , 'X染色体浓度' : 'x_concentration' } df.rename(columns=rename_dict, inplace=True ) print (f"Original sample count: {len (df)} " ) df = df[df['z_score_x' ].abs () < 2.5 ].reset_index(drop=True ) print (f"Sample count after filtering (abs(z_score_x) < 2.5): {len (df)} " ) df['abnormal' ] = df['染色体的非整倍体' ].notna().astype(int ) if 'bmi' in df.columns and df['bmi' ].isnull().any (): df['bmi' ].fillna(df['bmi' ].median(), inplace=True ) df['z21_x_ff' ] = df['z_score_21' ] * df['x_concentration' ] df['z18_x_ff' ] = df['z_score_18' ] * df['x_concentration' ] df['z13_x_ff' ] = df['z_score_13' ] * df['x_concentration' ] bins = [-np.inf, 2.5 , 3 , np.inf] labels = ['Normal_ZX' , 'Borderline_ZX' , 'Abnormal_ZX' ] df['z_score_x_binned' ] = pd.cut(abs (df['z_score_x' ]), bins=bins, labels=labels, right=False ) ohe = OneHotEncoder(sparse_output=False , handle_unknown='ignore' ) zx_binned_encoded = ohe.fit_transform(df[['z_score_x_binned' ]]) zx_binned_df = pd.DataFrame(zx_binned_encoded, columns=ohe.get_feature_names_out(['z_score_x_binned' ])) df = pd.concat([df.reset_index(drop=True ), zx_binned_df], axis=1 ) feature_cols = [ 'age' , 'gestational_week' , 'bmi' , 'alignment_ratio' , 'duplication_ratio' , 'unique_reads' , 'gc_content' , 'z_score_13' , 'z_score_18' , 'z_score_21' , 'z_score_x' , 'x_concentration' , 'z21_x_ff' , 'z18_x_ff' , 'z13_x_ff' ] feature_cols.extend(ohe.get_feature_names_out(['z_score_x_binned' ])) X_GLOBAL = df[feature_cols] Y_GLOBAL = df['abnormal' ] study = optuna.create_study(direction='minimize' ) study.optimize(objective, n_trials=100 , timeout=1200 , callbacks=[save_best_results_callback]) print (f"Best trial found with cost: {study.best_value} " ) print (f"Best parameters: {study.best_params} " ) print ("Generating report from the best trial..." ) best_results = study.user_attrs.get('best_trial_results' ) if not best_results: print ("Could not find saved results for the best trial. Exiting." ) return report_dict = best_results['report_dict' ] cm = np.array(best_results['confusion_matrix' ]) report_df = pd.DataFrame(report_dict).transpose() cm_df = pd.DataFrame(cm, index=['Actual Normal' , 'Actual Abnormal' ], columns=['Predicted Normal' , 'Predicted Abnormal' ]) os.makedirs(output_dir, exist_ok=True ) report_df.to_csv(os.path.join(output_dir, 'classification_report.csv' )) cm_df.to_csv(os.path.join(output_dir, 'confusion_matrix.csv' )) best_trial_auc = study.best_trial.user_attrs.get('auc' , 'N/A' ) report_path = os.path.join(output_dir, 'automl_ensemble_report.md' ) with open (report_path, 'w' , encoding='utf-8' ) as f: f.write("# AutoML 集成模型优化报告 (代价函数: 15*FN + 1*FP)\n\n" ) f.write("## 1. 概述\n" ) f.write("在加载数据后,首先排除了所有`abs(z_score_x) >= 2.5`的样本。后续所有的模型训练、优化和评估都在这个高置信度的样本子集上进行。\n" ) f.write("使用Optuna库对XGBoost和SVM的集成模型进行端到端的超参数优化。优化的目标是最小化临床代价函数:`Cost = 15 * FN + 1 * FP`。\n" ) f.write(f"总共执行了 **{len (study.trials)} ** 次试验。报告中的所有指标均来自代价最小的那一次特定试验。\n\n" ) f.write("## 2. 优化结果\n" ) f.write(f"**最小临床代价**: {study.best_value:.4 f} \n" ) f.write(f"**最佳试验对应的AUC**: {best_trial_auc:.4 f} \n\n" ) f.write("### 找到的最佳超参数组合:\n" ) for key, value in study.best_params.items(): f.write(f"- **{key} **: `{value} `\n" ) f.write("\n## 3. 最佳试验的性能详情\n" ) f.write("### 分类报告\n" ) f.write(report_df.to_markdown() + "\n\n" ) f.write("### 混淆矩阵 (计数)\n" ) f.write(cm_df.to_markdown() + "\n\n" ) f.write("\n" ) plt.figure(figsize=(8 , 6 )) sns.heatmap(cm, annot=True , fmt='d' , cmap='Blues' , xticklabels=['Normal' , 'Abnormal' ], yticklabels=['Normal' , 'Abnormal' ]) plt.title('Confusion Matrix of The Best Trial' ) plt.xlabel('Predicted Label' ) plt.ylabel('True Label' ) plt.tight_layout() plt.savefig(os.path.join(output_dir, 'confusion_matrix_counts.png' )) plt.close() print (f"AutoML tuning finished in {time.time() - start_time:.2 f} seconds." ) print (f"Final results saved to folder: {output_dir} " )if __name__ == '__main__' : CWD = 'C:\\Users\\yezf8\\Documents\\Y3Repo\\25C题' DATA_FILE = os.path.join(CWD, '女胎检测数据_filtered.csv' ) OUTPUT_DIR = os.path.join(CWD, 'problem4' , 'results_automl_ensemble' ) run_automl_tuning(DATA_FILE, OUTPUT_DIR)
p4_shap_analysis.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 import pandas as pdimport numpy as npimport xgboost as xgbfrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import SVCfrom sklearn.calibration import CalibratedClassifierCVfrom sklearn.preprocessing import OneHotEncoderimport matplotlib.pyplot as pltimport seaborn as snsimport osimport shap import time plt.rcParams['font.sans-serif' ] = ['SimHei' ] plt.rcParams['axes.unicode_minus' ] = False def run_shap_analysis (file_path, output_dir ): start_time = time.time() print ("Starting SHAP analysis on the entire sample set (this may take a very long time)..." ) df = pd.read_csv(file_path) rename_dict = { '检测孕天数' : 'gestational_week' , '年龄' : 'age' , '孕妇BMI' : 'bmi' , '在参考基因组上比对的比例' : 'alignment_ratio' , '重复读段的比例' : 'duplication_ratio' , '唯一比对的读段数' : 'unique_reads' , 'GC含量' : 'gc_content' , '13号染色体的Z值' : 'z_score_13' , '18号染色体的Z值' : 'z_score_18' , '21号染色体的Z值' : 'z_score_21' , 'X染色体的Z值' : 'z_score_x' , 'X染色体浓度' : 'x_concentration' } df.rename(columns=rename_dict, inplace=True ) original_sample_count = len (df) df = df[df['z_score_x' ].abs () < 2.5 ].reset_index(drop=True ) print (f"Original sample count: {original_sample_count} " ) print (f"Sample count after filtering (abs(z_score_x) < 2.5): {len (df)} " ) df['abnormal' ] = df['染色体的非整倍体' ].notna().astype(int ) if 'bmi' in df.columns and df['bmi' ].isnull().any (): df['bmi' ].fillna(df['bmi' ].median(), inplace=True ) df['z21_x_ff' ] = df['z_score_21' ] * df['x_concentration' ] df['z18_x_ff' ] = df['z_score_18' ] * df['x_concentration' ] df['z13_x_ff' ] = df['z_score_13' ] * df['x_concentration' ] bins = [-np.inf, 2.5 , 3 , np.inf] labels = ['Normal_ZX' , 'Borderline_ZX' , 'Abnormal_ZX' ] df['z_score_x_binned' ] = pd.cut(abs (df['z_score_x' ]), bins=bins, labels=labels, right=False ) ohe = OneHotEncoder(sparse_output=False , handle_unknown='ignore' ) zx_binned_encoded = ohe.fit_transform(df[['z_score_x_binned' ]]) zx_binned_df = pd.DataFrame(zx_binned_encoded, columns=ohe.get_feature_names_out(['z_score_x_binned' ])) df = pd.concat([df.reset_index(drop=True ), zx_binned_df], axis=1 ) feature_cols = [ 'age' , 'gestational_week' , 'bmi' , 'alignment_ratio' , 'duplication_ratio' , 'unique_reads' , 'gc_content' , 'z_score_13' , 'z_score_18' , 'z_score_21' , 'z_score_x' , 'x_concentration' , 'z21_x_ff' , 'z18_x_ff' , 'z13_x_ff' ] feature_cols.extend(ohe.get_feature_names_out(['z_score_x_binned' ])) X = df[feature_cols] y = df['abnormal' ] best_params = { 'xgb_n_estimators' : 479 , 'xgb_max_depth' : 6 , 'xgb_learning_rate' : 0.010359914293101614 , 'xgb_subsample' : 0.7500268146879681 , 'xgb_colsample_bytree' : 0.9383752680015073 , 'xgb_gamma' : 0.24133743945925995 , 'svm_C' : 1.0458842497938852 , 'svm_gamma' : 0.013559691602731587 , 'ensemble_w' : 0.876834598235878 , 'threshold' : 0.1584494525854912 } best_xgb_params = {k.replace('xgb_' , '' ): v for k, v in best_params.items() if k.startswith('xgb_' )} best_xgb_params.update({ 'objective' : 'binary:logistic' , 'eval_metric' : 'logloss' , 'use_label_encoder' : False , 'scale_pos_weight' : (y == 0 ).sum () / (y == 1 ).sum (), 'random_state' : 42 }) best_svm_params = {k.replace('svm_' , '' ): v for k, v in best_params.items() if k.startswith('svm_' )} best_svm_params.update({'kernel' : 'rbf' , 'probability' : False , 'random_state' : 42 }) best_w = best_params['ensemble_w' ] xgb_model = xgb.XGBClassifier(**best_xgb_params) xgb_model.fit(X, y) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) svc_model = SVC(**best_svm_params) calibrated_svc = CalibratedClassifierCV(svc_model, method='isotonic' , cv=5 ) calibrated_svc.fit(X_scaled, y) def ensemble_predict_proba (X_input ): if not isinstance (X_input, pd.DataFrame): X_input_df = pd.DataFrame(X_input, columns=X.columns) else : X_input_df = X_input xgb_proba = xgb_model.predict_proba(X_input_df)[:, 1 ] svm_proba = calibrated_svc.predict_proba(scaler.transform(X_input_df))[:, 1 ] return best_w * xgb_proba + (1 - best_w) * svm_proba print ("Calculating SHAP values for the ensemble (this will be very slow)..." ) background_data = shap.sample(X, 50 ) explainer = shap.KernelExplainer(ensemble_predict_proba, background_data) shap_values = explainer.shap_values(X) os.makedirs(output_dir, exist_ok=True ) print ("Generating SHAP summary plot (beeswarm)..." ) shap.summary_plot(shap_values, X, show=False ) plt.savefig(os.path.join(output_dir, 'shap_summary_beeswarm.png' ), bbox_inches='tight' ) plt.close() print ("Generating SHAP feature importance bar plot..." ) shap.summary_plot(shap_values, X, plot_type="bar" , show=False ) plt.savefig(os.path.join(output_dir, 'shap_feature_importance_bar.png' ), bbox_inches='tight' ) plt.close() print (f"SHAP analysis finished in {time.time() - start_time:.2 f} seconds." ) print (f"SHAP plots saved to folder: {output_dir} " )if __name__ == '__main__' : CWD = 'C:\\Users\\yezf8\\Documents\\Y3Repo\\25C题' DATA_FILE = os.path.join(CWD, '女胎检测数据_filtered.csv' ) OUTPUT_DIR = os.path.join(CWD, 'problem4' , 'results_automl_ensemble' ) run_shap_analysis(DATA_FILE, OUTPUT_DIR)