2025年数模国赛C题论文

基于生存分析的NIPT时间选择与胎儿异常判定

作者：余明羲，叶子枫，吴悠
日期：2025年9月7日

摘要

无创产前检测（NIPT）因其高准确性和低风险在胎儿染色体异常筛查中得到广泛应用，但检测结果受孕周、BMI、胎儿性别等多因素影响。本文基于大规模孕妇检测数据，构建数学模型，系统分析了孕妇特征与胎儿性染色体浓度的关系，并提出数据驱动的BMI分组与最佳检测时点推荐方法。

对于问题一,本文首先对Y染色体浓度、孕期周数、BMI等数据进行了数据预处理和可视化分析，为后续模型的建立提供可靠的数据支持。随后，由于各变量均未通过Sapiro–Wilk检验，使用Spearman偏相关系数分析Y染色体浓度与孕妇各生理指标间的相关性。为验证并量化影响，进一步采用多元线性回归模型，用最小二乘法拟合得到最终的模型，并进行残差分析以及显著性检验，计算得到决定系数** $R^2=0.045$ ，F统计量的p值为 $1.66\times10^{-8}$ **，表明整个回归模型是高度统计显著的。

对于问题二,本文采用结合B样条的半参数Cox比例风险模型，并为每个个体预测其95%成功率的达标孕周（Q95时间点）。由于临床数据删失严重，本文将每位孕妇的纵向观测数据转化为生存分析中的标准区间数据，并使用了多重插补（MI）缓解Cox比例风险模型无法处理的左删失问题。然后，本文利用保序回归对此“BMI-Q95预测周数”关系进行单调化处理，以此为监督信号，通过回归树算法寻找最优的BMI分组切点，得到4个分组，并结合剪枝与最小样本量等策略保证分组的稳健性。最后，在每个分组内，使用Kaplan–Meier方法估计生存曲线和推荐检测时间点（t95），通过敏感性分析确认了模型在面对测量误差时的稳健性与可靠性。

对于问题三,为综合考虑多种因素对Y染色体浓度达标比例的影响，本文在问题二的生存分析框架下的标准区间数据的基础上，拟合包含BMI、年龄等协变量的区间删失加速失效时间模型（AFT），最佳分布为log-normal。随后，本文对每个个体的首次达标时间分布进行初步估计，得到个体层的精确分布参数与分位点（Q90/Q95）及 π(25)=P(T≤25)。与问题二类似，为最大化组间差异，本文采用回归树进行分组；在得到的3个分组内，采用基于 AFT 条件分布的多重插补（MI），再结合KM方法估计生存曲线和推荐检测时间点（t95），当 KM 结果不可靠时则回退至 AFT 模型的组内中位预测。

对于问题四,本文首先完成特征工程以增强原始数据。随后，考虑到临床上漏诊付出代价较高，本文以最小化非对称临床代价函数为唯一优化目标，构建了端到端集成学习模型。该方法融合了XGBoost和核SVM两种模型的优势，并通过100次K折交叉验证对所有超参数进行联合寻优。在验证集上，训练得到的最佳模型的临床代价为297，AUC 分数为 0.8216，表明模型具有良好的区分与预测能力。最后，本文使用SHAP分析各特征的特征重要性并将结果可视化，增强了模型的可解释性及临床应用能力。

关键词: 多元线性回归, Cox模型, AFT模型, KM模型, 回归树, 集成学习模型

问题重述

问题背景

问题背景
无创产前检测（NIPT，Non-invasive Prenatal Testing）是一种用来筛查胎儿是否存在染色体异常的技术，具有采集母体血液、检测胎儿的游离DNA片段、分析胎儿染色体是否存在异常三种产前检测技术来鉴定胎儿的健康状况，近年来在产前检测领域得到了广泛应用。NIPT能够在早期检测出唐氏综合征、爱德华氏综合征、帕陶氏综合征等胎儿染色体异常情况，这三种体征分别由胎儿21号、18号、13号染色体浓度是否异常决定。该方法具有较高的准确性和较低的风险，是临床上推荐使用的产前筛查方法。

尽管NIPT在临床中表现出较高的精确度，其结果仍受多种因素影响，尤其是胎儿性染色体浓度的测定——男胎的Y染色体或女胎的X染色体的游离DNA片段的测定比例。此外，不同孕妇的孕期、体重指数（BMI）以及胎儿的性别等因素都可能对检测结果产生影响。其中对于男胎，由于其携带Y染色体，其浓度的变化受到孕妇孕周数和BMI等因素的显著影响。

为了提高NIPT检测的准确性并降低潜在风险，临床实践中通常根据孕妇的BMI值对其进行分组，并根据不同群体的特点选择最佳的检测时点。然而，由于每个孕妇的个体差异，简单的经验分组并不能适用于所有情况。因此，如何在不同BMI组别中合理选择检测时点，确保检测准确性并尽量减少胎儿异常风险，是一个亟待解决的问题。

本研究旨在通过数学建模，分析胎儿性染色体浓度与孕妇的孕期、BMI等因素的关系，探讨如何根据不同孕妇群体的特点，优化NIPT的检测时点，以提高检测的准确性并降低早期发现异常所带来的潜在风险。同时，本文将考虑不同因素对检测误差的影响，以制定更加科学合理的分组策略和检测时点选择标准。

问题要求

本题以无创产前检测（NIPT）为背景，利用附件提供的孕妇检测数据，包括孕周、BMI、身高、体重、染色体浓度、Z值、GC含量等数据，用来研究胎儿染色体浓度的变化规律及其影响因素，建立数学模型，分析胎儿性染色体浓度与孕妇特征的关系，并确定不同孕妇群体的最佳检测时点，此外再针对女胎提出异常判定方法，旨在为临床优化NIPT检测策略提供科学依据。

问题1： 依据表格有关男胎的Y染色体浓度数据和孕妇特征数据，研究Y染色体浓度与孕周、BMI等变量的相关性，建立定量关系模型，并对模型进行显著性检验，验证关系是否可靠。

问题2： BMI是影响男胎Y染色体浓度最早达标时间的主要因素，需要按BMI对男胎孕妇进行分组并确定最佳检测时点（浓度首次大于等于4%)，从而最小化潜在风险，并分析检测误差影响。

问题3： 在问题二中BMI分组的基础上，综合考虑身高、体重、年龄等多种因素，以及检测误差和达标比例（浓度达到或超过4%的比例），再结合男胎孕妇的BMI，适当调整分组情况以及每组的最佳NIPT时点，使得孕妇潜在风险达到最小化，并分析检测误差对结果的影响。

问题4： 针对女胎，为了明确女胎异常的判定方法，需结合21号、18号、13号染色体的Z值、GC含量、读段数及其比例、X染色体相关指标、孕妇BMI等特征，输出一个可用于判定女胎是否异常的模型或规则。

问题分析

问题一分析

针对问题一，本文选取孕妇的检测孕周、BMI、年龄三个指标，并对其与Y染色体浓度的偏相关性进行分析。首先通过Sapiro–Wilk检验（W检验）确认了所有变量均不服从正态分布，因此本文选用非参数的Spearman偏相关性分析替代 Pearson偏相关性分析，在控制其他变量后，即可得出Y染色体浓度与孕妇特征指标的偏相关性正负情况。其次，为验证并量化影响，进一步采用多元线性回归模型，并使用最小二乘法来寻找最佳的系数估计值，由此得到最终的拟合模型。最后，对该回归模型进行评估，计算 $R^2$ 与F统计量的p值，分别评判该模型的的可解释性与显著性。

问题二分析

针对问题二，需要基于孕妇BMI对其分组，并为其推荐一个能以高概率成功检测到Y染色体的最优孕周。由于数据上严重的左删失问题，本文将每位孕妇的纵向观测数据转化为生存分析框架下的标准区间数据，明确标识出删失类型，并采用了多重插补（MI）技术，对删失区间内的真实达标时间进行合理的随机插补，以构建可供分析的完整数据集。然后，拟合结合B样条的半参数Cox比例风险模型，以灵活捕捉BMI对达标风险的非线性影响，并为每个个体预测其95%成功率的达标孕周（Q95时间点）。为确保业务逻辑的合理性，本文利用保序回归对此“BMI-Q95预测周数”关系进行单调化处理，然后以此为监督信号，通过CART回归树算法以数据驱动的方式寻找最优的BMI分组切点，并结合剪枝与最小样本量等策略保证分组的稳健性。最后，在每个分组内，使用Kaplan–Meier方法估计生存曲线和推荐时间点（t95），再经跨组保序和半周圆整后得出最终推荐时点。为了分析检测误差对结果的影响，本文通过“模糊阈值”分析与“加噪蒙特卡洛”分析两种独立的敏感性分析来评估模型的稳健性，最终确认了模型在面对测量误差时的稳健性与可靠性。

问题三分析

针对问题三，需要在问题二的基础上考虑多种因素(如身高、体重、年龄等)的影响，对孕妇BMI对其进行分组，并为每组推荐一个最佳检测时点。在问题二的生存分析框架下的标准区间数据的基础上，首先拟合包含BMI、年龄等协变量的区间删失加速失效时间模型（AFT），比较得出最佳分布为log-normal，对每个个体的首次达标时间分布进行初步估计，得到个体层的精确分布参数与分位点（Q90/Q95）及 π(25)=P(T≤25)。其次，基于 AFT 模型的预测结果，利用CART回归树以 π(25) 为监督信号（最小化SSE）进行分组，以最大化组间差异；在每个分组内，为在高删失背景下稳健估计组内生存曲线，采用基于 AFT 条件分布的多重插补（MI），再结合KM方法估计生存曲线和推荐时间点（t95），当 KM 结果不可靠时则回退至 AFT 模型的组内中位预测。此外，与问题二同理，本文通过“模糊区间”分析与“加噪蒙特卡洛”分析两种独立的敏感性分析来评估模型的稳健性，最终确认了模型在面对测量误差时的稳健性与可靠性。

问题四分析

针对女胎非整倍体异常的判定问题，核心挑战在于假阴性（漏诊）的临床代价远高于假阳性（误诊）。为应对此挑战，本文在进行特征工程后，构建了一个以最小化非对称临床代价函数为唯一优化目标的端到端集成学习模型。该方法融合了XGBoost和核SVM两种模型的优势，并通过100次5折交叉验证对包括模型参数、集成权重和分类阈值在内的所有超参数进行联合寻优。此策略确保了最终模型的所有决策都精确地服务于“不惜代价降低漏诊率”这一核心临床需求，而非追求传统的准确率或AUC指标，从而在经过数据质控筛选后的高置信度样本上，实现了临床效用最大化的智能诊断。

符号说明

符号	说明	单位
$Y $	Y染色体浓度	/
$X_1 $	孕妇检测孕周	周
$X_2 $	孕妇BMI值	$kg/m^2$
$X_3 $	孕妇年龄	岁
$\beta $	回归系数	/
$\epsilon $	随机误差	/
$T_i $	第 $i$ 位受试者的达标时间	周
$L_i $	达标时间左边界阈值	周
$R_i $	达标时间右边界阈值	周

数据分析

本题旨在探究孕妇的各项生理指标（如孕周、年龄、BMI）与胎儿游离性染色体浓度之间的关系。数据集包含了数千份胎儿的母体血浆样本检测记录。本题所提供的孕妇一系列特征数据具有明确的分布特点及规律，本节以男胎检测数据为例进行分析。

首先，从妊娠方式来看，绝大多数孕妇为自然受孕（98.5%），通过体外受精（IVF）或宫腔内人工授精（IUI）等辅助生殖技术受孕的孕妇比例极低，表明样本主要代表了普遍的自然受孕人群。其次，在胎儿健康状况方面，96.5%的胎儿被记录为健康，这为后续分析提供了一个相对均质的基线。

年龄、BMI、孕周以及Y染色体浓度等指标的分布揭示了样本的关键构成。通过分布柱状图可知，在年龄分布上，25-30岁年龄段的孕妇构成了最大的群体（超过400人），其次是30-35岁年龄段，整体呈现以青壮年育龄女性为主的典型分布。然而，在BMI方面，数据显示出本研究孕妇的一个显著特点：绝大多数孕妇（近770人）的BMI在27-37之间，少部分低于27或高于37。这表明本研究的样本群体主要由高BMI孕妇构成，为深入探究BMI对检测指标的影响提供了充足的数据支持。检测孕周无论按周还是按天排列，都呈现出明显的多峰形态，峰值大致出现在90天（约13周）、110天（约16周）和150天（约21周）附近。这表明样本的采集并非在孕期内均匀分布，而是集中在几个关键的临床检查时间点。作为本研究的核心因变量，Y染色体浓度的分布呈现出典型的严重右偏态。绝大多数样本的浓度值集中在较低的区间（0.05-0.10），仅有少数样本具有非常高的浓度值。X染色体的浓度分布呈现较标准的正态分布，峰值在0.05左右，但此数据对本题研究男胎的帮助不大。

孕妇相关指标的分布直观图

此外，本文对不同指标的特殊值进行了整理与探究。通过四张箱线图，直观地揭示了研究孕妇队列中关键变量的分布特征。它表明，该研究的样本主要由年龄集中在26-32岁的青壮年女性构成，但一个显著的特点是孕妇的身体质量指数（BMI）普遍偏高，且存在大量高值异常点。更重要的是，作为核心指标的Y染色体浓度呈现出典型的严重右偏态分布，即绝大多数样本的浓度值都集中在较低的区间，仅有少数样本具有非常高的浓度。

特殊异常值分布箱线图

最后，从数据采集的时间趋势来看，从2023年1月至2024年5月，每月的检测数量呈现出一定的周期性波动，峰值出现在2023年春夏季。同时，孕妇的平均BMI在不同月份间也存在小幅波动，但未显示出与检测量同步的明确趋势。这些时间维度的信息有助于理解数据采集的背景，并评估潜在的时间混杂效应。此外，对孕妇的检测抽血次数分析显示，绝大多数孕妇仅进行1-2次检测，也为模型的构建提供了数据结构信息。

数据采集时间分布图

数据预处理

本题所提供的数据存在格式不统一、读取不方便、数据不合理等问题，在建模之前，需要先进行数据预处理，以确保分析结果更加科学、合理，模型建构更加稳定。本文对原始数据集进行了系统化的调整，包括对日期格式的统一、孕周格式的转换以及对无效数据的剔除。该流程旨在提升数据的一致性、完整性与可用性，从而减少噪声与偏差对模型性能的干扰。

日期格式的统一

原始数据中，不同的日期存在格式不统一的情况——"末次月经"的数据的年月日被“/”分开，而“检测日期”的数据的年月日被直接拼接在一起。为了读取数据更方便，本文将日期字段的数据都改成了“某年某月某日”的格式。具体示例如下：

日期格式统一示例

原始数据	改后数据
2023/5/20或20230520	2023年5月20日

孕周格式的转换

附件表格中的“检测孕周”字段的数据存在“周+天”的混合表示。本文将其全部换算成天数，方便数据的读取与比较。举例如下：

孕周格式转换示例

原始数据	改后数据
16w+4	116

所以，初步的统计分析显示：样本覆盖的孕周主要集中在12周至20周之间，中位孕周约为14周；孕妇年龄分布广泛，平均年龄约31岁；BMI的中位数为 $23.5kg/m^2$ ，但存在部分高BMI（ $>30kg/m^2$ ）的样本，呈现右偏态分布。关键指标Y染色体浓度的原始数值分布极不均匀，同样呈现明显的右偏态，说明大部分样本浓度值较低，少数样本浓度非常高，为后续的统计建模提供了明确思路。

唯一比对读段数的筛选

在无创产前检测中，“唯一比对的读段数”是衡量数据有效性的重要指标，其能够唯一映射到参考基因组某一位置，反映了测序读段在参考基因组上的有效比对数量，还可有效减少因重复序列或错误比对导致的假阳性结构变异信号。本文对该指标进行了两个层面的筛选——读段数范围的界定以及读段数异常值的剔除。

（1）唯一比对读段数范围的界定

本文采用文献检索法，根据多篇方法学和临床研究，唯一比对读段数不应低于一个最低阈值，否则可能会因被检测基因片段过少而导致检测不准确。因此，依据文献内容，本文规定在检测第21号、18号、13号染色体浓度时，将 $0.15\times\text{覆盖度}\approx3$ M条作为常用NIPT测序量的最低阈值。据此剔除掉了所有唯一比对读段数低于300万条的数据，避免因测序量不足或比对效率低导致的浓度估计偏差。

（2）唯一比对读段数异常值的剔除

附件中数据存在“唯一比对读段数”大于“原始读段数”的情况。因为前者是后者的一个子集，所以这在正常情况下是不可能发生的。因此，本文剔除了所有存在此情况的孕妇数据，一共71条。部分异常值数据如下：

唯一比对读段数异常值部分示例

序号	孕妇代码	原始读段数	唯一比对的读段数
690	A169	2132408	4395037
695	A171	2879248	3626619
696	A171	3636973	3737311
698	A171	3439440	3745489

经过上述步骤，数据集在时间格式、孕周表示及测序质量方面均实现了统一与优化。此外，本文对BMI值与身高、体重是否匹配进行了检验，发现全部匹配，无需调整。显著降低了因数据格式不一致或低质量样本引入的偏差风险，为后续的相关性分析、分组策略制定及数学模型构建奠定了坚实基础。

问题一的模型的建立和求解

数据整理与目标分析

本节旨在探究男胎样本中，哪些因素对母亲血浆中的胎儿Y染色体浓度产生影响。本文选取了孕周、孕妇BMI和孕妇年龄三个关键变量，希望评估它们与Y染色体浓度之间的独立关系。

针对预处理后的男胎检测数据，实行进一步的整理。对孕周数，仅提取周数作为数值，舍弃额外不满一周的天数；对需研究的四个变量，定义Y染色体浓度为 $Y$ ，检测孕周数为 $X_1$ ，孕妇BMI为 $X_2$ ，孕妇年龄为 $X_3$ 。由于我们需要判定变量间纯净的相关性，首先考虑使用偏相关系数进行判定。但是，标准的偏相关系数内使用的是Pearson相关系数，而使用Pearson相关分析的前提条件是所有变量均服从正态分布。经过w检验，发现数据并不满足正态性，则放弃Pearson相关，转而采用更适合的非参数方法——Spearman偏相关性分析。使用该方法，可以计算每个变量与其他所有变量之间的偏相关系数，从而分析变量之间的相关性强弱与正负。为了给出相应的关系模型，本节采用了多元线性回归，并对其显著效果进行更深层次的分析，验证了这三个因素对Y染色体浓度的独立影响。

为探究孕周、孕妇BMI及年龄对Y染色体浓度的独立影响，本研究构建了系统的统计分析模型。首先，通过正态性检验（Shapiro–Wilk检验）评估数据分布特性，结果显示所有关键变量均不服从正态分布。基于此，本文采用非参数的Spearman偏相关分析来衡量各变量间的单调关系强度。同时，为量化各因素的综合影响并建立预测模型，本文运用普通最小二乘法构建了多元线性回归模型，最终得到一个能够解释Y染色体浓度变化的数学方程。

正态性检验

Pearson相关系数的有效性依赖于变量服从正态分布的假设。为检验此假设，本文对四个核心变量采用Shapiro–Wilk检验进行正态性分析。

原假设 $H_0$ ：变量的样本数据来自于一个正态分布的总体。
备择假设 $H_1$ ：变量的样本数据不来自于一个正态分布的总体。
检测结构如下表：

Shapiro–Wilk检验结构

变量	检验p值	是否服从正态分布
Y染色体浓度	<0.0001	否（p<0.05）
检测孕周	<0.0001	否（p<0.05）
孕妇BMI	<0.0001	否（p<0.05）
年龄	<0.0001	否（p<0.05）

由此可知，所有变量的p值均远小于0.05的显著性水平。因此，本文拒绝了所有变量服从正态分布的假设。由于数据不满足正态性假设，使用Pearson偏相关分析可能会导致结果的偏差。因此，本文选择Spearman偏相关分析作为替代方法。Spearman相关是基于等级的非参数检验，它不要求数据服从特定的分布，对于非线性的关系也更为稳健，是处理当前数据的更优选择。

Spearman偏相关分析

Spearman偏相关分析结合了Spearman秩相关与偏相关的思想，既能处理非正态、非线性数据，又能在分析两个变量关系时控制其他变量的干扰。它不要求数据服从正态分布，适用于像本题数据一样分布未知的数据。
（1）数据排序：将所有涉及的变量 $(Y,X_1,X_2,X_3)$ 的原始数据转换为等级。
（2）计算偏相关系数：当需要评估多个变量之间的独立性关系时，最系统的方法是使用逆矩阵法一次性计算所有偏相关系数。

构建相关系数矩阵：首先，本文为所有涉及的变量 $(Y,X_1,X_2,X_3)$ 创建一个零阶Spearman相关系数矩阵 $\mathbf{C}$ 。

C = \begin{pmatrix} 1 & r_{YX_1} & r_{YX_2} & r_{YX_3} \\ r_{T} & 1 & r_{X_1X_2} & r_{X_1X_3} \\ r_{X_2Y} & r_{X_2X_1} & 1 & r_{X_2X_3} \\ r_{X_3Y} & r_{X_3X_1} & r_{X_3X_2} & 1 \end{pmatrix}

计算逆矩阵：计算该相关系数矩阵 $\mathbf{C}$ 的逆矩阵，记为 $P = C^{-1}$ 。
计算偏相关系数：任意两个变量 $i$ 和 $j$ 在控制了集合中所有其他变量后的偏相关系数，可以通过逆矩阵 $\mathbf{P}$ 的元素计算得出:

r_{ij \cdot Z} = - \frac{p_{ij}}{\sqrt{p_{ii} p_{jj}}}

例如，要计算Y染色体浓度( $Y$ )与检测孕周( $X_1$ )在控制了BMI( $X_2$ )和年龄( $X_3$ )后的偏相关系数 $r_{YX_1 \cdot X_2X_3}$ ，本文使用以下公式:

r_{YX_1 \cdot X_2X_3} = - \frac{p_{YX_1}}{\sqrt{p_{YY} p_{X_1X_1}}}

这种方法为计算多变量控制下的偏相关系数提供了一个系统性的框架。其通过对相关矩阵求逆，一次性获得在控制其他变量影响后的全部成对相关系数，计算效率高、结果结构化且对称一致，适合多变量和高维数据分析；结合Spearman秩相关矩阵时，还能兼具抗异常值和非正态分布的稳健性。

（3）假设检验：对计算出的每个偏相关系数进行显著性检验。

原假设 $H_0$ ：两个变量在控制了协变量后不相关，即偏相关系数 $\rho = 0$ 。
备择假设 $H_1$ ：两个变量在控制了协变量后存在相关性，即 $\rho \neq 0$ 。
p值的计算基于t分布，其统计量和自由度的计算如下：
t统计量： $t = r \sqrt{\frac{n - k - 2}{1 - r^2}}$
自由度： $df = n - k - 2$ ，其中， $n$ 是样本量， $k$ 是控制变量的数量。
使用Spearman方法，计算了每个变量与其他所有变量之间的偏相关系数。结果矩阵如下：

Spearman偏相关系数矩阵 (基于秩次)

	Y染色体浓度	检测孕周	孕妇BMI	年龄
Y染色体浓度	1.000000	0.086422	-0.140745	-0.095451
检测孕周	0.086422	1.000000	0.145435	-0.013050
孕妇BMI	-0.140745	0.145435	1.000000	0.027144
年龄	-0.095451	-0.013050	0.027144	1.000000

结果显示：

（1）Y染色体浓度与检测孕周（ $r=0.086$ ）：在控制了孕妇BMI和年龄后，Y染色体浓度与检测孕周存在微弱的正相关关系。

（2）Y染色体浓度与孕妇BMI（ $r=-0.141$ ）：在控制了孕周和年龄后，Y染色体浓度与孕妇BMI存在弱的负相关关系。这是三个因素中相关性最强的一个。

（3）Y染色体浓度与孕妇年龄（ $r=-0.095$ ）：在控制了孕周和BMI后，Y染色体浓度与孕妇年龄存在微弱的负相关关系。

通过对相关性系数的分析与对比，得出结论：孕周与Y染色体浓度呈正相关，在BMI和年龄相近的情况下，孕周越长，Y染色体浓度越高；孕妇BMI与Y染色体浓度呈负相关。在孕周和年龄相近的情况下，孕妇BMI越高，Y染色体浓度越低，这可能与“稀释效应”有关；孕妇年龄与Y染色体浓度呈负相关。在孕周和BMI相近的情况下，孕妇年龄越大，Y染色体浓度越低。

偏相关性系数热力图

多元线性回归模型的建立与求解

根据上述分析，为量化孕周、孕妇BMI、孕妇年龄三个自变量与Y染色体浓度的关系模型,本文采用多元线性回归，其能够同时考虑多个自变量对因变量的影响，在控制其他因素的情况下量化各变量的独立贡献，从而更全面、准确地解释因变量的变化规律；其可通过回归系数、显著性检验和拟合优度等指标评估模型的统计可靠性与预测能力。本题中，该模型不仅能对变量间的相关性情况进行验证，还能量化其影响，更深层次地评估变量间的关联情况。

多元线性回归模型的建立

（1）模型设定

根据题意，结合模型可基本假设因变量 $Y$ 可以表示为自变量 $X_1, X_2, X_3$ 的线性组合，加上一个随机误差项 $\epsilon$ 。理论上，总体回归模型可以表示为：

Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i3} + \epsilon_i

其中：

$i$ 代表第 $i$ 个观测样本。
$Y_i$ 是第 $i$ 个样本的Y染色体浓度观测值。
$X_{i1}, X_{i2}, X_{i3}$ 分别是第i个样本的检测孕周、孕妇BMI和年龄的观测值。
$\beta_0$ 是截距项，表示当所有自变量都为0时， $Y$ 的期望值。
$\beta_1, \beta_2, \beta_3$ 是回归系数。 $\beta_j$ 表示在其他自变量保持不变的情况下， $X_j$ 每增加一个单位， $Y$ 的平均变化量。
$\epsilon_i$ 是随机误差项，代表了模型未能解释的所有其他因素对 $Y_i$ 的影响。它满足高斯-马尔可夫假设，即期望为0，方差恒定，且相互独立。

（2）模型拟合：最小二乘法

由于总体系数 $\beta_j$ 无法被直接观测到，需要通过样本数据来估计它们。本节使用的是普通最小二乘法来寻找最佳的系数估计值。该方法的目标是找到一组系数估计值 $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3$ ，使得残差平方和（SSR）最小。样本回归模型是总体模型的估计形式：

\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \hat{\beta}_2 X_{i2} + \hat{\beta}_3 X_{i3}

其中 $\hat{Y}_i$ 是Y染色体浓度的拟合值或预测值。对于每个观测值，残差 $e_i$ 是观测值与拟合值之差：

e_i = Y_i - \hat{Y}_i

最小二乘法的目标是最小化所有残差的平方和 (SSR)：

\text{SSR} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{i1} - \hat{\beta}_2 X_{i2} - \hat{\beta}_3 X_{i3})^2

其中 $n$ 是样本量（在本文的案例中，n=851）。
为了找到最小化SSR的 $\hat{\beta}_j$ ，需要对SSR分别求关于 $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3$ 的偏导数，并令其等于0。这个过程会得到一组正规方程组。在矩阵形式下，这个过程更为简洁。假设：\
1. $\mathbf{Y}$ 是一个 $n \times 1$ 的因变量观测值向量。\
2. $\mathbf{X}$ 是一个 $n \times 4$ 的设计矩阵（包含一列全为1的截距项和三列自变量）。\
3. $\boldsymbol{\beta}$ 是一个 $4 \times 1$ 的系数向量。\
4. $\hat{\boldsymbol{\eta}}$ 是 $\boldsymbol{\beta}$ 的估计向量。

\mathbf{Y} = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} 1 & X_{11} & X_{12} & X_{13} \\ 1 & X_{21} & X_{22} & X_{23}\\ \vdots & \vdots & \vdots & \vdots \\ 1 & X_{n1} & X_{n2} & X_{n3} \end{pmatrix},\quad \hat{\boldsymbol{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \hat{\beta}_3 \end{pmatrix}

SSR可以表示为：

\text{SSR} = (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})

通过求解 $\frac{\partial(\text{SSR})}{\partial \hat{\boldsymbol{\beta}}} = 0$ ，本文得到正规方程的矩阵形式：

(\mathbf{X}^T \mathbf{X}) \hat{\boldsymbol{\beta}} = \mathbf{X}^T \mathbf{Y}

最终，本文可以解出系数的估计向量 $\hat{\boldsymbol{\beta}}$ ：

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}

本次模型构建旨在量化检测孕周、孕妇BMI及年龄对Y染色体浓度的独立影响。为此，本文采用了双重分析策略：
首先，通过Spearman偏相关分析，在克服数据非正态分布的挑战后，确立了各变量与Y染色体浓度之间独立关系的性质与方向（正相关或负相关）。
随后，构建多元线性回归模型，其目的不仅在于验证偏相关分析的发现，更在于将这些影响进行精确量化，从而建立一个可用的预测方程。
两个模型的结果高度一致：回归系数的符号（正/负）与偏相关系数的方向完全吻合，这为结论的稳健性提供了有力支持。

多元线性回归模型的求解

具体的系数估计值：

\begin{cases} \hat{\beta}_0 = 0.1384\\ \hat{\beta}_1\text{（检测孕周）} = 0.0002\\ \hat{\beta}_2 \text{（孕妇BMI）} = -0.0016\\ \hat{\beta}_3 \text{（年龄）} = -0.0010 \end{cases}

将这些估计值代入样本回归方程，得到最终的拟合模型：

\widehat{\text{Y染色体浓度}} = 0.1384 + 0.0002 \times (\text{检测孕周}) - 0.0016 \times (\text{孕妇BMI}) - 0.0010 \times (\text{年龄})

该方程就是对Y染色体浓度进行预测的数学模型。例如，对于一个检测孕周为20周、BMI为25、年龄为30岁的孕妇，其Y染色体浓度的预测值为：

\hat{Y} = 0.1384 + 0.0002 \times 20 - 0.0016 \times 25 - 0.0010 \times 30 = 0.0724

这个过程清晰地展示了如何从理论模型出发，利用OLS方法和样本数据，最终得到一个可用于解释和预测的、具体的数学方程。

Y染色体与孕妇特征指标的关系图

模型评估

本文中的多元线性回归模型的 $R^2=0.045$ ，表明检测孕周、孕妇BMI和年龄这三个变量能共同解释Y染色体浓度约4.5%的变异。这表明模型虽然显著，但解释力有限；F统计量的p值为 $1.66\times10^{-8}$ ，p值极小，远低于0.05，表明整个回归模型是高度统计显著的。

多元线性回归残差分析图

本次构建的回归模型在统计学上是高度显著的。模型的F统计量对应的p值（ $1.66\times10^{-8}$ ）远小于0.05的显著性水平，这有力地证明了检测孕周、孕妇BMI和年龄这三个自变量联合起来对Y染色体浓度确实存在显著的预测关系，整个方程是成立的，其揭示的规律不太可能由随机误差所致。

从模型系数来看，所有纳入的自变量——检测孕周、孕妇BMI和年龄——均为统计上非常显著的预测因子。具体而言，检测孕周与Y染色体浓度呈显著正相关，而孕妇BMI和年龄则与其呈显著负相关。这意味着，在控制了其他变量后，孕周越长、BMI越低、年龄越轻的孕妇，其血浆中胎儿Y染色体浓度倾向于更高。

然而，模型的实际解释能力非常有限。 $R^2$ 仅为0.045，表明该模型只能解释Y染色体浓度总变异的4.5%。这是一个相当低的比例，但在已有的数据下已经是较好的结果了。实际上，在模型的初步探索阶段，我们尝试了多种方法，包括但不限于决策树、Xgboost等非线性回归模型。尽管这些模型在训练集上的效果很好（ $R^2$ 大于0.9），但在训练集上的效果却很差（ $R^2$ 小于0），说明即使是功能强大的预测模型也很难在已有数据集上很好的捕捉到变量间的关系。在查找相关医学文献后，发现多元线性回归是解释Y染色体浓度与BMI等指标关系的常见模型。综合以上研究结果，本文认为，多元线性回归在现有数据集上效果虽然并不理想，但仍是解决此问题的最佳方案。数据集上诸多重要但未被捕捉的因素的缺失才是造成效果不良的根本原因。这些未被捕捉的因素可能包括复杂的生物学机制，如胎盘功能状态、母体与胎儿间的个体生物学差异、遗传背景，以及检测技术本身的波动等。

综上所述，该模型成功地识别出了几个影响Y染色体浓度的关键统计学指标及其影响方向，为理解这一生理现象提供了有价值的线索。但其较低的R平方值也明确警示本文，Y染色体浓度是一个受多重复杂因素共同调控的指标，孕妇检测孕周、孕妇BMI、孕妇年龄这三个指标对Y染色体浓度的影响的贡献很小，但显著性很强。

问题二的模型的建立和求解

为解决传统BMI分组在预测检测成功率方面区分度不足的问题，本研究构建了一套数据驱动的监督式分箱。该方法首先将问题转化为一个生存分析框架，通过区间删失方法处理纵向检测数据中“指标达标”事件发生时间的不确定性。随后，本文采用多重插补技术来应对区间删失带来的问题，并在每个插补数据集上拟合带B样条变换的Cox比例风险模型，为每个BMI值预测一个“指标达标孕周”。最后，以该预测孕周为监督信号，通过CART回归树算法以数据驱动的方式寻找最优的BMI分组切点，并结合剪枝与最小样本量等策略保证分箱的稳健性。最后，在每个分组内，使用Kaplan–Meier方法估计生存曲线和推荐时间点（t95），再经跨组保序和半周圆整后得出最终推荐时点。为了分析检测误差对结果的影响，本文通过“模糊区间”分析与“加噪蒙特卡洛”分析两种独立的敏感性分析来评估模型的稳健性，最终确认了模型在面对测量误差时的稳健性与可靠性。

数据探索性分析（EDA）

在正式构建监督式分箱模型之前，本文首先对原始数据进行了探索性分析（EDA），以评估传统BMI分组的预测能力。通过绘制BMI与关键检测指标的散点图并叠加LOWESS平滑曲线，探究BMI与关键检测指标的相关性情况，这为后续建模提供了理论依据。随后，本文将孕妇按照常规的的BMI标准（例如[20,28)，[28,32)，[32,36)，[36,40)，40 以上）进行分组，并对各组分别进行了KM生存分析，以比较其“指标达标”事件的发生率曲线。分析结果显示，尽管不同分组的KM曲线呈现出一定的分离趋势，但各曲线间区分度不足，甚至存在交叉现象。这一发现明确地揭示了传统分组方法无法有效且稳定地对风险进行分层，从而凸显了采用数据驱动的监督式学习方法来寻找最优风险分割点的必要性。

BMI与关键检测指标的相关性

探索性数据分析结果

对本题所给的原始分组样例进行分析后，得出了孕妇BMI值与关键检测指标之间的关系，如下图所示。图中包含了所有数据点的散点图，以及一条LOWESS平滑拟合曲线。曲线清晰地揭示了BMI与关键检测指标之间存在负相关趋势。即随着BMI的增高，可能导致检测成功的关键生物指标浓度趋于下降，这是后续建模的理论基础。其次，基于传统的BMI分组，绘制了KM生存曲线。这里的“生存”事件可以理解为“未发生检测误差”。

传统BMI分组的KM生存曲线

图中各曲线存在一定的分离趋势，表明不同BMI分组的检测误差率确实存在差异。然而，曲线之间的分离度存在交叉，说明简单的BMI分组不足以清晰、稳定地划分风险等级，这凸显了后续采用“监督式分箱”的必要性。

阈值的设定

在本研究中，阈值 c 是定义核心分析事件的基石。它代表了一个关键的临床或技术判断标准，用于判定某次检测的测量值 $y_{ij}$ 是否“达标”。具体而言，当 $y_{ij} >= c$ 时，本文将其定义为一次成功的“命中”（hit）事件。这个看似简单的二元判定是整个建模流程的起点，它将原始的、纵向的测量值序列，转化为一个标准的生存分析问题。模型的核心目标正是预测每位孕妇的测量值首次“命中”该阈值 c 所需的孕周 $T_i$ 。通过在时间序列上应用此规则，得以构建出区间删失数据 [ $L_i, R_i$ ]，为后续的Cox风险建模和监督式分箱提供了必需的输入。此外，为了评估模型对该关键参数的稳健性，本文还引入了“模糊阈值”区间 [ $c_l, c_u$ ] 的概念，用以模拟阈值本身存在不确定性的情况，从而检验最终结论的稳定性。

令原始行级观测集合为 $\mathcal{R}=\{(i,j): (\mathrm{pid}_i, t_{ij}, y_{ij}, \mathrm{BMI}_i )\}$ ，其中 $\mathrm{pid}_i$ 是第 $i$ 位受试者， $i=1,\dots,n$ ；阈值 $c$ 用于判定“达标”： $\mathrm{hit}_{ij} = \mathbf{1}\{y_{ij} \ge c\}$ 。对每位受试者的真实达标时间记为 $T_i$ ，若未发现则视为右删失。观测可表现为：区间删失、左删失或右删失。
给定模糊阈值区间 $[c_\ell,c_u]$ ，对观测序列定义如下规则：

（1）若存在最早的 $j$ 使得 $y_{ij} > c_u$ ，则把 $R_i=w_{ij}$ ，并令 $L_i$ 为 $R_i$ 之前最近一次时间点（若无则设为 0 或预先设定的下界），此时为区间删失。

（2）若全序列中没有 $y_{ij} > c_u$ ，则视为右删失， $L_i=\max_j w_{ij}$ 。

多重插补方法（MI）

由于本文无法观测到每位孕妇检测指标首次“达标”的确切孕周，只能确定它发生于某两次检测之间的时间窗口 [ $L_i, R_i$ ] 内，并且由于很多数据在第一次检测时就达标，出现严重的左删失情况，因此无法直接应用标准生存模型。为解决此问题，本文采用了多重插补技术。该方法的核心思想是：与其用一个单一值（如区间中点）来估计未知的“达标”时间，不如通过重复抽样生成 $M$ 个合理的“伪”完整数据集。在每个数据集中，本文根据一个预设的概率分布为每个区间删失的样本随机赋一个“达标”时间点 $T_i$ 。后续的Cox风险模型将在这 $M$ 个插补数据集上分别独立运行，最后将 $M$ 次的分析结果进行汇总，从而得到一个考虑了原始数据不确定性的、更为稳健的最终估计。

为处理“不确定”的达标时间，根据区间化规则，对第 $i$ 位受试者，若存在最早满足 $\mathrm{hit}_{ik}=1$ 的索引 $k$ ：
（1）若存在 $j^*<k$ 使 $\mathrm{hit}_{ij^*}=0$ ，则定义区间删失 $L_i=w_{i,j^*},\quad R_i=w_{i,k},\quad \text{ctype}_i=\text{区间删失}$ 。
（2）若无前置负例，则视为左删失，取
$L_i = w_{\mathrm{lb}}\ \text{(常设为检测下限，例如 6 周)},\quad R_i = w_{i,k},\quad \text{ctype}_i=\text{左删失}$ 。
（3）若序列中无正例，则右删失，设 $L_i = w_{i,\max},\quad R_i = +\infty,\quad \text{ctype}_i=\text{右删失}$ 。
左删失比例与BMI值的关系如图：

近似左删失比例与BMI的关系折线图

然后，对每个左删失样本做 $M$ 次插补，构造完整时间样本：

（1）均匀插补： $T_i^{(m)} \sim \mathrm{Unif}(L_i,R_i),\quad m=1,\dots,M$ ；

（2）**截断指数插补：**给定尺度参数 $\theta>0$ ，密度：

f(t) = \frac{(1/\theta) e^{-(t-L_i)/\theta}}{1 - e^{-(R_i-L_i)/\theta}},\quad t\in(L_i,R_i),

其逆变换采样为

t = -\theta \ln\big(1 - U (1-e^{-(R_i-L_i)/\theta})\big) + L_i,\quad U\sim\mathrm{Unif}(0,1),

（3）**自适应插补：**先按 BMI 分组计算每组左删失比例 $r_g$ ，若 $r_g$ 超过阈值 $\tau$ （默认 0.6），则对该组采用截断指数插补并调整左界（例如 $L_i\leftarrow 10$ 周）以引入向右偏的插补分布。
对右删失样本取 $T_i=L_i$ 且事件指示 $\delta_i=0$ 。由此得到 $M$ 个完整数据集 $\mathcal{D}^{(m)}$ 。

Cox模型的建立与分位时间预测

在本研究中，Cox比例风险模型是连接孕妇BMI与其检测指标“达标”事件风险的核心预测引擎。由于BMI与“达标”风险之间的关系可能并非简单的线性，本文不直接使用原始BMI值，而是采用B样条对其进行变换。B样条能将BMI灵活地表示为一组分段多项式基函数，从而有效捕捉两者间复杂的非线性模式。在经过多重插补后，本文在每个插补数据集上独立拟合一个Cox模型，该模型以BMI的B样条变换结果作为协变量，用以估计每个BMI值对应的“达标”瞬时风险。通过该模型，能为每位孕妇计算出其个体化的生存函数 $\hat{S_i}(t)$ ，并从中推导出关键的预测分位时间 $\hat{T_i}$ ,p——即预测该孕妇有p概率实现指标达标的孕周。这个预测孕周最终将作为监督式学习的目标，被用于训练后续的回归树，以找出最佳的BMI风险分割点。
在每一插补集 $\mathcal{D}^{(m)}$ 上，使用 BMI 的样条基向量 $X_i$ （B样条）拟合 Cox 模型：
$\lambda_i(t) = \lambda_0(t) \exp(\beta^\top X_i)$ ，
估计后得到个体生存函数（在离散时间网格 $\mathcal{T}$ 上）：
$\widehat S_i^{(m)}(t) = \widehat S_0^{(m)}(t)^{\exp(\beta^\top X_i)}$ ，随即对给定概率 $p$ （常用 $p=0.95$ ），定义第 $p$ 分位预测：

\widehat T_{i,p}^{(m)} = \inf\{t\in\mathcal{T}:\ \widehat S_i^{(m)}(t) \le 1-p\};

跨插补聚合（本文取中位数）得到最终预测：
$\widehat T_{i,p} = \mathrm{median}_{m=1}^M \widehat T_{i,p}^{(m)}$ 。

曲线单调化

为保证 BMI 增加时预测不减少，令样本按 BMI 升序排列为 $x_{(1)}\le\dots\le x_{(n)}$ ，对应预测值为 $y_{(i)}=\widehat T_{(i),p}$ 。求单调不减序列 $\widetilde y_{(i)}$ 以最小化平方误差：

\widetilde y = \arg\min_{\widetilde y_{(1)}\le\cdots\le\widetilde y_{(n)}} \sum_{i=1}^n (y_{(i)} - \widetilde y_{(i)})^2

由单调回归求解，求得 $y_i^{\mathrm{mono}}$ 。

回归树模型与KM估计

单变量回归树是一种决策树回归模型，用于预测一个连续型目标变量，其输入只有一个特征变量。它通过递归地划分输入变量的取值区间，在每个区间内用一个常数值来进行预测。以单变量 BMI 为自变量拟合回归树，来拟合 $y^{\mathrm{mono}}$ 。理想上希望得到将 BMI 实轴划分为 $K$ 个区间 $\{\mathcal{I}_g\}_{g=1}^K$ ，使得区间内 $y^{\mathrm{mono}}$ 方差尽可能小。
通过遍历剪枝参数，生成候选切点集合，并对候选方案施加最小宽度约束（相邻切点间距 $\ge w_{\min}$ ）与最小样本数约束（每叶子样本数 $\ge n_{\min}$ ），否则将邻区间合并。其评价准则是优先选择最终叶子数等于目标 $K=4$ 的候选，并在这些候选中最小化 MAE：

\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^n | y_i - \widehat y_{g(i)} |

其中 $\widehat y_{g}$ 为组内中位数。

在最终分组下对每组做KM估计，并估计第 $p$ 分位 $t_{g,p}$ 。
卡氏估计是一种用于生存分析的非参数统计方法，主要用于估计某个事件发生前的时间分布。它不依赖于生存时间的特点分布假设，适用于各种类型的生存数据；它能有效处理右删失样本，且可用于比较不同组的生存差异。
对最终分箱的每组 $g$ 进行KM估计 $\widehat S_g(t)$ ，并求组内第 $p$ 分位：

t_{g,p} = \inf\{t: \widehat S_g(t) \le 1-p\}

在 MI 环境下，对每个插补集分别计算 $t_{g,p}^{(m)}$ ，再跨插补聚合得到最终推荐。
回归树分组与KM估计结果

利用单变量回归树模型，得到最佳BMI切点数值及各组孕妇的最佳检测时点。

最佳BMI分组及检测时点

BMI组别	人数	最佳检测周数
29.0及以下	32	18.0
{}[29.0,31.1)	89	18.5
{}[31.1,33.2)	68	19.5
33.2及以上	68	23.0

对此分组结果重新绘制KM生存曲线，如下图所示：

最佳BMI分组的KM生存曲线

与原始分组的KM曲线相比，此图中的各组曲线分离得更为清晰、层次分明，高风险组的“无误差率”显著低于低风险组。这证明本文的算法成功地找到了能最大化风险差异的BMI阈值。

模型评估

log-rank对数秩检验

在确定分组后，为量化各组别之间的区分度，本文还对相邻两个风险组的KM生存曲线进行了对数秩检验，得到其p值。如下表所示：

对数秩检验p值

组别比较	$p_m$	$p_f$
29及以下与[29.0,31.1)	0.349	0.638
\relax[29.0,31.1)与[31.1,33.2)	0.168	0.302
\relax[31.1,33.2)与33.2及以上	0.093	0.051

由此可知，p值并非极小，这表明相邻风险组之间在KM生存曲线上的差异并非极其显著，有偶然因素掺杂。

敏感性分析

为探究检测误差对分组结果和最佳 NIPT时点的影响，本文分别进行了两个分析实验，分别测量噪声和模糊阈值对结果的影响。

首先是测量噪声的蒙特卡洛模拟。在每次模拟 $b=1,\dots,B$ 中，对所有行级观测加入独立同分布噪声：

\tilde y_{ij}^{(b)} = y_{ij} + \varepsilon_{ij}^{(b)},\qquad \varepsilon_{ij}^{(b)}\sim\mathcal{N}(0,\sigma^2),\sigma=0.01.

对每次扰动数据，重复区间化、MI（ $M$ 次）、Cox 预测、Isotonic 单调化、回归树分箱（目标 $K=4$ ）与组内 KM，得到

(\mathcal{C}^{(b)},\; t_{1,p}^{(b)},\dots,t_{K,p}^{(b)} ).

收集成功次样本的经验分布以估计切点与组内推荐的不确定性（均值、方差、置信区间、直方图等）。

噪声实验下BMI切点小提琴图
噪声实验下孕周分布小提琴图

噪声实验下的小提琴图

该过程等价于研究映射

\Phi:\ \{y_{ij}\} \mapsto (\mathcal{C},\; t_{1,p},\dots,t_{K,p})

在加噪扰动下的分布。若映射对噪声敏感，则输出分布会显示大方差或多峰性。实验结果如图11 。

图中每个切点的小提琴形状都非常狭窄，且集中在一个很小的BMI值范围内。每个风险组对应的推荐孕周同样呈现出非常集中的分布。这说明基于本文分箱模型的BMI切点和临床建议（即对不同风险的个体建议不同的复查时间）都非常稳定。这强力证明了本文找到的BMI阈值是数据内在的、稳定的结构性特征，对噪声具有高度的鲁棒性。

然后是模糊阈值实验。给定模糊阈值区间 $[0.039,0.041]$ ，对观测序列定义如下规则：

1.若存在最早的 $j$ 使得 $y_{ij} > c_u$ ，则把 $R_i=w_{ij}$ ，并令 $L_i$ 为 $R_i$ 之前最近一次时间点（若无则设为 0 或预先设定的下界），此时为 interval。\
2.若全序列中没有 $y_{ij} > c_u$ ，则视为右删失， $L_i=\max_j w_{ij}$ 。

与精确阈值判定不同，模糊阈值把只有超过上界 $c_u$ 的观测视作确定达标，而对处于 $(c_\ell,c_u]$ 的观测不直接判定为达标，从而扩大区间不确定性。

对区间样本采用一次均匀插补：

T_i^{\mathrm{fuzzy}} \sim \mathrm{Unif}(L_i,R_i).

然后在已给定的 BMI 分箱下分别计算两种规则（精确阈值 vs. 模糊阈值）得到的组内 KM 估计与第 $p$ 分位 $t_{g,p}^{\text{exact}}$ 与 $t_{g,p}^{\text{fuzzy}}$ ，以比较阈值模糊性对推荐的影响。

不同组的模糊阈值与精确阈值对比图

不同风险组的模糊阈值与精确阈值对比结果如图所示。据图分析，即使在考虑了测量误差的更苛刻条件下，不同风险组之间的差异依然显著。且组内 KM 估计与第 $p$ 分位 $t_{g,p}^{\text{exact}}$ 与 $t_{g,p}^{\text{fuzzy}}$ 差异较小。这说明模型对于输入数据的检验误差具有良好的鲁棒性。

问题三的模型的建立和求解

根据问题二对原始BMI分组的探究，可知原始分组在本题也不是最优解。由于原始数据中大量的左删失情况，MI+Cox的解决方法效果并不理想，因此本文使用了在高删失下更稳定地估计高分位的AFT模型，另外AFT和半参数的Cox一样支持BMI、年龄、IVF等协变量，符合问题三中要考虑多因素影响的要求。

模型建立

为了将 BMI 连续变量分割为 $k$ 个有序区间，使得组内监督目标的离散度最小，设数据集 $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$ ，其中 $x_i\in\mathbb{R}$ 为第 $i$ 个个体的 BMI， $y_i\in\mathbb{R}$ 为监督目标（如 $\pi_{25}=P(T\le 25)$ 或 $t_{95}$ 等）。我们以“首次达标时间” $T$ 为生存时间，采用“区间删失 + AFT 模型 + MI 条件插补 + KM 估计”的联合策略；在分组层面，以监督式分箱（等价于一维回归树）最小化组内目标的平方误差和，获得分割点与稳定的组间梯度。

区间删失数据处理

对个体 $i$ ，检测时间序列为 $\{t_{i,1},\ldots,t_{i,m_i}\}$ ，对应 Y 浓度为 $\{y_{i,1},\ldots,y_{i,m_i}\}$ 。首次达标阈值取 $\tau=0.04$ 。首次达标的区间 $[L_i,R_i]$ 构造为

[L_i,R_i]= \begin{cases} [0,t_{i,j^\ast}], & \text{左删失：} y_{i,1}\ge \tau,\\ [t_{i,j^\ast-1},t_{i,j^\ast}], & \text{区间删失：}\exists j^\ast \text{ s.t. } y_{i,j^\ast-1}<\tau\le y_{i,j^\ast},\\ [t_{i,m_i},+\infty), & \text{右删失：}\forall j,\;y_{i,j}<\tau. \end{cases}

定义删失类型指示 $(\delta_i^{\text{left}},\delta_i^{\text{int}},\delta_i^{\text{right}})\in\{0,1\}^3$ ，三者互斥且至多一者为 1，用于后续似然与插补。基于全体个体的识别结果，可统计三类删失的频数与比例（示例如表 \ref{tab:censoring_types}）。

区间删失对统计推断的含义

左删失仅给出 $T_i\le R_i$ 的信息；右删失仅给出 $T_i> L_i$ 的信息；区间删失给出 $L_i<T_i\le R_i$ 的信息。

在参数模型中，这三类观测贡献不同的似然项；在非参 KM 框架中，需先通过“条件抽样”将其转化为仅含右删失的样本以便估计阶梯生存曲线。

AFT 加速失效时间模型

AFT 模型刻画 $\log T$ 与协变量的线性关系：

\log T_i=\mathbf{x}_i^\top\boldsymbol{\beta}+\sigma\epsilon_i,\qquad \epsilon_i\overset{\text{i.i.d.}}{\sim}F_0,

其中 $\mathbf{x}_i$ 含 BMI、年龄与 IVF 类别等， $F_0$ 取自一族基准分布（本研究候选为对数正态、Weibull、对数逻辑）。令

\mu_i\equiv\mathbf{x}_i^\top\boldsymbol{\beta},\quad F_i(t)\equiv P(T_i\le t\mid \mathbf{x}_i)=F_0\!\left(\frac{\log t-\mu_i}{\sigma}\right),\quad S_i(t)=1-F_i(t).

log-normal： $F_0=\Phi$ ，则 $t_p=\exp\{\mu_i+\sigma \Phi^{-1}(p)\}$ ， $S_i(t)=1-\Phi\big((\log t-\mu_i)/\sigma\big)$ 。\
Weibull：设形状 $k=1/\sigma$ 、尺度 $\lambda=\exp(\mu_i)$ ， $F_i(t)=1-\exp\{-(t/\lambda)^k\}$ 。\
log-logistic： $F_i(t)=\big[1+\big(t/\lambda\big)^{-k}\big]^{-1}$ 。

区间删失 AFT 似然：
记基分布 CDF/密度为 $F_0,f_0$ 。第 $i$ 个体对对数似然的贡献为

\ell_i(\boldsymbol{\beta},\sigma)= \begin{cases} \log F_i(R_i),& \delta_i^{\text{left}}=1,\\ \log\big\{F_i(R_i)-F_i(L_i)\big\},& \delta_i^{\text{int}}=1,\\ \log\big\{1-F_i(L_i)\big\},& \delta_i^{\text{right}}=1, \end{cases}

总体对数似然 $\ell=\sum_i \ell_i$ 。用极大似然估计 $(\hat{\boldsymbol{\beta}},\hat{\sigma})$ ，并以 AIC = $2k-2\ell(\hat{\theta})$ 比较候选分布， $k$ 为自由参数个数。

个体层预测量：

分位时间： $t_{90},t_{95}$ 由 $t_p=\inf\{t:F_i(t)\ge p\}$ 给出；lognormal时 $t_p=\exp\{\mu_i+\sigma \Phi^{-1}(p)\}$ ，固定时点达标概率： $\pi_{25,i}=P(T_i\le 25)=F_i(25)$ 。\
这些量既用于监督分箱的目标（如 $\pi_{25}$ ），也用于 MI 条件插补与推荐时点的兜底（AFT 中位 $t_{95}$ ）。

基于 AFT 拟合结果的 MI 条件插补

令 $F_i$ 为 AFT 下个体条件 CDF。对于非右删失个体，进行 $M$ 次条件抽样：

T_i^{(m)}\sim \begin{cases} F_i^{-1}\!\big(U\cdot F_i(R_i)\big), & \text{左删失 } (0,R_i],\\ F_i^{-1}\!\big(F_i(L_i)+U\cdot(F_i(R_i)-F_i(L_i))\big), & \text{区间删失 } (L_i,R_i], \end{cases} \quad U\sim \text{Unif}(0,1).

右删失保持不变（仅记录删失界与删失指示）。每次插补得到仅含右删失的样本，之后据此计算 KM 曲线与分位数。本文取 $M=200$ ，以插补中位数与四分位距（IQR）表征不确定性。

监督式分箱（回归树）（SSE 准则）

给定分组数 $K$ 与最小叶大小 $\text{MIN\_LEAF}$ ，在 BMI 轴上搜索切点集合 $\mathcal{C}=\{c_1<\cdots<c_{K-1}\}$ ，最小化组内平方误差和：

\min_{\mathcal{C}}\; \sum_{g=1}^K \sum_{i\in \mathcal{I}_g(\mathcal{C})}\big(y_i-\bar{y}_g\big)^2, \quad \bar{y}_g=\frac{1}{|\mathcal{I}_g|}\sum_{i\in \mathcal{I}_g} y_i.

使用贪心策略：从全集出发，枚举可行切点，以降幅

\Delta \text{SSE}=\text{SSE}_{\text{parent}}-\text{SSE}_{\text{left}}-\text{SSE}_{\text{right}}

最大者为优，直至达到 $K$ 组或无显著增益（ $\Delta \text{SSE}<\text{MIN\_GAIN}$ ）或触达样本量约束；若失败则回退等频分箱。本文以 $y_i=\pi_{25,i}$ 为监督目标， $K=3,\ \text{MIN\_LEAF}=30$ 。

KM 估计与多重合并

对第 $m$ 次插补样本 $\{(T_i^{(m)},\delta_i^{(m)})\}$ ，记事件时点序列为 $\{t_j\}$ ，其 KM 估计

\widehat{S}^{(m)}(t)=\prod_{t_j\le t}\left(1-\frac{d_j^{(m)}}{n_j^{(m)}}\right),

其中 $d_j^{(m)}$ 为 $t_j$ 的事件数， $n_j^{(m)}$ 为风险集大小。多重插补的合并采用逐点中位：

\widehat{S}(t)=\operatorname{median}\big\{\widehat{S}^{(1)}(t),\ldots,\widehat{S}^{(M)}(t)\big\},

并记录 25–75% 分位带作为不确定性区间。

分位数与推荐时点

KM 分位数取

\hat{t}_\alpha^{\text{KM}}=\inf\{t:\widehat{S}(t)\le \alpha\},

AFT 分位数取个体分位的组内中位。对第 $k$ 组，推荐时点

y_k= \begin{cases} \text{round}\Big(\operatorname{median}\{\hat{t}_{0.05}^{\text{KM},(m)}\}_{m=1}^M/0.5\Big)\times 0.5,& \text{KM 可用};\\[2pt] \text{round}\Big(\operatorname{median}\{\hat{t}_{0.05}^{\text{AFT}}\}/0.5\Big)\times 0.5,& \text{否则}, \end{cases}

其中 round 表示四舍五入到 0.5 周。该规则保证 $P(T>y_k)\approx 0.05$ 的风险控制目标。

模型求解

本节给出似然构造、参数估计、分布选择、监督分箱、MI+KM 合并、KM–AFT 对齐度与推荐生成的完整实现细节与数值结果。

区间删失 AFT 的极大似然与分布选择

以log-normal / Weibull / log-logistic为候选，采用R语言中的\texttt{survreg}做interval-censor生存回归,在 $\text{Surv}(L_i,R_i,\texttt{type=''interval2''})$ 上求解 MLE，自动剔除全 NA 协变量或仅单水平的分类变量。
AIC 判别：结果为对数正态 AIC=337.4786、Weibull AIC=337.6242，二者接近但以log-normal为优（亦便于与 scipy 参数化对齐），并且由 $\hat{\mu}_i=\mathbf{x}_i^\top\hat{\boldsymbol{\beta}}$ 与 $\hat{\sigma}$ 导出个体 $t_{90},t_{95}$ 与 $\pi_{25}=F_i(25)$ 。

固定时点达标概率:
在log-normal情形，

\pi_{25,i}=P(T_i\le 25)=\Phi\big((\log 25-\hat{\mu}_i)/\hat{\sigma}\big),

是监督分箱的首选目标（USE_METRIC=‘‘pi_25’’），能更直接反映“到 25 周是否已达标”的风险梯度。

分组结果

监督式分箱的求解与性质

切点搜索：排序 BMI 后仅在相邻样本中点处枚举切点，时间复杂度 $O(n)$ 扫描一次可得每个候选的 SSE 降幅，整体为 $O(nK)$ 。对于每一侧叶子 $\ge \text{MIN\_LEAF}$ ；若本轮最佳 $\Delta\text{SSE}<\text{MIN\_GAIN}$ 则提前停止并回退等频分箱。最终得到分组结果，如表格所示。

组别编号	BMI范围
0	$[20.70,\,31.73]$
1	$[31.75,\,35.63]$
2	$[35.67,\,46.88]$

MI 条件插补与 KM 合并的实现

对每个左/区间删失样本，按个体 $F_i$ 进行条件抽样（AFT 条件 MI）；若缺参则回退区间均匀；重复 $M=200$ 次。
KM：每次插补对各 BMI 组分别拟合 KM，出生命中位曲线与 25–75% 分位带；再在统一网格 $t\in[0,26]$ 上取逐点中位合并，得到组层的 $\widehat{S}(t)$ 。
高分位可达性：若删失结构导致 KM 的 $t_{95}$ 不可达或不稳，则回退至 AFT 个体 $t_{95}$ 的组内中位数，保证推荐的稳定输出。

KM 分位时间估计结果：

组别	$t_{95}$ 中位	$t_{95}$ 下四分位	$t_{95}$ 上四分位	$t_{90}$ 中位	$t_{90}$ 下四分位	$t_{90}$ 上四分位
0	17.520	17.049	18.218	14.456	14.114	14.836
1	/	/	/	17.675	17.177	18.145
2	/	/	/	/	/	/

分组结果

AFT生存曲线

组别编号	BMI值范围	推荐最佳检测点
0	31.7及以下	17.5周
1	$[31.7,35.7)$	20.5周
2	35.7及以上	23.5周

模型评估

目标单调性与对数秩检验

分组配对的对数秩检验（合并 p 值）

配对组	卡方值	自由度	合并p值	统计量均值	样本数
0 vs 1	585.307	400	$3.948\times 10^{-9}$	1.511	200
1 vs 2	824.746	400	$1.096\times 10^{-31}$	2.369	200

目标的单调性强（Spearman $(\text{BMI},\ \text{pred\_t95})=0.953$ ）；外部生存差异（相邻组）合并 p 值极显著（0 vs 1： $3.95\times 10^{-9}$ ；1 vs 2： $1.10\times 10^{-31}$ ），证明分箱有效区分风险层级。

KM–AFT 对齐度的定义与计算

在诊断窗 $[8,24]$ 周内，令 $\widetilde{S}^{\text{KM}}(t)$ 为 MI 中位 KM 曲线， $\widetilde{S}^{\text{AFT}}(t)$ 为 AFT 精确中位曲线。定义

\text{align\_L1}_{8\text{--}24} =\frac{1}{|G|}\sum_{t\in G}\big|\widetilde{S}^{\text{KM}}(t)-\widetilde{S}^{\text{AFT}}(t)\big|,\quad \text{align\_sup}_{8\text{--}24} =\max_{t\in G}\big|\widetilde{S}^{\text{KM}}(t)-\widetilde{S}^{\text{AFT}}(t)\big|,

其中 $G=\{8,8{.}1,\ldots,24\}$ 为步长 $0.1$ 的网格。数值上，L1 越小代表整体更一致，sup 越小代表最坏点差异更小。对齐度与左删失率、样本量共同反映模型—数据一致性。

计算结果：
对齐度：

组0：L1 0.00617，sup 0.01898（优秀）
组1：L1 0.02591，sup 0.04716（良好）
组2：L1 0.06069，sup 0.09140（一般，由于样本小且左删失严重）

敏感性分析

为了验证模型结果的稳定性，项目进行了敏感性分析，通过向原始数据注入随机噪声并重复建模过程，观察关键结果（BMI切分点和推荐孕周）的变动情况。

噪声实验下BMI切点分布
各组推荐的孕周分布
噪声实验下的分布小提琴图

左图展示了在数据有噪声的情况下，两个BMI切分点的分布。可以看出，第一个切分点（粉色）非常稳定，集中在31-34之间；第二个切分点（绿色）波动稍大，但仍稳定在33-36之间。这证明了BMI分组方式是稳健的。

右图展示了各组推荐孕周的分布。可以看出，低BMI组的推荐时间非常稳定（小提琴很“瘦”），而高BMI组的推荐时间不确定性更大（小提琴更“胖”），这与高BMI组样本量少、删失率高的现实情况相符。尽管如此，各组的推荐时间核心区间清晰可辨，证明了最终推荐策略的整体可靠性。

此外，针对这三个不同的BMI分组，绘制出KM生存曲线，验证其敏感性。

三组KM生存曲线

从图中可以看出，在所有三个组中，橙色曲线（模糊定义）都位于蓝色曲线（精确定义）的下方，这意味着在模糊定义下，达标时间似乎发生得更早。然而，两条曲线的整体形状、趋势以及置信区间的大部分是重叠的，表明虽然定义不同会导致数值上的轻微差异，但并不会从根本上改变“BMI越高，达标时间越晚”这一核心结论。
因此，这张图证明了模型结果对于达标阈值的微小波动具有良好的稳健性。

问题四的模型的建立和求解

该题的核心任务是为女胎建立一个准确的异常判定方法。与男胎不同，女胎不携带Y染色体，因此无法使用Y染色体浓度作为直接的判断依据。因此，必须综合利用其他多种生物信息学指标，如各关键染色体（13, 18, 21, X）的Z值、GC含量、测序读段数以及孕妇的BMI等个人信息，来构建一个高精度的分类模型，以判断胎儿是否存在21、18或13号染色体的非整倍体异常。

模型建立

考虑到临床应用的严肃性，模型的评估不能仅仅依赖于传统的准确率。在产前检测中，假阴性，即未能检测出实际异常的胎儿，会带来严重的临床后果而错过干预窗口，其代价远高于假阳性，即错误地将正常胎儿标记为异常。因此，本文采纳了更符合临床需求的非对称代价函数作为模型优化的最终目标。

特征工程

在将数据输入模型之前，本文执行了一系列特征工程步骤，以增强原始数据的表达能力并满足模型要求。这些步骤包括：

（1）**缺失值处理：**部分样本的“孕妇BMI”特征存在缺失值。本文采用中位数插补的方法来填充这些缺失值，以保证数据的完整性。

（2）数据清洗：对数据进行了清洗，并根据文献以及临床经验设定了一个质量控制标准。本文仅在X染色体Z值的绝对值小于2.5的“高置信度”样本上进行后续所有操作。这一步骤排除了14个信号可能不可靠的样本，旨在构建一个在常规情况下更稳定、更可靠的模型。

（3）**交互特征构建：**为了捕捉关键变量之间可能存在的非线性协同效应，本文构建了新的交互特征。具体而言，将各染色体的Z值与X染色体浓度相乘：

z_score_chr_ff = z_score_chr \times x_concentration, \quad \text{where } chr \in \{13, 18, 21\} ,

这些交互特征旨在放大在高胎儿浓度下Z值的信号。

（4）**特征离散化：**观察到X染色体的Z值的绝对值在特定区间有不同的临床意义，因此对其进行分箱处理，将其转化为一个分类特征：

1.区间: $[0, 2.5), [2.5, 3), [3, +∞)$

2.标签: 正常（ZX）, 临界（ZX）, 异常（ZX）
随后，这个新生成的分类特征通过独热编码转换为多列二进制特征，以便于线性模型和树模型进行学习。

（5）**特征缩放：**由于支持向量机（SVM）对特征的尺度非常敏感，在将其输入SVM模型之前，对所有数值型特征进行了标准化处理，将每个特征 $j$ 转换为：

x'_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}

其中， $\mu_j$ 和 $\sigma_j$ 分别是训练集中特征 $j$ 的均值和标准差。此步骤确保所有特征具有零均值和单位方差，避免了某些特征因尺度问题在模型训练中占据主导地位。

（6）**最终特征列表：**经过上述所有处理步骤后，最终输入到模型中的完整特征列表是：年龄，孕周，BMI，比对率，重复率，唯一比对读段数，GC 含量，13 号染色体 Z 分数，18 号染色体Z值，21 号染色体Z值，X染色体Z值，X染色体浓度，21 号染色体Z值与胎儿 DNA 含量的交互特征，18 号染色体Z值与胎儿DNA含量的交互特征，13号染色体Z值与胎儿DNA含量的交互特征，X 染色体 Z 值分箱后为正常（ZX）的独热编码特征，X 染色体 Z 值分箱后为临界（ZX）的独热编码特征，X 染色体 Z值分箱后为异常（ZX）的独热编码特征。

支持向量机（SVM）

SVM的核心思想是在特征空间中寻找一个能将不同类别样本最大程度分开的最优超平面。对于给定的训练数据集 $D = \{(x_i, y_i)\}_{i=1}^N$ ，其中 $x_i \in \mathbb{R}^p$ 是p维特征向量， $y_i \in \{-1, 1\}$ 是类别标签。软间隔SVM的原始优化问题可以表示为：

\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^N \xi_i ,

约束条件为：

y_i(w^T x_i + b) \ge 1 - \xi_i, \quad \forall i=1, \dots, N , \xi_i \ge 0, \quad \forall i=1, \dots, N ,

其中， $w \in \mathbb{R}^p$ 是超平面的法向量， $b \in \mathbb{R}$ 是偏置项， $\|w\|^2$ 是正则化项，旨在最大化几何间隔， $C > 0$ 是一个正则化超参数，用于权衡间隔大小与误分类样本的容忍度， $\xi_i$ 是松弛变量，允许部分样本不满足间隔约束。

接着，为了处理非线性可分的数据，SVM使用核技巧将原始特征空间映射到一个更高维的希尔伯特空间： $\mathcal{H}$ ，并在这个高维空间中寻找线性超平面。这是通过一个非线性映射函数 $\phi: \mathbb{R}^p \to \mathcal{H}$ 实现的。用核函数 $K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$ 来替代在高维空间中的点积运算，从而避免了对映射 $\phi(x)$ 的显式计算。

在本项目中，本文选用了高斯核（RBF核），其定义如下：

K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)

其中， $\gamma > 0$ 是一个超参数，它定义了单个训练样本对于分类决策的影响范围。 $\gamma$ 越小，影响范围越大。

通过求解原始问题的对偶问题（Dual Problem），得到最终的决策函数：

f(x) = \text{sgn} \left( \sum_{i=1}^N \alpha_i y_i K(x_i, x) + b \right)

其中， $\alpha_i$ 是拉格朗日乘子，只有支持向量（Support Vectors）对应的 $\alpha_i$ 才非零。

极端梯度提升（XGBoost）

XGBoost是一种基于梯度提升决策树算法的高效、可扩展的实现。其构建的是一个由 $K$ 棵决策树组成的加法模型。对于一个样本 $x_i$ ，其预测值 $\hat{y}_i$ 为：

\hat{y}_i = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F}

其中 $\mathcal{F}$ 是所有可能的决策树组成的函数空间。

模型通过最小化一个包含损失函数和正则化项的目标函数来进行训练：

\text{Obj} = \sum_{i=1}^N l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k),

其中， $l(y_i, \hat{y}_i)$ 是损失函数，对于二分类问题，通常是对数损失：

l(y_i, \hat{y}_i) = -[y_i \log(p_i) + (1-y_i) \log(1-p_i)]

其中 $p_i = \sigma(\hat{y}_i)$ ， $\sigma$ 是Sigmoid函数。 $\Omega(f)$ 是正则化项，用于控制模型复杂度，防止过拟合：

\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2

$T$ 是树的叶子节点数量， $w_j$ 是第 $j$ 个叶子节点的分数（权重）， $\gamma$ 和 $\lambda$ 是正则化超参数。

由于模型是分步迭代训练的，在第 $t$ 轮，本文旨在找到一棵树 $f_t$ 来最小化目标：

\text{Obj}^{(t)} = \sum_{i=1}^N l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)

通过对损失函数进行二阶泰勒展开，可以近似得到在第 $t$ 轮需要优化的目标：

\text{Obj}^{(t)} \approx \sum_{i=1}^N [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2] + \Omega(f_t)

其中 $g_i$ 和 $h_i$ 分别是损失函数关于 $\hat{y}_i^{(t-1)}$ 的一阶和二阶梯度。

模型集成与代价优化

本文将两个基模型的概率输出进行线性加权平均，以得到最终的集成概率：

P_{\text{ensemble}}(x) = w \cdot P_{\text{XGB}}(x) + (1-w) \cdot P_{\text{SVM}}(x) ,

其中， $P_{\text{XGB}}(x)$ 和 $P_{\text{SVM}}(x)$ 分别是XGBoost和经过概率校准的SVM模型对于样本 $x$ 的预测概率， $w \in [0, 1]$ 是分配给XGBoost模型的权重。

基于集成概率，本文使用一个分类阈值 $\tau$ 来做出最终的二分类决策：

\hat{y} = \begin{cases} 1 & \text{if } P_{\text{ensemble}}(x) > \tau \\ 0 & \text{otherwise} \end{cases}

在本题中，所有超参数的优化目标是最小化一个自定义的临床代价函数，而非传统的准确率或AUC。该函数定义为：

\text{Cost} = c_{FN} \cdot \text{FN} + c_{FP} \cdot \text{FP}

其中 $\text{FN}$ 和 $\text{FP}$ 分别是假阴性和假阳性的样本数量，而 $c_{FN}$ 和 $c_{FP}$ 是对应的代价权重。在最终模型中，设 $c_{FN}=15$ ， $c_{FP}=1$ 。
因此，整个自动化机器学习过程的最终优化问题是：

\min_{\mathbf{H}} \left( 15 \cdot \text{FN}(\mathbf{H}) + 1 \cdot \text{FP}(\mathbf{H}) \right)

其中，超参数集合 $\mathbf{H}$ 包括了XGBoost的所有相关参数、SVM的参数（ $C, \gamma$ ）、集成权重 $w$ 以及分类阈值 $\tau$ 。代价函数中的 $\text{FN}(\mathbf{H})$ 和 $\text{FP}(\mathbf{H})$ 是在5折交叉验证（5-fold Cross-Validation）过程中，使用超参数集 $\mathbf{H}$ 所得到的假阴性和假阳性总数。

模型的优化过程——寻找最优超参数集合 $\mathbf{H}$ ），是在全体训练样本上进行的，其目标是最小化在所有样本上定义的临床代价函数。

然而，在最终生成报告以评估最优模型的性能时，引入了一个额外的质量控制步骤，以更贴近临床实际应用。对于X染色体Z值绝对值大于等于2.5的样本，将其视为“低置信度”或需要人工复核的样本，并将其从性能评估的数据集中排除。

因此，最终报告中的所有性能指标（如AUC、分类报告、混淆矩阵等）均在满足以下条件的样本子集上计算得出：

\{ (x_i, y_i) \in D \mid |z\_score\_x(i)| < 2.5 \} ,

其中， $z\_score\_x(i)$ 是样本 $x_i$ 的X染色体Z值。这一步骤旨在评估模型在排除极端异常或不可靠的X染色体信号后，在“高置信度”样本上的真实表现。

综上所述，本项目构建了一个复杂的集成学习系统。它不仅融合了两种强大的机器学习模型，更重要的是，它的整个超参数空间（包括基模型参数和集成参数）都是为了一个明确的、与业务紧密相关的临床代价函数而进行端到端优化的，从而确保模型在实际应用中能够取得最优的效用。

模型求解

最终目标函数公式为：

\text{最小化临床代价 (Cost)} = 15 \times \text{FN} + 1 \times \text{FP} ,

这个目标函数明确指出，漏诊一个异常样本的代价是误诊一个正常样本的12倍。

优化过程与结果分析

优化目标: 最小化临床代价 $15 \times \mathrm{FN} + 1 \times \mathrm{FP}$ 。
得到最佳模型: 进行 100 次k折交叉验证（ $k=5$ ），且每次均保证训练集不会污染验证集，得到在交叉验证中临床代价最小的模型，即最佳模型。最佳模型在验证集上的临床代价为297
模型综合判别能力 (AUC): 达到最低代价的这次最佳试验，其对应的 AUC 分数为 0.8216，这表明模型在不依赖特定阈值的情况下，具有良好的区分异常和正常样本的总体能力。

最佳超参数组合如下：

\begin{cases} ensemble\_w: 0.8768 (XGBoost\text{权重})\\ svm\_C: 1.0459\\ svm\_gamma: 0.0136\\ threshold: 0.1584 (\text{分类阈值})\\ xgb\_colsample\_bytree: 0.9384\\ xgb\_gamma: 0.2413\\ xgb\_learning\_rate: 0.0104\\ xgb\_max\_depth: 6\\ xgb\_n\_estimators: 479\\ xgb\_subsample: 0.7500\\ \end{cases}

混淆矩阵分析

以下是在高置信度样本子集上的最佳模型的具体表现：

	预测为正常	预测为异常
实际为正常	355	57
实际为异常	16	36

依据该矩阵可知，成功检测的有36例；发生漏诊有16例，这是最关键的指标，模型未能识别出16例异常样本。根据代价函数，这部分产生了 $16\times15=240$ 的代价；发生误诊的有57例，模型将57例正常样本标记为需要复核的“异常”，这部分产生了 $57\times1=57$ 的代价；成功排除的有355例，模型成功地将355例正常样本判断为正常。

混淆矩阵最优解

为了更深入地理解模型的性能，本文对分类报告的各项指标进行解读：

	精确率	召回率	F1分数	样本数
正常	0.96	0.86	0.91	412
异常	0.39	0.69	0.50	52
总体准确率			0.84	464

针对“异常”样本：

精确率= 0.39：在所有被模型预测为“异常”的样本中，只有 39% 是真正的异常。这意味着有较多的假阳性，这也是为了降低更昂贵的假阴性所付出的代价。

召回率 = 0.69：这是本案例的核心指标之一。它表示在所有真实为“异常”的样本中，模型成功“召回”或识别出了其中的 69%。根据代价函数的设计，模型牺牲了一部分精确率，以换取尽可能高的召回率，从而最大程度地避免漏诊。
针对“正常”样本：

精确率 = 0.96：在所有被模型预测为“正常”的样本中，有 96% 是真正的正常。这是一个非常高的数值，说明模型给出的“正常”判断具有很高的可靠性。

召回率= 0.86：在所有真实为“正常”的样本中，有 86% 被模型正确识别。另外的 14% 被错误地划分为“异常”（即假阳性）。

综上所述，模型的AUC值为0.8216，表明其在区分异常和正常样本方面具有良好的整体能力。混淆矩阵显示，模型在464个高置信度样本中成功识别了36个异常样本，同时将355个正常样本正确分类为正常。尽管存在57个假阳性，但这是为了最大限度地减少16个假阴性所做的权衡。本文成功构建并优化了一个专门针对女胎NIPT数据异常判定的高级集成模型。该模型通过在筛选后的高置信度数据集上，端到端地学习一个15:1的非对称临床代价函数，实现了在“漏诊”和“误诊”之间的高度定制化的平衡，最终达到了297.0的最低临床代价分数。模型的召回率（69%）显著高于精确率（39%），这与设定的“不惜一切代价避免漏诊”的优化目标一致。

模型各特征的特征重要性

模型的评价

模型的优点

使用生存模型自然地处理删失数据，能提供更符合临床需求的预测结果。
通过融合SVM和XGBoost两种强大的机器学习算法，并进行端到端的自动化超参数调优，模型能够捕捉复杂的非线性关系和特征交互，获得了很高的整体判别能力（AUC=0.8216）。
分箱不是基于先验知识（如传统的“偏瘦/正常/超重”），而是完全由数据驱动，以最大化风险区分度为目标。从结果看，监督式分箱后的KM曲线分离度远优于传统分箱，证明了其有效性。

模型的缺点

多元线性回归模型无法捕捉变量之间复杂的交互作用或非线性模式，因此其预测精度通常不如更复杂的机器学习模型。
最小二乘法容易受到极端异常值的影响，导致模型参数估计产生偏差。

附录

文件列表

文件名	说明
`p1\_literature\_based\_analysis.py`	文献驱动的基础数据分析（问题一）
`p1\_relationship\_analysis.py`	特征相关性分析（问题一）
`p1\_xgboost\_analysis.py`	XGBoost建模与预测（问题一）
`p2\_bmi\_supervised\_binning.py`	BMI变量监督分箱处理（问题二）
`p2\_eda.py`	探索性数据分析（问题二）
`p2\_noise\_grouped\_sensitivity\_analysis.py`	分组噪声敏感性分析（问题二）
`p2\_plot\_sensitivity\_trends.py`	敏感性趋势可视化（问题二）
`p2\_fuzzy\_interval\_modeling.py`	模糊区间建模（问题二）
`p3\_aft.R`	加速失效时间（AFT）模型建模（问题三）
`p3\_bmi\_group\_plots.py`	BMI分组可视化绘图（问题三）
`p3\_bmi\_supervised\_binning.py`	BMI监督分箱处理（问题三）
`p3\_fuzzy\_interval\_modeling.py`	模糊区间建模（问题三）
`p3\_noise\_grouped\_sensitivity\_analysis.py`	分组噪声敏感性分析（问题三）
`p4\_automl\_ensemble\_tuning.py`	AutoML集成建模与调参（问题四）
`p4\_shap\_analysis.py`	SHAP模型可解释性分析（问题四）

代码

p1_literature_based_analysis.py

'''
本脚本根据指定文献的方法，分析Y染色体浓度与孕妇关键特征（年龄、BMI、孕周）之间的关系。
分析流程包括：
1. 孕妇年龄与BMI的相关性分析。
2. Y染色体浓度与孕周的相关性分析。
3. 在不同孕周分组下，校正BMI后，分析Y染色体浓度与孕妇年龄的相关性。
4. 在不同孕周分组下，校正年龄后，分析Y染色体浓度与孕妇BMI的相关性。
新增功能：在进行相关性分析前，会进行正态性检验（Shapiro-Wilk test），并根据检验结果自动选择Pearson或Spearman相关性分析。
'''
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, shapiro, spearmanr
import re

# --- 0. 辅助函数 ---
def get_correlation(series1, series2):
    """
    检验两组数据的正态性，并根据结果选择合适的相关性分析方法。
    如果两组数据都服从正态分布，则使用Pearson相关系数。
    否则，使用Spearman等级相关系数。
    """
    # Shapiro-Wilk检验要求样本量至少为3
    if len(series1) < 3 or len(series2) < 3:
        return np.nan, np.nan, "数据不足 (样本量<3)"

    shapiro_stat1, shapiro_p1 = shapiro(series1)
    shapiro_stat2, shapiro_p2 = shapiro(series2)

    alpha = 0.05
    # 只有当两个变量都来自正态分布时，才使用Pearson
    if shapiro_p1 > alpha and shapiro_p2 > alpha:
        corr, p_value = pearsonr(series1, series2)
        method = "Pearson"
    else:
        corr, p_value = spearmanr(series1, series2)
        method = "Spearman"

    return corr, p_value, method

# --- 1. 数据加载和准备 ---

# 加载数据
try:
    df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='gbk')
except UnicodeDecodeError:
    df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='utf-8')

# 将'检测孕天数'转换为周数
# 根据用户说明，该列单位为天。转换为周以便于分组。
df['孕周'] = pd.to_numeric(df['检测孕天数'], errors='coerce') // 7


# 选择相关列并删除含有缺失值的行
# 使用 'Y染色体浓度'
relevant_cols = ['Y染色体浓度', '孕周', '孕妇BMI', '年龄']
analysis_df = df[relevant_cols].dropna()

# 确保数据类型正确
# '孕周'在转换时已处理，其他列在此处确保
for col in ['Y染色体浓度', '孕妇BMI', '年龄']:
    analysis_df[col] = pd.to_numeric(analysis_df[col])

print("--- 数据加载和预处理完成 ---")
print(f"处理后总样本数: {len(analysis_df)}")
print("转换后的孕周（周数）描述性统计：")
print(analysis_df[['孕周']].describe()) # 添加描述性统计以供验证
print("-" * 50 + "\n")


# --- 2. 孕妇年龄与BMI相关性分析 ---
print("--- 2. 孕妇年龄与BMI相关性分析 ---")
age_bmi_grouped = analysis_df.groupby('年龄')['孕妇BMI'].agg(['median', 'count'])
age_bmi_filtered = age_bmi_grouped[age_bmi_grouped['count'] >= 5]

ages = age_bmi_filtered.index
mi_medians = age_bmi_filtered['median']

corr, p_value, method = get_correlation(ages, mi_medians)
print(f"孕妇年龄与BMI中值的相关性分析 (样本数>=5的组):")
print(f"  - 使用方法: {method}")
print(f"  - 相关系数: {corr:.4f}")
print(f"  - p-value: {p_value:.4f}")
print("-" * 50 + "\n")


# --- 3. Y染色体浓度与孕周相关性分析 ---
print("--- 3. Y染色体浓度与孕周相关性分析 ---")
week_dna_grouped = analysis_df.groupby('孕周')['Y染色体浓度'].agg(['median', 'count'])
week_dna_filtered = week_dna_grouped[week_dna_grouped['count'] >= 5]

weeks = week_dna_filtered.index
dna_medians_by_week = week_dna_filtered['median']

corr, p_value, method = get_correlation(weeks, dna_medians_by_week)
print(f"孕周与Y染色体浓度中值的相关性分析 (样本数>=5的组):")
print(f"  - 使用方法: {method}")
print(f"  - 相关系数: {corr:.4f}")
print(f"  - p-value: {p_value:.4f}")
print("-" * 50 + "\n")


# --- 4. Y染色体浓度与孕妇年龄相关性 (校正BMI) ---
print("--- 4. Y染色体浓度与孕妇年龄相关性 (校正BMI) ---")
# 数据校正：cfEB = Y染色体浓度 / BMI * 1000
analysis_df['cfEB'] = (analysis_df['Y染色体浓度'] / analysis_df['孕妇BMI']) * 1000

# 定义孕周分组
bins = [11, 14, 16, 18, 20, 26]
labels = ['12-14周', '15-16周', '17-18周', '19-20周', '21-26周']
analysis_df['孕周分组'] = pd.cut(analysis_df['孕周'], bins=bins, right=True, labels=labels)

print("按孕周分组，分析孕妇年龄与cfEB的相关性:")
for group_name, group_df in analysis_df.groupby('孕周分组'):
    age_cfeb_grouped = group_df.groupby('年龄')['cfEB'].agg(['median', 'count'])
    age_cfeb_filtered = age_cfeb_grouped[age_cfeb_grouped['count'] >= 5]

    print(f"\n孕周组: {group_name}")
    if len(age_cfeb_filtered) < 2:
        print("  - 数据不足，无法进行相关性分析。")
        continue

    ages_in_group = age_cfeb_filtered.index
    cfeb_medians = age_cfeb_filtered['median']

    corr, p_value, method = get_correlation(ages_in_group, cfeb_medians)
    print(f"  - 使用方法: {method}")
    if pd.isna(corr):
        continue
    print(f"  - 相关系数: {corr:.4f}")
    print(f"  - p-value: {p_value:.4f}")
print("-" * 50 + "\n")


# --- 5. Y染色体浓度与孕妇BMI相关性 (校正年龄) ---
print("--- 5. Y染色体浓度与孕妇BMI相关性 (校正年龄) ---")
# 数据校正：cfEA = Y染色体浓度 / 孕妇年龄 * 1000
analysis_df['cfEA'] = (analysis_df['Y染色体浓度'] / analysis_df['年龄']) * 1000

print("按孕周分组，分析孕妇BMI与cfEA的相关性:")
for group_name, group_df in analysis_df.groupby('孕周分组'):
    bmi_cfea_grouped = group_df.groupby('孕妇BMI')['cfEA'].agg(['median', 'count'])
    bmi_cfea_filtered = bmi_cfea_grouped[bmi_cfea_grouped['count'] >= 5]

    print(f"\n孕周组: {group_name}")
    if len(bmi_cfea_filtered) < 2:
        print("  - 数据不足，无法进行相关性分析。")
        continue

    bmis_in_group = bmi_cfea_filtered.index
    cfea_medians = bmi_cfea_filtered['median']

    corr, p_value, method = get_correlation(bmis_in_group, cfea_medians)
    print(f"  - 使用方法: {method}")
    if pd.isna(corr):
        continue
    print(f"  - 相关系数: {corr:.4f}")
    print(f"  - p-value: {p_value:.4f}")
print("-" * 50 + "\n")

p1_relationship_analysis.py

import pandas as pd
import statsmodels.api as sm
import re
import pingouin as pg
from scipy.stats import shapiro
import numpy as np

# 加载数据
try:
    df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='gbk')
except UnicodeDecodeError:
    df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='utf-8')


# --- 数据清洗和准备 ---

# 清洗'检测孕天数'列的函数
def clean_gestational_week(gw_str):
    if isinstance(gw_str, str):
        # 使用正则表达式查找第一个数字
        match = re.search(r'\d+', gw_str)
        if match:
            return int(match.group(0))
    # 如果值已经是数字或可以转换
    try:
        return int(gw_str)
    except (ValueError, TypeError):
        return None # 对于无法转换的值返回None

# 应用清洗函数
df['检测孕周_cleaned'] = df['检测孕天数'].apply(clean_gestational_week)

# 选择相关列并删除含有缺失值的行
relevant_cols = ['Y染色体浓度', '检测孕周_cleaned', '孕妇BMI', '年龄']
analysis_df = df[relevant_cols].dropna()

# 确保数据类型正确
analysis_df['Y染色体浓度'] = pd.to_numeric(analysis_df['Y染色体浓度'])
analysis_df['检测孕周_cleaned'] = pd.to_numeric(analysis_df['检测孕周_cleaned'])
analysis_df['孕妇BMI'] = pd.to_numeric(analysis_df['孕妇BMI'])
analysis_df['年龄'] = pd.to_numeric(analysis_df['年龄'])

# 对Y染色体浓度进行log变换
# analysis_df['log_Y染色体浓度'] = np.log(analysis_df['Y染色体浓度'])


# --- 特征工程：根据XGBoost的发现创造新特征 ---
print("--- 正在进行特征工程 ---")
# 1. 为BMI创建铰链特征
# hinge_point_bmi = 31
# analysis_df['BMI_hinge'] = analysis_df['孕妇BMI'].apply(lambda x: max(0, x - hinge_point_bmi))



# print("新特征已创建:", ['BMI_hinge'])


# --- 正态性检验 (Shapiro-Wilk) ---
alpha = 0.05
print("--- 正态性检验 (Shapiro-Wilk) ---")
print("原假设 (H0): 数据服从正态分布")
print(f"显著性水平 (alpha) = {alpha}\n")

all_normal = True
# Note: We check normality on original columns, not engineered ones for this specific workflow
for column in ['Y染色体浓度', '检测孕周_cleaned', '孕妇BMI', '年龄']:
    stat, p_value = shapiro(analysis_df[column])
    print(f"变量: {column}")
    print(f"  - 检验统计量: {stat:.4f}")
    print(f"  - p-value: {p_value:.4f}")
    if p_value > alpha:
        print(f"  - 结论: p > {alpha}，不能拒绝原假设，数据可视为服从正态分布。")
    else:
        all_normal = False
        print(f"  - 结论: p <= {alpha}，拒绝原假设，数据不服从正态分布。")
    print("-" * 30)


# --- 偏相关系数分析 ---
print("\n--- 偏相关系数分析 ---")
if not all_normal:
    print("*** 警告: 由于部分或全部数据未通过正态性检验，将使用Spearman方法进行相关性分析。***")
    print("--- Spearman 偏相关系数 (基于秩次) ---")
    # 通过计算数据秩次的Pearson相关性来得到Spearman相关性
    # We will show the correlation of the original variables for comparison
    partial_corr_df = analysis_df[['Y染色体浓度', '检测孕周_cleaned', '孕妇BMI', '年龄']].rank().pcorr()
else:
    print("--- Pearson 偏相关系数 ---")
    partial_corr_df = analysis_df[['Y染色体浓度', '检测孕周_cleaned', '孕妇BMI', '年龄']].pcorr()

print(partial_corr_df)
print("\n")

# --- 回归建模 (使用改造后的特征) ---

# 定义因变量 (Y) 和新的自变量 (X_engineered)
Y = analysis_df['Y染色体浓度']
X_engineered = analysis_df[['检测孕周_cleaned',
                           '孕妇BMI',
                           '年龄']]

# 为模型添加常数（截距）
X_engineered = sm.add_constant(X_engineered)

# 拟合新的多元线性回归模型
model_engineered = sm.OLS(Y, X_engineered).fit()

# --- 模型评估和显著性检验 ---
print("\n--- 改造后的线性回归模型摘要 ---")
print(model_engineered.summary())
print("\n")

# --- 结果解读 ---
print("--- 结果解读 ---")
# 从模型摘要中提取关键值
r_squared = model_engineered.rsquared_adj # 使用调整后的R方更佳
f_pvalue = model_engineered.f_pvalue
coefficients = model_engineered.params
p_values = model_engineered.pvalues

print(f"调整后的R平方 (Adj. R-squared): {r_squared:.4f}")
print(f"F统计量 p值: {f_pvalue:.4f}")
print("\n系数及其p值:")
for var in coefficients.index:
    print(f"  {var}: {coefficients[var]:.4f} (p-value: {p_values[var]:.4f})")
print("\n")

# 显著性分析
print(f"显著性水平 (alpha) = {alpha}")

# 整体模型显著性
if f_pvalue < alpha:
    print("整体模型在统计上是显著的 (F检验 p-value < 0.05)。")
else:
    print("整体模型在统计上不显著 (F检验 p-value >= 0.05)。")

print("\n各系数显著性:")
for var in p_values.index:
    if var == 'const': continue
    if p_values[var] < alpha:
        print(f"- 特征 '{var}' 在统计上是显著的。")
    else:
        print(f"- 特征 '{var}' 在统计上不显著。")

p1_xgboost_analysis.py

import pandas as pd
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.inspection import PartialDependenceDisplay
import re

def main():
    # --- 数据加载和准备 ---
    print("--- 1. 开始加载和准备数据 ---")
    try:
        df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='gbk')
    except UnicodeDecodeError:
        df = pd.read_csv('../男胎检测数据_filtered.csv', encoding='utf-8')

    def clean_gestational_week(gw_str):
        if isinstance(gw_str, str):
            match = re.search(r'\d+', gw_str)
            if match: return int(match.group(0))
        try: return int(gw_str)
        except (ValueError, TypeError): return None

    df['检测孕周_cleaned'] = df['检测孕天数'].apply(clean_gestational_week)
    relevant_cols = ['Y染色体浓度', '检测孕周_cleaned', '孕妇BMI', '年龄']
    analysis_df = df[relevant_cols].dropna()

    X = analysis_df[['检测孕周_cleaned', '孕妇BMI', '年龄']]
    Y = analysis_df['Y染色体浓度']
    print("数据加载和准备完成。")

    # 设置matplotlib支持中文显示
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    # --- 2. 定义XGBoost模型和参数网格以进行自动调参 ---
    xgb_model = xgb.XGBRegressor(
        objective='reg:squarederror',
        random_state=42
    )

    # 定义要搜索的超参数网格
    param_grid = {
        'n_estimators': [100, 150, 200],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.05, 0.1],
        'subsample': [0.8, 0.9],
        'colsample_bytree': [0.8, 0.9]
    }

    # --- 3. 使用GridSearchCV进行自动调参和交叉检验 ---
    print("\n--- 3. 使用GridSearchCV进行自动调参 ---")
    grid_search = GridSearchCV(
        estimator=xgb_model,
        param_grid=param_grid,
        cv=5,
        scoring='r2',
        verbose=1,
        n_jobs=-1  # 使用所有可用的CPU核心
    )
    grid_search.fit(X, Y)

    print(f"\n找到的最佳参数: {grid_search.best_params_}")
    print(f"使用最佳参数在交叉检验中的最佳R²分数: {grid_search.best_score_:.4f}")

    # 获取最佳模型
    xgb_model_tuned = grid_search.best_estimator_

    # --- 4. 最终模型已通过GridSearchCV训练完成 ---
    print("\n--- 4. 最佳模型已在全部数据上完成训练 ---")
    # GridSearchCV(refit=True) 默认会在找到最佳参数后，用全部数据重新训练一个模型
    # 它存储在 grid_search.best_estimator_ 中，因此无需再次调用 .fit()
    print("最终模型已准备好用于分析。")

    # --- 5. 分析特征重要性 ---
    print("\n--- 5. 分析特征重要性 ---")
    importances = xgb_model_tuned.feature_importances_
    feature_names = X.columns
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)
    print("各特征重要性排序：")
    print(importance_df)

    plt.figure(figsize=(10, 6))
    sns.barplot(x='Importance', y='Feature', data=importance_df)
    plt.title('特征重要性排序 (正则化XGBoost)')
    plt.show()

    # --- 6. 绘制所有特征的部分依赖图 ---
    print("\n--- 6. 生成所有特征的部分依赖图 ---")
    features_to_plot = [0, 1, 2] # 索引0, 1, 2分别对应X中的三列
    
    display = PartialDependenceDisplay.from_estimator(
        xgb_model_tuned,
        X, # 在完整数据集上计算PDP
        features_to_plot,
        feature_names=feature_names,
        n_jobs=3,
        grid_resolution=30,
    )
    display.figure_.suptitle('所有特征对Y染色体浓度的部分依赖性 (正则化XGBoost)', size=16)
    plt.subplots_adjust(top=0.9)
    plt.show()
    print("\n分析完成。")

if __name__ == '__main__':
    main()

p2_bmi_supervised_binning.py

# -*- coding: utf-8 -*-
"""
生存导向监督分箱（BMI）— 支持“区间删失 + 多重插补（MI）”与自适应插补、log-rank 显著性
- 固定为4组 + 可复现 + 单调化 + 稳健切分 + 切点持久化
- 支持 time-dependent AUC（scikit-survival，不装则跳过）
- 新增：对左删失占比高的 BMI 段，自适应采用“向右偏重的插补（trunc_exp）并把左界抬到 10 周”
- 新增：相邻 BMI 组的 log-rank 检验（基于 MI 多次插补，Fisher 法合并 p 值），导出 CSV

运行:
python p2_bmi_supervised_binning.py
输出目录:
./outputs_binning
"""

import os
import json
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.utils import k_fold_cross_validation
from lifelines.statistics import logrank_test
from patsy import dmatrix
from sklearn.tree import DecisionTreeRegressor
from sklearn.isotonic import IsotonicRegression

# 可选: scikit-survival 用于 time-dependent AUC（若未安装将自动跳过）
try:
    from sksurv.metrics import cumulative_dynamic_auc
    from sksurv.util import Surv as SKSurv
    HAS_SK_SURV = True
except Exception:
    HAS_SK_SURV = False
    warnings.warn("未安装 scikit-survival，将跳过 time-dependent AUC。pip install scikit-survival")

# 可选：SciPy 用于 Fisher 合并 p 值
try:
    from scipy.stats import chi2
    HAS_SCIPY = True
except Exception:
    HAS_SCIPY = False
    warnings.warn("未安装 SciPy，将无法用 Fisher 法合并 p 值。pip install scipy")

# ---------------- 可复现设置 ----------------
SEED = 42  # 修改这里可更换随机种子（也将写入 JSON）三组114514 四组42

def set_global_seed(seed: int = 42):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)

set_global_seed(SEED)

# ---------------- 路径与常量 ----------------
PATIENT_CSV = "./eda_outputs/patient_level_summary.csv"
RAW_CSV = "../男胎检测数据_filtered.csv"  # 若无 patient_level 会尝试从原始行级构建
OUTDIR = "./outputs_binning"

# 生存/预测时间网格（周）
TIME_LOWER, TIME_UPPER = 10.0, 30.0
N_TIME_POINTS = 801

# 关心的分位数
P_LIST = [0.90, 0.95]
MAIN_P = 0.95          # 用于监督分箱与推荐的主分位（建议 0.95）
MONOTONE_Q90 = True    # Q90 是否也做单调化（用于绘图与参考）

# 分箱数量（固定为 4 组）
REQUIRED_BINS = 4

# 稳健性与可视化
MIN_SAMPLES_PER_BIN = 40   # 每组至少样本数，避免 KM 画不出线/估计不稳
MIN_BIN_WIDTH = 2.0        # 邻近切点最小间距（BMI 单位）
PLOT_MIN_SAMPLES = 20      # KM 绘图阈值（<20 也画，但用虚线）
PLOT_EVEN_IF_SMALL = True  # 即使 <阈值，也绘图（用虚线）

# 决策树拟合与剪枝
TREE_BASE_MAX_DEPTH = 5
TREE_BASE_MIN_SAMPLES_LEAF = 10
TREE_RANDOM_STATE = SEED

# Cox 模型（当 COX_PENALIZER=None 时将自动做 CV 选择）
SPLINE_DF = 4
SPLINE_DEGREE = 3
COX_PENALIZER = None             # 固定 0.05；若想自动 CV，请设为 None
COX_PENALIZER_GRID = [0.0, 0.02, 0.05, 0.1, 0.2]  # 自动 CV 的候选网格
CV_FOLDS = 5

# -------- 区间删失/插补 --------
USE_INTERVAL_MI = True                # 启用“区间删失 + 多重插补”路径
MI_M = 20                             # 多重插补次数
MI_SAMPLING = "uniform"               # 初始插补策略: "uniform" | "trunc_exp"
CV_ONCE_FOR_MI = True                 # MI 情况下仅首个插补做一次 CV，后续复用 penalizer（显著加速）
DETECTION_LOWER_BOUND = 6.0           # 初始“左删失左界”（周），可 6/10

# 自适应插补（基于左删失比例阈值）
ADAPTIVE_ENABLE = True
LEFT_CENSOR_RATE_THRESHOLD = 0.6      # 左删失比例≥该阈值的 BMI 组，进入自适应策略
ADAPTIVE_METHOD_HIGH = "trunc_exp"    # 高左删失组采用的插补
ADAPTIVE_LB_HIGH = 10.0               # 高左删失组采用的左界（周）
ADAPTIVE_M = 20                       # 自适应阶段的 MI 次数（可与 MI_M 相同）

# 诊断用“早期失败窗口”
EARLY_WINDOW = (10.0, 14.0)

# 原始列名
COL_PATIENT = "孕妇代码"
COL_GA_DAYS = "检测孕天数"
COL_DATE = "检测日期"
COL_Y_CONC = "Y染色体浓度"
COL_BMI = "孕妇BMI"
COL_HEIGHT = "身高"
COL_WEIGHT = "体重"
THRESH_Y_FRAC = 0.04  # 4%（比例）

# ----------- 工具函数 -----------
def to_num(s): return pd.to_numeric(s, errors="coerce")
def derive_week(s_days): return to_num(s_days) / 7.0
def compute_bmi(h_cm, w_kg):
    h_m = to_num(h_cm) / 100.0
    w = to_num(w_kg)
    return w / (h_m ** 2)

def parse_y_as_fraction(s):
    if s.dtype == object:
        s_str = s.astype(str).str.strip()
        if s_str.str.contains("%").any():
            vals = pd.to_numeric(s_str.str.replace("%","").str.replace(",",""), errors="coerce")
            return vals / 100.0
    y = pd.to_numeric(s, errors="coerce")
    finite = y[np.isfinite(y)]
    if finite.size >= 10:
        q95 = np.nanpercentile(finite, 95)
        if 1.0 < q95 <= 100.0:
            return y / 100.0
    return y

def safe_min(a):
    a = np.asarray(a)
    return float(np.nanmin(a)) if np.isfinite(a).any() else np.nan

def safe_max(a):
    a = np.asarray(a)
    return float(np.nanmax(a)) if np.isfinite(a).any() else np.nan

# ----------- 行级 -> 区间 -----------
def build_patient_interval_from_raw(raw_csv, detection_lower_bound=DETECTION_LOWER_BOUND):
    df = pd.read_csv(raw_csv)
    if COL_GA_DAYS not in df.columns or COL_PATIENT not in df.columns or COL_Y_CONC not in df.columns:
        raise ValueError("原始CSV缺少必要列（孕妇代码/检测孕天数/Y染色体浓度）。")
    df["孕周"] = derive_week(df[COL_GA_DAYS])
    if COL_DATE in df.columns:
        df[COL_DATE] = pd.to_datetime(df[COL_DATE], errors="coerce")
    df["Y_frac"] = parse_y_as_fraction(df[COL_Y_CONC])
    df["有效测量"] = ~df["Y_frac"].isna()
    df["达标"] = df["有效测量"] & (df["Y_frac"] >= THRESH_Y_FRAC)

    if COL_BMI in df.columns:
        df["BMI_num"] = to_num(df[COL_BMI])
    else:
        df["BMI_num"] = np.nan
    if df["BMI_num"].isna().any() and (COL_HEIGHT in df.columns) and (COL_WEIGHT in df.columns):
        calc = compute_bmi(df[COL_HEIGHT], df[COL_WEIGHT])
        df.loc[df["BMI_num"].isna(), "BMI_num"] = calc[df["BMI_num"].isna()]

    sort_cols = ["孕周"] + ([COL_DATE] if COL_DATE in df.columns else [])
    df = df.sort_values(sort_cols)

    rows = []
    for pid, g in df.groupby(COL_PATIENT, dropna=False):
        bmi = g["BMI_num"].dropna().iloc[0] if g["BMI_num"].notna().any() else np.nan
        g_valid = g[g["有效测量"]].copy()
        times = g_valid["孕周"].values
        hits = g_valid["达标"].values.astype(bool)

        if len(times) == 0:
            L, R, ctype = np.nan, np.nan, "missing"
        else:
            pos_idx = np.where(hits)[0]
            if len(pos_idx) > 0:
                first_pos_i = int(pos_idx[0])
                neg_before = np.where(~hits[:first_pos_i])[0]
                if len(neg_before) > 0:
                    L = float(times[neg_before[-1]])
                    R = float(times[first_pos_i])
                    if R < L:
                        L, R = R, L
                    ctype = "interval"
                else:
                    L = float(detection_lower_bound)
                    R = float(times[first_pos_i])
                    if R < L:
                        L = max(R - 1e-3, detection_lower_bound)
                    ctype = "left"
            else:
                last_valid_week = float(times[-1])
                L, R, ctype = last_valid_week, np.inf, "right"

        rows.append({
            "patient_id": pid, "BMI": bmi, "L": L, "R": R, "ctype": ctype
        })
    df_int = pd.DataFrame(rows)
    return df_int

# ----------- 区间 -> 插补数据集 -----------
def _rand_trunc_expon(L, R, scale=2.0):
    if not np.isfinite(R):
        R = TIME_UPPER
    U = random.random()
    denom = 1.0 - np.exp(-(R - L) / scale)
    if denom <= 1e-12:
        return float((L + R) / 2.0)
    t = -scale * np.log(1 - U * denom) + L
    return float(min(max(t, L + 1e-6), R))

def sample_time_from_interval(L, R, left_lower_bound=DETECTION_LOWER_BOUND, method="uniform"):
    if not np.isfinite(R):
        return np.nan
    if not np.isfinite(L):
        L = float(left_lower_bound)
    L_eff = float(L); R_eff = float(R)
    if R_eff <= L_eff:
        return float(R_eff)
    if method == "uniform":
        return float(np.random.uniform(L_eff, R_eff))
    else:
        return _rand_trunc_expon(L_eff, R_eff, scale=2.0)

def multiple_imputations_from_intervals(df_int, M=MI_M, method=MI_SAMPLING, left_lower_bound=DETECTION_LOWER_BOUND):
    dfs = []
    base = df_int.copy()
    keep_mask = base["ctype"].isin(["interval", "left", "right"])
    base = base[keep_mask].copy()
    for m in range(M):
        rows = []
        for _, r in base.iterrows():
            ctype = r["ctype"]; BMI = r["BMI"]; pid = r["patient_id"]; L, R = r["L"], r["R"]
            if ctype in ["interval", "left"] and np.isfinite(R):
                t = sample_time_from_interval(L, R, left_lower_bound=left_lower_bound, method=method)
                rows.append({"patient_id": pid, "BMI": BMI, "time": t, "event": 1})
            elif ctype == "right" and np.isfinite(L):
                rows.append({"patient_id": pid, "BMI": BMI, "time": float(L), "event": 0})
            else:
                continue
        df_m = pd.DataFrame(rows)
        df_m = df_m[(df_m["time"] >= 6) & (df_m["time"] <= 40)]
        df_m["event"] = df_m["event"].astype(int)
        dfs.append(df_m.reset_index(drop=True))
    return dfs

# ----------- 自适应插补（按组左删失比例） -----------
def bins_from_cuts(x, cuts):
    if len(cuts) == 0:
        bins = [-np.inf, np.inf]
    else:
        bins = [-np.inf] + cuts + [np.inf]
    labels = []
    for i in range(len(bins)-1):
        a, b = bins[i], bins[i+1]
        labels.append(f"[{round(a,1) if np.isfinite(a) else '-inf'}, {round(b,1) if np.isfinite(b) else '+inf'})")
    idx = np.digitize(x, bins[1:-1], right=False)
    return pd.Series([labels[i] for i in idx], index=x.index), bins, labels

def parse_label_left(label: str) -> float:
    left = label.split(",")[0][1:].strip()
    return float(left.replace("-inf", "-1e9"))

def compute_left_censor_rates(df_int, labels_series):
    groups = []
    for g, idxs in labels_series.groupby(labels_series).groups.items():
        sub = df_int.loc[idxs]
        n = len(sub)
        if n == 0:
            rate = np.nan
        else:
            rate = float((sub["ctype"] == "left").sum()) / n
        groups.append({"group": g, "n": n, "left_censor_rate": rate})
    res = pd.DataFrame(groups).sort_values("group", key=lambda s: s.map(parse_label_left)).reset_index(drop=True)
    return res

def multiple_imputations_adaptive(df_int, labels_series, left_rate_df,
                                  threshold=LEFT_CENSOR_RATE_THRESHOLD,
                                  M=ADAPTIVE_M,
                                  method_default=MI_SAMPLING,
                                  method_high=ADAPTIVE_METHOD_HIGH,
                                  lb_default=DETECTION_LOWER_BOUND,
                                  lb_high=ADAPTIVE_LB_HIGH):
    left_rate_map = dict(zip(left_rate_df["group"], left_rate_df["left_censor_rate"]))
    dfs = []
    for m in range(M):
        rows = []
        for idx, r in df_int.iterrows():
            if r["ctype"] not in ["interval", "left", "right"]:
                continue
            group = labels_series.at[idx] if idx in labels_series.index else None
            rate = left_rate_map.get(group, 0.0)
            use_high = (rate is not None) and (np.isfinite(rate)) and (rate >= threshold)
            method = method_high if use_high else method_default
            lb = lb_high if use_high else lb_default

            pid = r["patient_id"]; BMI = r["BMI"]; L, R, ct = r["L"], r["R"], r["ctype"]
            if ct in ["interval", "left"] and np.isfinite(R):
                t = sample_time_from_interval(L, R, left_lower_bound=lb, method=method)
                rows.append({"patient_id": pid, "BMI": BMI, "time": t, "event": 1})
            elif ct == "right" and np.isfinite(L):
                rows.append({"patient_id": pid, "BMI": BMI, "time": float(L), "event": 0})
            else:
                continue
        df_m = pd.DataFrame(rows)
        df_m = df_m[(df_m["time"] >= 6) & (df_m["time"] <= 40)]
        df_m["event"] = df_m["event"].astype(int)
        dfs.append(df_m.reset_index(drop=True))
    return dfs

# ----------- KM 分位时间与组内推荐 -----------
def km_quantile_time(durations, events, target_S):
    kmf = KaplanMeierFitter()
    kmf.fit(durations=durations, event_observed=events)
    sf = kmf.survival_function_.reset_index().rename(columns={"KM_estimate":"S","timeline":"t"})
    sf = sf.sort_values("t")
    hit = sf[sf["S"] <= target_S]
    return float(hit["t"].iloc[0]) if len(hit) else np.nan

def per_group_km_quantiles(df, group_col, p_list):
    rows = []
    for g, sub in df.dropna(subset=["time","BMI"]).groupby(group_col):
        res = {"group": str(g), "n": len(sub)}
        if len(sub) >= 20:
            for p in p_list:
                res[f"KM_t{int(p*100)}"] = km_quantile_time(sub["time"].values, sub["event"].values, target_S=1.0 - p)
        else:
            for p in p_list:
                res[f"KM_t{int(p*100)}"] = np.nan
        rows.append(res)
    return pd.DataFrame(rows)

def group_km_recommendations_from_imputations(imputed_sets, cuts_final, p=MAIN_P):
    # 对每个插补集：按固定 cuts 分组 -> 组内 KM Qp；最后对每组在插补维度上取中位数
    label_vals = {}
    for df_m in imputed_sets:
        labels, _, _ = bins_from_cuts(df_m["BMI"], cuts_final)
        rec = per_group_km_quantiles(df_m.assign(group=labels.values), "group", [p])
        col = f"KM_t{int(p*100)}"
        for _, r in rec.iterrows():
            label_vals.setdefault(r["group"], []).append(r[col])
    rows = []
    for g, L in label_vals.items():
        arr = np.array([v for v in L if np.isfinite(v)])
        rows.append({"group": g, f"KM_t{int(p*100)}_MI_med": (float(np.median(arr)) if len(arr) > 0 else np.nan)})
    out = pd.DataFrame(rows)
    out = out.sort_values("group", key=lambda s: s.map(parse_label_left)).reset_index(drop=True)
    return out

# ----------- Cox 分位时间预测与特征 -----------
def predict_tp_from_cox(cph, df_feat, p=0.90, time_grid=None):
    if time_grid is None:
        time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS)
    surv = cph.predict_survival_function(df_feat, times=time_grid)
    target_S = 1.0 - p
    t_list = []
    for col in surv.columns:
        s = surv[col].values
        hit = np.where(s <= target_S)[0]
        t_list.append(np.nan if len(hit) == 0 else time_grid[hit[0]])
    return pd.Series(t_list, index=df_feat.index)

def make_bmi_spline(bmi_series, bmi_center, df=SPLINE_DF, degree=SPLINE_DEGREE):
    Xs = dmatrix(
        f"bs(BMI_centered, df={df}, degree={degree}, include_intercept=False)",
        {"BMI_centered": (bmi_series - bmi_center).values},
        return_type='dataframe'
    )
    Xs.index = bmi_series.index
    return Xs

# ----------- 分箱工具 -----------
def extract_tree_cuts_1d(tree: DecisionTreeRegressor):
    thr = tree.tree_.threshold
    cuts = sorted([float(t) for t in thr if t != -2.0])
    return cuts

def apply_min_width(cuts, min_width=2.0):
    if not cuts:
        return []
    kept = [cuts[0]]
    for c in cuts[1:]:
        if c - kept[-1] >= min_width:
            kept.append(c)
    return kept

def merge_small_bins(df_labels, label_col, min_samples=40):
    labels_order = sorted(df_labels[label_col].unique(), key=parse_label_left)
    counts = df_labels[label_col].value_counts().to_dict()

    def neighbors(idx):
        left = labels_order[idx-1] if idx-1 >= 0 else None
        right = labels_order[idx+1] if idx+1 < len(labels_order) else None
        return left, right

    changed = True
    while changed:
        changed = False
        smalls = [lab for lab in labels_order if counts.get(lab, 0) < min_samples]
        for lab in smalls:
            if lab not in labels_order:
                continue
            idx = labels_order.index(lab)
            L, R = neighbors(idx)
            if L is None and R is None:
                continue
            cL = counts.get(L, -1)
            cR = counts.get(R, -1)
            target = L if (cL >= cR and L is not None) else (R if R is not None else L)
            df_labels.loc[df_labels[label_col] == lab, label_col] = target
            counts[target] = counts.get(target, 0) + counts.get(lab, 0)
            counts[lab] = 0
            labels_order.remove(lab)
            changed = True
            break
    return df_labels[label_col]

def canonicalize_from_labels(x_series, labels_series):
    uniq_labels = pd.unique(labels_series)
    lefts = sorted(set(parse_label_left(l) for l in uniq_labels))
    cuts_final = [v for v in lefts if v > -1e8]
    labels_final, bins_edges, labels_text = bins_from_cuts(x_series, cuts_final)
    return cuts_final, labels_final, bins_edges, labels_text

def evaluate_bins_mae(bmi, y_true, labels):
    df_tmp = pd.DataFrame({"BMI": bmi, "y": y_true, "label": labels})
    preds = df_tmp.groupby("label")["y"].median().to_dict()
    y_hat = df_tmp["label"].map(preds).astype(float)
    mae = np.nanmean(np.abs(df_tmp["y"].values - y_hat.values))
    sizes = df_tmp["label"].value_counts().to_dict()
    return mae, sizes

def fit_cox_with_cv_penalizer(cox_df, penalizer_grid, duration_col="time", event_col="event", k=5):
    best_pen = None
    best_c = -np.inf
    results = []
    for pen in penalizer_grid:
        cph_try = CoxPHFitter(penalizer=float(pen))
        try:
            cph_try.fit(cox_df, duration_col=duration_col, event_col=event_col, show_progress=False)
            scores = k_fold_cross_validation(
                cph_try, cox_df, duration_col=duration_col, event_col=event_col,
                k=k, scoring_method="concordance_index"
            )
            mean_c = float(np.mean(scores))
        except Exception:
            mean_c = np.nan
        results.append((float(pen), mean_c))
        if np.isfinite(mean_c) and mean_c > best_c:
            best_c = mean_c; best_pen = float(pen)
    if best_pen is None:
        best_pen = 0.0
    return best_pen, results

# ----------- time-dependent AUC（自动裁剪时域） -----------
def time_dependent_auc_curve(df, lower=10.0, upper=30.0, n_points=40, kfold=5, penalizer=0.05):
    if not HAS_SK_SURV:
        return None, None, None
    df = df.dropna(subset=["BMI","time","event"]).copy()
    df = df[(df["time"] >= 6) & (df["time"] <= 40)]
    if len(df) < 50:
        warnings.warn("样本过少，跳过 AUC 评估。")
        return None, None, None

    BMI_CENTER = df["BMI"].mean()
    X = dmatrix("bs(BMI_centered, df=4, degree=3, include_intercept=False)",
                {"BMI_centered": (df["BMI"] - BMI_CENTER).values}, return_type='dataframe')
    X.index = df.index
    y = df[["time","event"]].copy()

    idx = np.arange(len(df)); np.random.shuffle(idx)
    folds = np.array_split(idx, kfold)
    times_global = np.linspace(lower, upper, n_points)
    auc_matrix = []; used_fold = 0

    for i in range(kfold):
        test_idx = folds[i]; train_idx = np.setdiff1d(idx, test_idx)
        if len(train_idx) < 20 or len(test_idx) < 20: continue

        cph = CoxPHFitter(penalizer=penalizer)
        train_df = pd.concat([y.iloc[train_idx].reset_index(drop=True),
                              X.iloc[train_idx].reset_index(drop=True)], axis=1)
        try:
            cph.fit(train_df, duration_col="time", event_col="event", show_progress=False)
        except Exception as e:
            warnings.warn(f"Cox 拟合失败 fold {i+1}: {e}"); continue

        risk_test = cph.predict_partial_hazard(X.iloc[test_idx]).values.ravel()
        y_train = SKSurv.from_arrays(event=y.iloc[train_idx]["event"].astype(bool).values,
                                     time=y.iloc[train_idx]["time"].values)
        y_test = SKSurv.from_arrays(event=y.iloc[test_idx]["event"].astype(bool).values,
                                    time=y.iloc[test_idx]["time"].values)

        tmin = float(np.min(y.iloc[test_idx]["time"].values))
        tmax = float(np.max(y.iloc[test_idx]["time"].values))
        mask = (times_global > tmin + 1e-8) & (times_global < tmax - 1e-8)
        if not np.any(mask): continue
        times_fold = times_global[mask]

        try:
            auc_t, _ = cumulative_dynamic_auc(y_train, y_test, risk_test, times_fold)
        except ValueError:
            eps = 1e-3
            mask2 = (times_global > tmin + eps) & (times_global < tmax - eps)
            if not np.any(mask2): continue
            times_fold = times_global[mask2]
            auc_t, _ = cumulative_dynamic_auc(y_train, y_test, risk_test, times_fold)
            mask = mask2

        fold_vec = np.full_like(times_global, np.nan, dtype=float)
        fold_vec[mask] = auc_t
        auc_matrix.append(fold_vec); used_fold += 1

    if used_fold == 0:
        warnings.warn("无可用折进行 AUC 评估。"); return None, None, None

    auc_matrix = np.vstack(auc_matrix)
    mean_auc = np.nanmean(auc_matrix, axis=0)
    std_auc = np.nanstd(auc_matrix, axis=0)

    plt.figure(figsize=(7.5,4.5))
    plt.plot(times_global, mean_auc, lw=2, label="Mean AUC(t)")
    lo = np.where(np.isfinite(mean_auc - std_auc), mean_auc - std_auc, np.nan)
    hi = np.where(np.isfinite(mean_auc + std_auc), mean_auc + std_auc, np.nan)
    plt.fill_between(times_global, np.nan_to_num(lo, nan=0.5), np.nan_to_num(hi, nan=1.0), alpha=0.2, label="±1 SD")
    plt.ylim(0.5, 1.0); plt.xlabel("孕周（周）"); plt.ylabel("AUC(t)")
    plt.title("Time-dependent AUC（Cox + BMI 样条，K折，自动裁剪时域）")
    plt.grid(alpha=0.3); plt.legend()
    plt.tight_layout(); plt.savefig(os.path.join(OUTDIR, "auc_curve.png"), dpi=180); plt.close()

    auc_df = pd.DataFrame({"t": times_global, "auc_mean": mean_auc, "auc_sd": std_auc})
    auc_df.to_csv(os.path.join(OUTDIR, "auc_curve.csv"), index=False, encoding="utf-8-sig")
    return times_global, mean_auc, std_auc

# ---------------- 主流程 ----------------
def main():
    os.makedirs(OUTDIR, exist_ok=True)

    # 打印上一次 run 的 seed/cuts（若存在）
    meta_path_prev = os.path.join(OUTDIR, "bmi_supervised_bins_cuts.json")
    if os.path.exists(meta_path_prev):
        try:
            with open(meta_path_prev, "r", encoding="utf-8") as f:
                prev = json.load(f)
            print(f"[INFO] 上一次运行 seed={prev.get('seed')}, cuts_final={prev.get('chosen',{}).get('cuts_final')}")
        except Exception as e:
            print("[WARN] 读取上一次 JSON 失败：", e)

    # ============ 第1阶段：初始 MI 拟合 + 分箱 ============
    df_pred = None
    mode_used = "exact"
    df_int = None  # 保存区间数据以用于自适应阶段

    if USE_INTERVAL_MI and os.path.exists(RAW_CSV):
        print("[INFO] 使用区间删失 + 多重插补（MI）路径...")
        mode_used = "interval_mi"
        df_int = build_patient_interval_from_raw(RAW_CSV, detection_lower_bound=DETECTION_LOWER_BOUND)
        out_int_csv = os.path.join(OUTDIR, "patient_level_intervals.csv")
        df_int.to_csv(out_int_csv, index=False, encoding="utf-8-sig")
        print(f"[INFO] 已保存区间数据: {out_int_csv} (n={len(df_int)})")

        set_global_seed(SEED)
        imputed_sets = multiple_imputations_from_intervals(
            df_int, M=MI_M, method=MI_SAMPLING, left_lower_bound=DETECTION_LOWER_BOUND
        )
        print(f"[INFO] 生成多重插补数据集 M={len(imputed_sets)}")

        agg_preds = None
        penalizer_used = None
        df_pred_first = None

        for m_idx, df_m in enumerate(imputed_sets):
            if len(df_m) == 0: continue
            BMI_CENTER = df_m["BMI"].mean()
            X_spline = make_bmi_spline(df_m["BMI"], BMI_CENTER, df=SPLINE_DF, degree=SPLINE_DEGREE)
            cox_df = pd.concat([df_m[["time","event"]].reset_index(drop=True),
                                X_spline.reset_index(drop=True)], axis=1)

            if COX_PENALIZER is None:
                if (m_idx == 0) or (not CV_ONCE_FOR_MI):
                    best_pen, pen_cv = fit_cox_with_cv_penalizer(
                        cox_df, COX_PENALIZER_GRID, duration_col="time", event_col="event", k=CV_FOLDS
                    )
                    penalizer_used = float(best_pen)
                    print(f"[MI {m_idx+1}] Cox penalizer CV: {pen_cv} -> best={best_pen}")
                else:
                    best_pen = penalizer_used
            else:
                best_pen = float(COX_PENALIZER)
                if m_idx == 0:
                    print(f"[MI {m_idx+1}] Cox penalizer fixed to: {best_pen}")

            cph = CoxPHFitter(penalizer=best_pen)
            cph.fit(cox_df, duration_col="time", event_col="event", show_progress=False)

            time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS)
            pred_cols = {}
            for p in P_LIST:
                pred_cols[f"pred_t{int(p*100)}"] = predict_tp_from_cox(cph, X_spline, p=p, time_grid=time_grid)
            df_pred_m = pd.concat([df_m.reset_index(drop=True), pd.DataFrame(pred_cols).reset_index(drop=True)], axis=1)
            df_pred_m["BMI"] = df_m["BMI"].values

            if agg_preds is None:
                agg_preds = df_pred_m[["patient_id","BMI"] + [f"pred_t{int(p*100)}" for p in P_LIST]].copy()
                for p in P_LIST:
                    agg_preds[f"pred_t{int(p*100)}_list"] = agg_preds[f"pred_t{int(p*100)}"].apply(lambda x: [x])
            else:
                tmp = df_pred_m.set_index("patient_id")
                base = agg_preds.set_index("patient_id")
                for p in P_LIST:
                    vals = tmp[f"pred_t{int(p*100)}"]
                    lst = base[f"pred_t{int(p*100)}_list"]
                    base[f"pred_t{int(p*100)}_list"] = [
                        (lst.iloc[i] + [vals.iloc[i]]) if (i < len(vals)) else lst.iloc[i]
                        for i in range(len(lst))
                    ]
                agg_preds = base.reset_index()

            if m_idx == 0:
                df_pred_first = df_pred_m.copy()
                try:
                    scores = k_fold_cross_validation(cph, cox_df, duration_col="time", event_col="event",
                                                     k=CV_FOLDS, scoring_method="concordance_index")
                    print(f"[MI {m_idx+1}] {CV_FOLDS}-fold C-index: mean={np.mean(scores):.3f}, std={np.std(scores):.3f}")
                except Exception as e:
                    print("[MI] 交叉验证跳过：", e)

        if agg_preds is None:
            raise RuntimeError("多重插补生成的预测为空，请检查数据。")

        for p in P_LIST:
            lists = agg_preds[f"pred_t{int(p*100)}_list"]
            agg_preds[f"pred_t{int(p*100)}"] = lists.apply(
                lambda L: np.nanmedian([v for v in L if np.isfinite(v)]) if isinstance(L, list) and len(L)>0 else np.nan
            )
        df_pred = pd.DataFrame({
            "patient_id": agg_preds["patient_id"],
            "BMI": agg_preds["BMI"]
        })
        for p in P_LIST:
            df_pred[f"pred_t{int(p*100)}"] = agg_preds[f"pred_t{int(p*100)}"].values

        df_pred["time"] = df_pred_first["time"].values
        df_pred["event"] = df_pred_first["event"].values

    else:
        print("[INFO] 使用原始（把首次达标视为精确事件）的路径...")
        mode_used = "exact"
        if os.path.exists(PATIENT_CSV):
            df_pat = pd.read_csv(PATIENT_CSV)
            if "patient_id" not in df_pat.columns:
                if "孕妇代码" in df_pat.columns:
                    df_pat = df_pat.rename(columns={"孕妇代码":"patient_id"})
                else:
                    df_pat["patient_id"] = np.arange(len(df_pat))
            if "BMI" not in df_pat.columns and "BMI_num" in df_pat.columns:
                df_pat = df_pat.rename(columns={"BMI_num":"BMI"})
        else:
            df_int_tmp = build_patient_interval_from_raw(RAW_CSV, detection_lower_bound=DETECTION_LOWER_BOUND)
            exact_rows = []
            for _, r in df_int_tmp.iterrows():
                if r["ctype"] in ["interval", "left"] and np.isfinite(r["R"]):
                    exact_rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": r["R"], "event": 1})
                elif r["ctype"] == "right" and np.isfinite(r["L"]):
                    exact_rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": r["L"], "event": 0})
            df_pat = pd.DataFrame(exact_rows)

        needed = ["patient_id","BMI","time","event"]
        for k in needed:
            if k not in df_pat.columns:
                raise ValueError(f"数据缺列：{k}")
        df = df_pat[needed].copy().dropna(subset=["BMI","time","event"])
        df = df[(df["time"] >= 6) & (df["time"] <= 40)]
        df["event"] = df["event"].astype(int)

        BMI_CENTER = df["BMI"].mean()
        X_spline_train = make_bmi_spline(df["BMI"], BMI_CENTER, df=SPLINE_DF, degree=SPLINE_DEGREE)
        cox_df = pd.concat([df[["time","event"]].reset_index(drop=True),
                            X_spline_train.reset_index(drop=True)], axis=1)

        if COX_PENALIZER is None:
            best_pen, pen_cv = fit_cox_with_cv_penalizer(
                cox_df, COX_PENALIZER_GRID, duration_col="time", event_col="event", k=CV_FOLDS
            )
            penalizer = float(best_pen)
            print(f"Cox penalizer CV: {pen_cv} -> best={best_pen}")
        else:
            penalizer = float(COX_PENALIZER)
            print(f"Cox penalizer fixed to: {penalizer}")

        cph = CoxPHFitter(penalizer=penalizer)
        cph.fit(cox_df, duration_col="time", event_col="event", show_progress=False)
        try:
            scores = k_fold_cross_validation(cph, cox_df, duration_col="time", event_col="event",
                                             k=CV_FOLDS, scoring_method="concordance_index")
            print(f"{CV_FOLDS}-fold C-index: mean={np.mean(scores):.3f}, std={np.std(scores):.3f}")
        except Exception:
            pass

        time_grid = np.linspace(TIME_LOWER, TIME_UPPER, N_TIME_POINTS)
        pred = {}
        for p in P_LIST:
            pred[f"pred_t{int(p*100)}"] = predict_tp_from_cox(cph, X_spline_train, p=p, time_grid=time_grid)
        df_pred = pd.concat([df.reset_index(drop=True), pd.DataFrame(pred).reset_index(drop=True)], axis=1)

    # ============ 单调化 + CART 监督分箱（保持原逻辑） ============
    main_col = f"pred_t{int(MAIN_P*100)}"
    x_bmi = df_pred["BMI"].values
    order = np.argsort(x_bmi)

    iso = IsotonicRegression(increasing=True, out_of_bounds="clip")
    y95_raw = df_pred[main_col].fillna(TIME_UPPER).values
    y95_sorted = iso.fit_transform(x_bmi[order], y95_raw[order])
    y95_fit = np.empty_like(y95_sorted); y95_fit[order] = y95_sorted
    df_pred["t_main_mono"] = y95_fit

    if MONOTONE_Q90 and "pred_t90" in df_pred.columns:
        y90_raw = df_pred["pred_t90"].fillna(TIME_UPPER).values
        y90_sorted = iso.fit_transform(x_bmi[order], y90_raw[order])
        y90_fit = np.empty_like(y90_sorted); y90_fit[order] = y90_sorted
        df_pred["t90_mono"] = y90_fit
    else:
        df_pred["t90_mono"] = df_pred.get("pred_t90", pd.Series(np.nan, index=df_pred.index))

    y_target = df_pred["t_main_mono"].values
    X = df_pred[["BMI"]].values

    base_tree = DecisionTreeRegressor(
        max_depth=TREE_BASE_MAX_DEPTH,
        min_samples_leaf=TREE_BASE_MIN_SAMPLES_LEAF,
        random_state=TREE_RANDOM_STATE
    )
    base_tree.fit(X, y_target)
    path = base_tree.cost_complexity_pruning_path(X, y_target)
    ccp_alphas = np.unique(path.ccp_alphas)

    candidates = []
    for alpha in ccp_alphas:
        tree = DecisionTreeRegressor(
            max_depth=TREE_BASE_MAX_DEPTH,
            min_samples_leaf=TREE_BASE_MIN_SAMPLES_LEAF,
            ccp_alpha=float(alpha),
            random_state=TREE_RANDOM_STATE
        )
        tree.fit(X, y_target)
        cuts_raw = extract_tree_cuts_1d(tree)
        cuts = apply_min_width(cuts_raw, MIN_BIN_WIDTH)
        labels_series, _, _ = bins_from_cuts(df_pred["BMI"], cuts)
        df_lab = pd.DataFrame({"label": labels_series})
        labels_merged = merge_small_bins(df_lab, "label", MIN_SAMPLES_PER_BIN)
        cuts_final, labels_final, bins_edges_final, _ = canonicalize_from_labels(df_pred["BMI"], labels_merged)
        mae, sizes = evaluate_bins_mae(df_pred["BMI"].values, y_target, labels_final.values)
        n_bins_final = len(set(labels_final.values))
        candidates.append({
            "alpha": float(alpha),
            "cuts_raw": cuts_raw,
            "cuts_after_minwidth": cuts,
            "cuts_final": cuts_final,
            "bins_edges_final": [None if not np.isfinite(b) else float(b) for b in bins_edges_final],
            "n_bins_final": n_bins_final,
            "mae": float(mae),
            "sizes": {k:int(v) for k,v in sizes.items()},
            "labels_series_final": labels_final.copy()
        })

    def dist_to_required(n): return abs(n - REQUIRED_BINS)
    exact = [c for c in candidates if c["n_bins_final"] == REQUIRED_BINS]
    if exact:
        chosen = sorted(exact, key=lambda c: (c["mae"]))[0]
    else:
        chosen = sorted(candidates, key=lambda c: (dist_to_required(c["n_bins_final"]), c["mae"]))[0]

    df_pred["BMI_bin_supervised"] = chosen["labels_series_final"].values

    def enforce_bins_and_min_samples():
        df_lab2 = pd.DataFrame({"label": df_pred["BMI_bin_supervised"]})
        labels_merged2 = merge_small_bins(df_lab2, "label", max(MIN_SAMPLES_PER_BIN, PLOT_MIN_SAMPLES))
        cuts_final2, labels_final2, bins_edges_final2, _ = canonicalize_from_labels(df_pred["BMI"], labels_merged2)
        n_bins2 = len(set(labels_final2.values))
        return labels_final2, cuts_final2, bins_edges_final2, n_bins2

    labels_final2, cuts_final2, bins_edges_final2, n_bins2 = enforce_bins_and_min_samples()
    if n_bins2 != REQUIRED_BINS:
        alt = None
        for cnd in sorted(candidates, key=lambda c: (dist_to_required(c["n_bins_final"]), c["mae"])):  # type: ignore
            df_pred["BMI_bin_supervised"] = cnd["labels_series_final"].values
            labels_final2, cuts_final2, bins_edges_final2, n_bins2 = enforce_bins_and_min_samples()
            if n_bins2 == REQUIRED_BINS:
                alt = (cnd, labels_final2, cuts_final2, bins_edges_final2); break
        if alt is not None:
            chosen, labels_final2, cuts_final2, bins_edges_final2 = alt  # type: ignore

    df_pred["BMI_bin_supervised"] = labels_final2.values
    chosen["cuts_final"] = cuts_final2
    chosen["bins_edges_final"] = [None if not np.isfinite(b) else float(b) for b in bins_edges_final2]
    n_bins_final = len(set(df_pred["BMI_bin_supervised"]))
    print(f"[INFO] 最终组数: {n_bins_final}（目标 {REQUIRED_BINS}）")

    # 组内 KM/Cox/推荐（初始）
    km_tab = per_group_km_quantiles(df_pred, "BMI_bin_supervised", P_LIST)
    pred_tab = df_pred.groupby("BMI_bin_supervised")[["pred_t90","pred_t95","t_main_mono","t90_mono"]].median().reset_index() \
                      .rename(columns={"pred_t90":"Cox_pred_t90_median",
                                       "pred_t95":"Cox_pred_t95_median",
                                       "t_main_mono":"Cox_t_main_mono_median",
                                       "t90_mono":"Cox_t90_mono_median"})
    summary = km_tab.merge(pred_tab, left_on="group", right_on="BMI_bin_supervised", how="left").drop(columns=["BMI_bin_supervised"])
    summary = summary.sort_values("group", key=lambda s: s.map(parse_label_left)).reset_index(drop=True)

    rec = summary["Cox_t_main_mono_median"].apply(lambda x: np.nan if pd.isna(x) else round(x*2)/2).values
    for i in range(1, len(rec)):
        if not np.isnan(rec[i-1]) and (np.isnan(rec[i]) or rec[i] < rec[i-1]):
            rec[i] = rec[i-1]
    summary["recommended_week"] = rec

    # ============ 第2阶段：自适应插补（高左删失组使用 trunc_exp + 左界 10 周） ============
    adaptive_outputs = {}
    if ADAPTIVE_ENABLE and (df_int is not None):
        # 用最终 cuts 对 df_int 分组，统计左删失比例
        labels_int, _, _ = bins_from_cuts(df_int["BMI"], chosen["cuts_final"])
        left_rates = compute_left_censor_rates(df_int, labels_int)
        left_rates.to_csv(os.path.join(OUTDIR, "left_censor_rates_by_group.csv"), index=False, encoding="utf-8-sig")
        print("\n[INFO] 左删失比例（按最终 BMI 组）：")
        print(left_rates.to_string(index=False))

        # 自适应 MI
        set_global_seed(SEED)
        imps_adapt = multiple_imputations_adaptive(
            df_int, labels_int, left_rates,
            threshold=LEFT_CENSOR_RATE_THRESHOLD,
            M=ADAPTIVE_M,
            method_default=MI_SAMPLING,
            method_high=ADAPTIVE_METHOD_HIGH,
            lb_default=DETECTION_LOWER_BOUND,
            lb_high=ADAPTIVE_LB_HIGH
        )

        # 组内 KM t95（在插补维度上取中位数）
        rec_adapt = group_km_recommendations_from_imputations(imps_adapt, chosen["cuts_final"], p=MAIN_P)
        # 推荐周（保序 + 0.5 周舍入）
        rec_vals = rec_adapt[f"KM_t{int(MAIN_P*100)}_MI_med"].values.copy()
        # 保序
        for i in range(1, len(rec_vals)):
            if np.isfinite(rec_vals[i-1]) and (np.isnan(rec_vals[i]) or rec_vals[i] < rec_vals[i-1]):
                rec_vals[i] = rec_vals[i-1]
        rec_vals = np.array([np.nan if np.isnan(v) else round(v*2)/2 for v in rec_vals])
        rec_adapt["recommended_week_adaptive"] = rec_vals
        rec_adapt.to_csv(os.path.join(OUTDIR, "recommendations_adaptive.csv"), index=False, encoding="utf-8-sig")
        print("\n[INFO] 自适应插补后的组内 KM Q95（插补中位数）与推荐：")
        print(rec_adapt.to_string(index=False))

        adaptive_outputs["left_rates"] = left_rates
        adaptive_outputs["rec_adapt"] = rec_adapt

        # ============ 相邻 BMI 组的 log-rank 检验（基于自适应 MI），Fisher 合并 p ============
        # 使用每个插补集计算一次 p 值，再合并
        # 确定有序的组标签
        ordered_groups = sorted(rec_adapt["group"].tolist(), key=parse_label_left)
        pairs = [(ordered_groups[i], ordered_groups[i+1]) for i in range(len(ordered_groups)-1)]
        rows_lr = []
        for gL, gR in pairs:
            p_list = []
            for df_m in imps_adapt:
                labels_m, _, _ = bins_from_cuts(df_m["BMI"], chosen["cuts_final"])
                subL = df_m[labels_m.values == gL]
                subR = df_m[labels_m.values == gR]
                if len(subL) < 5 or len(subR) < 5:
                    continue
                try:
                    res = logrank_test(subL["time"], subR["time"],
                                       event_observed_A=subL["event"], event_observed_B=subR["event"])
                    p_list.append(float(res.p_value))
                except Exception:
                    continue
            if len(p_list) == 0:
                p_fisher = np.nan; p_median = np.nan
            else:
                p_median = float(np.median(p_list))
                if HAS_SCIPY:
                    stat = -2.0 * np.sum(np.log(np.clip(p_list, 1e-300, 1.0)))
                    df_chi = 2 * len(p_list)
                    p_fisher = float(1.0 - chi2.cdf(stat, df_chi))
                else:
                    p_fisher = np.nan
            rows_lr.append({"group_left": gL, "group_right": gR,
                            "n_imputations": len(p_list),
                            "p_median": p_median, "p_fisher": p_fisher})
        df_lr = pd.DataFrame(rows_lr)
        df_lr.to_csv(os.path.join(OUTDIR, "logrank_adjacent_adaptive.csv"), index=False, encoding="utf-8-sig")
        print("\n[INFO] 相邻 BMI 组 log-rank（自适应 MI，Fisher 合并 p）：")
        print(df_lr.to_string(index=False))

    # ============ 保存 CSV/JSON、图形 ============
    out_csv = os.path.join(OUTDIR, "bmi_supervised_bins_summary.csv")
    summary.to_csv(out_csv, index=False, encoding="utf-8-sig")
    print(f"\n已保存分箱汇总：{out_csv}")
    print(summary.to_string(index=False))

    bmi_min = float(df_pred["BMI"].min()); bmi_max = float(df_pred["BMI"].max())
    meta = {
        "seed": SEED, "mode": mode_used,
        "time_grid": {"lower": TIME_LOWER, "upper": TIME_UPPER, "n_points": N_TIME_POINTS},
        "main_quantile": MAIN_P, "monotone_q90": MONOTONE_Q90,
        "required_bins": REQUIRED_BINS, "min_samples_per_bin": MIN_SAMPLES_PER_BIN, "min_bin_width": MIN_BIN_WIDTH,
        "tree_base": {"max_depth": TREE_BASE_MAX_DEPTH, "min_samples_leaf": TREE_BASE_MIN_SAMPLES_LEAF, "random_state": TREE_RANDOM_STATE},
        "interval_mi": {
            "use_interval_mi": USE_INTERVAL_MI, "mi_M": MI_M, "mi_sampling": MI_SAMPLING,
            "cv_once_for_mi": CV_ONCE_FOR_MI, "detection_lower_bound": DETECTION_LOWER_BOUND,
            "adaptive": {
                "enabled": ADAPTIVE_ENABLE,
                "left_censor_rate_threshold": LEFT_CENSOR_RATE_THRESHOLD,
                "method_high": ADAPTIVE_METHOD_HIGH,
                "lb_high": ADAPTIVE_LB_HIGH,
                "adaptive_M": ADAPTIVE_M
            }
        },
        "cox": {"spline_df": SPLINE_DF, "spline_degree": SPLINE_DEGREE},
        "data": {"bmi_min": bmi_min, "bmi_max": bmi_max, "n": int(len(df_pred))},
        "chosen": {
            "cuts_final": [round(x, 6) for x in chosen.get("cuts_final", [])],
            "bin_edges_final": chosen.get("bins_edges_final"),
            "groups": summary["group"].tolist(),
            "group_sizes": summary["n"].fillna(0).astype(int).tolist(),
            "recommended_week": summary["recommended_week"].tolist()
        }
    }
    if ADAPTIVE_ENABLE and (df_int is not None) and ("left_rates" in adaptive_outputs):
        meta["adaptive_results"] = {
            "left_censor_rates": adaptive_outputs["left_rates"].to_dict(orient="list"),
            "recommendations_adaptive": adaptive_outputs["rec_adapt"].to_dict(orient="list")
        }
    with open(os.path.join(OUTDIR, "bmi_supervised_bins_cuts.json"), "w", encoding="utf-8") as f:
        json.dump(meta, f, ensure_ascii=False, indent=2)
    print(f"已保存切点与参数：{os.path.join(OUTDIR, 'bmi_supervised_bins_cuts.json')}")

    # Q 曲线图
    plt.figure(figsize=(9,5.5))
    order_idx = np.argsort(df_pred["BMI"].values)
    bmi_sorted = df_pred["BMI"].values[order_idx]
    q90_sorted = df_pred[("t90_mono" if MONOTONE_Q90 else "pred_t90")].values[order_idx]
    q95_sorted = df_pred[main_col].fillna(TIME_UPPER).values[order_idx]
    iso_g = IsotonicRegression(increasing=True, out_of_bounds='clip')
    q95_mono_sorted = iso_g.fit_transform(bmi_sorted, q95_sorted)
    for c in chosen["cuts_final"]:
        plt.axvline(c, color="red", ls="--", alpha=0.6)
    plt.plot(bmi_sorted, q90_sorted, lw=2, label=f"{'Q90 单调化' if MONOTONE_Q90 else 'Q90 原始'} (BMI) - Cox")
    plt.plot(bmi_sorted, q95_mono_sorted, lw=2, label=f"Q{int(MAIN_P*100)} (BMI) - Cox 单调化")
    plt.xlabel("BMI"); plt.ylabel("预测达标周数")
    plt.title("预测分位时间 vs BMI（Cox/MI）与监督分箱切点")
    plt.legend(); plt.grid(alpha=0.3); plt.tight_layout()
    plt.savefig(os.path.join(OUTDIR, "qcurves_with_cuts.png"), dpi=200)
    plt.close()

    # BMI 直方图
    plt.figure(figsize=(8,4))
    plt.hist(df_pred["BMI"], bins=20, color="#8fbcd4", alpha=0.85, edgecolor="#333")
    for c in chosen["cuts_final"]:
        plt.axvline(c, color="red", ls="--", alpha=0.7)
    plt.xlabel("BMI"); plt.ylabel("频数"); plt.title("BMI 分布与切点")
    plt.tight_layout(); plt.savefig(os.path.join(OUTDIR, "bmi_hist_with_cuts.png"), dpi=200); plt.close()

    # Q 边界夹住比例
    for p_key, arr in [("Q90", df_pred.get("pred_t90", pd.Series(np.nan)).values),
                       (f"Q{int(MAIN_P*100)}", df_pred[main_col].values)]:
        vals = arr.copy()
        n = np.isfinite(vals).sum()
        if n == 0:
            print(f"{p_key} 边界夹住比例: 数据全 NaN")
            continue
        at_low = np.nanmean(vals <= TIME_LOWER + 1e-8)
        at_up = np.nanmean(vals >= TIME_UPPER - 1e-8)
        print(f"{p_key} 边界夹住比例: at_lower={at_low:.3f}, at_upper={at_up:.3f} (n={n})")

    # KM 图
    kmf = KaplanMeierFitter()
    plt.figure(figsize=(9.5,6.2))
    for label, sub in df_pred.groupby("BMI_bin_supervised"):
        sub = sub.dropna(subset=["time"])
        if len(sub) == 0:
            continue
        lab_txt = f"{label} (n={len(sub)})"
        kmf.fit(durations=sub["time"].values, event_observed=sub["event"].values, label=lab_txt)
        if len(sub) < PLOT_MIN_SAMPLES and PLOT_EVEN_IF_SMALL:
            kmf.plot(ci_show=False, lw=1.8, ls="--", alpha=0.95)
        else:
            kmf.plot(ci_show=True, ci_alpha=0.15, lw=2.2)
    plt.xlabel("孕周（周）"); plt.ylabel("未达标概率 S(t)")
    plt.title("Kaplan–Meier（监督分箱后的 BMI 组） — 基于MI近似")
    plt.grid(alpha=0.2); plt.tight_layout()
    plt.savefig(os.path.join(OUTDIR, "km_by_supervised_bins.png"), dpi=200)
    plt.close()

    # AUC
    try:
        df_learn_for_auc = df_pred[["BMI","time","event"]].copy()
        times, mean_auc, std_auc = time_dependent_auc_curve(df_learn_for_auc, lower=10.0, upper=30.0, n_points=40, kfold=5, penalizer=0.05)
        if times is not None:
            print(f"[INFO] 已生成 AUC 曲线（点数={len(times)}），结果输出至 {OUTDIR}/auc_curve.png 与 auc_curve.csv")
        else:
            print("[INFO] 跳过 AUC 曲线（未安装 scikit-survival 或样本不足）")
    except Exception as e:
        warnings.warn(f"AUC 评估阶段出现异常（不影响主流程）：{e}")

    # 友好打印
    print("\n推荐结果（用于报告）：")
    cols = ["group","n"] + [f"KM_t{int(p*100)}" for p in P_LIST] + \
           ["Cox_pred_t90_median","Cox_pred_t95_median","Cox_t_main_mono_median","Cox_t90_mono_median","recommended_week"]
    print(summary[cols].to_string(index=False))

if __name__ == "__main__":
    main()

p2_eda.py

# -*- coding: utf-8 -*-
"""
NIPT Step1: 数据准备与 EDA（修正：Y 浓度按“比例”处理，阈值=0.04）
- 直接读取 ../男胎检测数据_filtered.csv
- 统一把“Y染色体浓度”解析为比例（0–1），若似乎是百分数（0–100），自动 /100
- 达标判定阈值使用 0.04（即 4%）
- 产出：
  - outputs/patient_level_summary.csv
  - outputs/km_by_bmi.png
  - outputs/longitudinal_examples.png
  - outputs/scatter_lowess_by_bmi.png
"""

import os
import warnings
from typing import Optional, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines import KaplanMeierFitter
from statsmodels.nonparametric.smoothers_lowess import lowess

warnings.filterwarnings("ignore")

# ========= 配置 =========
CSV_PATH = "../男胎检测数据_filtered.csv"
OUTPUT_DIR = "./eda_outputs"
THRESH_Y_FRAC = 0.04  # 4% 的“比例”阈值
RANDOM_SEED = 42
N_TRAJ = 30  # 纵向轨迹随机抽样人数
np.random.seed(RANDOM_SEED)

# ========= 列名 =========
COL_PATIENT = "孕妇代码"
COL_GA_DAYS = "检测孕天数"
COL_DATE = "检测日期"
COL_Y_CONC = "Y染色体浓度"
COL_BMI = "孕妇BMI"
COL_HEIGHT = "身高"   # cm
COL_WEIGHT = "体重"   # kg


def ensure_dir(p: str):
    if not os.path.exists(p):
        os.makedirs(p, exist_ok=True)


def to_numeric_series(s: pd.Series) -> pd.Series:
    return pd.to_numeric(s, errors="coerce")


def derive_week_from_days(s_days: pd.Series) -> pd.Series:
    days = to_numeric_series(s_days)
    return days / 7.0


def compute_bmi(height_cm: pd.Series, weight_kg: pd.Series) -> pd.Series:
    h_m = to_numeric_series(height_cm) / 100.0
    w = to_numeric_series(weight_kg)
    bmi = w / (h_m ** 2)
    return bmi


def parse_y_as_fraction(s: pd.Series) -> Tuple[pd.Series, str]:
    """
    将 Y 染色体浓度统一解析为“比例”（0–1）。
    规则：
      - 如果字符串中带有 %，去掉 % 并 /100
      - 否则转为数值；若数值的高分位（如95分位）>1 且 <=100，判为百分数，/100
      - 其余情况按比例使用
    返回：y_frac, scale_note
    """
    s_raw = s.copy()

    # 若出现百分号
    if s_raw.dtype == object:
        s_str = s_raw.astype(str).str.strip()
        if s_str.str.contains("%").any():
            vals = pd.to_numeric(s_str.str.replace("%", "").str.replace(",", ""), errors="coerce")
            return vals / 100.0, "parsed_from_percent_symbol"

    y = pd.to_numeric(s_raw, errors="coerce")
    finite = y[np.isfinite(y)]
    scale_note = "as_fraction"
    if finite.size >= 10:
        q95 = np.nanpercentile(finite, 95)
        if (q95 > 1.0) and (q95 <= 100.0):
            y = y / 100.0
            scale_note = "auto_div_100_from_percent_range"
    return y, scale_note


def mark_failure_from_y(y_frac: pd.Series) -> pd.Series:
    # 缺失为“失败”；其它已清理不再追加判定
    return y_frac.isna()


def prepare_row_level(df_raw: pd.DataFrame) -> pd.DataFrame:
    df = df_raw.copy()

    # 孕周
    if COL_GA_DAYS not in df.columns:
        raise ValueError("缺少列：检测孕天数")
    df["孕周"] = derive_week_from_days(df[COL_GA_DAYS])

    # 日期（可选）
    if COL_DATE in df.columns:
        df[COL_DATE] = pd.to_datetime(df[COL_DATE], errors="coerce")

    # Y 浓度 -> 统一为“比例”（0–1）
    y_frac, scale_note = parse_y_as_fraction(df[COL_Y_CONC])
    df["Y_frac"] = y_frac

    # 失败/有效
    df["测序失败"] = mark_failure_from_y(df["Y_frac"])
    df["有效测量"] = ~df["测序失败"]

    # 达标（比例 >= 0.04）
    df["达标"] = (df["有效测量"]) & (df["Y_frac"] >= THRESH_Y_FRAC)

    # BMI：优先用“孕妇BMI”；缺失则由身高体重计算
    if COL_BMI in df.columns:
        df["BMI_num"] = to_numeric_series(df[COL_BMI])
    else:
        df["BMI_num"] = np.nan

    if df["BMI_num"].isna().any():
        if (COL_HEIGHT in df.columns) and (COL_WEIGHT in df.columns):
            bmi_calc = compute_bmi(df[COL_HEIGHT], df[COL_WEIGHT])
            df.loc[df["BMI_num"].isna(), "BMI_num"] = bmi_calc[df["BMI_num"].isna()]

    # 关键 sanity check 输出
    yfin = df["Y_frac"][df["有效测量"]]
    print("\n===== Y 浓度单位与分布（已统一为“比例”0–1）=====")
    print(f"单位推断: {scale_note}")
    if len(yfin):
        desc = yfin.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
        print(desc.to_string())
        print(f"以阈值 {THRESH_Y_FRAC:.3f} 判定达标的测次比例: {float((df['达标']).mean()):.3f}")
    else:
        print("有效测量为空，检查原始数据。")

    return df


def aggregate_patient_level(df: pd.DataFrame) -> pd.DataFrame:
    # 排序（孕周升序，日期保证稳定）
    sort_cols = ["孕周"]
    if COL_DATE in df.columns:
        sort_cols.append(COL_DATE)
    df_sorted = df.sort_values(sort_cols)

    rows = []
    for pid, g in df_sorted.groupby(COL_PATIENT, dropna=False):
        bmi_values = g["BMI_num"].dropna()
        bmi = bmi_values.iloc[0] if len(bmi_values) else np.nan

        n_total = len(g)
        n_valid = int(g["有效测量"].sum())
        n_fail = n_total - n_valid

        ga_all = g["孕周"].dropna()
        earliest_week = ga_all.min() if len(ga_all) else np.nan
        latest_week = ga_all.max() if len(ga_all) else np.nan

        # 首次达标（仅有效测量）
        g_valid = g[g["有效测量"]]
        first_ge4_week = np.nan
        if len(g_valid):
            hit = g_valid[g_valid["达标"]]
            if len(hit):
                first_ge4_week = hit["孕周"].iloc[0]

        # event/time
        if pd.notna(first_ge4_week):
            event = 1
            time = first_ge4_week
        else:
            event = 0
            time = latest_week  # 右删失时间：最后一次观测孕周

        rows.append({
            COL_PATIENT: pid,
            "BMI": bmi,
            "n_records": n_total,
            "n_valid": n_valid,
            "n_fail": n_fail,
            "earliest_week": earliest_week,
            "latest_week": latest_week,
            "event": event,
            "time": time,
            "all_failed": int(n_valid == 0),
        })

    df_pat = pd.DataFrame(rows)

    # BMI 离群标记（IQR）
    bmi_clean = df_pat["BMI"].dropna()
    if len(bmi_clean):
        q1, q3 = np.percentile(bmi_clean, [25, 75])
        iqr = q3 - q1
        mild_low, mild_high = q1 - 1.5 * iqr, q3 + 1.5 * iqr
        extreme_low, extreme_high = q1 - 3.0 * iqr, q3 + 3.0 * iqr
    else:
        mild_low = mild_high = extreme_low = extreme_high = np.nan

    df_pat["BMI_outlier_mild"] = (df_pat["BMI"] < mild_low) | (df_pat["BMI"] > mild_high)
    df_pat["BMI_outlier_extreme"] = (df_pat["BMI"] < extreme_low) | (df_pat["BMI"] > extreme_high)

    # 事件总体情况
    print("\n===== 患者级事件与删失情况 =====")
    print(f"总人数: {len(df_pat)}, 有事件(首次达标)人数: {int(df_pat['event'].sum())} "
          f"({df_pat['event'].mean():.3f}), 全失败人数: {int(df_pat['all_failed'].sum())}")

    return df_pat


def make_bmi_bins(df_pat_or_rows: pd.DataFrame,
                  colname: str = "BMI",
                  bins: Optional[list] = None,
                  labels: Optional[list] = None) -> pd.Series:
    if bins is None:
        bins = [20, 28, 32, 36, 40, np.inf]
    if labels is None:
        labels = ["[20,28)", "[28,32)", "[32,36)", "[36,40)", "[40,+)"]
    return pd.cut(df_pat_or_rows[colname], bins=bins, labels=labels, right=False, include_lowest=True)


def set_cn_font():
    try:
        plt.rcParams["font.sans-serif"] = ["SimHei", "Noto Sans CJK SC", "Microsoft YaHei"]
        plt.rcParams["axes.unicode_minus"] = False
    except Exception:
        pass


def plot_km_by_bmi(df_pat: pd.DataFrame, outdir: str):
    set_cn_font()
    df_pat = df_pat.copy()
    df_pat["BMI_bin"] = make_bmi_bins(df_pat, colname="BMI")

    kmf = KaplanMeierFitter()
    plt.figure(figsize=(9, 6))
    any_group_plotted = False
    for label, sub in df_pat.groupby("BMI_bin", dropna=False):
        sub = sub.dropna(subset=["time"])
        if len(sub) < 5:
            continue
        # 若该层一个事件都没有，KM 会是水平线；但仍然画出来以示对比
        kmf.fit(durations=sub["time"].values, event_observed=sub["event"].values, label=str(label))
        kmf.plot(ci_show=True, ci_alpha=0.15, lw=2)
        any_group_plotted = True

    plt.title("Kaplan–Meier（事件=首次 Y≥4%）按 BMI 分层")
    plt.xlabel("孕周（周）")
    plt.ylabel("未达标概率 S(t)")
    plt.grid(alpha=0.2)
    plt.tight_layout()
    fp = os.path.join(outdir, "km_by_bmi.png")
    if any_group_plotted:
        plt.savefig(fp, dpi=200)
    plt.close()


def plot_longitudinal_examples(df_rows: pd.DataFrame, outdir: str, n_patients: int = 30):
    set_cn_font()
    dfv = df_rows[df_rows["有效测量"]].copy()
    counts = dfv.groupby(COL_PATIENT).size()
    eligible_ids = counts[counts >= 2].index
    if len(eligible_ids) == 0:
        return
    sample_ids = np.random.choice(eligible_ids, size=min(n_patients, len(eligible_ids)), replace=False)
    sub = dfv[dfv[COL_PATIENT].isin(sample_ids)].copy()

    plt.figure(figsize=(10, 7))
    for pid, g in sub.groupby(COL_PATIENT):
        g = g.sort_values("孕周")
        plt.plot(g["孕周"], g["Y_frac"], marker="o", ms=3, lw=1, alpha=0.6)

    plt.axhline(THRESH_Y_FRAC, color="red", ls="--", lw=1.5, alpha=0.8, label=f"阈值 {THRESH_Y_FRAC:.2%}")
    plt.title(f"随机抽样 {len(sample_ids)} 位孕妇的纵向 Y 浓度轨迹（比例）")
    plt.xlabel("孕周（周）")
    plt.ylabel("Y 染色体浓度（比例）")
    plt.grid(alpha=0.2)
    plt.tight_layout()
    fp = os.path.join(outdir, "longitudinal_examples.png")
    plt.savefig(fp, dpi=200)
    plt.close()


def plot_scatter_lowess_by_bmi(df_rows: pd.DataFrame, outdir: str):
    set_cn_font()
    dfv = df_rows[df_rows["有效测量"]].copy()
    dfv = dfv.dropna(subset=["孕周", "Y_frac", "BMI_num"])
    dfv = dfv[(dfv["孕周"] >= 6) & (dfv["孕周"] <= 35)]
    dfv = dfv.rename(columns={"BMI_num": "BMI"})
    dfv["BMI_bin"] = make_bmi_bins(dfv, colname="BMI")

    g = sns.FacetGrid(dfv, col="BMI_bin", col_wrap=3, sharex=True, sharey=True, height=3.2)
    def facet_scatter(data, color, **kwargs):
        plt.scatter(data["孕周"], data["Y_frac"], s=8, alpha=0.35, color=color)
        x = data["孕周"].values
        y = data["Y_frac"].values
        if len(x) >= 20:
            fitted = lowess(y, x, frac=0.3, return_sorted=True)
            plt.plot(fitted[:, 0], fitted[:, 1], color="black", lw=2)
        plt.axhline(THRESH_Y_FRAC, color="red", ls="--", lw=1.2, alpha=0.8)

    g.map_dataframe(facet_scatter)
    g.set_axis_labels("孕周（周）", "Y 浓度（比例）")
    g.fig.subplots_adjust(top=0.88)
    g.fig.suptitle("Y 浓度 vs 孕周（按 BMI 分层，比例刻度）")
    fp = os.path.join(outdir, "scatter_lowess_by_bmi.png")
    plt.savefig(fp, dpi=200)
    plt.close()


def summarize_and_save(df_pat: pd.DataFrame, outdir: str):
    print("\n===== 患者级汇总（前几行） =====")
    print(df_pat.head())

    print("\n===== 每位孕妇检测次数分布 =====")
    print(df_pat["n_records"].describe())
    print(df_pat["n_records"].value_counts().head(10))

    print("\n===== 每位孕妇测序失败次数分布 =====")
    print(df_pat["n_fail"].describe())
    print(df_pat["n_fail"].value_counts().head(10))

    print("\n===== 最早/最晚检测孕周 =====")
    print(df_pat[["earliest_week", "latest_week"]].describe())

    print("\n===== BMI 分布 =====")
    print(df_pat["BMI"].describe())
    n_mild = int(df_pat["BMI_outlier_mild"].sum())
    n_ext = int(df_pat["BMI_outlier_extreme"].sum())
    print(f"轻度离群（1.5*IQR）人数: {n_mild}, 极端离群（3*IQR）人数: {n_ext}")

    out_csv = os.path.join(outdir, "patient_level_summary.csv")
    df_pat.to_csv(out_csv, index=False, encoding="utf-8-sig")
    print(f"\n已保存患者级汇总：{out_csv}")


def main():
    ensure_dir(OUTPUT_DIR)

    # 读取数据（直接读取）
    df_raw = pd.read_csv(CSV_PATH)

    # 基本列检查
    for col in [COL_PATIENT, COL_GA_DAYS, COL_Y_CONC]:
        if col not in df_raw.columns:
            raise ValueError(f"未找到列：{col}")

    # 行级准备
    df_rows = prepare_row_level(df_raw)

    # 患者级聚合（event/time 等）
    df_pat = aggregate_patient_level(df_rows)

    # 输出汇总
    summarize_and_save(df_pat, OUTPUT_DIR)

    # 可视化
    plot_km_by_bmi(df_pat, OUTPUT_DIR)
    plot_longitudinal_examples(df_rows, OUTPUT_DIR, n_patients=N_TRAJ)
    plot_scatter_lowess_by_bmi(df_rows, OUTPUT_DIR)

    print(f"\n图像已保存至目录：{OUTPUT_DIR}")
    print(" - km_by_bmi.png")
    print(" - longitudinal_examples.png")
    print(" - scatter_lowess_by_bmi.png")


if __name__ == "__main__":
    main()

p2_noise_grouped_sensitivity_analysis.py

# -*- coding: utf-8 -*-
"""
敏感性分析（分组）：评估观测噪声对BMI分箱切点和最终推荐周数的影响。
"""

import pandas as pd
import numpy as np
from lifelines import KaplanMeierFitter, CoxPHFitter
from sklearn.tree import DecisionTreeRegressor
from sklearn.isotonic import IsotonicRegression
from patsy import dmatrix
import os
import random
import warnings

warnings.filterwarnings("ignore")

# --- Configuration ---
N_BOOTSTRAPS = 50
NOISE_STD_DEV = 0
CFFDNA_THRESHOLD = 0.04

# --- File Paths and Names ---
RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv"
OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs"
COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码", "检测孕天数", "Y染色体浓度", "孕妇BMI"

# --- Model Parameters ---
SEED = 42
REQUIRED_BINS = 4  # <--- 已修正为4组
MIN_SAMPLES_PER_BIN = 20 # 适当放宽以避免在bootstrap样本中失败
MI_M = 10
MAIN_P = 0.95
DETECTION_LOWER_BOUND = 10.0

# --- Helper Functions ---
def set_global_seed(seed: int):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)

def build_patient_intervals(df_raw, detection_lower_bound):
    df = df_raw.copy()
    df["孕周"] = df[COL_GA_DAYS] / 7.0
    df["达标"] = df[COL_Y_CONC] >= CFFDNA_THRESHOLD
    rows = []
    for pid, g in df.groupby(COL_PATIENT):
        bmi = g[COL_BMI].iloc[0]
        g = g.sort_values("孕周")
        times, hits = g["孕周"].values, g["达标"].values
        pos_idx = np.where(hits)[0]
        if len(pos_idx) > 0:
            first_pos_i = int(pos_idx[0])
            neg_before = np.where(~hits[:first_pos_i])[0]
            R = float(times[first_pos_i])
            L = float(times[neg_before[-1]]) if len(neg_before) > 0 else float(detection_lower_bound)
            ctype = "interval" if len(neg_before) > 0 else "left"
        else:
            L, R, ctype = float(times[-1]), np.inf, "right"
        rows.append({"patient_id": pid, "BMI": bmi, "L": L, "R": R, "ctype": ctype})
    return pd.DataFrame(rows)

def sample_time_from_interval(L, R, lower_bound):
    if not np.isfinite(R): return np.nan
    L_eff = float(L) if np.isfinite(L) else float(lower_bound)
    return np.random.uniform(L_eff, float(R))

def multiple_imputations(df_int, M, lower_bound):
    dfs = []
    for _ in range(M):
        rows = []
        for _, r in df_int.iterrows():
            if r["ctype"] in ["interval", "left"]:
                t = sample_time_from_interval(r["L"], r["R"], lower_bound)
                rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": t, "event": 1})
            elif r["ctype"] == "right":
                rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": r["L"], "event": 0})
        df_m = pd.DataFrame(rows).dropna()
        df_m["event"] = df_m["event"].astype(int)
        dfs.append(df_m)
    return dfs

def get_cox_predictions(imputed_sets, p_list):
    if not imputed_sets: return None
    agg_preds = None
    for m_idx, df_m in enumerate(imputed_sets):
        if len(df_m) < 20: continue
        BMI_CENTER = df_m["BMI"].mean()
        X_spline = dmatrix(f"bs(BMI_centered, df=4, degree=3, include_intercept=False)",
                           {"BMI_centered": (df_m["BMI"] - BMI_CENTER).values}, return_type='dataframe')
        cox_df = pd.concat([df_m[["time", "event"]].reset_index(drop=True), X_spline.reset_index(drop=True)], axis=1)
        cph = CoxPHFitter(penalizer=0.05)
        try:
            cph.fit(cox_df, duration_col="time", event_col="event", show_progress=False, robust=True)
        except Exception: continue
        time_grid = np.linspace(DETECTION_LOWER_BOUND, 35.0, 200)
        surv = cph.predict_survival_function(X_spline, times=time_grid)
        df_pred_m = df_m[["patient_id", "BMI"]].copy()
        for p in p_list:
            target_S, t_preds = 1.0 - p, []
            for col in surv.columns:
                s = surv[col].values
                hit = np.where(s <= target_S)[0]
                t_preds.append(time_grid[hit[0]] if len(hit) > 0 else np.nan)
            df_pred_m[f"pred_t{int(p*100)}"] = t_preds
        if agg_preds is None:
            agg_preds = df_pred_m
        else:
            current_preds = df_pred_m.rename(columns={f"pred_t{int(p*100)}": f"pred_t{int(p*100)}_{m_idx}" for p in p_list})
            agg_preds = agg_preds.merge(current_preds, on=["patient_id", "BMI"], how="left")
    if agg_preds is None: return None
    for p in p_list:
        pred_cols = [col for col in agg_preds.columns if col.startswith(f"pred_t{int(p*100)}")]
        agg_preds[f"pred_t{int(p*100)}_final"] = agg_preds[pred_cols].median(axis=1)
    return agg_preds[["patient_id", "BMI"] + [f"pred_t{int(p*100)}_final" for p in p_list]]

def get_supervised_cuts(df_pred, y_col, n_bins, min_samples):
    df_pred = df_pred.dropna(subset=["BMI", y_col])
    if len(df_pred) < min_samples * n_bins: return []
    iso = IsotonicRegression(increasing=True, out_of_bounds="clip")
    y_mono = iso.fit_transform(df_pred["BMI"], df_pred[y_col])
    tree = DecisionTreeRegressor(max_leaf_nodes=n_bins, min_samples_leaf=min_samples, random_state=SEED)
    tree.fit(df_pred[["BMI"]], y_mono)
    return sorted([t for t in tree.tree_.threshold if t != -2.0])

def get_group_recommendations(df_with_groups, group_col):
    recommendations = {}
    for name, group in df_with_groups.groupby(group_col):
        if len(group) < 10: continue
        kmf = KaplanMeierFitter()
        kmf.fit(group['time'], group['event'])
        sf = kmf.survival_function_.reset_index()
        hit = sf[sf["KM_estimate"] <= (1 - MAIN_P)]
        rec = hit["timeline"].iloc[0] if len(hit) > 0 else group['time'].max()
        recommendations[name] = rec
    return recommendations

def run_single_simulation(df_raw, noise_std):
    set_global_seed(random.randint(0, 100000))
    df_noisy = df_raw.copy()
    df_noisy[COL_Y_CONC] += np.random.normal(0, noise_std, size=len(df_noisy))
    df_int = build_patient_intervals(df_noisy, DETECTION_LOWER_BOUND)
    imputed_sets = multiple_imputations(df_int, M=MI_M, lower_bound=DETECTION_LOWER_BOUND)
    if not imputed_sets: return None
    df_pred = get_cox_predictions(imputed_sets, [MAIN_P])
    if df_pred is None: return None
    y_col = f"pred_t{int(MAIN_P*100)}_final"
    cuts = get_supervised_cuts(df_pred, y_col, REQUIRED_BINS, MIN_SAMPLES_PER_BIN)
    if len(cuts) != REQUIRED_BINS - 1: return None
    df_m1 = imputed_sets[0]
    bin_edges = [-np.inf] + cuts + [np.inf]
    df_m1["group"] = pd.cut(df_m1["BMI"], bins=bin_edges, labels=range(REQUIRED_BINS))
    recs = get_group_recommendations(df_m1, "group")
    if len(recs) != REQUIRED_BINS: return None
    return {"cuts": cuts, "recommendations": [recs.get(i, np.nan) for i in range(REQUIRED_BINS)]}

# --- Main Execution ---
if __name__ == "__main__":
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    set_global_seed(SEED)
    try:
        df_raw = pd.read_csv(RAW_DATA_FILE, encoding='gbk')
    except UnicodeDecodeError:
        df_raw = pd.read_csv(RAW_DATA_FILE, encoding='utf-8')
    df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna()
    for col in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]:
        df_raw[col] = pd.to_numeric(df_raw[col], errors='coerce')
    df_raw = df_raw.dropna()
    results = []
    print(f"Starting {N_BOOTSTRAPS} grouped sensitivity simulations (for {REQUIRED_BINS} groups)...")
    for i in range(N_BOOTSTRAPS):
        print(f"  Running simulation {i+1}/{N_BOOTSTRAPS}...")
        sim_result = run_single_simulation(df_raw, NOISE_STD_DEV)
        if sim_result:
            results.append(sim_result)
    print("Simulations complete.")
    if not results:
        print("No valid simulation results were obtained. The process might be too unstable or parameters too strict.")
    else:
        df_results = pd.DataFrame(results)
        df_cuts = pd.DataFrame(df_results["cuts"].tolist(), columns=[f"Cut_{i+1}" for i in range(REQUIRED_BINS - 1)])
        df_recs = pd.DataFrame(df_results["recommendations"].tolist(), columns=[f"Group_{i+1}_Rec" for i in range(REQUIRED_BINS)])
        output_csv_path = os.path.join(OUTPUT_DIR, "grouped_sensitivity_results_4_groups.csv")
        pd.concat([df_cuts, df_recs], axis=1).to_csv(output_csv_path, index=False)
        print(f"Saved detailed simulation results to {output_csv_path}")

p2_plot_sensitivity_trends.py


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# --- Configuration ---
INPUT_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs/grouped_sensitivity_results_4_groups.csv"
OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs"
# The data file actually contains data for 3 groups, not 4.
N_GROUPS = 3

# --- Main Execution ---
if __name__ == "__main__":
    if not os.path.exists(INPUT_CSV):
        print(f"Error: Input file not found at {INPUT_CSV}")
        exit()

    df = pd.read_csv(INPUT_CSV)

    # With N_GROUPS = 3, this will select ['Cut_1', 'Cut_2']
    cut_cols = [f"Cut_{i+1}" for i in range(N_GROUPS - 1)]
    # This will select ['Group_1_Rec', 'Group_2_Rec', 'Group_3_Rec']
    rec_cols = [f"Group_{i+1}_Rec" for i in range(N_GROUPS)]

    # Reshape data for violin plots
    cuts_melted = pd.melt(df, value_vars=cut_cols, var_name='Cut_Point', value_name='BMI_Value')
    recs_melted = pd.melt(df, value_vars=rec_cols, var_name='Recommendation_Group', value_name='Gestational_Week')

    # --- Plotting ---
    plt.style.use('seaborn-v0_8-whitegrid')
    # Set font to support Chinese characters if needed, falling back to default
    try:
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
    except:
        print("SimHei font not found, using default font.")

    # Plot for Cut-offs
    fig1, ax1 = plt.subplots(figsize=(10, 12))
    sns.violinplot(data=cuts_melted, x='Cut_Point', y='BMI_Value', ax=ax1, hue='Cut_Point', palette="pastel", legend=False)
    ax1.set_title('不同切点的BMI值分布 (3组)', fontsize=16)
    ax1.set_xlabel('BMI 切点', fontsize=12)
    ax1.set_ylabel('BMI 值', fontsize=12)
    plt.tight_layout()
    output_path_cuts = os.path.join(OUTPUT_DIR, "sensitivity_violin_cuts_3_groups.png")
    plt.savefig(output_path_cuts, dpi=150)
    print(f"Saved cut-off violin plot to {output_path_cuts}")
    plt.close(fig1)

    # Plot for Recommendations
    fig2, ax2 = plt.subplots(figsize=(10, 12))
    sns.violinplot(data=recs_melted, x='Recommendation_Group', y='Gestational_Week', ax=ax2, hue='Recommendation_Group', palette="pastel", legend=False)
    ax2.set_title('不同组别的推荐孕周分布 (3组)', fontsize=16)
    ax2.set_xlabel('推荐组别', fontsize=12)
    ax2.set_ylabel('推荐孕周', fontsize=12)
    plt.xticks(rotation=45)
    plt.tight_layout()
    output_path_recs = os.path.join(OUTPUT_DIR, "sensitivity_violin_recs_3_groups.png")
    plt.savefig(output_path_recs, dpi=150)
    print(f"Saved recommendation violin plot to {output_path_recs}")
    plt.close(fig2)

p2_fuzzy_interval_modeling.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
import os
import json

# --- Configuration ---
FUZZY_LOWER_BOUND = 0.039
FUZZY_UPPER_BOUND = 0.041
CFFDNA_THRESHOLD = 0.04
N_GROUPS = 4

# --- File Paths ---
RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv"
OUTPUT_DIR = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/sensitivity_analysis_outputs"
CUTS_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem2/outputs_binning/bmi_supervised_bins_cuts.json"

# --- Column Names ---
COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码", "检测孕天数", "Y染色体浓度", "孕妇BMI"

# --- Helper Functions ---
def get_event_data(df, threshold):
    patient_results = []
    for _, group in df.groupby(COL_PATIENT):
        group = group.sort_values(by=COL_GA_DAYS)
        weeks = group[COL_GA_DAYS] / 7.0
        hits = group[COL_Y_CONC] >= threshold
        duration = weeks[hits.idxmax()] if hits.any() else weeks.max()
        observed = hits.any()
        patient_results.append({'patient_id': group[COL_PATIENT].iloc[0], 'BMI': group[COL_BMI].iloc[0], 'duration': duration, 'observed': observed})
    return pd.DataFrame(patient_results)

def get_fuzzy_event_data(df, lower_b, upper_b):
    patient_results = []
    for _, group in df.groupby(COL_PATIENT):
        group = group.sort_values(by=COL_GA_DAYS)
        weeks = group[COL_GA_DAYS] / 7.0
        concentrations = group[COL_Y_CONC]
        L, R, status = 0, np.inf, 'right_censored'
        above_indices = np.where(concentrations > upper_b)[0]
        if len(above_indices) > 0:
            first_above_idx = above_indices[0]
            R = weeks.iloc[first_above_idx]
            possible_L_indices = np.where(weeks < R)[0]
            L = weeks.iloc[possible_L_indices[-1]] if len(possible_L_indices) > 0 else 0
            status = 'interval'
        else:
            L = weeks.max()
        patient_results.append({'patient_id': group[COL_PATIENT].iloc[0], 'BMI': group[COL_BMI].iloc[0], 'L': L, 'R': R, 'status': status})
    df_intervals = pd.DataFrame(patient_results)
    imputed_times = []
    for _, row in df_intervals.iterrows():
        if row['status'] == 'interval':
            imputed_time = np.random.uniform(row['L'], row['R'])
            imputed_times.append({'patient_id': row['patient_id'], 'BMI': row['BMI'], 'duration': imputed_time, 'observed': True})
        elif row['status'] == 'right_censored':
            imputed_times.append({'patient_id': row['patient_id'], 'BMI': row['BMI'], 'duration': row['L'], 'observed': False})
    return pd.DataFrame(imputed_times)

def get_q95_recommendation(df):
    if df.empty or not df['observed'].any(): return np.nan
    kmf = KaplanMeierFitter()
    kmf.fit(df['duration'], df['observed'])
    sf = kmf.survival_function_.reset_index()
    hit = sf[sf["KM_estimate"] <= 0.05]
    return hit["timeline"].iloc[0] if len(hit) > 0 else df['duration'].max()

# --- Main Execution ---
if __name__ == "__main__":
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    try:
        with open(CUTS_FILE, 'r') as f: cuts = json.load(f)['chosen']['cuts_final']
    except FileNotFoundError:
        print(f"Error: Cuts file not found at {CUTS_FILE}. Cannot perform grouped analysis.")
        exit()
    
    bin_edges = [-np.inf] + cuts + [np.inf]
    labels = [f'Group {i+1}: {bin_edges[i]:.1f} <= BMI < {bin_edges[i+1]:.1f}' for i in range(N_GROUPS)]

    try: df_raw = pd.read_csv(RAW_DATA_FILE, encoding='gbk')
    except: df_raw = pd.read_csv(RAW_DATA_FILE, encoding='utf-8')
    df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna()
    for col in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]:
        df_raw[col] = pd.to_numeric(df_raw[col], errors='coerce')
    df_raw = df_raw.dropna()

    df_baseline = get_event_data(df_raw, CFFDNA_THRESHOLD)
    df_fuzzy = get_fuzzy_event_data(df_raw, FUZZY_LOWER_BOUND, FUZZY_UPPER_BOUND)

    df_baseline['group'] = pd.cut(df_baseline['BMI'], bins=bin_edges, labels=labels)
    df_fuzzy['group'] = pd.cut(df_fuzzy['BMI'], bins=bin_edges, labels=labels)

    # --- Plotting ---
    fig, axes = plt.subplots(N_GROUPS, 1, figsize=(10, 18), sharex=True)
    fig.suptitle('Kaplan-Meier Curves by BMI Group: Exact vs. Fuzzy Threshold', fontsize=16)
    results_summary = []

    for i, group_label in enumerate(labels):
        ax = axes[i]
        baseline_group = df_baseline[df_baseline['group'] == group_label]
        fuzzy_group = df_fuzzy[df_fuzzy['group'] == group_label]

        rec_baseline, rec_fuzzy = np.nan, np.nan
        if not baseline_group.empty:
            kmf_b = KaplanMeierFitter().fit(baseline_group['duration'], baseline_group['observed'], label=f'Exact Threshold')
            kmf_b.plot(ax=ax, ci_show=True)
            rec_baseline = get_q95_recommendation(baseline_group)
        if not fuzzy_group.empty:
            kmf_f = KaplanMeierFitter().fit(fuzzy_group['duration'], fuzzy_group['observed'], label=f'Fuzzy Interval')
            kmf_f.plot(ax=ax, ci_show=True)
            rec_fuzzy = get_q95_recommendation(fuzzy_group)
        
        ax.set_title(group_label)
        ax.set_ylabel('Probability of Not Reaching Threshold')
        ax.grid(True, linestyle='--')
        ax.legend()
        results_summary.append({'Group': group_label, 'Q95_Week_Exact': rec_baseline, 'Q95_Week_Fuzzy': rec_fuzzy})

    axes[-1].set_xlabel('Gestational Week')
    plt.tight_layout(rect=[0, 0.03, 1, 0.96])
    output_path_png = os.path.join(OUTPUT_DIR, "fuzzy_interval_comparison_by_group.png")
    plt.savefig(output_path_png, dpi=150)
    print(f"Saved grouped comparison plot to {output_path_png}")

    # --- Save Summary Table ---
    df_summary = pd.DataFrame(results_summary)
    output_path_csv = os.path.join(OUTPUT_DIR, "fuzzy_interval_summary_by_group.csv")
    df_summary.to_csv(output_path_csv, index=False)
    print(f"Saved summary table to {output_path_csv}")
    print("\nSummary Table:")
    print(df_summary.to_string())

p3_aft.R

#!/usr/bin/env Rscript
# -*- coding: utf-8 -*-
# 区间删失 AFT 模型 + 导出 event_intervals.csv 与 joint_tdcox_preds.csv、aft_params_by_patient.csv
# - 构造首次达标时间的区间 [L, R]（左删失/区间删失/右删失）
# - 用 survreg 拟合 AFT（lognormal/weibull/loglogistic），按 AIC 选优（稳健选择，含兜底）
# - 输出：
#   1) outputs_joint_r/event_intervals.csv（patient_id, L, R, type）
#   2) outputs_joint_r/joint_tdcox_preds.csv（patient_id, pred_t90, pred_t95, pi_25, BMI, [IVF_cat]）
#   3) outputs_joint_r/aft_params_by_patient.csv（patient_id, BMI, Age, IVF_cat, dist, mu, sigma, shape, scale, t90, t95）
#   4) outputs_joint_r/aft_model_info.json（所选分布、AIC、系数、公式）
#
# 说明：
# - 修复 “object 'BMI' not found” 场景：若原始表无 BMI，则尝试由身高/体重换算；仍缺则以中位数兜底，确保公式可评估。
# - IVF 若仅一个水平，自动从公式剔除。
# - 可通过环境变量 AFT_DIST=lognormal|weibull|loglogistic 固定分布；默认自动按 AIC 选最优。

# ============ 参数 ============
RAW_CSV    <- "../男胎检测数据_filtered.csv"
OUTDIR     <- "outputs_joint_r"
THRESH     <- 0.04          # 达标阈值（比例）
TIME_MIN   <- 6.0           # 使用的孕周下限
TIME_MAX   <- 40.0          # 使用的孕周上限
PRED_P     <- c(0.90, 0.95) # 需要的分位点（t90/t95）
PI_TIME    <- 25.0          # π(t) 的 t（这里为 π(25)）
SEED       <- 114514

# ============ 环境与工具 ============
set.seed(SEED)
if (!dir.exists(OUTDIR)) dir.create(OUTDIR, recursive = TRUE, showWarnings = FALSE)
suppressPackageStartupMessages({
  library(survival)
  library(jsonlite)
})

logit <- function(p, eps=1e-3) {
  p <- pmax(pmin(p, 1-eps), eps)
  log(p) - log(1-p)
}

safe_read_csv <- function(file_path) {
  candidates <- c(
    file_path,
    file.path("..", basename(file_path)),
    file.path(".", basename(file_path))
  )
  sel <- NULL
  for (p in candidates) if (file.exists(p)) { sel <- p; break }
  if (is.null(sel)) stop("找不到数据文件：", paste(candidates, collapse=" | "))

  ok <- FALSE; df <- NULL
  if (requireNamespace("data.table", quietly = TRUE)) {
    try({ df <- data.table::fread(sel, encoding="UTF-8", showProgress=FALSE); ok <- TRUE }, silent=TRUE)
    if (!ok) try({ df <- data.table::fread(sel, encoding="GB18030", showProgress=FALSE); ok <- TRUE }, silent=TRUE)
  }
  if (!ok) try({ df <- read.csv(sel, stringsAsFactors=FALSE, fileEncoding="UTF-8"); ok <- TRUE }, silent=TRUE)
  if (!ok) try({ df <- read.csv(sel, stringsAsFactors=FALSE); ok <- TRUE }, silent=TRUE)
  if (!ok) stop("CSV 读取失败（UTF-8/GB18030/默认编码均失败）")
  as.data.frame(df)
}

find_col <- function(pattern, cols) {
  m <- grep(pattern, cols, ignore.case=TRUE, value=TRUE)
  if (length(m)>0) m[1] else NULL
}

as_frac <- function(x) {
  if (is.numeric(x)) return(x)
  xs <- trimws(as.character(x))
  if (any(grepl("%", xs, fixed=TRUE))) {
    v <- as.numeric(gsub("[%,]","", xs)); return(v/100)
  } else {
    v <- suppressWarnings(as.numeric(xs))
    f <- v[is.finite(v)]
    if (length(f)>=10) {
      q95 <- suppressWarnings(as.numeric(stats::quantile(f, 0.95, na.rm=TRUE)))
      if (is.finite(q95) && q95>1 && q95<=100) return(v/100)
    }
    return(v)
  }
}

ivf_to_cat <- function(x) {
  v <- trimws(as.character(x))
  v[v==""] <- NA
  v <- gsub("（", "(", v, fixed=TRUE)
  v <- gsub("）", ")", v, fixed=TRUE)
  v_low <- tolower(v)
  is_art <- grepl("\\bivf\\b", v_low) |
            grepl("\\biui\\b", v_low) |
            grepl("试管", v) |
            grepl("人工授精", v) |
            grepl("体外受精", v) |
            grepl("辅助生殖", v) |
            grepl("\\bart\\b", v_low)
  out <- ifelse(is.na(v), "natural",
                ifelse(is_art, "art",
                       ifelse(grepl("自然受孕", v) | grepl("自然", v), "natural", "natural")))
  out
}

# ============ 读取与预处理 ============
cat("=== Interval-censor AFT: 数据读取 ===\n")
df <- safe_read_csv(RAW_CSV)
cat(sprintf("数据维度: %d 行 %d 列\n", nrow(df), ncol(df)))

col_patient <- find_col("孕妇代码|patient", names(df))
col_ga      <- find_col("检测孕天数|GA|day", names(df))
col_y       <- find_col("Y染色体浓度|Y.*浓度|concentration", names(df))
if (is.null(col_patient) || is.null(col_ga) || is.null(col_y)) {
  stop("未找到必要列（孕妇代码/检测孕天数/Y染色体浓度）")
}

# 患者-检测级明细
dat <- data.frame(
  patient_id = as.character(df[[col_patient]]),
  t_week     = as.numeric(df[[col_ga]])/7,
  Y_frac     = as_frac(df[[col_y]]),
  stringsAsFactors = FALSE
)

# BMI：优先直接列；否则由身高/体重推算；再否则整体中位数兜底
if ("孕妇BMI" %in% names(df)) {
  BMI_raw <- suppressWarnings(as.numeric(df[["孕妇BMI"]]))
} else {
  hcol <- find_col("身高|height", names(df))
  wcol <- find_col("体重|weight", names(df))
  if (!is.null(hcol) && !is.null(wcol)) {
    hh <- suppressWarnings(as.numeric(df[[hcol]]))
    ww <- suppressWarnings(as.numeric(df[[wcol]]))
    BMI_raw <- ww / ((hh/100)^2)
  } else {
    BMI_raw <- rep(NA_real_, nrow(df))
  }
}
dat$BMI <- BMI_raw

# 年龄（可选）
acol <- find_col("年龄|age", names(df))
dat$Age <- if (!is.null(acol)) suppressWarnings(as.numeric(df[[acol]])) else NA_real_

# IVF 行级（可选）
col_ivf <- if ("IVF妊娠" %in% names(df)) "IVF妊娠" else find_col("IVF|试管|人工|IUI", names(df))
dat$IVF_row <- if (!is.null(col_ivf)) ivf_to_cat(df[[col_ivf]]) else "natural"

# 仅保留合理孕周
dat <- dat[is.finite(dat$t_week) & dat$t_week>=TIME_MIN & dat$t_week<=TIME_MAX, ]
dat$valid <- is.finite(dat$Y_frac)

# ============ 构造区间 [L, R] ============
get_interval <- function(d) {
  d <- d[order(d$t_week), ]
  y <- d$Y_frac
  t <- d$t_week
  ok <- which(is.finite(y))
  if (length(ok)==0) return(c(L=NA, R=NA, type="none"))
  y <- y[ok]; t <- t[ok]
  cross <- which(y >= THRESH)
  if (length(cross)==0) {
    return(c(L = t[length(t)], R = Inf, type="right"))
  } else {
    j <- cross[1]
    if (j == 1) {
      return(c(L = 0, R = t[1], type="left"))
    } else {
      prev_neg <- max(which(y[1:(j-1)] < THRESH))
      L <- t[prev_neg]; R <- t[j]
      return(c(L=L, R=R, type="interval"))
    }
  }
}

iv_rows <- by(dat, dat$patient_id, get_interval)
iv_dt <- do.call(rbind, lapply(names(iv_rows), function(pid) {
  z <- iv_rows[[pid]]
  data.frame(patient_id=pid, L=as.numeric(z["L"]), R=as.numeric(z["R"]),
             type=as.character(z["type"]), stringsAsFactors = FALSE)
}))
iv_dt$L <- as.numeric(iv_dt$L); iv_dt$R <- as.numeric(iv_dt$R)

cat("区间类型计数：\n"); print(table(iv_dt$type, useNA="ifany"))

# ============ 导出 event_intervals.csv ============
event_out <- iv_dt[iv_dt$type %in% c("left","interval","right"), c("patient_id","L","R","type")]
out_intervals_csv <- file.path(OUTDIR, "event_intervals.csv")
write.csv(event_out, out_intervals_csv, row.names = FALSE)
cat("已导出区间删失明细到：", out_intervals_csv, "（n=", nrow(event_out), "）\n", sep="")

# ============ 患者层协变量聚合 ============
mode_char <- function(v) {
  v <- v[!is.na(v) & v!=""]
  if (!length(v)) return("natural")
  tab <- sort(table(v), decreasing=TRUE)
  names(tab)[1]
}
covs_num <- stats::aggregate(cbind(BMI, Age) ~ patient_id, data=dat,
                             FUN=function(z) suppressWarnings(median(as.numeric(z), na.rm=TRUE)))
covs_ivf <- stats::aggregate(IVF_row ~ patient_id, data=dat, FUN=mode_char)
names(covs_ivf)[2] <- "IVF_cat"
covs <- merge(covs_num, covs_ivf, by="patient_id", all=TRUE)

# 兜底，确保 BMI/Age 存在且有限
if (any(!is.finite(covs$BMI))) {
  med_bmi <- median(covs$BMI[is.finite(covs$BMI)], na.rm=TRUE); if (!is.finite(med_bmi)) med_bmi <- 25
  covs$BMI[!is.finite(covs$BMI)] <- med_bmi
}
if (any(!is.finite(covs$Age))) {
  med_age <- median(covs$Age[is.finite(covs$Age)], na.rm=TRUE); if (!is.finite(med_age)) med_age <- 30
  covs$Age[!is.finite(covs$Age)] <- med_age
}
covs$IVF_cat[is.na(covs$IVF_cat) | covs$IVF_cat==""] <- "natural"
covs$IVF_cat <- factor(covs$IVF_cat, levels=c("natural","art"))

# 生存对象（interval2）
df_ic <- merge(iv_dt, covs, by="patient_id", all.x=TRUE)
Surv_ic <- with(df_ic, Surv(L, R, type="interval2"))

# 构建 AFT 公式：剔除全 NA 的协变量；IVF 若只有一个水平则剔除
use_terms <- c()
if (any(is.finite(df_ic$BMI))) use_terms <- c(use_terms, "BMI")
if (any(is.finite(df_ic$Age))) use_terms <- c(use_terms, "Age")
use_ivf <- length(levels(droplevels(df_ic$IVF_cat))) >= 2
if (use_ivf) use_terms <- c(use_terms, "IVF_cat")
form <- as.formula(paste0("Surv_ic ~ ", ifelse(length(use_terms)==0, "1", paste(use_terms, collapse=" + "))))
cat("AFT 公式：", deparse(form), "\n", sep="")

# ============ 拟合 AFT（多分布择优，稳健选择） ============
DIST <- Sys.getenv("AFT_DIST", unset="auto")
cands <- c("lognormal","weibull","loglogistic")

fit_one <- function(dist_name) {
  try(survreg(form, data=df_ic, dist=dist_name, na.action=na.omit), silent=TRUE)
}

fits <- list()
if (DIST == "auto") {
  for (d in cands) {
    obj <- fit_one(d)
    if (!inherits(obj, "try-error") && inherits(obj, "survreg")) fits[[d]] <- obj else
      message("[WARN] survreg 拟合失败：", d)
  }
  if (!length(fits)) stop("所有 AFT 分布拟合均失败；请检查数据。")
  aics <- sapply(fits, AIC)
  cat("各分布 AIC：\n"); print(aics)
  best_name <- names(which.min(aics))
} else {
  obj <- fit_one(DIST)
  if (inherits(obj, "try-error") || !inherits(obj, "survreg")) stop("指定分布拟合失败：", DIST)
  fits[[DIST]] <- obj; best_name <- DIST
}
fit <- fits[[best_name]]
cat(sprintf("选择分布: %s（AIC=%.1f）\n", best_name, AIC(fit)))

# ============ 个体预测：t90/t95 与 π(25) ============
newdat <- covs[, c("patient_id","BMI","Age","IVF_cat")]
if (!use_ivf) newdat$IVF_cat <- NULL
if ("IVF_cat" %in% names(newdat)) {
  newdat$IVF_cat <- factor(newdat$IVF_cat, levels = levels(df_ic$IVF_cat))
}

# 分位数预测
pred_q <- function(fit_obj, p_vec, newdata) {
  out <- do.call(cbind, lapply(p_vec, function(pp) as.numeric(predict(fit_obj, newdata=newdata, type="quantile", p=pp))))
  colnames(out) <- paste0("t", as.integer(p_vec*100))
  out
}
Q <- pred_q(fit, PRED_P, newdata = newdat)

# π(25) = P(T <= 25)
pi_at_t <- function(fit_obj, t_scalar, newdata) {
  mu <- as.numeric(predict(fit_obj, newdata=newdata, type="lp"))  # 线性预测 = μ
  sc <- fit_obj$scale
  dist <- fit_obj$dist
  psurvreg(t_scalar, mean=mu, scale=sc, distribution=dist)        # CDF
}
PI <- pi_at_t(fit, PI_TIME, newdat)

# 预测输出
res <- data.frame(
  patient_id = newdat$patient_id,
  pred_t90 = Q[, "t90"],
  pred_t95 = Q[, "t95"],
  pi_25 = as.numeric(PI),
  BMI = covs$BMI[match(newdat$patient_id, covs$patient_id)],
  stringsAsFactors = FALSE
)
if (use_ivf) res$IVF_cat <- covs$IVF_cat[match(newdat$patient_id, covs$patient_id)]

out_csv <- file.path(OUTDIR, "joint_tdcox_preds.csv")
write.csv(res, out_csv, row.names = FALSE)
cat("已保存预测到：", out_csv, sprintf("（n=%d）\n", nrow(res)))

# ============ 精确分布参数导出（逐个体） ============
mu_lp <- as.numeric(predict(fit, newdata=newdat, type="lp"))
sc    <- as.numeric(fit$scale)
dist  <- fit$dist

param_dt <- data.frame(
  patient_id = newdat$patient_id,
  BMI        = covs$BMI[match(newdat$patient_id, covs$patient_id)],
  Age        = covs$Age[match(newdat$patient_id, covs$patient_id)],
  IVF_cat    = if ("IVF_cat" %in% names(res)) as.character(res$IVF_cat) else NA_character_,
  dist       = dist,
  mu         = NA_real_,
  sigma      = NA_real_,
  shape      = NA_real_,
  scale      = NA_real_,
  t90        = Q[, "t90"],
  t95        = Q[, "t95"],
  stringsAsFactors = FALSE
)

if (dist == "lognormal") {
  param_dt$mu    <- mu_lp
  param_dt$sigma <- sc
  # shape/scale 置空（对数正态不用）
} else if (dist == "weibull") {
  # survreg 参数化：shape = 1/scale_param（fit$scale），scale = exp(mu)
  param_dt$shape <- 1/sc
  param_dt$scale <- exp(mu_lp)
} else if (dist == "loglogistic") {
  param_dt$shape <- 1/sc
  param_dt$scale <- exp(mu_lp)
}

param_csv <- file.path(OUTDIR, "aft_params_by_patient.csv")
write.csv(param_dt, param_csv, row.names = FALSE)
cat("已导出个体分布参数到：", param_csv, "  dist=", dist, "\n", sep="")

# ============ 模型信息 JSON ============
info <- list(
  formula = deparse(form),
  selected_dist = dist,
  scale_sigma = sc,
  coef = as.list(coef(fit)),
  AIC_selected = AIC(fit),
  AIC_candidates = lapply(names(fits), function(nm) list(dist=nm, AIC=AIC(fits[[nm]])))
)
info_json <- file.path(OUTDIR, "aft_model_info.json")
writeLines(jsonlite::toJSON(info, pretty=TRUE, auto_unbox=TRUE), info_json)
cat("已写出模型信息：", info_json, "\n", sep="")

p3_bmi_group_plots.py

# -*- coding: utf-8 -*-
"""
图表：
- fig_bmi_panels.png  面板图
- fig_mi_km_curves.png  区间删失 MI+KM 曲线（AFT 条件插补优先；否则 Uniform）
- fig_aft_curves_exact.png  AFT 精确曲线（基于 survreg 参数）
"""

import os, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, lognorm, weibull_min, fisk

sns.set_theme(style="whitegrid")
plt.rcParams.update({"font.size":12, "axes.labelsize":12, "axes.titlesize":13, "legend.fontsize":10, "figure.titlesize":14, "axes.unicode_minus":False})
for f in ["SimHei","Microsoft YaHei","Noto Sans CJK SC","WenQuanYi Zen Hei","PingFang SC","Arial"]:
    plt.rcParams["font.sans-serif"] = [f, "DejaVu Sans", "Arial"]; break

OUT_DIR = "outputs_binning"; os.makedirs(OUT_DIR, exist_ok=True)
GROUP_CSV      = os.path.join(OUT_DIR, "bmi_groups.csv")
INTERVAL_CSV   = os.path.join("outputs_joint_r", "event_intervals.csv")
AFT_PARAM_CSV  = os.path.join("outputs_joint_r", "aft_params_by_patient.csv")
RECOMMEND_CSV  = os.path.join(OUT_DIR, "recommendations_by_group.csv")
PRED_CSV       = os.path.join("outputs_joint_r", "joint_tdcox_preds.csv")

T_MIN, T_MAX, DT = 0.0, 26.0, 0.1
MI_M, MI_SEED = 200, 20240922
PAL_LINE = sns.color_palette("tab10", 10)
PAL_BAR  = sns.color_palette("Set2", 8)

def require_file(path, desc):
    if not os.path.exists(path): raise FileNotFoundError(f"缺少 {desc}: {path}")

def load_data():
    require_file(GROUP_CSV, "bmi_groups.csv")
    groups = pd.read_csv(GROUP_CSV)
    preds  = pd.read_csv(PRED_CSV) if os.path.exists(PRED_CSV) else pd.DataFrame()
    df = groups.merge(preds, on="patient_id", how="left", suffixes=("", "_pred"))
    for col in ["pred_t95","pred_t90","pi_25"]:
        c2=f"{col}_pred"
        if c2 in df.columns and col not in df.columns: df.rename(columns={c2:col}, inplace=True)
    return df, groups, preds

def make_group_labels(groups_df: pd.DataFrame, decimals=1, include_n=True):
    stat = groups_df.groupby("group_idx").agg(n=("patient_id","size"), bmin=("BMI","min"), bmax=("BMI","max")).reset_index().sort_values("group_idx")
    labels={}; G=stat["group_idx"].tolist()
    for _, r in stat.iterrows():
        g=int(r["group_idx"]); n=int(r["n"]); lo=round(float(r["bmin"]),decimals); hi=round(float(r["bmax"]),decimals)
        if g==G[0]: txt=f"≤ {hi:.{decimals}f}"
        elif g==G[-1]: txt=f"≥ {lo:.{decimals}f}"
        else: txt=f"{lo:.{decimals}f}–{hi:.{decimals}f}"
        if include_n: txt=f"{txt} (n={n})"
        labels[g]=f"组 {g} {txt}"
    return labels

def plot_panels(df_all, df_groups):
    size_tab=df_groups["group_idx"].value_counts().sort_index().reset_index()
    size_tab.columns=["group_idx","n"]
    if os.path.exists(INTERVAL_CSV):
        inter=pd.read_csv(INTERVAL_CSV)
        left_rate=inter.merge(df_groups[["patient_id","group_idx"]], on="patient_id", how="inner") \
                       .groupby("group_idx")["type"].apply(lambda s: np.mean(s.values=="left")).reset_index() \
                       .rename(columns={"type":"left_censor_rate"})
    else:
        left_rate=pd.DataFrame({"group_idx": sorted(df_groups["group_idx"].unique()), "left_censor_rate": np.nan})
    fig, axes = plt.subplots(2,2, figsize=(12,8))
    ax=axes[0,0]; sns.barplot(data=size_tab, x="group_idx", y="n", ax=ax, color=PAL_BAR[0]); ax.set_title("BMI 组样本量"); ax.set_xlabel("BMI 组"); ax.set_ylabel("样本数")
    for p in ax.patches: h=p.get_height(); ax.annotate(f"{h:.0f}", (p.get_x()+p.get_width()/2, h), ha="center", va="bottom", fontsize=10)
    ax=axes[0,1]; sns.barplot(data=left_rate, x="group_idx", y="left_censor_rate", ax=ax, color=PAL_BAR[1]); ax.set_title("左删失比例"); ax.set_ylim(0,1)
    for p in ax.patches: h=p.get_height();
    ax=axes[1,0]
    if "pi_25" in df_all.columns:
        sns.boxplot(data=df_all, x="group_idx", y="pi_25", ax=ax, color=PAL_BAR[2], width=0.5, fliersize=2)
        sns.stripplot(data=df_all, x="group_idx", y="pi_25", ax=ax, color="#444", size=2, alpha=0.45, jitter=0.22)
        ax.set_title("pi_25 组间分布"); ax.set_ylim(0,1.02)
    else:
        ax.axis("off"); ax.text(0.5,0.5,"缺少 pi_25 列", ha="center", va="center")
    ax=axes[1,1]
    if "pred_t95" in df_all.columns:
        sns.boxplot(data=df_all, x="group_idx", y="pred_t95", ax=ax, color=PAL_BAR[3], width=0.5, fliersize=2)
        sns.stripplot(data=df_all, x="group_idx", y="pred_t95", ax=ax, color="#444", size=2, alpha=0.45, jitter=0.22)
        ax.set_title("pred_t95（周）组间分布"); ax.set_ylim(T_MIN-0.5, T_MAX+0.5)
    else:
        ax.axis("off"); ax.text(0.5,0.5,"缺少 pred_t95 列", ha="center", va="center")
    fig.tight_layout(); fig.savefig(os.path.join(OUT_DIR,"fig_bmi_panels.png"), dpi=150); plt.close(fig)
    print("[OK] 已保存：fig_bmi_panels.png")

# --------- AFT 条件插补的 MI+KM 曲线 ---------
def aft_objs_from_row(r):
    d=str(r["dist"]).strip().lower()
    if d=="lognormal" and np.isfinite(r["mu"]) and np.isfinite(r["sigma"]) and r["sigma"]>0:
        s=float(r["sigma"]); sc=np.exp(float(r["mu"]))
        return (lambda x: lognorm.cdf(np.maximum(x,1e-9), s=s, scale=sc),
                lambda u: lognorm.ppf(np.clip(u,1e-12,1-1e-12), s=s, scale=sc))
    if d=="weibull" and np.isfinite(r["shape"]) and np.isfinite(r["scale"]) and r["shape"]>0 and r["scale"]>0:
        c=float(r["shape"]); sc=float(r["scale"])
        return (lambda x: weibull_min.cdf(np.maximum(x,1e-9), c=c, scale=sc),
                lambda u: weibull_min.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc))
    if d=="loglogistic" and np.isfinite(r["shape"]) and np.isfinite(r["scale"]) and r["shape"]>0 and r["scale"]>0:
        c=float(r["shape"]); sc=float(r["scale"])
        return (lambda x: fisk.cdf(np.maximum(x,1e-9), c=c, scale=sc),
                lambda u: fisk.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc))
    return None, None

def mi_km_curves(groups_df):
    if not os.path.exists(INTERVAL_CSV): return None
    inter = pd.read_csv(INTERVAL_CSV)
    t_grid = np.arange(T_MIN, T_MAX+1e-9, DT)
    rng = np.random.default_rng(MI_SEED)
    x = inter.merge(groups_df[["patient_id","group_idx"]], on="patient_id", how="inner")
    aft = pd.read_csv(AFT_PARAM_CSV) if os.path.exists(AFT_PARAM_CSV) else pd.DataFrame()
    if not aft.empty: aft = aft[["patient_id","dist","mu","sigma","shape","scale"]]
    x = x.merge(aft, on="patient_id", how="left")

    def km_from_rc(times, events):
        ord_idx=np.argsort(times); t=times[ord_idx]; e=events[ord_idx].astype(int)
        uniq=np.unique(t[e==1]);
        if uniq.size==0: return np.array([]), np.array([])
        n_at=len(t); S=1.0; S_vals=[]
        for u in uniq:
            d=int(np.sum((t==u)&(e==1))); c=int(np.sum((t==u)&(e==0)))
            if n_at>0: S*=(1.0 - d/n_at)
            S_vals.append(S); n_at-=(d+c)
        return uniq, np.array(S_vals, dtype=float)

    groups=sorted(x["group_idx"].unique())
    S_store={g:[] for g in groups}

    for _ in range(MI_M):
        rows=[]
        for _, r in x.iterrows():
            L=float(r["L"]); R=r["R"]; typ=r["type"]
            if typ=="right" or (isinstance(R,float) and (np.isinf(R))):
                rows.append((r["group_idx"], L, 0)); continue
            cdf, ppf = aft_objs_from_row(r)
            if typ=="left":
                Rt=float(R)
                if cdf is not None:
                    u = rng.uniform(0.0, max(cdf(Rt), 1e-12))
                    t = float(ppf(u))
                else:
                    t = rng.uniform(1e-6, Rt)
                rows.append((r["group_idx"], t, 1))
            else:
                Lt=float(L); Rt=float(R)
                if cdf is not None:
                    uL, uR = cdf(Lt), cdf(Rt)
                    if not (np.isfinite(uL) and np.isfinite(uR)) or uR <= uL + 1e-12:
                        t = 0.5*(Lt+Rt) if Rt>Lt else Lt
                    else:
                        u = rng.uniform(uL, uR); t = float(ppf(u))
                else:
                    t = Lt if Rt<=Lt else rng.uniform(Lt, Rt)
                rows.append((r["group_idx"], t, 1))
        imp=pd.DataFrame(rows, columns=["group_idx","time","event"])
        for g in groups:
            gi=imp[imp["group_idx"]==g]
            if gi.empty: S_store[g].append(np.ones_like(t_grid)); continue
            times=gi["time"].values.astype(float); events=gi["event"].values.astype(int)
            ut,Sv=km_from_rc(times, events)
            idx=np.searchsorted(ut, t_grid, side="right")-1; idx=np.clip(idx, -1, len(Sv)-1)
            Sg=np.ones_like(t_grid, dtype=float); m=idx>=0; Sg[m]=Sv[idx[m]]; S_store[g].append(Sg)

    curves=[]
    for g in groups:
        S_arr=np.vstack(S_store[g]); S_med=np.nanmedian(S_arr,axis=0); S_q25=np.nanpercentile(S_arr,25,axis=0); S_q75=np.nanpercentile(S_arr,75,axis=0)
        curves.append(pd.DataFrame({"group_idx": g, "t": t_grid, "S_med": S_med, "S_q25": S_q25, "S_q75": S_q75}))
    return pd.concat(curves, ignore_index=True)

# -------- AFT 精确曲线 --------
def read_aft_params():
    return pd.read_csv(AFT_PARAM_CSV) if os.path.exists(AFT_PARAM_CSV) else None

def aft_S_of_t(dist, params, t_grid):
    t=np.maximum(t_grid,1e-9)
    if dist=="lognormal":   return 1 - norm.cdf((np.log(t)-params["mu"])/params["sigma"])
    if dist=="weibull":     return np.exp(- (t/params["scale"]) ** params["shape"])
    if dist=="loglogistic": return 1.0 / (1.0 + (t/params["scale"]) ** params["shape"])
    raise ValueError("unknown dist")

def aft_curves_from_params(groups_df, aft_df):
    if aft_df is None or aft_df.empty: return None
    aft_df = aft_df.drop(columns=[c for c in ["group_idx"] if c in aft_df.columns]) \
                   .merge(groups_df[["patient_id","group_idx"]], on="patient_id", how="left")
    t_grid=np.arange(T_MIN, T_MAX+1e-9, DT)
    curves=[]
    for g in sorted(aft_df["group_idx"].dropna().unique()):
        sub=aft_df[aft_df["group_idx"]==g]; S_mat=[]
        for _, r in sub.iterrows():
            d=str(r["dist"]).strip().lower()
            if d=="lognormal" and np.isfinite(r["mu"]) and np.isfinite(r["sigma"]) and r["sigma"]>0:
                S=aft_S_of_t("lognormal", {"mu":float(r["mu"]), "sigma":float(r["sigma"])}, t_grid)
            elif d=="weibull" and np.isfinite(r["shape"]) and np.isfinite(r["scale"]) and r["shape"]>0 and r["scale"]>0:
                S=aft_S_of_t("weibull", {"shape":float(r["shape"]), "scale":float(r["scale"])}, t_grid)
            elif d=="loglogistic" and np.isfinite(r["shape"]) and np.isfinite(r["scale"]) and r["shape"]>0 and r["scale"]>0:
                S=aft_S_of_t("loglogistic", {"shape":float(r["shape"]), "scale":float(r["scale"])}, t_grid)
            else:
                continue
            S_mat.append(S)
        if not S_mat: continue
        S_arr=np.vstack(S_mat); S_med=np.nanmedian(S_arr,axis=0); S_q25=np.nanpercentile(S_arr,25,axis=0); S_q75=np.nanpercentile(S_arr,75,axis=0)
        curves.append(pd.DataFrame({"group_idx": int(g), "t": t_grid, "S_med": S_med, "S_q25": S_q25, "S_q75": S_q75}))
    return pd.concat(curves, ignore_index=True) if curves else None

def plot_mi_km(groups_df, km_curves_df, rec_map):
    if km_curves_df is None or km_curves_df.empty: return
    labels=make_group_labels(groups_df, include_n=True)
    fig, ax = plt.subplots(figsize=(9,6))
    for i, g in enumerate(sorted(km_curves_df["group_idx"].unique())):
        gd=km_curves_df[km_curves_df["group_idx"]==g]; col=PAL_LINE[i%len(PAL_LINE)]
        ax.step(gd["t"], gd["S_med"], where="post", color=col, lw=2, label=labels.get(int(g), f"组{g}"))
        ax.fill_between(gd["t"], gd["S_q25"], gd["S_q75"], color=col, alpha=0.18, step="post")
        if int(g) in rec_map:
            v=rec_map[int(g)]; ax.axvline(v, color=col, ls="--", lw=1.2); ax.text(v,0.06,"推荐", color=col, fontsize=9, ha="right", va="bottom", rotation=90)
    ax.set_xlim(T_MIN,T_MAX); ax.set_ylim(0,1.02); ax.set_xlabel("孕周（周）"); ax.set_ylabel("S(t) = P(T > t)")
    ax.set_title("区间删失 MI+KM 生存曲线（AFT 条件插补）"); ax.legend(title="BMI 组（范围）"); fig.tight_layout()
    fig.savefig(os.path.join(OUT_DIR,"fig_mi_km_curves.png"), dpi=150); plt.close(fig)

def plot_aft(groups_df, aft_curves_df, rec_map):
    if aft_curves_df is None or aft_curves_df.empty: return
    labels=make_group_labels(groups_df, include_n=True)
    fig, ax = plt.subplots(figsize=(9,6))
    for i, g in enumerate(sorted(aft_curves_df["group_idx"].unique())):
        gd=aft_curves_df[aft_curves_df["group_idx"]==g]; col=PAL_LINE[i%len(PAL_LINE)]
        ax.plot(gd["t"], gd["S_med"], color=col, lw=2, label=labels.get(int(g), f"组{g}"))
        ax.fill_between(gd["t"], gd["S_q25"], gd["S_q75"], color=col, alpha=0.18)
        if int(g) in rec_map:
            v=rec_map[int(g)]; ax.axvline(v, color=col, ls="--", lw=1.2); ax.text(v,0.06,"推荐", color=col, fontsize=9, ha="right", va="bottom", rotation=90)
    ax.set_xlim(T_MIN,T_MAX); ax.set_ylim(0,1.02); ax.set_xlabel("孕周（周）"); ax.set_ylabel("S(t) = P(T > t)")
    ax.set_title("AFT 组生存曲线（基于 survreg 精确参数）"); ax.legend(title="BMI 组（范围）"); fig.tight_layout()
    fig.savefig(os.path.join(OUT_DIR,"fig_aft_curves_exact.png"), dpi=150); plt.close(fig)

def main():
    df_all, groups, _ = load_data();
    plot_panels(df_all, groups)
    rec_map={}
    if os.path.exists(RECOMMEND_CSV):
        rec=pd.read_csv(RECOMMEND_CSV); rec_map={int(r["group_idx"]): float(r["recommended_week"]) for _, r in rec.iterrows() if np.isfinite(r.get("recommended_week", np.nan))}
    km_curves = mi_km_curves(groups)
    plot_mi_km(groups, km_curves, rec_map)
    aft_df = read_aft_params(); aft_curves = aft_curves_from_params(groups, aft_df)
    plot_aft(groups, aft_curves, rec_map)
    print("[OK] 三张图已生成到 outputs_binning/")

if __name__ == "__main__":
    main()

p3_bmi_supervised_binning.py

# -*- coding: utf-8 -*-
"""
BMI 监督分箱 + 评估 + 每组推荐时点（标准档，KM(MI) 优先）
- 改进：MI 插补支持“基于 AFT 条件分布”的半参数插补（优先采用）；若缺少参数文件则回退 Uniform。
- 继续输出：KM t95 的中位与 IQR、AFT 个体 t95 的中位与 IQR、AFT 与 KM 的对齐度（8–24 周）。
"""

import os, sys, math
import numpy as np
import pandas as pd
import json
try:
    from scipy.stats import chi2, norm, lognorm, weibull_min, fisk
    SCIPY_OK = True
except Exception:
    SCIPY_OK = False

# 路径
PRED_CSV      = os.path.join("outputs_joint_r", "joint_tdcox_preds.csv")
INTERVALS_CSV = os.path.join("outputs_joint_r", "event_intervals.csv")
AFT_PARAM_CSV = os.path.join("outputs_joint_r", "aft_params_by_patient.csv")
OUT_DIR       = "outputs_binning"
os.makedirs(OUT_DIR, exist_ok=True)

# 分箱与评估参数
TARGET_K   = 3
MIN_LEAF   = 30
MIN_GAIN   = 1e-5
USE_METRIC = "pi_25"
N_QUANT_CANDIDATES = 400

# MI 设置
MI_M     = 200
MI_SEED  = 114514
Q_T95    = 0.05
Q_T90    = 0.10
USE_AFT_CONDITIONAL_MI = True   # 打开：使用 AFT 条件分布插补；关掉：Uniform

RECOMMEND_ROUND_STEP = 0.5

# 对齐度评估区间与时间网格
T_MIN, T_MAX, DT = 0.0, 26.0, 0.1
ALIGN_LO, ALIGN_HI = 8.0, 24.0

# ---------- 公共小工具 ----------
def _sse(x: np.ndarray) -> float:
    if len(x) == 0: return 0.0
    mu = float(np.mean(x)); return float(((x - mu) ** 2).sum())

def _round_to_step(x: float, step: float = 0.5):
    if x is None or not np.isfinite(x): return float("nan")
    return round(x / step) * step

def _load_predictions(path=PRED_CSV):
    if not os.path.exists(path):
        print(f"[ERROR] 预测文件不存在: {path}"); sys.exit(1)
    df = pd.read_csv(path)
    metric = USE_METRIC if USE_METRIC in df.columns else None
    if metric is None:
        for alt in ["pi_25","pred_t95","pred_t90"]:
            if alt in df.columns:
                print(f"[WARN] 未找到 {USE_METRIC}；回退到 {alt} 作为监督目标"); metric = alt; break
    if metric is None: print("[ERROR] 无可用监督目标列"); sys.exit(1)
    needed = {"patient_id","BMI",metric}
    if not needed.issubset(df.columns): print(f"[ERROR] 缺列: {needed - set(df.columns)}"); sys.exit(1)
    df = df[np.isfinite(df["BMI"]) & np.isfinite(df[metric])].copy()
    keep = ["patient_id","BMI","pred_t90","pred_t95","pi_25"]; keep=[c for c in keep if c in df.columns]
    df = df[list(dict.fromkeys(["patient_id","BMI",metric] + keep))].copy()
    df.rename(columns={metric:"target"}, inplace=True)
    if "pred_t95" in df.columns:
        rho = pd.Series(df["BMI"]).corr(pd.Series(df["pred_t95"]), method="spearman")
        print(f"[DIAG] Spearman: BMI vs pred_t95 = {rho:.3f}")
    if "pi_25" in df.columns:
        rho = pd.Series(df["BMI"]).corr(pd.Series(df["pi_25"]), method="spearman")
        print(f"[DIAG] Spearman: BMI vs pi_25 = {rho:.3f}")
    print(f"[INFO] 使用监督目标: {USE_METRIC} ；有效样本数: n={len(df)}")
    return df, USE_METRIC

def _best_split_one_leaf(df_leaf: pd.DataFrame, min_leaf=MIN_LEAF):
    x = df_leaf["BMI"].values
    qs = np.linspace(0.02, 0.98, max(2, int(N_QUANT_CANDIDATES)))
    cuts = np.unique(np.quantile(x, qs))
    if cuts.size == 0: return None, float("-inf")
    base = _sse(df_leaf["target"].values)
    best_gain = float("-inf"); best_cut=None
    for c in cuts:
        left = df_leaf[df_leaf["BMI"] <= c]; right= df_leaf[df_leaf["BMI"] > c]
        if len(left) < min_leaf or len(right) < min_leaf: continue
        gain = base - (_sse(left["target"].values) + _sse(right["target"].values))
        if gain > best_gain: best_gain=gain; best_cut=float(c)
    return best_cut, best_gain

def _greedy_supervised(df: pd.DataFrame, target_k=TARGET_K, min_leaf=MIN_LEAF, min_gain=MIN_GAIN):
    leaves=[df.copy()]; cuts=[]
    while len(leaves) < target_k:
        best_idx=None; best_cut=None; best_gain=float("-inf")
        for i, leaf in enumerate(leaves):
            cut, gain = _best_split_one_leaf(leaf, min_leaf=min_leaf)
            if cut is not None and gain > best_gain:
                best_idx=i; best_cut=cut; best_gain=gain
        if best_cut is None or best_gain < min_gain: break
        leaf = leaves.pop(best_idx)
        left = leaf[leaf["BMI"] <= best_cut].copy(); right= leaf[leaf["BMI"] > best_cut].copy()
        leaves.extend([left,right]); cuts.append(best_cut)
    cuts=sorted(cuts)
    labels=[]
    for b in df["BMI"].values:
        g=0
        for c in cuts:
            if b>c: g+=1
        labels.append(g)
    out=df.copy(); out["group_idx"]=labels
    ok=True
    for g in sorted(out["group_idx"].unique()):
        if int((out["group_idx"]==g).sum()) < MIN_LEAF: ok=False; break
    return out, cuts, ok

def _fallback_equal_frequency(df: pd.DataFrame, k=TARGET_K):
    n=len(df);
    if k<=1 or n<=1:
        out=df.copy(); out["group_idx"]=0; edges=np.array([df["BMI"].min(), df["BMI"].max()]); return out, edges
    df_sorted=df.sort_values("BMI").reset_index(drop=True); b=df_sorted["BMI"].values
    idxs=sorted(set([int(round(i*n/k)) for i in range(1,k) if 0<int(round(i*n/k))<n]))
    cuts=[]
    for idx in idxs:
        lb=b[idx-1]; rb=b[idx]
        if rb>lb: cuts.append((lb+rb)/2.0)
    cuts=sorted(set(cuts))
    labels=[]
    for v in df["BMI"].values:
        g=0
        for c in cuts:
            if v>c: g+=1
        labels.append(g)
    out=df.copy(); out["group_idx"]=labels
    edges=np.array([df["BMI"].min()]+cuts+[df["BMI"].max()], dtype=float)
    return out, edges

def _group_labels_from_bmi(df_g: pd.DataFrame):
    gstats=df_g.groupby("group_idx").agg(bmin=("BMI","min"), bmax=("BMI","max")).reset_index().sort_values("group_idx")
    lut={int(r["group_idx"]): f"[{r['bmin']:.2f}, {r['bmax']:.2f}]" for _, r in gstats.iterrows()}
    return df_g["group_idx"].map(lut), lut

def _describe_groups(df_g: pd.DataFrame):
    return df_g.groupby("group_idx").agg(
        n=("BMI","size"), bmi_min=("BMI","min"), bmi_max=("BMI","max"),
        bmi_med=("BMI","median"), target_mean=("target","mean"), target_med=("target","median")
    ).reset_index().sort_values("group_idx")

# ---------- 读取 AFT 个体参数 ----------
def _read_aft_params():
    if not (USE_AFT_CONDITIONAL_MI and os.path.exists(AFT_PARAM_CSV)):
        return None
    df = pd.read_csv(AFT_PARAM_CSV)
    for c in ["mu","sigma","shape","scale"]:
        if c in df.columns: df[c]=pd.to_numeric(df[c], errors="coerce")
    # 仅保留必要列
    keep = ["patient_id","dist","mu","sigma","shape","scale"]
    keep = [c for c in keep if c in df.columns]
    return df[keep].copy()

# ---------- 计算 KM（支持 AFT 条件插补） ----------
def _km_from_right_censored(times: np.ndarray, events: np.ndarray):
    ord_idx=np.argsort(times); t=times[ord_idx]; e=events[ord_idx].astype(int)
    uniq_times=np.unique(t[e==1])
    if uniq_times.size==0: return np.array([]), np.array([])
    n_at=len(t); S=1.0; S_vals=[]
    for u in uniq_times:
        d=int(np.sum((t==u)&(e==1))); c=int(np.sum((t==u)&(e==0)))
        if n_at>0: S*=(1.0 - d/n_at)
        S_vals.append(S); n_at -= (d+c)
    return uniq_times, np.array(S_vals, dtype=float)

def _km_quantile(uniq_times: np.ndarray, S_vals: np.ndarray, alpha: float):
    if uniq_times.size==0: return float("nan")
    idx=np.where(S_vals<=alpha)[0]
    if idx.size==0: return float("nan")
    return float(uniq_times[idx[0]])

def _aft_dist_objs(row):
    """返回 (cdf, ppf) 两个可调用对象，用于该个体的分布。"""
    dist=str(row["dist"]).strip().lower()
    if dist=="lognormal" and np.isfinite(row["mu"]) and np.isfinite(row["sigma"]) and row["sigma"]>0:
        s=float(row["sigma"]); sc=np.exp(float(row["mu"]))
        def cdf(x): return lognorm.cdf(np.maximum(x,1e-9), s=s, scale=sc)
        def ppf(u): return lognorm.ppf(np.clip(u, 1e-12, 1-1e-12), s=s, scale=sc)
        return cdf, ppf
    if dist=="weibull" and np.isfinite(row["shape"]) and np.isfinite(row["scale"]) and row["shape"]>0 and row["scale"]>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return weibull_min.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return weibull_min.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    if dist=="loglogistic" and np.isfinite(row["shape"]) and np.isfinite(row["scale"]) and row["shape"]>0 and row["scale"]>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return fisk.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return fisk.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    return None, None

def _mi_km_summary_and_curves(intervals_df: pd.DataFrame, groups_df: pd.DataFrame, aft_params: pd.DataFrame = None):
    """
    返回：
    - km_summary: 各组 KM t95/t90 的 MI 中位数与 IQR
    - km_curves:  各组的 S_med/S_q25/S_q75 曲线（用于对齐度）
    说明：
    - 若提供 aft_params，则对 left/interval 使用 AFT 条件分布插补；
      否则使用 Uniform(L,R) 或 (0,R) 插补。
    """
    rng=np.random.default_rng(MI_SEED)
    df = intervals_df.merge(groups_df[["patient_id","group_idx"]], on="patient_id", how="inner")
    if aft_params is not None:
        df = df.merge(aft_params, on="patient_id", how="left")

    groups=sorted(df["group_idx"].unique())
    t_grid=np.arange(T_MIN, T_MAX+1e-9, DT)
    S_store={g:[] for g in groups}; t95_list={g:[] for g in groups}; t90_list={g:[] for g in groups}

    for _ in range(MI_M):
        imp_rows=[]
        for _, r in df.iterrows():
            L=float(r["L"]); R=r["R"]; typ=r["type"]
            if typ=="right" or (isinstance(R,float) and (math.isinf(R) or np.isinf(R))):
                # 右删失仍按删失处理（不插补事件）
                imp_rows.append((r["patient_id"], r["group_idx"], L, 0))
                continue
            # 需要插补的：left 或 interval
            if aft_params is not None and pd.notna(r.get("dist", np.nan)):
                cdf, ppf = _aft_dist_objs(r)
            else:
                cdf, ppf = None, None
            if typ=="left":
                Rt=float(R)
                if cdf is not None and ppf is not None:
                    u_low = 0.0
                    u_high = float(cdf(Rt))
                    if not np.isfinite(u_high) or u_high <= 1e-12:
                        t = max(1e-6, Rt*0.5)  # 极端情况下的兜底
                    else:
                        u = rng.uniform(u_low, max(u_high, u_low+1e-12))
                        t = float(ppf(u))
                else:
                    # 回退 Uniform(0,R]
                    t = rng.uniform(1e-6, Rt)
                imp_rows.append((r["patient_id"], r["group_idx"], t, 1))
            else:
                # interval
                Lt=float(L); Rt=float(R)
                if cdf is not None and ppf is not None:
                    u_low = float(cdf(Lt))
                    u_high = float(cdf(Rt))
                    if not (np.isfinite(u_low) and np.isfinite(u_high)) or u_high <= u_low + 1e-12:
                        # 窗口太窄或数值异常，退回中点/Uniform
                        t = 0.5*(Lt+Rt) if Rt>Lt else Lt
                    else:
                        u = rng.uniform(u_low, u_high)
                        t = float(ppf(u))
                else:
                    # Uniform(L,R]
                    t = Lt if Rt<=Lt else rng.uniform(Lt, Rt)
                imp_rows.append((r["patient_id"], r["group_idx"], t, 1))
        imp_df=pd.DataFrame(imp_rows, columns=["patient_id","group_idx","time","event"])

        for g in groups:
            gi=imp_df[imp_df["group_idx"]==g]
            if gi.empty:
                S_store[g].append(np.ones_like(t_grid)); t95_list[g].append(np.nan); t90_list[g].append(np.nan); continue
            times=gi["time"].values.astype(float); events=gi["event"].values.astype(int)
            ut, Sv = _km_from_right_censored(times, events)
            # 分位
            t95_list[g].append(_km_quantile(ut, Sv, Q_T95))
            t90_list[g].append(_km_quantile(ut, Sv, Q_T90))
            # 曲线（右连续阶梯）
            idx=np.searchsorted(ut, t_grid, side="right")-1; idx=np.clip(idx, -1, len(Sv)-1)
            Sg=np.ones_like(t_grid, dtype=float); m=idx>=0; Sg[m]=Sv[idx[m]]; S_store[g].append(Sg)

    # 聚合
    rows=[]; curves=[]
    for g in groups:
        arr95=np.array(t95_list[g], dtype=float); arr90=np.array(t90_list[g], dtype=float)
        rows.append({
            "group_idx": g,
            "KM_t95_MI_med": float(np.nanmedian(arr95)),
            "KM_t95_q25": float(np.nanpercentile(arr95,25)) if np.isfinite(arr95).any() else np.nan,
            "KM_t95_q75": float(np.nanpercentile(arr95,75)) if np.isfinite(arr95).any() else np.nan,
            "KM_t90_MI_med": float(np.nanmedian(arr90)),
            "KM_t90_q25": float(np.nanpercentile(arr90,25)) if np.isfinite(arr90).any() else np.nan,
            "KM_t90_q75": float(np.nanpercentile(arr90,75)) if np.isfinite(arr90).any() else np.nan,
        })
        S_arr=np.vstack(S_store[g]); S_med=np.nanmedian(S_arr,axis=0); S_q25=np.nanpercentile(S_arr,25,axis=0); S_q75=np.nanpercentile(S_arr,75,axis=0)
        curves.append(pd.DataFrame({"group_idx": g, "t": t_grid, "S_med": S_med, "S_q25": S_q25, "S_q75": S_q75}))
    km_summary=pd.DataFrame(rows).sort_values("group_idx")
    km_curves=pd.concat(curves, ignore_index=True)
    return km_summary, km_curves

# ---------- AFT 精确曲线（用于对齐度） ----------
def _aft_S_of_t(dist: str, params: dict, t_grid: np.ndarray) -> np.ndarray:
    t = np.maximum(t_grid, 1e-9)
    d = str(dist).strip().lower()
    if d == "lognormal":
        mu = float(params["mu"]); sigma = float(params["sigma"])
        z = (np.log(t) - mu) / sigma
        if SCIPY_OK:
            return 1.0 - norm.cdf(z)
        else:
            return 0.5 * np.erfc(z / np.sqrt(2.0))
    elif d == "weibull":
        shape = float(params["shape"]); scale = float(params["scale"])
        return np.exp(- (t / scale) ** shape)
    elif d == "loglogistic":
        shape = float(params["shape"]); scale = float(params["scale"])
        return 1.0 / (1.0 + (t / scale) ** shape)
    else:
        raise ValueError("unknown dist")

def _aft_curves_from_params(groups_df: pd.DataFrame, aft_df: pd.DataFrame):
    """
    由个体 AFT 参数计算每组的中位生存曲线与 25–75% 分位带。
    返回列：group_idx, t, S_med, S_q25, S_q75
    """
    if aft_df is None or aft_df.empty: return None
    use_cols = [c for c in ["patient_id","dist","mu","sigma","shape","scale"] if c in aft_df.columns]
    a = aft_df[use_cols].copy()
    a = a.merge(groups_df[["patient_id","group_idx"]], on="patient_id", how="left")
    t_grid = np.arange(T_MIN, T_MAX+1e-9, DT)

    curves=[]
    for g in sorted(a["group_idx"].dropna().unique()):
        sub = a[a["group_idx"]==g]
        S_mat=[]
        for _, r in sub.iterrows():
            d = str(r["dist"]).strip().lower()
            try:
                if d=="lognormal" and np.isfinite(r.get("mu", np.nan)) and np.isfinite(r.get("sigma", np.nan)) and r["sigma"]>0:
                    S = _aft_S_of_t("lognormal", {"mu": float(r["mu"]), "sigma": float(r["sigma"])}, t_grid)
                elif d=="weibull" and np.isfinite(r.get("shape", np.nan)) and np.isfinite(r.get("scale", np.nan)) and r["shape"]>0 and r["scale"]>0:
                    S = _aft_S_of_t("weibull", {"shape": float(r["shape"]), "scale": float(r["scale"])}, t_grid)
                elif d=="loglogistic" and np.isfinite(r.get("shape", np.nan)) and np.isfinite(r.get("scale", np.nan)) and r["shape"]>0 and r["scale"]>0:
                    S = _aft_S_of_t("loglogistic", {"shape": float(r["shape"]), "scale": float(r["scale"])}, t_grid)
                else:
                    continue
            except Exception:
                continue
            S_mat.append(S)
        if not S_mat:
            continue
        S_arr = np.vstack(S_mat)
        S_med = np.nanmedian(S_arr, axis=0)
        S_q25 = np.nanpercentile(S_arr, 25, axis=0)
        S_q75 = np.nanpercentile(S_arr, 75, axis=0)
        curves.append(pd.DataFrame({"group_idx": int(g), "t": t_grid, "S_med": S_med, "S_q25": S_q25, "S_q75": S_q75}))
    return pd.concat(curves, ignore_index=True) if curves else None

# ---------- 对齐度 ----------
def _alignment_metrics(km_curves: pd.DataFrame, aft_curves: pd.DataFrame, lo=ALIGN_LO, hi=ALIGN_HI):
    if km_curves is None or aft_curves is None: return None
    res=[]
    for g in sorted(set(km_curves["group_idx"]).intersection(set(aft_curves["group_idx"]))):
        km_g=km_curves[km_curves["group_idx"]==g]; aft_g=aft_curves[aft_curves["group_idx"]==g]
        grid=np.intersect1d(km_g["t"].values, aft_g["t"].values)
        mask=(grid>=lo)&(grid<=hi); grid=grid[mask]
        if grid.size==0:
            res.append({"group_idx": int(g), "align_L1_8_24": np.nan, "align_sup_8_24": np.nan})
            continue
        km_med=km_g.set_index("t").loc[grid, "S_med"].values
        aft_med=aft_g.set_index("t").loc[grid, "S_med"].values
        diff=np.abs(km_med - aft_med)
        res.append({"group_idx": int(g), "align_L1_8_24": float(np.mean(diff)), "align_sup_8_24": float(np.max(diff))})
    return pd.DataFrame(res).sort_values("group_idx")

# ---------- 推荐与主流程 ----------
def _make_recommendations(km_summary: pd.DataFrame, df_groups: pd.DataFrame,
                          left_censor: pd.DataFrame = None,
                          aft_q: pd.DataFrame = None, align_df: pd.DataFrame = None):
    if "group_label" in df_groups.columns:
        label_map = df_groups.groupby("group_idx")["group_label"].agg(lambda s: s.dropna().iloc[0] if s.dropna().size>0 else None).to_dict()
    else:
        _, lut = _group_labels_from_bmi(df_groups); label_map = lut
    sizes = df_groups["group_idx"].value_counts().sort_index()
    work = km_summary.copy()
    if aft_q is not None: work = work.merge(aft_q, on="group_idx", how="left")
    if align_df is not None: work = work.merge(align_df, on="group_idx", how="left")

    recs=[]
    for _, row in work.sort_values("group_idx").iterrows():
        g=int(row["group_idx"])
        t95_km = float(row["KM_t95_MI_med"]) if np.isfinite(row["KM_t95_MI_med"]) else float("nan")
        t95_fill = float(row["AFT_t95_med"]) if "AFT_t95_med" in row.index and np.isfinite(row["AFT_t95_med"]) else np.nan
        if np.isfinite(t95_km): rec_raw=t95_km; note="KM_t95_MI_med"
        elif np.isfinite(t95_fill): rec_raw=t95_fill; note="fallback: AFT group median t95"
        else: rec_raw=float("nan"); note="no t95 available"
        rec=_round_to_step(rec_raw, RECOMMEND_ROUND_STEP)
        recs.append({
            "group_idx": g, "group_label": label_map.get(g, ""), "n": int(sizes.get(g, 0)),
            "KM_t95_MI_med": t95_km,
            "KM_t95_q25": float(row.get("KM_t95_q25", np.nan)),
            "KM_t95_q75": float(row.get("KM_t95_q75", np.nan)),
            "AFT_t95_med": float(row.get("AFT_t95_med", np.nan)),
            "AFT_t95_q25": float(row.get("AFT_t95_q25", np.nan)),
            "AFT_t95_q75": float(row.get("AFT_t95_q75", np.nan)),
            "align_L1_8_24": float(row.get("align_L1_8_24", np.nan)),
            "align_sup_8_24": float(row.get("align_sup_8_24", np.nan)),
            "recommended_week": rec, "notes": note
        })
    rec_df=pd.DataFrame(recs).sort_values("group_idx")
    if left_censor is not None and {"group_idx","left_censor_rate"}.issubset(left_censor.columns):
        rec_df=rec_df.merge(left_censor, on="group_idx", how="left")
    out_csv=os.path.join(OUT_DIR, "recommendations_by_group.csv")
    rec_df.to_csv(out_csv, index=False, encoding="utf-8-sig")
    print(f"[OK] 已保存推荐与指标：{out_csv}")
    return rec_df

def main():
    print("[INFO] 读取 R 侧预测：", PRED_CSV)
    df, _ = _load_predictions(PRED_CSV)

    # 监督分箱
    sup_df, sup_cuts, sup_ok = _greedy_supervised(df, TARGET_K, MIN_LEAF, MIN_GAIN)
    print(f"[INFO] 监督分箱 cutpoints(BMI): {', '.join(f'{c:.3f}' for c in sorted(sup_cuts))}" if sup_cuts else "[INFO] 单叶")
    print(_describe_groups(sup_df).to_string(index=False))
    final_df = sup_df.copy()
    # --- 记录“是否回退”状态，便于后续导出 cuts JSON ---
    used_fallback = False
    if (sup_df["group_idx"].nunique() < TARGET_K) or (not sup_ok):
        print("[WARN] 监督分箱未达标，回退等频分箱")
        final_df, _ = _fallback_equal_frequency(df, k=TARGET_K)
        print(_describe_groups(final_df).to_string(index=False))
        used_fallback = True
    final_df["group_label"], _ = _group_labels_from_bmi(final_df)

    # 保存映射
    keep = ["patient_id","BMI","group_idx","group_label","target"] + [c for c in ["pi_25","pred_t95","pred_t90"] if c in df.columns]
    final_df[keep].sort_values(["group_idx","BMI","patient_id"]).to_csv(os.path.join(OUT_DIR, "bmi_groups.csv"), index=False, encoding="utf-8-sig")

    # --- 新增：导出 cuts JSON（供敏感性分析脚本复用） ---
    # 优先采用监督分箱产生的 sup_cuts；若回退或 sup_cuts 不齐，则从最终分组边界中点近似计算。
    cuts_for_json = []
    uniq_groups = sorted(final_df["group_idx"].unique())
    K = len(uniq_groups)
    if (not used_fallback) and (len(sup_cuts) == max(0, K - 1)):
        cuts_for_json = sorted(map(float, sup_cuts))
    else:
        # 由最终分组近似估算 cuts：相邻组 BMI 上下边界的中点
        gstats = final_df.groupby("group_idx").agg(bmin=("BMI","min"), bmax=("BMI","max")).sort_index()
        est = []
        for g in range(K - 1):
            left_max = float(gstats.loc[g, "bmax"])
            right_min = float(gstats.loc[g + 1, "bmin"])
            if np.isfinite(left_max) and np.isfinite(right_min):
                est.append(0.5 * (left_max + right_min))
        cuts_for_json = sorted(est)
    cuts_obj = {"chosen": {"cuts_final": cuts_for_json, "k": int(K), "source": ("supervised" if not used_fallback else "equal_freq")}}
    with open(os.path.join(OUT_DIR, "bmi_supervised_bins_cuts.json"), "w", encoding="utf-8") as f:
        json.dump(cuts_obj, f, ensure_ascii=False, indent=2)
    print(f"[OK] 已写出分箱 cuts JSON: {os.path.join(OUT_DIR, 'bmi_supervised_bins_cuts.json')}")

    # 区间数据
    if not os.path.exists(INTERVALS_CSV):
        print(f"[ERROR] 缺少 {INTERVALS_CSV}"); sys.exit(1)
    intervals = pd.read_csv(INTERVALS_CSV)
    if not {"patient_id","L","R","type"}.issubset(intervals.columns):
        print("[ERROR] event_intervals.csv 缺列"); sys.exit(1)

    # 左删失比例
    left_censor = intervals.merge(final_df[["patient_id","group_idx"]], on="patient_id", how="inner") \
                           .groupby("group_idx")["type"].apply(lambda s: np.mean(s.values=="left")).reset_index() \
                           .rename(columns={"type":"left_censor_rate"})
    left_censor.to_csv(os.path.join(OUT_DIR,"left_censor_by_group.csv"), index=False, encoding="utf-8-sig")

    # 读取 AFT 参数（若有）并进行 AFT 条件插补 MI（仅用于 MI）
    aft_params_for_mi = _read_aft_params()
    if aft_params_for_mi is None:
        print("[WARN] 未找到 AFT 参数或关闭了 AFT 条件插补，改用 Uniform MI。")
    km_summary, km_curves = _mi_km_summary_and_curves(intervals, final_df[["patient_id","group_idx"]], aft_params=aft_params_for_mi)

    # AFT 个体 t95 的 IQR（用于推荐与报告，不依赖 MI）
    aft_q = None
    if os.path.exists(AFT_PARAM_CSV):
        a = pd.read_csv(AFT_PARAM_CSV)
        if "patient_id" in a.columns:
            a2 = a.merge(final_df[["patient_id","group_idx"]], on="patient_id", how="left")
            if "t95" in a2.columns:
                g = a2.groupby("group_idx")["t95"]
                aft_q = g.median().rename("AFT_t95_med").to_frame()
                aft_q["AFT_t95_q25"] = g.quantile(0.25)
                aft_q["AFT_t95_q75"] = g.quantile(0.75)
                aft_q = aft_q.reset_index().sort_values("group_idx")

    # 计算 AFT 精确曲线并评估与 KM 的对齐度（8–24 周）
    aft_curves = None
    if os.path.exists(AFT_PARAM_CSV):
        try:
            aft_df_full = pd.read_csv(AFT_PARAM_CSV)
            aft_curves = _aft_curves_from_params(final_df[["patient_id","group_idx"]], aft_df_full)
        except Exception as e:
            print(f"[WARN] 计算 AFT 曲线失败：{e}")
            aft_curves = None

    align_df = _alignment_metrics(km_curves, aft_curves, lo=ALIGN_LO, hi=ALIGN_HI) if aft_curves is not None else None
    if align_df is not None and not align_df.empty:
        print("[OK] KM–AFT 对齐度已计算并并入 recommendations_by_group.csv")
    else:
        print("[WARN] 未生成对齐度（缺少 AFT 曲线或无可比区间）")

    # 保存 KM 统计
    km_summary.to_csv(os.path.join(OUT_DIR,"km_quantiles_by_group.csv"), index=False, encoding="utf-8-sig")

    # 生成推荐（合并对齐度）
    _make_recommendations(km_summary, final_df, left_censor=left_censor, aft_q=aft_q, align_df=align_df)

if __name__ == "__main__":
    main()

p3_fuzzy_interval_modeling.py

import os, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
# 新增：AFT 条件分布支持（可选）
try:
    from scipy.stats import lognorm, weibull_min, fisk
    SCIPY_OK = True
except Exception:
    SCIPY_OK = False

# --- 配置 ---
RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv"
OUTPUT_DIR    = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/sensitivity_analysis_outputs"
CUTS_FILE     = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_binning/bmi_supervised_bins_cuts.json"
# 新增：备用的分组映射 CSV（若 cuts JSON 缺失则使用）
BINS_DIR      = os.path.dirname(CUTS_FILE)
GROUPS_CSV    = os.path.join(BINS_DIR, "bmi_groups.csv")
DEFAULT_K     = 3
# 新增：AFT 个体参数（由 p3_joint_tdcox.R 生成）
AFT_PARAM_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_joint_r/aft_params_by_patient.csv"

COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码", "检测孕天数", "Y染色体浓度", "孕妇BMI"

CFFDNA_THRESHOLD = 0.04
FUZZY_LOWER_BOUND = 0.039
FUZZY_UPPER_BOUND = 0.041

os.makedirs(OUTPUT_DIR, exist_ok=True)

def safe_read_csv(path):
    try:
        return pd.read_csv(path, encoding="gbk")
    except Exception:
        return pd.read_csv(path, encoding="utf-8")

def get_event_data(df, threshold):
    rows=[]
    for _, g in df.groupby(COL_PATIENT):
        g = g.sort_values(COL_GA_DAYS)
        weeks = g[COL_GA_DAYS].values/7.0
        y     = g[COL_Y_CONC].values
        hit_idx = np.where(y >= threshold)[0]
        if hit_idx.size>0:
            i = int(hit_idx[0])
            duration = float(weeks[i])
            observed = True
        else:
            duration = float(weeks.max())
            observed = False
        rows.append({"patient_id": g[COL_PATIENT].iloc[0],
                     "BMI": float(g[COL_BMI].iloc[0]),
                     "duration": duration, "observed": int(observed)})
    return pd.DataFrame(rows)

# 新增：AFT 分布对象（cdf/ppf）
def _aft_dist_objs(row):
    if not SCIPY_OK: return None, None
    dist = str(row.get("dist", "")).strip().lower()
    if dist == "lognormal" and np.isfinite(row.get("mu", np.nan)) and np.isfinite(row.get("sigma", np.nan)) and float(row["sigma"])>0:
        s=float(row["sigma"]); sc=np.exp(float(row["mu"]))
        def cdf(x): return lognorm.cdf(np.maximum(x,1e-9), s=s, scale=sc)
        def ppf(u): return lognorm.ppf(np.clip(u,1e-12,1-1e-12), s=s, scale=sc)
        return cdf, ppf
    if dist == "weibull" and np.isfinite(row.get("shape", np.nan)) and np.isfinite(row.get("scale", np.nan)) and float(row["shape"])>0 and float(row["scale"])>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return weibull_min.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return weibull_min.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    if dist == "loglogistic" and np.isfinite(row.get("shape", np.nan)) and np.isfinite(row.get("scale", np.nan)) and float(row["shape"])>0 and float(row["scale"])>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return fisk.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return fisk.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    return None, None

def get_fuzzy_event_data(df, lower_b, upper_b, aft_params=None):
    # 构造 [L,R]：若出现首次 y>upper_b，则 R=该周；L=其前一次采样周（若无则 0）
    # 右删失：从未超过 upper_b，则 L=最后一次周，R=+inf
    rows=[]
    for _, g in df.groupby(COL_PATIENT):
        g = g.sort_values(COL_GA_DAYS)
        t = (g[COL_GA_DAYS].values/7.0).astype(float)
        y = g[COL_Y_CONC].values.astype(float)
        above = np.where(y > upper_b)[0]
        if above.size>0:
            j = int(above[0])
            R = float(t[j])
            L = float(t[j-1]) if j-1>=0 else 0.0
            rows.append({"patient_id": g[COL_PATIENT].iloc[0], "BMI": float(g[COL_BMI].iloc[0]),
                         "L": L, "R": R, "ctype": "interval"})
        else:
            rows.append({"patient_id": g[COL_PATIENT].iloc[0], "BMI": float(g[COL_BMI].iloc[0]),
                         "L": float(t.max()), "R": np.inf, "ctype": "right"})
    iv = pd.DataFrame(rows)
    # 若有 AFT 参数，按 patient_id 合并
    if aft_params is not None and len(aft_params):
        iv = iv.merge(aft_params, on="patient_id", how="left")

    # 使用 AFT 条件分布插补（存在 dist 时）；否则回退 Uniform(L,R]；右删失保持删失
    imputed=[]
    rng = np.random.default_rng(114514)
    for _, r in iv.iterrows():
        if r["ctype"] == "interval" and np.isfinite(r["R"]):
            cdf, ppf = _aft_dist_objs(r) if ("dist" in iv.columns and pd.notna(r.get("dist", np.nan))) else (None, None)
            Lt, Rt = float(r["L"]), float(r["R"])
            if cdf and ppf:
                u_lo, u_hi = float(cdf(Lt)), float(cdf(Rt))
                if (not np.isfinite(u_lo)) or (not np.isfinite(u_hi)) or u_hi <= u_lo + 1e-12:
                    t = 0.5*(Lt+Rt) if Rt>Lt else Lt
                else:
                    u = float(rng.uniform(u_lo, u_hi))
                    t = float(ppf(u))
            else:
                t = Lt if Rt<=Lt else float(rng.uniform(Lt, Rt))
            imputed.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "duration": t, "observed": 1})
        else:
            imputed.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "duration": r["L"], "observed": 0})
    return pd.DataFrame(imputed)

def get_q95_recommendation(df_grp):
    if df_grp.empty: return np.nan
    kmf = KaplanMeierFitter().fit(df_grp["duration"], event_observed=df_grp["observed"])
    sf = kmf.survival_function_.reset_index()
    hit = sf[sf["KM_estimate"] <= 0.05]
    return float(hit["timeline"].iloc[0]) if len(hit)>0 else float(df_grp["duration"].max())

# 新增：在缺少 cuts JSON 时自动回退
def resolve_bin_edges_and_labels(df_raw):
    # 1) 首选：cuts JSON
    try:
        with open(CUTS_FILE, "r", encoding="utf-8") as f:
            cuts_obj = json.load(f)["chosen"]
            cuts = cuts_obj.get("cuts_final", [])
            K = int(cuts_obj.get("k", len(cuts) + 1))
            edges = [-np.inf] + list(cuts) + [np.inf]
            labels = [f"Group {i+1}" for i in range(K)]
            print(f"[INFO] 采用 cuts JSON（K={K}）：{cuts}")
            return edges, labels
    except FileNotFoundError:
        print(f"[WARN] 找不到 cuts JSON：{CUTS_FILE}")

    # 2) 备选：从 bmi_groups.csv 推回 cut（相邻组边界中点）
    if os.path.exists(GROUPS_CSV):
        try:
            gdf = safe_read_csv(GROUPS_CSV)
            if {"BMI","group_idx"}.issubset(gdf.columns):
                gstats = gdf.groupby("group_idx").agg(bmin=("BMI","min"), bmax=("BMI","max")).sort_index()
                est = []
                for g in range(gstats.shape[0] - 1):
                    left_max = float(gstats.iloc[g]["bmax"])
                    right_min = float(gstats.iloc[g+1]["bmin"])
                    if np.isfinite(left_max) and np.isfinite(right_min):
                        est.append(0.5 * (left_max + right_min))
                K = gstats.shape[0]
                edges = [-np.inf] + est + [np.inf]
                labels = [f"Group {i+1}" for i in range(K)]
                print(f"[INFO] 采用 bmi_groups.csv 推回 cuts（K={K}）：{[round(c,3) for c in est]}")
                return edges, labels
        except Exception as e:
            print(f"[WARN] 解析 bmi_groups.csv 失败：{e}")

    # 3) 最后：等频分箱（默认 K=DEFAULT_K）
    bmi = pd.to_numeric(df_raw["孕妇BMI"], errors="coerce").dropna()
    if len(bmi) < DEFAULT_K:
        # 极端小样本时退回 2 组
        k = 2
    else:
        k = DEFAULT_K
    qs = np.linspace(0, 1, k + 1)[1:-1]
    cuts = sorted(bmi.quantile(qs).unique().tolist())
    edges = [-np.inf] + cuts + [np.inf]
    labels = [f"Group {i+1}" for i in range(k)]
    print(f"[INFO] 采用等频分箱（K={k}）：{[round(c,3) for c in cuts]}")
    return edges, labels

if __name__ == "__main__":
    # 原始数据
    df_raw = safe_read_csv(RAW_DATA_FILE)
    df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna()
    for c in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]:
        df_raw[c] = pd.to_numeric(df_raw[c], errors="coerce")
    df_raw = df_raw.dropna()

    # 尝试载入 AFT 个体参数
    aft_params = None
    if os.path.exists(AFT_PARAM_CSV):
        try:
            a = pd.read_csv(AFT_PARAM_CSV)
            keep = [c for c in ["patient_id","dist","mu","sigma","shape","scale"] if c in a.columns]
            if "patient_id" in keep:
                aft_params = a[keep].copy()
                print(f"[INFO] Using AFT conditional imputation for fuzzy intervals (n={len(aft_params)})")
        except Exception as e:
            print(f"[WARN] Failed to read AFT params; fallback to Uniform imputation: {e}")

    # 替换：使用“可回退”的分箱边界
    bin_edges, labels = resolve_bin_edges_and_labels(df_raw)

    # 精确 vs 模糊（模糊侧使用 AFT 条件插补）
    df_exact = get_event_data(df_raw, CFFDNA_THRESHOLD)
    df_fuzzy = get_fuzzy_event_data(df_raw, FUZZY_LOWER_BOUND, FUZZY_UPPER_BOUND, aft_params=aft_params)
    df_exact["group"] = pd.cut(df_exact["BMI"], bins=bin_edges, labels=labels)
    df_fuzzy["group"] = pd.cut(df_fuzzy["BMI"], bins=bin_edges, labels=labels)

    # 作图
    fig, axes = plt.subplots(len(labels), 1, figsize=(9, 3.0*len(labels)), sharex=True)  # was 2.6 -> 3.0
    if len(labels)==1: axes=[axes]
    fig.suptitle("Q3: KM curves by BMI groups (Exact 4% vs Fuzzy [3.9%, 4.1%])", fontsize=13)
    # 可选：改为更短
    # fig.suptitle("Q3: KM by BMI groups — Exact 4% vs Fuzzy [3.9%, 4.1%]", fontsize=13)
    summary=[]
    for i, glb in enumerate(labels):
        ax = axes[i]
        g0 = df_exact[df_exact["group"]==glb]
        g1 = df_fuzzy[df_fuzzy["group"]==glb]
        rec0 = np.nan; rec1 = np.nan
        if not g0.empty:
            KaplanMeierFitter().fit(g0["duration"], g0["observed"], label="Exact 4%").plot(ax=ax, ci_show=True)
            rec0 = get_q95_recommendation(g0)
        if not g1.empty:
            KaplanMeierFitter().fit(g1["duration"], g1["observed"], label="Fuzzy [3.9%, 4.1%]").plot(ax=ax, ci_show=True)
            rec1 = get_q95_recommendation(g1)
        ax.set_title(str(glb))
        ax.set_ylabel("Survival S(t)")
        ax.grid(True, ls="--", alpha=0.5)
        ax.legend()
        summary.append({"Group": glb, "t95_exact": rec0, "t95_fuzzy": rec1})
    axes[-1].set_xlabel("Gestational age (weeks)")  # English x-label

    out_png = os.path.join(OUTPUT_DIR, "fuzzy_interval_comparison_by_group.png")
    plt.tight_layout(rect=[0,0.03,1,0.96])
    plt.savefig(out_png, dpi=150)
    print(f"[OK] 已保存对比图：{out_png}")

    out_csv = os.path.join(OUTPUT_DIR, "fuzzy_interval_summary_by_group.csv")
    pd.DataFrame(summary).to_csv(out_csv, index=False, encoding="utf-8-sig")
    print(f"[OK] 已保存汇总表：{out_csv}")

p3_noise_grouped_sensitivity_analysis.py

# -*- coding: utf-8 -*-
"""
问题三：分组敏感性分析（蒙特卡洛）
- 给 Y 测量值加入小幅高斯噪声，多次重复：
  1) 重新学习监督分箱（等值回归 + 决策树）得到 BMI cuts；
  2) 对每组做 KM，取 t95 作为推荐周数；
- 输出：每次实验的 cuts 与各组推荐的 CSV；并绘制小提琴图（cuts 与推荐）。
"""

import os, json, random, warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from patsy import dmatrix
from sklearn.isotonic import IsotonicRegression
from sklearn.tree import DecisionTreeRegressor
# 新增：AFT 条件分布所需（可选）
try:
    from scipy.stats import lognorm, weibull_min, fisk
    SCIPY_OK = True
except Exception:
    SCIPY_OK = False

warnings.filterwarnings("ignore")

# --- 配置 ---
RAW_DATA_FILE = "C:/Users/yezf8/Documents/Y3Repo/25C题/男胎检测数据_filtered.csv"
OUTPUT_DIR    = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/sensitivity_analysis_outputs"
CUTS_FILE     = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_binning/bmi_supervised_bins_cuts.json"
# 新增：AFT 个体参数（由 p3_joint_tdcox.R 产出）
AFT_PARAM_CSV = "C:/Users/yezf8/Documents/Y3Repo/25C题/problem3/outputs_joint_r/aft_params_by_patient.csv"

COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI = "孕妇代码", "检测孕天数", "Y染色体浓度", "孕妇BMI"

N_BOOTSTRAPS = 50
NOISE_STD_DEV = 0.0005   # 对 Y 的绝对噪声标准差（约 0.05 个百分点）
SEED = 42

CFFDNA_THRESHOLD = 0.04
DETECTION_LOWER_BOUND = 6.0   # 孕周最小网格
MI_M = 10
P_MAIN = 0.95

MIN_SAMPLES_PER_BIN = 20
TREE_PENALTY = 0.05

os.makedirs(OUTPUT_DIR, exist_ok=True)

def set_seed(s):
    os.environ["PYTHONHASHSEED"] = str(s)
    random.seed(s)
    np.random.seed(s)

def safe_read_csv(path):
    try: return pd.read_csv(path, encoding="gbk")
    except: return pd.read_csv(path, encoding="utf-8")

def build_patient_intervals(df):
    df2 = df.copy()
    df2["孕周"] = df2[COL_GA_DAYS] / 7.0
    df2["达标"] = df2[COL_Y_CONC] >= CFFDNA_THRESHOLD
    out=[]
    for pid, g in df2.groupby(COL_PATIENT):
        g = g.sort_values("孕周")
        t = g["孕周"].values.astype(float)
        y = g["达标"].values.astype(bool)
        bmi = float(g[COL_BMI].iloc[0])
        pos = np.where(y)[0]
        if pos.size>0:
            j = int(pos[0])
            R = float(t[j])
            if j>0:
                L = float(t[:j][~y[:j]].max()) if (~y[:j]).any() else DETECTION_LOWER_BOUND
                ctype = "interval" if (~y[:j]).any() else "left"
            else:
                L = DETECTION_LOWER_BOUND; ctype = "left"
        else:
            L, R, ctype = float(t.max()), np.inf, "right"
        out.append({"patient_id": pid, "BMI": bmi, "L": L, "R": R, "ctype": ctype})
    return pd.DataFrame(out)

def sample_time_from_interval(L, R, lb):
    if not np.isfinite(R): return np.nan
    L_eff = float(L) if np.isfinite(L) else float(lb)
    return np.random.uniform(L_eff, float(R)) if R > L_eff else L_eff

# 新增：AFT 行为（按 p2_bmi_supervised_binning.py 一致的分布族）
def _aft_dist_objs(row):
    if not SCIPY_OK: return None, None
    dist = str(row.get("dist", "")).strip().lower()
    if dist == "lognormal" and np.isfinite(row.get("mu", np.nan)) and np.isfinite(row.get("sigma", np.nan)) and float(row["sigma"])>0:
        s=float(row["sigma"]); sc=np.exp(float(row["mu"]))
        def cdf(x): return lognorm.cdf(np.maximum(x,1e-9), s=s, scale=sc)
        def ppf(u): return lognorm.ppf(np.clip(u,1e-12,1-1e-12), s=s, scale=sc)
        return cdf, ppf
    if dist == "weibull" and np.isfinite(row.get("shape", np.nan)) and np.isfinite(row.get("scale", np.nan)) and float(row["shape"])>0 and float(row["scale"])>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return weibull_min.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return weibull_min.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    if dist == "loglogistic" and np.isfinite(row.get("shape", np.nan)) and np.isfinite(row.get("scale", np.nan)) and float(row["shape"])>0 and float(row["scale"])>0:
        c=float(row["shape"]); sc=float(row["scale"])
        def cdf(x): return fisk.cdf(np.maximum(x,1e-9), c=c, scale=sc)
        def ppf(u): return fisk.ppf(np.clip(u,1e-12,1-1e-12), c=c, scale=sc)
        return cdf, ppf
    return None, None

# 修改：MI 支持 AFT 条件分布插补（若参数可用），否则回退 Uniform
def multiple_imputations(iv_df, M, lb):
    dfs=[]
    rng = np.random.default_rng(SEED)
    for _ in range(M):
        rows=[]
        for _, r in iv_df.iterrows():
            ctype = r["ctype"]
            if ctype in ("left","interval"):
                if ("dist" in iv_df.columns) and pd.notna(r.get("dist", np.nan)):
                    cdf, ppf = _aft_dist_objs(r)
                else:
                    cdf, ppf = (None, None)
                if ctype == "left":
                    Rt = float(r["R"])
                    if cdf and ppf:
                        u_lo, u_hi = 0.0, float(cdf(Rt))
                        if not np.isfinite(u_hi) or u_hi <= 1e-12:
                            t = max(1e-6, Rt*0.5)
                        else:
                            u = float(rng.uniform(u_lo, u_hi))
                            t = float(ppf(u))
                    else:
                        t = float(rng.uniform(1e-6, Rt))
                    rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": t, "event": 1})
                else:
                    Lt, Rt = float(r["L"]), float(r["R"])
                    if cdf and ppf:
                        u_lo, u_hi = float(cdf(Lt)), float(cdf(Rt))
                        if (not np.isfinite(u_lo)) or (not np.isfinite(u_hi)) or u_hi <= u_lo + 1e-12:
                            t = 0.5*(Lt+Rt) if Rt>Lt else Lt
                        else:
                            u = float(rng.uniform(u_lo, u_hi))
                            t = float(ppf(u))
                    else:
                        t = Lt if Rt<=Lt else float(rng.uniform(Lt, Rt))
                    rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": t, "event": 1})
            else:
                rows.append({"patient_id": r["patient_id"], "BMI": r["BMI"], "time": float(r["L"]), "event": 0})
        d = pd.DataFrame(rows).dropna()
        d["event"] = d["event"].astype(int)
        dfs.append(d)
    return dfs

def get_cox_predictions(imputed_sets, p_list):
    agg=None
    for m, d in enumerate(imputed_sets):
        if len(d) < 20: continue
        bmi_c = d["BMI"].mean()
        X = dmatrix("bs(BMIc, df=4, degree=3, include_intercept=False)", {"BMIc": (d["BMI"]-bmi_c).values}, return_type="dataframe")
        cph = CoxPHFitter(penalizer=TREE_PENALTY)
        try:
            cph.fit(pd.concat([d[["time","event"]].reset_index(drop=True), X.reset_index(drop=True)], axis=1),
                    duration_col="time", event_col="event", robust=True, show_progress=False)
        except Exception:
            continue
        grid = np.linspace(DETECTION_LOWER_BOUND, 35.0, 200)
        S = cph.predict_survival_function(X, times=grid)
        dfp = d[["patient_id","BMI"]].copy()
        for p in p_list:
            target = 1.0 - p
            t_pred=[]
            for col in S.columns:
                s = S[col].values
                idx = np.where(s <= target)[0]
                t_pred.append(float(grid[idx[0]]) if idx.size>0 else np.nan)
            dfp[f"pred_t{int(p*100)}"] = t_pred
        if agg is None:
            agg = dfp
        else:
            for p in p_list:
                agg = agg.merge(dfp[["patient_id", f"pred_t{int(p*100)}"]].rename(columns={f"pred_t{int(p*100)}": f"pred_t{int(p*100)}_{m}"}), on="patient_id", how="left")
    if agg is None: return None
    for p in p_list:
        cols = [c for c in agg.columns if c.startswith(f"pred_t{int(p*100)}")]
        agg[f"pred_t{int(p*100)}_final"] = agg[cols].median(axis=1)
    keep = ["patient_id","BMI"] + [f"pred_t{int(p*100)}_final" for p in p_list]
    return agg[keep]

def get_supervised_cuts(df_pred, y_col, n_bins, min_leaf):
    d = df_pred.dropna(subset=["BMI", y_col]).copy()
    if len(d) < n_bins * min_leaf: return []
    iso = IsotonicRegression(increasing=True, out_of_bounds="clip")
    y_mono = iso.fit_transform(d["BMI"].values, d[y_col].values)
    tree = DecisionTreeRegressor(max_leaf_nodes=n_bins, min_samples_leaf=min_leaf, random_state=SEED)
    tree.fit(d[["BMI"]].values, y_mono)
    cuts = sorted([t for t in tree.tree_.threshold if t != -2.0])
    # 去重/去界外
    cuts = [float(c) for c in cuts if np.isfinite(c)]
    return cuts

def get_group_recommendations(df_with_groups):
    recs=[]
    for gid, g in df_with_groups.groupby("group"):
        if len(g) < 10:
            recs.append(np.nan); continue
        kmf = KaplanMeierFitter().fit(g["time"], g["event"])
        sf = kmf.survival_function_.reset_index()
        hit = sf[sf["KM_estimate"] <= (1.0 - P_MAIN)]
        recs.append(float(hit["timeline"].iloc[0]) if len(hit)>0 else float(g["time"].max()))
    return recs

# 修改：接受 aft_params 以便在 MI 内部使用
def run_one(df_raw, noise_std, req_bins, min_leaf, aft_params=None):
    # 噪声
    d = df_raw.copy()
    d[COL_Y_CONC] = d[COL_Y_CONC] + np.random.normal(0, noise_std, size=len(d))
    # 区间
    iv = build_patient_intervals(d)
    # 合并 AFT 参数（若可用）
    if aft_params is not None and len(aft_params):
        iv = iv.merge(aft_params, on="patient_id", how="left")
    # MI（优先 AFT 条件插补）
    imps = multiple_imputations(iv, MI_M, DETECTION_LOWER_BOUND)
    if not imps: return None
    # Cox 预测 + 监督分箱（保留 Cox 以保证 50 次内切点可重复学习）
    df_pred = get_cox_predictions(imps, [P_MAIN])
    if df_pred is None: return None
    y_col = f"pred_t{int(P_MAIN*100)}_final"
    cuts = get_supervised_cuts(df_pred, y_col, req_bins, min_leaf)
    if len(cuts) != req_bins - 1: return None
    # 用第一份 MI 数据，按新 cuts 分组并给出 KM t95
    edges = [-np.inf] + cuts + [np.inf]
    dm = imps[0].copy()
    dm["group"] = pd.cut(dm["BMI"], bins=edges, labels=range(req_bins))
    recs = get_group_recommendations(dm)
    if len(recs) != req_bins: return None
    return {"cuts": cuts, "recs": recs}

if __name__ == "__main__":
    set_seed(SEED)
    # 读取问题三的 cuts，用于获取组数（不强制复用边界）
    try:
        with open(CUTS_FILE, "r", encoding="utf-8") as f:
            cuts_obj = json.load(f)["chosen"]
            REQUIRED_BINS = int(cuts_obj.get("k", len(cuts_obj.get("cuts_final", []))+1))
    except Exception as e:
        print(f"[WARN] 读取 cuts 失败，将使用 3 组默认: {e}")
        REQUIRED_BINS = 3

    # 原始数据
    df_raw = safe_read_csv(RAW_DATA_FILE)
    df_raw = df_raw[[COL_PATIENT, COL_GA_DAYS, COL_Y_CONC, COL_BMI]].dropna()
    for c in [COL_GA_DAYS, COL_Y_CONC, COL_BMI]:
        df_raw[c] = pd.to_numeric(df_raw[c], errors="coerce")
    df_raw = df_raw.dropna()

    # 新增：加载 AFT 个体参数（若存在）
    aft_params = None
    if os.path.exists(AFT_PARAM_CSV):
        try:
            a = pd.read_csv(AFT_PARAM_CSV)
            keep = [c for c in ["patient_id","dist","mu","sigma","shape","scale"] if c in a.columns]
            if "patient_id" in keep:
                aft_params = a[keep].copy()
                print(f"[INFO] Loaded AFT params for MI: n={len(aft_params)}")
        except Exception as e:
            print(f"[WARN] Failed to read AFT params, fallback to Uniform MI: {e}")

    # Monte Carlo
    runs=[]
    print(f"[INFO] 开始蒙特卡洛：N={N_BOOTSTRAPS}, 组数={REQUIRED_BINS}, 噪声σ={NOISE_STD_DEV}")
    ok=0
    for i in range(N_BOOTSTRAPS):
        print(f"  - 运行 {i+1}/{N_BOOTSTRAPS} ...")
        res = run_one(df_raw, NOISE_STD_DEV, REQUIRED_BINS, MIN_SAMPLES_PER_BIN, aft_params=aft_params)
        if res is not None:
            runs.append(res); ok+=1
    print(f"[INFO] 完成：有效实验 {ok}/{N_BOOTSTRAPS}")

    if not runs:
        print("[ERROR] 无有效结果，参数可能过严。")
        raise SystemExit(0)

    # 导出明细 CSV
    df_cuts = pd.DataFrame([r["cuts"] for r in runs], columns=[f"Cut_{i+1}" for i in range(REQUIRED_BINS-1)])
    df_recs = pd.DataFrame([r["recs"] for r in runs], columns=[f"Group_{i+1}_Rec" for i in range(REQUIRED_BINS)])
    out_csv = os.path.join(OUTPUT_DIR, f"grouped_sensitivity_results_{REQUIRED_BINS}g.csv")
    pd.concat([df_cuts, df_recs], axis=1).to_csv(out_csv, index=False, encoding="utf-8-sig")
    print(f"[OK] 已保存模拟结果：{out_csv}")

    # 小提琴图：cuts（英文 + 马卡龙 + 拉长纵轴）
    fig1, ax1 = plt.subplots(figsize=(9, 5.2))  # was (8, 4)
    parts = ax1.violinplot([df_cuts[c].dropna().values for c in df_cuts.columns],
                           positions=np.arange(1, len(df_cuts.columns)+1), showextrema=False)
    # 马卡龙配色
    pastel_colors = ["#FFD1DC", "#C1E1C1", "#FDFD96", "#AEC6CF", "#FFB347", "#E6E6FA", "#B5EAD7", "#FFDAC1", "#C7CEEA"]
    for i, b in enumerate(parts["bodies"]):
        b.set_facecolor(pastel_colors[i % len(pastel_colors)])
        b.set_edgecolor("white")
        b.set_alpha(0.9)
    ax1.set_xticks(np.arange(1, len(df_cuts.columns)+1))
    ax1.set_xticklabels(df_cuts.columns)
    ax1.set_ylabel("BMI cut value")
    ax1.set_title("Q3: Distribution of BMI cutpoints under noise (violin plot)")
    ax1.grid(True, ls="--", alpha=0.5)
    out_png1 = os.path.join(OUTPUT_DIR, f"violin_cuts_{REQUIRED_BINS}g.png")
    plt.tight_layout(); plt.savefig(out_png1, dpi=150)
    print(f"[OK] 已保存小提琴图（cuts）：{out_png1}")

    # 小提琴图：推荐周数（英文 + 马卡龙 + 拉长纵轴）
    fig2, ax2 = plt.subplots(figsize=(9, 5.2))  # was (8, 4)
    parts2 = ax2.violinplot([df_recs[c].dropna().values for c in df_recs.columns],
                            positions=np.arange(1, len(df_recs.columns)+1), showextrema=False)
    for i, b in enumerate(parts2["bodies"]):
        b.set_facecolor(pastel_colors[i % len(pastel_colors)])
        b.set_edgecolor("white")
        b.set_alpha(0.9)
    ax2.set_xticks(np.arange(1, len(df_recs.columns)+1))
    ax2.set_xticklabels(df_recs.columns)
    ax2.set_ylabel("Recommended week (t95, KM)")
    ax2.set_title("Q3: Distribution of recommended weeks (t95, KM) under noise (violin plot)")
    ax2.grid(True, ls="--", alpha=0.5)
    out_png2 = os.path.join(OUTPUT_DIR, f"violin_recs_{REQUIRED_BINS}g.png")
    plt.tight_layout(); plt.savefig(out_png2, dpi=150)
    print(f"[OK] 已保存小提琴图（推荐）：{out_png2}")

p4_automl_ensemble_tuning.py

import pandas as pd
import numpy as np
import xgboost as xgb
import optuna
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time

# Set plot style and font for Chinese characters
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# --- Global Variables ---
X_GLOBAL = None
Y_GLOBAL = None
N_SPLITS_CV = 5

def save_best_results_callback(study, trial):
    """Callback to save the results of the best trial."""
    if study.best_trial == trial:
        best_params = trial.params
        best_w = best_params['ensemble_w']
        best_threshold = best_params['threshold']
        
        oof_xgb_proba = np.array(trial.user_attrs['oof_xgb_proba'])
        oof_svm_proba = np.array(trial.user_attrs['oof_svm_proba'])

        ensemble_proba = best_w * oof_xgb_proba + (1 - best_w) * oof_svm_proba
        y_pred = (ensemble_proba > best_threshold).astype(int)

        report_dict = classification_report(Y_GLOBAL, y_pred, target_names=['Normal', 'Abnormal'], output_dict=True, zero_division=0)
        cm = confusion_matrix(Y_GLOBAL, y_pred)

        study.set_user_attr('best_trial_results', {
            'report_dict': report_dict,
            'confusion_matrix': cm.tolist()
        })

def objective(trial):
    """
    Optuna objective function to minimize clinical cost.
    """
    # --- 1. Suggest Hyperparameters ---
    xgb_params = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,
        'scale_pos_weight': (Y_GLOBAL == 0).sum() / (Y_GLOBAL == 1).sum(),
        'random_state': 42,
        'n_estimators': trial.suggest_int('xgb_n_estimators', 100, 500),
        'max_depth': trial.suggest_int('xgb_max_depth', 3, 8),
        'learning_rate': trial.suggest_float('xgb_learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('xgb_subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('xgb_colsample_bytree', 0.6, 1.0),
        'gamma': trial.suggest_float('xgb_gamma', 0, 0.5)
    }
    svm_params = {
        'kernel': 'rbf',
        'C': trial.suggest_float('svm_C', 1e-2, 1e3, log=True),
        'gamma': trial.suggest_float('svm_gamma', 1e-4, 1e-1, log=True),
        'probability': False,
        'random_state': 42
    }
    ensemble_w = trial.suggest_float('ensemble_w', 0, 1)
    threshold = trial.suggest_float('threshold', 0.01, 0.99)

    # --- 2. Perform Cross-Validation ---
    skf = StratifiedKFold(n_splits=N_SPLITS_CV, shuffle=True, random_state=42)
    oof_xgb_proba = np.zeros(len(X_GLOBAL))
    oof_svm_proba = np.zeros(len(X_GLOBAL))

    for train_idx, val_idx in skf.split(X_GLOBAL, Y_GLOBAL):
        X_train, X_val = X_GLOBAL.iloc[train_idx], X_GLOBAL.iloc[val_idx]
        y_train, y_val = Y_GLOBAL.iloc[train_idx], Y_GLOBAL.iloc[val_idx]

        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled = scaler.transform(X_val)

        xgb_model = xgb.XGBClassifier(**xgb_params)
        xgb_model.fit(X_train, y_train)
        oof_xgb_proba[val_idx] = xgb_model.predict_proba(X_val)[:, 1]

        svc_model = SVC(**svm_params)
        calibrated_svc = CalibratedClassifierCV(svc_model, method='isotonic', cv=3)
        calibrated_svc.fit(X_train_scaled, y_train)
        oof_svm_proba[val_idx] = calibrated_svc.predict_proba(X_val_scaled)[:, 1]

    # --- 3. Calculate Cost and AUC ---
    ensemble_proba = ensemble_w * oof_xgb_proba + (1 - ensemble_w) * oof_svm_proba
    y_pred = (ensemble_proba > threshold).astype(int)

    auc = roc_auc_score(Y_GLOBAL, ensemble_proba)
    trial.set_user_attr('auc', auc)
    trial.set_user_attr('oof_xgb_proba', oof_xgb_proba.tolist())
    trial.set_user_attr('oof_svm_proba', oof_svm_proba.tolist())

    cm = confusion_matrix(Y_GLOBAL, y_pred)
    FN = cm[1, 0]
    FP = cm[0, 1]

    cost = 15 * FN + 1 * FP
    return cost

def run_automl_tuning(file_path, output_dir):
    global X_GLOBAL, Y_GLOBAL
    start_time = time.time()
    print("Starting Final AutoML Ensemble Tuning Process...")

    # --- Load, Preprocess, and Filter Data ---
    df = pd.read_csv(file_path)
    rename_dict = {
        '检测孕天数': 'gestational_week', '年龄': 'age', '孕妇BMI': 'bmi',
        '在参考基因组上比对的比例': 'alignment_ratio', '重复读段的比例': 'duplication_ratio',
        '唯一比对的读段数': 'unique_reads', 'GC含量': 'gc_content',
        '13号染色体的Z值': 'z_score_13', '18号染色体的Z值': 'z_score_18',
        '21号染色体的Z值': 'z_score_21', 'X染色体的Z值': 'z_score_x',
        'X染色体浓度': 'x_concentration'
    }
    df.rename(columns=rename_dict, inplace=True)
    
    print(f"Original sample count: {len(df)}")
    df = df[df['z_score_x'].abs() < 2.5].reset_index(drop=True)
    print(f"Sample count after filtering (abs(z_score_x) < 2.5): {len(df)}")

    df['abnormal'] = df['染色体的非整倍体'].notna().astype(int)
    if 'bmi' in df.columns and df['bmi'].isnull().any():
        df['bmi'].fillna(df['bmi'].median(), inplace=True)
    
    df['z21_x_ff'] = df['z_score_21'] * df['x_concentration']
    df['z18_x_ff'] = df['z_score_18'] * df['x_concentration']
    df['z13_x_ff'] = df['z_score_13'] * df['x_concentration']
    bins = [-np.inf, 2.5, 3, np.inf]
    labels = ['Normal_ZX', 'Borderline_ZX', 'Abnormal_ZX']
    df['z_score_x_binned'] = pd.cut(abs(df['z_score_x']), bins=bins, labels=labels, right=False)
    ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    zx_binned_encoded = ohe.fit_transform(df[['z_score_x_binned']])
    zx_binned_df = pd.DataFrame(zx_binned_encoded, columns=ohe.get_feature_names_out(['z_score_x_binned']))
    df = pd.concat([df.reset_index(drop=True), zx_binned_df], axis=1)
    feature_cols = [
        'age', 'gestational_week', 'bmi', 'alignment_ratio', 'duplication_ratio',
        'unique_reads', 'gc_content', 'z_score_13', 'z_score_18', 'z_score_21', 'z_score_x',
        'x_concentration', 'z21_x_ff', 'z18_x_ff', 'z13_x_ff'
    ]
    feature_cols.extend(ohe.get_feature_names_out(['z_score_x_binned']))
    X_GLOBAL = df[feature_cols]
    Y_GLOBAL = df['abnormal']

    # --- Run Optuna Study ---
    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=100, timeout=1200, callbacks=[save_best_results_callback]) # 100 trials or 20 minutes

    print(f"Best trial found with cost: {study.best_value}")
    print(f"Best parameters: {study.best_params}")

    # --- Generate Report from Best Trial's Saved Attributes ---
    print("Generating report from the best trial...")
    
    best_results = study.user_attrs.get('best_trial_results')
    if not best_results:
        print("Could not find saved results for the best trial. Exiting.")
        return

    report_dict = best_results['report_dict']
    cm = np.array(best_results['confusion_matrix'])

    report_df = pd.DataFrame(report_dict).transpose()
    cm_df = pd.DataFrame(cm, index=['Actual Normal', 'Actual Abnormal'], columns=['Predicted Normal', 'Predicted Abnormal'])

    os.makedirs(output_dir, exist_ok=True)
    report_df.to_csv(os.path.join(output_dir, 'classification_report.csv'))
    cm_df.to_csv(os.path.join(output_dir, 'confusion_matrix.csv'))

    best_trial_auc = study.best_trial.user_attrs.get('auc', 'N/A')
    report_path = os.path.join(output_dir, 'automl_ensemble_report.md')

    with open(report_path, 'w', encoding='utf-8') as f:
        f.write("# AutoML 集成模型优化报告 (代价函数: 15*FN + 1*FP)\n\n")
        f.write("## 1. 概述\n")
        f.write("在加载数据后，首先排除了所有`abs(z_score_x) >= 2.5`的样本。后续所有的模型训练、优化和评估都在这个高置信度的样本子集上进行。\n")
        f.write("使用Optuna库对XGBoost和SVM的集成模型进行端到端的超参数优化。优化的目标是最小化临床代价函数：`Cost = 15 * FN + 1 * FP`。\n")
        f.write(f"总共执行了 **{len(study.trials)}** 次试验。报告中的所有指标均来自代价最小的那一次特定试验。\n\n")
        f.write("## 2. 优化结果\n")
        f.write(f"**最小临床代价**: {study.best_value:.4f}\n")
        f.write(f"**最佳试验对应的AUC**: {best_trial_auc:.4f}\n\n")
        f.write("### 找到的最佳超参数组合:\n")
        for key, value in study.best_params.items():
            f.write(f"- **{key}**: `{value}`\n")
        f.write("\n## 3. 最佳试验的性能详情\n")
        f.write("### 分类报告\n")
        f.write(report_df.to_markdown() + "\n\n")
        f.write("### 混淆矩阵 (计数)\n")
        f.write(cm_df.to_markdown() + "\n\n")
        f.write("![混淆矩阵图](./confusion_matrix_counts.png)\n")

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Abnormal'], yticklabels=['Normal', 'Abnormal'])
    plt.title('Confusion Matrix of The Best Trial')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'confusion_matrix_counts.png'))
    plt.close()

    print(f"AutoML tuning finished in {time.time() - start_time:.2f} seconds.")
    print(f"Final results saved to folder: {output_dir}")

if __name__ == '__main__':
    CWD = 'C:\\Users\\yezf8\\Documents\\Y3Repo\\25C题'
    DATA_FILE = os.path.join(CWD, '女胎检测数据_filtered.csv')
    OUTPUT_DIR = os.path.join(CWD, 'problem4', 'results_automl_ensemble')
    run_automl_tuning(DATA_FILE, OUTPUT_DIR)

p4_shap_analysis.py


import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shap # Import shap
import time

# Set plot style and font for Chinese characters
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

def run_shap_analysis(file_path, output_dir):
    start_time = time.time()
    print("Starting SHAP analysis on the entire sample set (this may take a very long time)...")

    # --- 1. Load, Preprocess, and Filter Data ---
    df = pd.read_csv(file_path)
    rename_dict = {
        '检测孕天数': 'gestational_week', '年龄': 'age', '孕妇BMI': 'bmi',
        '在参考基因组上比对的比例': 'alignment_ratio', '重复读段的比例': 'duplication_ratio',
        '唯一比对的读段数': 'unique_reads', 'GC含量': 'gc_content',
        '13号染色体的Z值': 'z_score_13', '18号染色体的Z值': 'z_score_18',
        '21号染色体的Z值': 'z_score_21', 'X染色体的Z值': 'z_score_x',
        'X染色体浓度': 'x_concentration'
    }
    df.rename(columns=rename_dict, inplace=True)

    original_sample_count = len(df)
    df = df[df['z_score_x'].abs() < 2.5].reset_index(drop=True)
    print(f"Original sample count: {original_sample_count}")
    print(f"Sample count after filtering (abs(z_score_x) < 2.5): {len(df)}")

    df['abnormal'] = df['染色体的非整倍体'].notna().astype(int)
    if 'bmi' in df.columns and df['bmi'].isnull().any():
        df['bmi'].fillna(df['bmi'].median(), inplace=True)

    df['z21_x_ff'] = df['z_score_21'] * df['x_concentration']
    df['z18_x_ff'] = df['z_score_18'] * df['x_concentration']
    df['z13_x_ff'] = df['z_score_13'] * df['x_concentration']
    bins = [-np.inf, 2.5, 3, np.inf]
    labels = ['Normal_ZX', 'Borderline_ZX', 'Abnormal_ZX']
    df['z_score_x_binned'] = pd.cut(abs(df['z_score_x']), bins=bins, labels=labels, right=False)
    ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    zx_binned_encoded = ohe.fit_transform(df[['z_score_x_binned']])
    zx_binned_df = pd.DataFrame(zx_binned_encoded, columns=ohe.get_feature_names_out(['z_score_x_binned']))
    df = pd.concat([df.reset_index(drop=True), zx_binned_df], axis=1)
    feature_cols = [
        'age', 'gestational_week', 'bmi', 'alignment_ratio', 'duplication_ratio',
        'unique_reads', 'gc_content', 'z_score_13', 'z_score_18', 'z_score_21', 'z_score_x',
        'x_concentration', 'z21_x_ff', 'z18_x_ff', 'z13_x_ff'
    ]
    feature_cols.extend(ohe.get_feature_names_out(['z_score_x_binned']))
    X = df[feature_cols]
    y = df['abnormal']

    # --- 2. Train Final Models with Best Parameters ---
    # (Hardcoded from last AutoML run)
    best_params = {
        'xgb_n_estimators': 479, 'xgb_max_depth': 6, 'xgb_learning_rate': 0.010359914293101614,
        'xgb_subsample': 0.7500268146879681, 'xgb_colsample_bytree': 0.9383752680015073,
        'xgb_gamma': 0.24133743945925995, 'svm_C': 1.0458842497938852,
        'svm_gamma': 0.013559691602731587, 'ensemble_w': 0.876834598235878,
        'threshold': 0.1584494525854912
    }

    # Extract best XGBoost params
    best_xgb_params = {k.replace('xgb_', ''): v for k, v in best_params.items() if k.startswith('xgb_')}
    best_xgb_params.update({
        'objective': 'binary:logistic', 'eval_metric': 'logloss',
        'use_label_encoder': False, 'scale_pos_weight': (y == 0).sum() / (y == 1).sum(),
        'random_state': 42
    })

    # Extract best SVM params
    best_svm_params = {k.replace('svm_', ''): v for k, v in best_params.items() if k.startswith('svm_')}
    best_svm_params.update({'kernel': 'rbf', 'probability': False, 'random_state': 42})

    best_w = best_params['ensemble_w']

    # Train final XGBoost model
    xgb_model = xgb.XGBClassifier(**best_xgb_params)
    xgb_model.fit(X, y)

    # Train final Calibrated SVM model
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X) # Scale the entire filtered dataset
    svc_model = SVC(**best_svm_params)
    calibrated_svc = CalibratedClassifierCV(svc_model, method='isotonic', cv=5) # Use cv=5 for final calibration
    calibrated_svc.fit(X_scaled, y)

    # --- 3. Define Ensemble Prediction Function for SHAP ---
    # (This is the core change)
    def ensemble_predict_proba(X_input):
        # Ensure X_input is a DataFrame with correct columns for XGBoost
        if not isinstance(X_input, pd.DataFrame):
            X_input_df = pd.DataFrame(X_input, columns=X.columns)
        else:
            X_input_df = X_input

        # Predict probabilities from individual models
        xgb_proba = xgb_model.predict_proba(X_input_df)[:, 1]
        svm_proba = calibrated_svc.predict_proba(scaler.transform(X_input_df))[:, 1]

        # Combine with ensemble weight
        return best_w * xgb_proba + (1 - best_w) * svm_proba

    # --- 4. Perform SHAP Analysis ---
    print("Calculating SHAP values for the ensemble (this will be very slow)...")
    
    # Use a small, representative background dataset for KernelExplainer
    # For KernelExplainer, the background data should be representative of the data distribution.
    # Using shap.sample(X, 50) is a common approach for larger datasets.
    background_data = shap.sample(X, 50) # Sample 50 instances for background

    # Create KernelExplainer for the ensemble prediction function
    explainer = shap.KernelExplainer(ensemble_predict_proba, background_data)

    # Calculate SHAP values for the entire dataset X
    # This is the most time-consuming step.
    shap_values = explainer.shap_values(X)

    # --- 5. Generate and Save SHAP Plots ---
    os.makedirs(output_dir, exist_ok=True)

    # SHAP Summary Plot (Beeswarm) - for all features
    print("Generating SHAP summary plot (beeswarm)...")
    shap.summary_plot(shap_values, X, show=False)
    plt.savefig(os.path.join(output_dir, 'shap_summary_beeswarm.png'), bbox_inches='tight')
    plt.close()

    # SHAP Feature Importance Bar Plot - for all features
    print("Generating SHAP feature importance bar plot...")
    shap.summary_plot(shap_values, X, plot_type="bar", show=False)
    plt.savefig(os.path.join(output_dir, 'shap_feature_importance_bar.png'), bbox_inches='tight')
    plt.close()

    print(f"SHAP analysis finished in {time.time() - start_time:.2f} seconds.")
    print(f"SHAP plots saved to folder: {output_dir}")

if __name__ == '__main__':
    CWD = 'C:\\Users\\yezf8\\Documents\\Y3Repo\\25C题'
    DATA_FILE = os.path.join(CWD, '女胎检测数据_filtered.csv')
    OUTPUT_DIR = os.path.join(CWD, 'problem4', 'results_automl_ensemble')
    run_shap_analysis(DATA_FILE, OUTPUT_DIR)

论文

#数学建模

2025年数模国赛C题论文

https://blog.giraffish.me/post/36ac6f29/

作者

卖柠檬雪糕的鱼

发布于

2025年9月10日

许可协议

Text-to-Video Retrieval 上一篇

数模笔记模型篇二下一篇