TeachCLIP 数据流可视化: 视频特征聚合过程

1. 输入: ViT 提取的帧特征

Shape: [B=1, N, D]

这是从 CLIP Visual Encoder (ViT) 输出的原始特征。每一行代表视频中的一帧。
代码对应: visual_output = self.clip.encode_image(video)

2. 时序建模 (Transformer x 4 Layers)

Shape: [B, N, D] ↔ [N, B, D]

使用 module_cross.py 中的 Transformer 类进行时序交互。
代码对应:

# Sequential type: Time Transformer Encoder
seq_length = visual_output.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long, device=visual_output.device)
position_ids = position_ids.unsqueeze(0).expand(visual_output.size(0), -1)
frame_position_embeddings = self.frame_position_embeddings(position_ids)
visual_output = visual_output + frame_position_embeddings

extended_video_mask = (1.0 - video_mask.unsqueeze(1)) * -1000000.0
extended_video_mask = extended_video_mask.expand(-1, video_mask.size(1), -1)
visual_output = visual_output.permute(1, 0, 2)  # NLD -> LND
visual_output = self.transformerClip(visual_output, extended_video_mask)
visual_output = visual_output.permute(1, 0, 2)  # LND -> NLD
visual_output = visual_output + visual_output_original

注意： 根据配置文件 (msrvtt-7k.yaml)，这里堆叠了 4层 Transformer Block。
单层步骤：Pre-LN → Self-Attention → Residual → Pre-LN → MLP (QuickGELU) → Residual。
初始化技巧： 为了防止深层网络数值爆炸，Attention 和 MLP 的输出投影层权重被缩放了 $1/\sqrt{2L}$ (约 0.35)。
下方展示的是第 1 层的内部细节，以及经过 4 层处理后的最终输出。

2.1 位置编码 (Positional Encoding)

2.2 Layer 1: Self-Attention 细节

2.3 Layer 1: FFN (Feed-Forward Network) & Residual

架构说明 (Pre-LN): 本模型使用的是 Pre-LN 结构，与原始 Transformer 的 "Add & Norm" (Post-LN) 不同。
流程: Input $\xrightarrow{Norm}$ FFN $\xrightarrow{Add}$ Output
即: $x = x + \text{FFN}(\text{LayerNorm}(x))$
这里的 MLP (Multi-Layer Perceptron) 即为标准 Transformer 中的 FFN 部分。

2.4 Layer 2-4: 循环处理 (最终输出)

数据继续经过 3 层相同的 Transformer Block 处理...

3. 帧权重预测 (Frame Weight Predict Module V2)

Shape: [B, N, 1]

基于 Transformer 输出的特征，通过 MLP 预测每一帧的重要性权重。
代码对应:

# Frame Weight Predict Module V2
Frameweight = self.frameLinear(visual_output)
Frameweight = torch.nn.functional.relu(Frameweight)
Frameweight = self.frameLinear2(Frameweight)

公式: $W_{raw} = \text{Linear}_2(\text{ReLU}(\text{Linear}_1(V_{trans})))$
其中:

$V_{trans}$: Transformer 输出的特征 (Shape: [N, D])
$\text{Linear}_1$: 线性层 (D → D)
$\text{ReLU}$: 激活函数 $\max(0, x)$
$\text{Linear}_2$: 线性层 (D → 1)，输出每一帧的原始权重分数

3.1 Linear 1 & ReLU

3.2 Linear 2 (Raw Weights)

4. 加权聚合 (Weighted Aggregation)

Shape: [B, D]

将预测的权重经过 Softmax 归一化后，对 L2 归一化后的帧特征进行加权求和。
代码对应:

visual_output = visual_output / visual_output.norm(dim=-1, keepdim=True)
# ...
if return_fine:
    visual_output_adapt = (visual_output * Frameweight.div(0.1).softmax(1)).sum(1)

公式: $V_{agg} = \sum_{i=1}^{N} \alpha_i \cdot \hat{v}_i$
其中:

$\hat{v}_i = \frac{v_i}{\|v_i\|_2}$: L2 归一化后的第 $i$ 帧特征
$\alpha_i = \text{softmax}(\frac{w_i}{0.1})$: 第 $i$ 帧的最终权重 (Temperature=0.1)

4.1 特征归一化 (L2 Norm)

4.2 权重 Softmax (Temp=0.1)

4.3 加权求和结果

5. 最终归一化 (Final L2 Normalization)

Shape: [B, D]

对聚合后的视频向量进行 L2 归一化，得到最终的语义空间向量。
代码对应:

visual_output_adapt = visual_output_adapt / visual_output_adapt.norm(dim=-1, keepdim=True)

公式: $V_{final} = \frac{V_{agg}}{\|V_{agg}\|_2}$