# Python regex 改 YAML 段：`.*?` + 通用尾部 pattern 跨段误吞

**日期**：2026-05-17
**场景**：在 robot-lifecycle.yaml 里加一个 lane（warehouse 后插入 branch lane），用 Python re.sub 替换 lanes 段。规则写：
```python
re.sub(r'# 5 个主泳道.*?\n(?:  - \{.*\}\n)+', new_lanes + '\n', content, flags=re.S)
```
结果：YAML 从 186 行变成 45 行——**整个 nodes / edges / sidebar / notes 段都被 regex 吞了**。新 lane 写进去了，但其他业务定义全部消失。无 git 历史可恢复，靠手工重写 ~150 行配置救回来。

## 直接原因

`.*?` 非贪婪 + `re.S` (DOTALL) + `(?:  - \{.*\}\n)+` 尾部 pattern：
- `.*?` 试图最短匹配，从 `# 5 个主泳道` 开始往后扫
- 在遇到第一个 `  - {...}\n` 模式时开始尝试 `+` 量词
- 但 `+` 是贪婪 — 它会**继续吃后续所有 `  - {...}\n` 行**，跨过 nodes 段、edges 段、notes 段（这些段每行也是 `  - {...}` 格式）一直吃到文件末尾或不再匹配的行

YAML 文件中节点 / 边 / notes 都用 `  - {...}` 单行格式 — 跟 lanes 完全同形态。正则没办法区分"这条 `  - {...}` 是 lane 还是 node"。

## 元根因

**用 regex 替换"业务上的一个段"，但 regex 没有段边界感知**：
- YAML 的"段"由顶级 key（如 `lanes:` / `nodes:` / `edges:`）划分，但 list item 是同质化的 `  - {...}`
- 正则只看字符模式，看不到"语义结构"
- `.*?` + 同质化 list pattern 是典型陷阱：**最短开头匹配 + 贪婪重复尾部 = 吞跨段所有同形态行**

## 应该用什么手段改 YAML

按优先级：

1. **YAML 解析器**（PyYAML / js-yaml）：load → 操作 dict → dump。结构感知，零误吞风险。 **最优**。
   ```python
   import yaml
   data = yaml.safe_load(open('f.yaml'))
   data['lanes'].insert(2, {'id': 'branch', ...})
   open('f.yaml','w').write(yaml.dump(data, allow_unicode=True))
   ```
   缺点：dump 会重排格式、丢注释、扁平化 inline `{...}` 风格。如果原文件强依赖手写格式（如本案 inline lane 定义），不适用。

2. **段定界 regex**：明确写死段开始和段结束 anchor：
   ```python
   # 用下一个顶级 key 做结束 anchor
   re.sub(
     r'(lanes:\n)(?:  - \{.*?\}\n)+(\n# 节点)',  # ← 用 "\n# 节点" 截断
     r'\1' + new_lane_lines + r'\2',
     content
   )
   ```
   关键：**绝不让 regex `.*?` 跨"段尾标识"**。

3. **行级 Python 处理**（split + 索引 + 重组）：
   ```python
   lines = content.splitlines()
   start = lines.index('lanes:')
   # 找下一个非缩进顶级 key
   end = next(i for i in range(start+1, len(lines)) if lines[i] and not lines[i].startswith((' ', '\t')))
   new_lines = lines[:start+1] + lane_def_lines + lines[end:]
   ```
   笨但绝对安全 — `.*` 不会越界。

4. **Edit 工具按精确字符串**：适合小改动（一两行），但 CJK 全角标点容易匹配失败（本项目反复踩过）。

## 永远不要做的

- **用 `.*?` + 跨多行同质化 pattern 的 list 重复**，除非配合"段尾 anchor"
- **不做 dry-run 验证**就一次性写覆盖（regex 改写前应该先 `re.findall` 看匹配范围，确认只覆盖目标段）
- **在没 git commit 的 working tree 大改文件**（哪怕几秒钟前生成的，也应该先 `git add -N` + `cp` 备份再动）

## 工程化保险

**用 regex 改"段级"配置前，先 `re.findall(pattern, content)` 看会匹配到几个 match、每个 match 多长，确认只覆盖目标段**：

```python
matches = re.findall(pattern, content, flags=re.S)
print(f'matches: {len(matches)}')
for m in matches:
    print(f'  len={len(m)} starts="{m[:30]}..." ends="...{m[-30:]}"')
# 视觉确认 OK 之后再 re.sub
```

如果一个 match 长度 > 几百字符 / 行数 > 5，**几乎肯定吃错了**。

## 类似的"假装精确实际跨段"陷阱

- HTML/XML 用 `<div>.*?</div>` 匹配嵌套 div（吃错最近还是最远 closing tag）
- Markdown 用 `^##.*?^##` 匹配章节（dotall 模式下吃跨章节）
- SQL 用 `BEGIN.*?END` 跨多个事务

通用规则：**同质化重复结构 + `.*?` = 不可靠**。要么用结构化解析器，要么显式段尾 anchor。