|
@@ -1,72 +1,167 @@
|
|
|
# 数据提取规则系统设计文档
|
|
# 数据提取规则系统设计文档
|
|
|
|
|
|
|
|
-> 版本: 1.0.0
|
|
|
|
|
-> 日期: 2026-01-22
|
|
|
|
|
|
|
+> 版本: 2.0.0
|
|
|
|
|
+> 日期: 2026-01-23
|
|
|
> 作者: AI Assistant (Claude Opus 4.5)
|
|
> 作者: AI Assistant (Claude Opus 4.5)
|
|
|
|
|
|
|
|
## 一、概述
|
|
## 一、概述
|
|
|
|
|
|
|
|
### 1.1 背景
|
|
### 1.1 背景
|
|
|
|
|
|
|
|
-在电力工程预评价报告生成场景中,用户需要从多个来源文档(PDF、Word、Excel)中提取特定数据,并按照规则进行处理(直接提取、AI提取、AI总结等),最终生成结构化的报告。
|
|
|
|
|
|
|
+在电力工程预评价报告生成场景中,用户需要从多个来源文档(PDF、Word、Excel)中提取特定数据,生成标准化的报告。
|
|
|
|
|
|
|
|
-当前人工整理的数据提取规则以表格形式存在,包含:
|
|
|
|
|
|
|
+**核心痛点**:
|
|
|
|
|
+- 不同工程项目的报告结构相似,但数据不同
|
|
|
|
|
+- 人工从多个文档中复制粘贴数据,效率低、易出错
|
|
|
|
|
+- 相同类型的报告需要重复劳动
|
|
|
|
|
|
|
|
-- 来源数据/文件
|
|
|
|
|
-- 来源数据/文件的具体章节/内容
|
|
|
|
|
-- 取值数据规则
|
|
|
|
|
-- 待提供/备注
|
|
|
|
|
-
|
|
|
|
|
-本系统的目标是**将这一人工整理流程可视化、结构化**,让用户能够在界面上配置提取规则,系统自动执行提取任务。
|
|
|
|
|
|
|
+**解决方案**:
|
|
|
|
|
+用户上传一份**已完成的真实报告**作为示例,在报告中标记**变量**(如工程名称、批复日期等),并配置每个变量的**数据来源**。之后遇到同类项目,只需替换来源文件,系统自动提取数据生成新报告。
|
|
|
|
|
|
|
|
### 1.2 核心概念
|
|
### 1.2 核心概念
|
|
|
|
|
|
|
|
-| 概念 | 说明 |
|
|
|
|
|
-| ------------------------------------ | -------------------------------------------- |
|
|
|
|
|
-| **项目(Project)** | 一个报告生成任务,包含多个来源文档和提取规则 |
|
|
|
|
|
-| **来源文档(SourceDocument)** | 项目中用到的文档,关联已解析的 Document |
|
|
|
|
|
-| **提取规则(ExtractRule)** | 描述如何从来源文档中提取数据的配置 |
|
|
|
|
|
-| **提取结果(ExtractResult)** | 规则执行后的提取值,可被后续规则引用 |
|
|
|
|
|
|
|
+| 概念 | 说明 |
|
|
|
|
|
+| ---- | ---- |
|
|
|
|
|
+| **模板(Template)** | 基于真实报告创建,包含变量定义和来源文件配置 |
|
|
|
|
|
+| **来源文件定义(SourceFile)** | 模板需要的来源文件类型,用别名标识(如"可研批复") |
|
|
|
|
|
+| **变量(Variable)** | 报告中需要动态替换的内容,绑定到文档中的具体位置 |
|
|
|
|
|
+| **生成任务(Generation)** | 使用模板生成新报告的一次任务 |
|
|
|
|
|
|
|
|
### 1.3 设计原则
|
|
### 1.3 设计原则
|
|
|
|
|
|
|
|
-1. **数据溯源**:每个提取值都能追溯到来源文档的具体位置
|
|
|
|
|
-2. **灵活配置**:支持多种来源类型和提取方式的组合
|
|
|
|
|
-3. **可复用**:规则配置可保存为模板,应用到类似项目
|
|
|
|
|
-4. **渐进式**:支持分步提取、人工确认、修正
|
|
|
|
|
|
|
+1. **示例驱动**:基于真实报告创建模板,所见即所得
|
|
|
|
|
+2. **数据溯源**:每个提取值都能追溯到来源文档的具体位置
|
|
|
|
|
+3. **灵活来源**:来源文件数量和类型由用户自定义
|
|
|
|
|
+4. **一次配置,多次复用**:模板创建后可用于生成任意多份新报告
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## 二、用户使用流程
|
|
|
|
|
+
|
|
|
|
|
+### 2.1 流程概览
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
+│ 第一次:创建模板 │
|
|
|
|
|
+├─────────────────────────────────────────────────────────────────┤
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 1. 上传示例报告(一份真实的、完整的报告) │
|
|
|
|
|
+│ "襄阳连云110kV预评价报告.docx" │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 2. 添加来源文件(自定义数量和别名) │
|
|
|
|
|
+│ ├── 可研批复.pdf 别名:"可研批复" │
|
|
|
|
|
+│ ├── 站址报告.docx 别名:"站址报告" │
|
|
|
|
|
+│ └── 概算表.xlsx 别名:"概算表" │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 3. 在示例报告中标记变量 │
|
|
|
|
|
+│ 选中文本 "襄阳连云 110kV 输变电工程" │
|
|
|
|
|
+│ → 设为变量 "project_name" │
|
|
|
|
|
+│ → 来源:从【可研批复】第1页 AI提取 │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 4. 保存模板 │
|
|
|
|
|
+│ │
|
|
|
|
|
+└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
+ ↓
|
|
|
|
|
+┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
+│ 第二次起:使用模板生成新报告 │
|
|
|
|
|
+├─────────────────────────────────────────────────────────────────┤
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 1. 选择模板 │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 2. 上传新项目的来源文件(按别名对应) │
|
|
|
|
|
+│ ├── 武汉东湖批复.pdf → "可研批复" │
|
|
|
|
|
+│ ├── 武汉站址报告.docx → "站址报告" │
|
|
|
|
|
+│ └── 武汉概算.xlsx → "概算表" │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 3. 点击【生成】 │
|
|
|
|
|
+│ 系统自动: │
|
|
|
|
|
+│ - 从新文件提取数据 │
|
|
|
|
|
+│ - 替换模板中的变量 │
|
|
|
|
|
+│ - 生成新报告 │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 4. 预览、确认、下载 │
|
|
|
|
|
+│ "武汉东湖110kV预评价报告.docx" │
|
|
|
|
|
+│ │
|
|
|
|
|
+└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### 2.2 变量标记交互
|
|
|
|
|
+
|
|
|
|
|
+用户在文档编辑器中操作:
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
+│ 📄 示例报告编辑器 │
|
|
|
|
|
+├─────────────────────────────────────────────────────────────────┤
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 一、项目概述 │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 1.1 工程名称:襄阳连云 110kV 输变电工程 │
|
|
|
|
|
+│ ═══════════════════════════ ← 用户选中这段文字 │
|
|
|
|
|
+│ │ │
|
|
|
|
|
+│ ▼ 右键菜单 │
|
|
|
|
|
+│ ┌────────────────────────────────────┐ │
|
|
|
|
|
+│ │ 📌 设为变量 │ │
|
|
|
|
|
+│ │ │ │
|
|
|
|
|
+│ │ 变量名:project_name │ │
|
|
|
|
|
+│ │ 显示名:工程名称 │ │
|
|
|
|
|
+│ │ │ │
|
|
|
|
|
+│ │ 当前值:襄阳连云 110kV 输变电工程 │ │
|
|
|
|
|
+│ │ │ │
|
|
|
|
|
+│ │ 数据来源: │ │
|
|
|
|
|
+│ │ ● 从来源文件提取 │ │
|
|
|
|
|
+│ │ ├─ 来源文件:[可研批复 ▼] │ │
|
|
|
|
|
+│ │ ├─ 定位方式:[按页码 ▼] │ │
|
|
|
|
|
+│ │ ├─ 页码范围:1-2 │ │
|
|
|
|
|
+│ │ └─ 提取方式:[AI提取 ▼] │ │
|
|
|
|
|
+│ │ └─ 提取目标:工程项目名称 │ │
|
|
|
|
|
+│ │ │ │
|
|
|
|
|
+│ │ ○ 手动输入 │ │
|
|
|
|
|
+│ │ ○ 引用其他变量 │ │
|
|
|
|
|
+│ │ ○ 固定值(不变) │ │
|
|
|
|
|
+│ │ │ │
|
|
|
|
|
+│ │ [取消] [确定] │ │
|
|
|
|
|
+│ └────────────────────────────────────┘ │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 1.2 建设单位:【$construction_unit】← 已标记的变量高亮显示 │
|
|
|
|
|
+│ │
|
|
|
|
|
+│ 1.3 批复日期:【$approval_date】 │
|
|
|
|
|
+│ │
|
|
|
|
|
+└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
+```
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 二、系统架构
|
|
|
|
|
|
|
+## 三、系统架构
|
|
|
|
|
|
|
|
-### 2.1 整体架构图
|
|
|
|
|
|
|
+### 3.1 整体架构图
|
|
|
|
|
|
|
|
```
|
|
```
|
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
-│ 前端 (Vue.js) │
|
|
|
|
|
|
|
+│ 前端 (Vue.js / Flutter) │
|
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
|
-│ │ 项目管理 │ │ 文档管理 │ │ 规则配置 │ │ 提取执行 │ │
|
|
|
|
|
|
|
+│ │ 模板管理 │ │ 文档编辑器 │ │ 变量配置 │ │ 报告生成 │ │
|
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
│
|
|
│
|
|
|
▼
|
|
▼
|
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
-│ Gateway Service │
|
|
|
|
|
|
|
+│ API Gateway │
|
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
│
|
|
│
|
|
|
- ┌──────────────────┼──────────────────┐
|
|
|
|
|
- ▼ ▼ ▼
|
|
|
|
|
|
|
+ ┌────────────────────────────┼────────────────────────────┐
|
|
|
|
|
+ ▼ ▼ ▼
|
|
|
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
|
|
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
|
|
|
-│ extract-service │ │ document-service │ │ ai-service │
|
|
|
|
|
-│ (新增) │ │ (已有) │ │ (已有) │
|
|
|
|
|
|
|
+│ template-service │ │ document-service │ │ ai-service │
|
|
|
|
|
+│ (模板与生成) │ │ (文档管理) │ │ (AI提取) │
|
|
|
│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
|
|
│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
|
|
|
-│ │ ProjectService │ │ │ │ DocumentService│ │ │ │ DeepSeekClient │ │
|
|
|
|
|
-│ │ RuleService │ │ │ │ ElementService │ │ │ │ AIService │ │
|
|
|
|
|
-│ │ ExecuteService │ │ │ └────────────────┘ │ │ └────────────────┘ │
|
|
|
|
|
|
|
+│ │ TemplateService│ │ │ │ DocumentService│ │ │ │ DeepSeekClient │ │
|
|
|
|
|
+│ │ VariableService│ │ │ │ ElementService │ │ │ │ AIExtractService│ │
|
|
|
|
|
+│ │ GenerationSvc │ │ │ └────────────────┘ │ │ └────────────────┘ │
|
|
|
│ └────────────────┘ │ └──────────────────────┘ └──────────────────────┘
|
|
│ └────────────────┘ │ └──────────────────────┘ └──────────────────────┘
|
|
|
└──────────────────────┘
|
|
└──────────────────────┘
|
|
|
- │ │ │
|
|
|
|
|
- └──────────────────┴──────────────────┘
|
|
|
|
|
|
|
+ │ │ │
|
|
|
|
|
+ └────────────────────────────┴────────────────────────────┘
|
|
|
│
|
|
│
|
|
|
▼
|
|
▼
|
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
@@ -74,1677 +169,525 @@
|
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-### 2.2 模块职责
|
|
|
|
|
|
|
+### 3.2 模块职责
|
|
|
|
|
|
|
|
-| 模块 | 职责 | 位置 |
|
|
|
|
|
-| -------------------------- | ---------------------------------------------- | ---------------------------- |
|
|
|
|
|
-| **extract-service** | 项目管理、规则配置、提取执行(**新增**) | `backend/extract-service` |
|
|
|
|
|
-| **document-service** | 文档管理、元素存储(已有) | `backend/document-service` |
|
|
|
|
|
-| **parse-service** | 文档解析、结构化提取(已有) | `backend/parse-service` |
|
|
|
|
|
-| **ai-service** | AI 提取、总结、润色(已有) | `backend/ai-service` |
|
|
|
|
|
-| **graph-service** | 数据源、知识图谱(已有) | `backend/graph-service` |
|
|
|
|
|
|
|
+| 模块 | 职责 | 位置 |
|
|
|
|
|
+| ---- | ---- | ---- |
|
|
|
|
|
+| **template-service** | 模板管理、变量配置、报告生成(重构 extract-service) | `backend/extract-service` |
|
|
|
|
|
+| **document-service** | 文档管理、元素存储(已有) | `backend/document-service` |
|
|
|
|
|
+| **parse-service** | 文档解析、结构化提取(已有) | `backend/parse-service` |
|
|
|
|
|
+| **ai-service** | AI 提取、总结(已有) | `backend/ai-service` |
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 三、数据库设计
|
|
|
|
|
|
|
+## 四、数据库设计
|
|
|
|
|
|
|
|
-### 3.1 ER 图
|
|
|
|
|
|
|
+### 4.1 ER 图
|
|
|
|
|
|
|
|
```
|
|
```
|
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
-│ projects │ │ source_documents│ │ extract_rules │
|
|
|
|
|
|
|
+│ templates │ │ source_files │ │ variables │
|
|
|
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
|
|
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
|
|
|
-│ id │◄──────│ project_id │ │ id │
|
|
|
|
|
-│ user_id │ │ id │◄──────│ project_id │
|
|
|
|
|
-│ name │ │ document_id │ │ source_doc_id │
|
|
|
|
|
-│ description │ │ alias │ │ target_field_key│
|
|
|
|
|
-│ status │ │ doc_type │ │ target_field_name│
|
|
|
|
|
-│ config │ │ metadata │ │ rule_index │
|
|
|
|
|
-│ created_at │ │ created_at │ │ source_type │
|
|
|
|
|
-│ updated_at │ └─────────────────┘ │ source_config │
|
|
|
|
|
-└─────────────────┘ │ extract_type │
|
|
|
|
|
- │ extract_config │
|
|
|
|
|
-┌─────────────────┐ │ status │
|
|
|
|
|
-│ extract_results │ │ extracted_value │
|
|
|
|
|
-├─────────────────┤ │ value_type │
|
|
|
|
|
-│ id │ │ metadata │
|
|
|
|
|
-│ rule_id │◄────────────────────────────────│ created_at │
|
|
|
|
|
-│ project_id │ │ updated_at │
|
|
|
|
|
-│ extracted_value │ └─────────────────┘
|
|
|
|
|
-│ value_type │
|
|
|
|
|
-│ source_content │ ┌─────────────────┐
|
|
|
|
|
-│ confidence │ │ rule_templates │
|
|
|
|
|
-│ status │ ├─────────────────┤
|
|
|
|
|
-│ metadata │ │ id │
|
|
|
|
|
-│ created_at │ │ user_id │
|
|
|
|
|
-│ confirmed_at │ │ name │
|
|
|
|
|
-│ confirmed_by │ │ description │
|
|
|
|
|
-└─────────────────┘ │ rules_snapshot │
|
|
|
|
|
- │ doc_type_pattern│
|
|
|
|
|
- │ created_at │
|
|
|
|
|
- └─────────────────┘
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 3.2 表结构定义
|
|
|
|
|
-
|
|
|
|
|
-#### 3.2.1 projects(项目表)
|
|
|
|
|
|
|
+│ id │◄──────│ template_id │ │ id │
|
|
|
|
|
+│ user_id │ │ id │ │ template_id │──────►│
|
|
|
|
|
+│ name │ │ alias │◄──────│ source_file_alias│ │
|
|
|
|
|
+│ description │ │ description │ │ name │ │
|
|
|
|
|
+│ base_document_id│ │ file_types │ │ display_name │ │
|
|
|
|
|
+│ status │ │ required │ │ location │ │
|
|
|
|
|
+│ config │ │ example_doc_id │ │ example_value │ │
|
|
|
|
|
+│ created_at │ │ display_order │ │ source_type │ │
|
|
|
|
|
+│ updated_at │ └─────────────────┘ │ source_config │ │
|
|
|
|
|
+└─────────────────┘ │ extract_type │ │
|
|
|
|
|
+ │ │ extract_config │ │
|
|
|
|
|
+ │ ┌─────────────────┐ │ display_order │ │
|
|
|
|
|
+ │ │ generations │ └─────────────────┘ │
|
|
|
|
|
+ │ ├─────────────────┤ │
|
|
|
|
|
+ └──────►│ template_id │ │
|
|
|
|
|
+ │ id │◄─────────────────────────────────────────┘
|
|
|
|
|
+ │ user_id │ (通过 template_id 关联)
|
|
|
|
|
+ │ name │
|
|
|
|
|
+ │ source_file_map │ ← JSONB: {"可研批复": "doc_123", ...}
|
|
|
|
|
+ │ variable_values │ ← JSONB: {"project_name": {...}, ...}
|
|
|
|
|
+ │ output_doc_id │
|
|
|
|
|
+ │ status │
|
|
|
|
|
+ │ created_at │
|
|
|
|
|
+ │ completed_at │
|
|
|
|
|
+ └─────────────────┘
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### 4.2 表结构定义
|
|
|
|
|
+
|
|
|
|
|
+#### 4.2.1 templates(模板表)
|
|
|
|
|
|
|
|
```sql
|
|
```sql
|
|
|
-CREATE TABLE projects (
|
|
|
|
|
- id VARCHAR(32) PRIMARY KEY,
|
|
|
|
|
- user_id VARCHAR(32) NOT NULL,
|
|
|
|
|
- name VARCHAR(255) NOT NULL COMMENT '项目名称',
|
|
|
|
|
- description TEXT COMMENT '项目描述',
|
|
|
|
|
- status VARCHAR(32) DEFAULT 'draft' COMMENT '状态: draft/extracting/completed/archived',
|
|
|
|
|
- config JSONB COMMENT '项目配置',
|
|
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
- updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
-
|
|
|
|
|
- INDEX idx_user_id (user_id),
|
|
|
|
|
- INDEX idx_status (status)
|
|
|
|
|
|
|
+CREATE TABLE templates (
|
|
|
|
|
+ id VARCHAR(36) PRIMARY KEY,
|
|
|
|
|
+ user_id VARCHAR(36) NOT NULL,
|
|
|
|
|
+ name VARCHAR(255) NOT NULL COMMENT '模板名称',
|
|
|
|
|
+ description TEXT COMMENT '模板描述',
|
|
|
|
|
+ base_document_id VARCHAR(36) NOT NULL COMMENT '示例报告文档ID',
|
|
|
|
|
+ status VARCHAR(32) DEFAULT 'draft' COMMENT '状态: draft/published/archived',
|
|
|
|
|
+ config JSONB COMMENT '模板配置',
|
|
|
|
|
+ is_public BOOLEAN DEFAULT FALSE COMMENT '是否公开',
|
|
|
|
|
+ use_count INT DEFAULT 0 COMMENT '使用次数',
|
|
|
|
|
+ create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+ update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+ create_by VARCHAR(36),
|
|
|
|
|
+ update_by VARCHAR(36)
|
|
|
);
|
|
);
|
|
|
|
|
|
|
|
-COMMENT ON TABLE projects IS '数据提取项目';
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**config 字段结构**:
|
|
|
|
|
|
|
+CREATE INDEX idx_templates_user_id ON templates(user_id);
|
|
|
|
|
+CREATE INDEX idx_templates_status ON templates(status);
|
|
|
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "outputFormat": "docx", // 输出格式
|
|
|
|
|
- "autoExtract": false, // 是否自动执行提取
|
|
|
|
|
- "notifyOnComplete": true, // 完成时通知
|
|
|
|
|
- "aiModel": "deepseek-chat" // 使用的AI模型
|
|
|
|
|
-}
|
|
|
|
|
|
|
+COMMENT ON TABLE templates IS '报告模板';
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-#### 3.2.2 source_documents(来源文档表)
|
|
|
|
|
|
|
+#### 4.2.2 source_files(来源文件定义表)
|
|
|
|
|
|
|
|
```sql
|
|
```sql
|
|
|
-CREATE TABLE source_documents (
|
|
|
|
|
- id VARCHAR(32) PRIMARY KEY,
|
|
|
|
|
- project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
|
|
- document_id VARCHAR(32) NOT NULL COMMENT '关联的 Document ID',
|
|
|
|
|
- alias VARCHAR(128) NOT NULL COMMENT '文档别名,如"可研批复"',
|
|
|
|
|
- doc_type VARCHAR(32) NOT NULL COMMENT '文档类型: pdf/docx/xlsx',
|
|
|
|
|
|
|
+CREATE TABLE source_files (
|
|
|
|
|
+ id VARCHAR(36) PRIMARY KEY,
|
|
|
|
|
+ template_id VARCHAR(36) NOT NULL REFERENCES templates(id) ON DELETE CASCADE,
|
|
|
|
|
+ alias VARCHAR(100) NOT NULL COMMENT '文件别名,如"可研批复"',
|
|
|
|
|
+ description TEXT COMMENT '文件说明',
|
|
|
|
|
+ file_types JSONB COMMENT '允许的文件类型: ["pdf", "docx"]',
|
|
|
|
|
+ required BOOLEAN DEFAULT TRUE COMMENT '是否必须',
|
|
|
|
|
+ example_document_id VARCHAR(36) COMMENT '创建模板时使用的示例文件',
|
|
|
display_order INT DEFAULT 0 COMMENT '显示顺序',
|
|
display_order INT DEFAULT 0 COMMENT '显示顺序',
|
|
|
- metadata JSONB COMMENT '元数据',
|
|
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
-
|
|
|
|
|
- INDEX idx_project_id (project_id),
|
|
|
|
|
- UNIQUE (project_id, alias)
|
|
|
|
|
|
|
+ create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+
|
|
|
|
|
+ UNIQUE (template_id, alias)
|
|
|
);
|
|
);
|
|
|
|
|
|
|
|
-COMMENT ON TABLE source_documents IS '项目来源文档';
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**metadata 字段结构**:
|
|
|
|
|
|
|
+CREATE INDEX idx_source_files_template ON source_files(template_id);
|
|
|
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "fileName": "鄂电司发展〔2024〕124号...批复.pdf",
|
|
|
|
|
- "fileSize": 1024000,
|
|
|
|
|
- "pageCount": 18,
|
|
|
|
|
- "parseStatus": "completed"
|
|
|
|
|
-}
|
|
|
|
|
|
|
+COMMENT ON TABLE source_files IS '来源文件定义';
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-#### 3.2.3 extract_rules(提取规则表)
|
|
|
|
|
|
|
+#### 4.2.3 variables(变量表)
|
|
|
|
|
|
|
|
```sql
|
|
```sql
|
|
|
-CREATE TABLE extract_rules (
|
|
|
|
|
- id VARCHAR(32) PRIMARY KEY,
|
|
|
|
|
- project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
|
|
- source_doc_id VARCHAR(32) COMMENT '来源文档ID(可为空,表示引用/固定/手动)',
|
|
|
|
|
-
|
|
|
|
|
- -- 目标字段
|
|
|
|
|
- target_field_key VARCHAR(128) NOT NULL COMMENT '目标字段Key(程序用)',
|
|
|
|
|
- target_field_name VARCHAR(255) NOT NULL COMMENT '目标字段名称(显示用)',
|
|
|
|
|
- target_field_group VARCHAR(128) COMMENT '字段分组',
|
|
|
|
|
- rule_index INT NOT NULL COMMENT '规则顺序',
|
|
|
|
|
-
|
|
|
|
|
- -- 来源配置
|
|
|
|
|
- source_type VARCHAR(32) NOT NULL COMMENT '来源类型: document/self_reference/fixed/manual',
|
|
|
|
|
- source_config JSONB NOT NULL COMMENT '来源配置',
|
|
|
|
|
-
|
|
|
|
|
- -- 提取配置
|
|
|
|
|
- extract_type VARCHAR(32) NOT NULL COMMENT '提取类型: direct/ai_extract/ai_summarize/ocr',
|
|
|
|
|
|
|
+CREATE TABLE variables (
|
|
|
|
|
+ id VARCHAR(36) PRIMARY KEY,
|
|
|
|
|
+ template_id VARCHAR(36) NOT NULL REFERENCES templates(id) ON DELETE CASCADE,
|
|
|
|
|
+
|
|
|
|
|
+ -- 变量标识
|
|
|
|
|
+ name VARCHAR(100) NOT NULL COMMENT '变量名(程序用)',
|
|
|
|
|
+ display_name VARCHAR(200) NOT NULL COMMENT '显示名称',
|
|
|
|
|
+ variable_group VARCHAR(100) COMMENT '变量分组',
|
|
|
|
|
+
|
|
|
|
|
+ -- 在示例报告中的位置
|
|
|
|
|
+ location JSONB NOT NULL COMMENT '文档中的位置',
|
|
|
|
|
+ -- location 结构:
|
|
|
|
|
+ -- {
|
|
|
|
|
+ -- "element_id": "elem_001",
|
|
|
|
|
+ -- "type": "text" | "table_cell" | "paragraph",
|
|
|
|
|
+ -- "start_offset": 10,
|
|
|
|
|
+ -- "end_offset": 35,
|
|
|
|
|
+ -- "row_index": 2, -- 表格行
|
|
|
|
|
+ -- "col_index": 1 -- 表格列
|
|
|
|
|
+ -- }
|
|
|
|
|
+
|
|
|
|
|
+ -- 示例值(原文档中的值)
|
|
|
|
|
+ example_value TEXT COMMENT '示例值',
|
|
|
|
|
+ value_type VARCHAR(32) DEFAULT 'text' COMMENT '值类型: text/date/number/table',
|
|
|
|
|
+
|
|
|
|
|
+ -- 数据来源
|
|
|
|
|
+ source_file_alias VARCHAR(100) COMMENT '来源文件别名',
|
|
|
|
|
+ source_type VARCHAR(32) NOT NULL COMMENT '来源类型: document/manual/reference/fixed',
|
|
|
|
|
+ source_config JSONB COMMENT '来源配置',
|
|
|
|
|
+ -- source_config 示例(document类型):
|
|
|
|
|
+ -- {
|
|
|
|
|
+ -- "location": {
|
|
|
|
|
+ -- "type": "page",
|
|
|
|
|
+ -- "pageStart": 1,
|
|
|
|
|
+ -- "pageEnd": 2
|
|
|
|
|
+ -- }
|
|
|
|
|
+ -- }
|
|
|
|
|
+
|
|
|
|
|
+ -- 提取方式
|
|
|
|
|
+ extract_type VARCHAR(32) COMMENT '提取类型: direct/ai_extract/ai_summarize',
|
|
|
extract_config JSONB COMMENT '提取配置',
|
|
extract_config JSONB COMMENT '提取配置',
|
|
|
-
|
|
|
|
|
- -- 结果
|
|
|
|
|
- status VARCHAR(32) DEFAULT 'pending' COMMENT '状态: pending/extracting/extracted/confirmed/error',
|
|
|
|
|
- extracted_value TEXT COMMENT '提取出的值',
|
|
|
|
|
- value_type VARCHAR(32) DEFAULT 'text' COMMENT '值类型: text/table/image/list',
|
|
|
|
|
- error_message TEXT COMMENT '错误信息',
|
|
|
|
|
-
|
|
|
|
|
- -- 元数据
|
|
|
|
|
- metadata JSONB COMMENT '元数据',
|
|
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
- updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
-
|
|
|
|
|
- INDEX idx_project_id (project_id),
|
|
|
|
|
- INDEX idx_status (status),
|
|
|
|
|
- INDEX idx_target_field_key (target_field_key),
|
|
|
|
|
- UNIQUE (project_id, target_field_key)
|
|
|
|
|
-);
|
|
|
|
|
-
|
|
|
|
|
-COMMENT ON TABLE extract_rules IS '数据提取规则';
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 3.2.4 extract_results(提取结果表)
|
|
|
|
|
-
|
|
|
|
|
-```sql
|
|
|
|
|
-CREATE TABLE extract_results (
|
|
|
|
|
- id VARCHAR(32) PRIMARY KEY,
|
|
|
|
|
- rule_id VARCHAR(32) NOT NULL REFERENCES extract_rules(id) ON DELETE CASCADE,
|
|
|
|
|
- project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
|
|
-
|
|
|
|
|
- -- 提取结果
|
|
|
|
|
- extracted_value TEXT NOT NULL COMMENT '提取出的值',
|
|
|
|
|
- value_type VARCHAR(32) DEFAULT 'text' COMMENT '值类型',
|
|
|
|
|
-
|
|
|
|
|
- -- 来源追溯
|
|
|
|
|
- source_content TEXT COMMENT '来源原文内容',
|
|
|
|
|
- source_location JSONB COMMENT '来源位置信息',
|
|
|
|
|
-
|
|
|
|
|
- -- 质量评估
|
|
|
|
|
- confidence DECIMAL(5,4) COMMENT 'AI提取的置信度 0-1',
|
|
|
|
|
-
|
|
|
|
|
- -- 状态
|
|
|
|
|
- status VARCHAR(32) DEFAULT 'extracted' COMMENT '状态: extracted/confirmed/rejected/modified',
|
|
|
|
|
-
|
|
|
|
|
- -- 人工处理
|
|
|
|
|
- modified_value TEXT COMMENT '人工修正后的值',
|
|
|
|
|
- confirmed_at TIMESTAMP COMMENT '确认时间',
|
|
|
|
|
- confirmed_by VARCHAR(32) COMMENT '确认人',
|
|
|
|
|
-
|
|
|
|
|
- -- 元数据
|
|
|
|
|
- metadata JSONB COMMENT '元数据(AI输出、处理日志等)',
|
|
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
-
|
|
|
|
|
- INDEX idx_rule_id (rule_id),
|
|
|
|
|
- INDEX idx_project_id (project_id),
|
|
|
|
|
- INDEX idx_status (status)
|
|
|
|
|
|
|
+ -- extract_config 示例(ai_extract类型):
|
|
|
|
|
+ -- {
|
|
|
|
|
+ -- "targetDescription": "提取工程名称",
|
|
|
|
|
+ -- "fieldType": "text",
|
|
|
|
|
+ -- "expectedFormat": "XX市XX工程"
|
|
|
|
|
+ -- }
|
|
|
|
|
+
|
|
|
|
|
+ display_order INT DEFAULT 0 COMMENT '显示顺序',
|
|
|
|
|
+ create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+ update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+
|
|
|
|
|
+ UNIQUE (template_id, name)
|
|
|
);
|
|
);
|
|
|
|
|
|
|
|
-COMMENT ON TABLE extract_results IS '提取结果历史';
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**source_location 字段结构**:
|
|
|
|
|
|
|
+CREATE INDEX idx_variables_template ON variables(template_id);
|
|
|
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "documentId": "doc_001",
|
|
|
|
|
- "documentAlias": "可研批复",
|
|
|
|
|
- "locationType": "page",
|
|
|
|
|
- "pageStart": 1,
|
|
|
|
|
- "pageEnd": 2,
|
|
|
|
|
- "elementIds": ["elem_001", "elem_002"],
|
|
|
|
|
- "chapterPath": ["1", "建设必要性"],
|
|
|
|
|
- "textPreview": "本项目建设必要性主要体现在..."
|
|
|
|
|
-}
|
|
|
|
|
|
|
+COMMENT ON TABLE variables IS '模板变量';
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-#### 3.2.5 rule_templates(规则模板表)
|
|
|
|
|
|
|
+#### 4.2.4 generations(生成任务表)
|
|
|
|
|
|
|
|
```sql
|
|
```sql
|
|
|
-CREATE TABLE rule_templates (
|
|
|
|
|
- id VARCHAR(32) PRIMARY KEY,
|
|
|
|
|
- user_id VARCHAR(32) NOT NULL,
|
|
|
|
|
- name VARCHAR(255) NOT NULL COMMENT '模板名称',
|
|
|
|
|
- description TEXT COMMENT '模板描述',
|
|
|
|
|
-
|
|
|
|
|
- -- 模板内容
|
|
|
|
|
- rules_snapshot JSONB NOT NULL COMMENT '规则配置快照',
|
|
|
|
|
- doc_type_pattern JSONB COMMENT '适用的文档类型模式',
|
|
|
|
|
-
|
|
|
|
|
- -- 统计
|
|
|
|
|
- use_count INT DEFAULT 0 COMMENT '使用次数',
|
|
|
|
|
-
|
|
|
|
|
- -- 元数据
|
|
|
|
|
- is_public BOOLEAN DEFAULT FALSE COMMENT '是否公开',
|
|
|
|
|
- tags JSONB COMMENT '标签',
|
|
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
- updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
-
|
|
|
|
|
- INDEX idx_user_id (user_id),
|
|
|
|
|
- INDEX idx_is_public (is_public)
|
|
|
|
|
|
|
+CREATE TABLE generations (
|
|
|
|
|
+ id VARCHAR(36) PRIMARY KEY,
|
|
|
|
|
+ template_id VARCHAR(36) NOT NULL REFERENCES templates(id),
|
|
|
|
|
+ user_id VARCHAR(36) NOT NULL,
|
|
|
|
|
+
|
|
|
|
|
+ name VARCHAR(255) COMMENT '任务名称',
|
|
|
|
|
+
|
|
|
|
|
+ -- 来源文件映射:别名 → 文档ID
|
|
|
|
|
+ source_file_map JSONB NOT NULL COMMENT '来源文件映射',
|
|
|
|
|
+ -- 示例:{"可研批复": "doc_123", "站址报告": "doc_456"}
|
|
|
|
|
+
|
|
|
|
|
+ -- 变量提取结果
|
|
|
|
|
+ variable_values JSONB COMMENT '变量值',
|
|
|
|
|
+ -- 示例:
|
|
|
|
|
+ -- {
|
|
|
|
|
+ -- "project_name": {
|
|
|
|
|
+ -- "value": "武汉东湖 110kV 输变电工程",
|
|
|
|
|
+ -- "confidence": 0.96,
|
|
|
|
|
+ -- "source_preview": "...",
|
|
|
|
|
+ -- "status": "extracted"
|
|
|
|
|
+ -- }
|
|
|
|
|
+ -- }
|
|
|
|
|
+
|
|
|
|
|
+ -- 生成的文档
|
|
|
|
|
+ output_document_id VARCHAR(36) COMMENT '输出文档ID',
|
|
|
|
|
+
|
|
|
|
|
+ status VARCHAR(32) DEFAULT 'pending' COMMENT '状态: pending/extracting/review/completed/error',
|
|
|
|
|
+ error_message TEXT COMMENT '错误信息',
|
|
|
|
|
+
|
|
|
|
|
+ create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
|
+ completed_at TIMESTAMP COMMENT '完成时间'
|
|
|
);
|
|
);
|
|
|
|
|
|
|
|
-COMMENT ON TABLE rule_templates IS '提取规则模板';
|
|
|
|
|
|
|
+CREATE INDEX idx_generations_template ON generations(template_id);
|
|
|
|
|
+CREATE INDEX idx_generations_user ON generations(user_id);
|
|
|
|
|
+CREATE INDEX idx_generations_status ON generations(status);
|
|
|
|
|
+
|
|
|
|
|
+COMMENT ON TABLE generations IS '报告生成任务';
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 四、核心数据结构
|
|
|
|
|
-
|
|
|
|
|
-### 4.1 source_config 详细设计
|
|
|
|
|
-
|
|
|
|
|
-#### 4.1.1 来自文档(document)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 文档来源配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class DocumentSourceConfig {
|
|
|
|
|
- /** 来源文档ID(source_documents 表的 ID) */
|
|
|
|
|
- private String sourceDocId;
|
|
|
|
|
-
|
|
|
|
|
- /** 文档别名(便于显示) */
|
|
|
|
|
- private String documentAlias;
|
|
|
|
|
-
|
|
|
|
|
- /** 定位方式 */
|
|
|
|
|
- private LocationConfig location;
|
|
|
|
|
-}
|
|
|
|
|
|
|
+## 五、变量来源类型
|
|
|
|
|
|
|
|
-/**
|
|
|
|
|
- * 定位配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class LocationConfig {
|
|
|
|
|
- /**
|
|
|
|
|
- * 定位类型
|
|
|
|
|
- * - page: 按页码
|
|
|
|
|
- * - chapter: 按章节
|
|
|
|
|
- * - element: 按元素ID
|
|
|
|
|
- * - excel_cell: 按Excel单元格
|
|
|
|
|
- * - full_document: 全文档
|
|
|
|
|
- */
|
|
|
|
|
- private String type;
|
|
|
|
|
-
|
|
|
|
|
- // === 按页码定位 ===
|
|
|
|
|
- private Integer pageStart;
|
|
|
|
|
- private Integer pageEnd;
|
|
|
|
|
-
|
|
|
|
|
- // === 按章节定位 ===
|
|
|
|
|
- /** 章节路径,如 ["3", "5", "3", "3"] 表示 3.5.3.3 */
|
|
|
|
|
- private List<String> chapterPath;
|
|
|
|
|
- /** 章节标题关键词 */
|
|
|
|
|
- private String chapterTitle;
|
|
|
|
|
-
|
|
|
|
|
- // === 按段落过滤 ===
|
|
|
|
|
- /** 段落范围 [start, end],1-based */
|
|
|
|
|
- private List<Integer> paragraphRange;
|
|
|
|
|
- /** 段落关键词过滤 */
|
|
|
|
|
- private String paragraphKeyword;
|
|
|
|
|
-
|
|
|
|
|
- // === 按元素ID定位 ===
|
|
|
|
|
- /** 直接指定的 DocumentElement ID 列表 */
|
|
|
|
|
- private List<String> elementIds;
|
|
|
|
|
-
|
|
|
|
|
- // === Excel 定位 ===
|
|
|
|
|
- /** Sheet 名称 */
|
|
|
|
|
- private String sheetName;
|
|
|
|
|
- /** 单元格范围,如 "A1:C10" 或 "1.5.1"(自定义格式) */
|
|
|
|
|
- private String cellRef;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+### 5.1 来源类型总览
|
|
|
|
|
|
|
|
-**示例:按页码定位**
|
|
|
|
|
|
|
+| 来源类型 | 说明 | 适用场景 |
|
|
|
|
|
+| -------- | ---- | -------- |
|
|
|
|
|
+| `document` | 从来源文件提取 | 工程名称、批复日期等需要从文档中获取的信息 |
|
|
|
|
|
+| `manual` | 手动输入 | 联系人、特殊备注等无法自动获取的信息 |
|
|
|
|
|
+| `reference` | 引用其他变量 | 组合已提取的值,如"《{project_name}可行性研究报告》" |
|
|
|
|
|
+| `fixed` | 固定值 | 不随项目变化的固定文本 |
|
|
|
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "sourceDocId": "sd_001",
|
|
|
|
|
- "documentAlias": "可研批复",
|
|
|
|
|
- "location": {
|
|
|
|
|
- "type": "page",
|
|
|
|
|
- "pageStart": 1,
|
|
|
|
|
- "pageEnd": 2,
|
|
|
|
|
- "paragraphKeyword": "(一)建设必要性"
|
|
|
|
|
- }
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**示例:按章节定位**
|
|
|
|
|
|
|
+### 5.2 document 类型配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "sourceDocId": "sd_002",
|
|
|
|
|
- "documentAlias": "站址报告",
|
|
|
|
|
- "location": {
|
|
|
|
|
- "type": "chapter",
|
|
|
|
|
- "chapterPath": ["3", "5", "3", "3"],
|
|
|
|
|
- "chapterTitle": "区域地质及地震概况"
|
|
|
|
|
|
|
+ "source_file_alias": "可研批复",
|
|
|
|
|
+ "source_type": "document",
|
|
|
|
|
+ "source_config": {
|
|
|
|
|
+ "location": {
|
|
|
|
|
+ "type": "page", // page | chapter | element
|
|
|
|
|
+ "pageStart": 1,
|
|
|
|
|
+ "pageEnd": 2,
|
|
|
|
|
+ "paragraphKeyword": null // 可选:段落关键词过滤
|
|
|
|
|
+ }
|
|
|
|
|
+ },
|
|
|
|
|
+ "extract_type": "ai_extract",
|
|
|
|
|
+ "extract_config": {
|
|
|
|
|
+ "targetDescription": "从批复文件中提取工程项目的完整名称",
|
|
|
|
|
+ "fieldType": "text",
|
|
|
|
|
+ "expectedFormat": "XX市XX工程",
|
|
|
|
|
+ "examples": ["襄阳连云220千伏输变电工程"]
|
|
|
}
|
|
}
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-**示例:按Excel定位**
|
|
|
|
|
|
|
+### 5.3 manual 类型配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "sourceDocId": "sd_003",
|
|
|
|
|
- "documentAlias": "法规模板",
|
|
|
|
|
- "location": {
|
|
|
|
|
- "type": "excel_cell",
|
|
|
|
|
- "sheetName": "变电站扩建页",
|
|
|
|
|
- "cellRef": "1.5.1"
|
|
|
|
|
|
|
+ "source_type": "manual",
|
|
|
|
|
+ "source_config": {
|
|
|
|
|
+ "placeholder": "请输入项目联系人姓名",
|
|
|
|
|
+ "required": true,
|
|
|
|
|
+ "defaultValue": "",
|
|
|
|
|
+ "inputType": "text", // text | textarea | date | number
|
|
|
|
|
+ "validation": {
|
|
|
|
|
+ "maxLength": 50,
|
|
|
|
|
+ "pattern": "^[\\u4e00-\\u9fa5]{2,10}$"
|
|
|
|
|
+ }
|
|
|
}
|
|
}
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-#### 4.1.2 引用已提取字段(self_reference)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 自引用来源配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class SelfReferenceSourceConfig {
|
|
|
|
|
- /** 引用的字段Key */
|
|
|
|
|
- private String referenceFieldKey;
|
|
|
|
|
-
|
|
|
|
|
- /** 引用的字段名称(便于显示) */
|
|
|
|
|
- private String referenceFieldName;
|
|
|
|
|
-
|
|
|
|
|
- /** 多字段引用(用于组合) */
|
|
|
|
|
- private List<String> referenceFieldKeys;
|
|
|
|
|
-
|
|
|
|
|
- /** 组合模板,如 "{project_name}可行性研究报告" */
|
|
|
|
|
- private String combineTemplate;
|
|
|
|
|
-
|
|
|
|
|
- /** 转换规则 */
|
|
|
|
|
- private TransformConfig transform;
|
|
|
|
|
-}
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 转换配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class TransformConfig {
|
|
|
|
|
- /** 转换类型: replace/format/substring */
|
|
|
|
|
- private String type;
|
|
|
|
|
-
|
|
|
|
|
- // === replace 类型 ===
|
|
|
|
|
- private String searchText;
|
|
|
|
|
- private String replaceText;
|
|
|
|
|
-
|
|
|
|
|
- // === format 类型 ===
|
|
|
|
|
- private String formatPattern; // 如日期格式 "yyyy年MM月dd日"
|
|
|
|
|
-
|
|
|
|
|
- // === substring 类型 ===
|
|
|
|
|
- private Integer startIndex;
|
|
|
|
|
- private Integer endIndex;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**示例:引用并替换**
|
|
|
|
|
|
|
+### 5.4 reference 类型配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "referenceFieldKey": "project_overview",
|
|
|
|
|
- "referenceFieldName": "项目概述",
|
|
|
|
|
- "transform": {
|
|
|
|
|
- "type": "replace",
|
|
|
|
|
- "searchText": "XX项目",
|
|
|
|
|
- "replaceText": "{project_name}"
|
|
|
|
|
|
|
+ "source_type": "reference",
|
|
|
|
|
+ "source_config": {
|
|
|
|
|
+ "referenceVariables": ["project_name", "design_unit", "report_date"],
|
|
|
|
|
+ "combineTemplate": "《{project_name}可行性研究报告》由{design_unit}于{report_date}编制",
|
|
|
|
|
+ "transform": {
|
|
|
|
|
+ "type": "format",
|
|
|
|
|
+ "formatPattern": "{0}"
|
|
|
|
|
+ }
|
|
|
}
|
|
}
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-**示例:多字段组合**
|
|
|
|
|
|
|
+### 5.5 fixed 类型配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "referenceFieldKeys": ["project_name", "design_unit", "report_date"],
|
|
|
|
|
- "combineTemplate": "《{project_name}可行性研究报告》由{design_unit}于{report_date}编制"
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 4.1.3 固定内容(fixed)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 固定内容配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class FixedSourceConfig {
|
|
|
|
|
- /** 固定文本内容 */
|
|
|
|
|
- private String content;
|
|
|
|
|
-
|
|
|
|
|
- /** 内容类型 */
|
|
|
|
|
- private String contentType; // text/html/markdown
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**示例**
|
|
|
|
|
-
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "content": "本报告依据《电力建设工程预算编制办法》(2018版)编制。",
|
|
|
|
|
- "contentType": "text"
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 4.1.4 手动输入(manual)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 手动输入配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class ManualSourceConfig {
|
|
|
|
|
- /** 输入提示 */
|
|
|
|
|
- private String placeholder;
|
|
|
|
|
-
|
|
|
|
|
- /** 是否必填 */
|
|
|
|
|
- private Boolean required;
|
|
|
|
|
-
|
|
|
|
|
- /** 默认值 */
|
|
|
|
|
- private String defaultValue;
|
|
|
|
|
-
|
|
|
|
|
- /** 输入类型 */
|
|
|
|
|
- private String inputType; // text/textarea/date/number/select
|
|
|
|
|
-
|
|
|
|
|
- /** 选项列表(inputType=select时) */
|
|
|
|
|
- private List<String> options;
|
|
|
|
|
-
|
|
|
|
|
- /** 校验规则 */
|
|
|
|
|
- private ValidationConfig validation;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-**示例**
|
|
|
|
|
-
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "placeholder": "请输入项目联系人姓名",
|
|
|
|
|
- "required": true,
|
|
|
|
|
- "inputType": "text",
|
|
|
|
|
- "validation": {
|
|
|
|
|
- "maxLength": 50,
|
|
|
|
|
- "pattern": "^[\\u4e00-\\u9fa5]{2,10}$"
|
|
|
|
|
|
|
+ "source_type": "fixed",
|
|
|
|
|
+ "source_config": {
|
|
|
|
|
+ "fixedValue": "本报告依据《电力建设工程预算编制办法》(2018版)编制。"
|
|
|
}
|
|
}
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-### 4.2 extract_config 详细设计
|
|
|
|
|
-
|
|
|
|
|
-#### 4.2.1 直接提取(direct)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 直接提取配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class DirectExtractConfig {
|
|
|
|
|
- /** 是否去除首尾空白 */
|
|
|
|
|
- private Boolean trimWhitespace = true;
|
|
|
|
|
-
|
|
|
|
|
- /** 是否移除换行符 */
|
|
|
|
|
- private Boolean removeLineBreaks = false;
|
|
|
|
|
-
|
|
|
|
|
- /** 是否合并连续空格 */
|
|
|
|
|
- private Boolean mergeSpaces = true;
|
|
|
|
|
-
|
|
|
|
|
- /** 保留的HTML标签(如需保留格式) */
|
|
|
|
|
- private List<String> preserveTags;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+---
|
|
|
|
|
|
|
|
-#### 4.2.2 AI提取(ai_extract)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * AI 字段提取配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class AIExtractConfig {
|
|
|
|
|
- /** 提取目标描述 */
|
|
|
|
|
- private String targetDescription;
|
|
|
|
|
-
|
|
|
|
|
- /** 字段类型 */
|
|
|
|
|
- private String fieldType; // text/date/number/person/org/location
|
|
|
|
|
-
|
|
|
|
|
- /** 预期格式描述 */
|
|
|
|
|
- private String expectedFormat;
|
|
|
|
|
-
|
|
|
|
|
- /** 示例值 */
|
|
|
|
|
- private List<String> examples;
|
|
|
|
|
-
|
|
|
|
|
- /** 是否返回多个结果 */
|
|
|
|
|
- private Boolean multipleResults = false;
|
|
|
|
|
-
|
|
|
|
|
- /** 自定义提示词(高级) */
|
|
|
|
|
- private String customPrompt;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+## 六、提取类型
|
|
|
|
|
|
|
|
-**示例:提取工程名称**
|
|
|
|
|
|
|
+### 6.1 提取类型总览
|
|
|
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "targetDescription": "从批复文件中提取工程项目的完整名称",
|
|
|
|
|
- "fieldType": "text",
|
|
|
|
|
- "expectedFormat": "XX市XX工程",
|
|
|
|
|
- "examples": ["襄阳连云220千伏输变电工程", "武汉东湖110千伏输变电工程"]
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+| 提取类型 | 说明 | 适用场景 |
|
|
|
|
|
+| -------- | ---- | -------- |
|
|
|
|
|
+| `direct` | 直接提取 | 定位到的内容直接使用 |
|
|
|
|
|
+| `ai_extract` | AI字段提取 | 从一段文本中提取特定字段 |
|
|
|
|
|
+| `ai_summarize` | AI总结 | 对内容进行总结提炼 |
|
|
|
|
|
|
|
|
-**示例:提取日期**
|
|
|
|
|
|
|
+### 6.2 direct 配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "targetDescription": "提取可研报告的批复日期",
|
|
|
|
|
- "fieldType": "date",
|
|
|
|
|
- "expectedFormat": "YYYY年MM月DD日",
|
|
|
|
|
- "examples": ["2024年5月15日", "2023年12月1日"]
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 4.2.3 AI总结(ai_summarize)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * AI 总结/提炼配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class AISummarizeConfig {
|
|
|
|
|
- /** 总结提示词 */
|
|
|
|
|
- private String summarizePrompt;
|
|
|
|
|
-
|
|
|
|
|
- /** 总结维度/角度 */
|
|
|
|
|
- private List<String> focusPoints;
|
|
|
|
|
-
|
|
|
|
|
- /** 总结规则 */
|
|
|
|
|
- private List<String> rules;
|
|
|
|
|
-
|
|
|
|
|
- /** 输出风格 */
|
|
|
|
|
- private String style; // formal/concise/detailed/bullet_points
|
|
|
|
|
-
|
|
|
|
|
- /** 最大字数 */
|
|
|
|
|
- private Integer maxLength;
|
|
|
|
|
-
|
|
|
|
|
- /** 是否保留关键数据 */
|
|
|
|
|
- private Boolean preserveKeyData = true;
|
|
|
|
|
-
|
|
|
|
|
- /** 引用的上下文字段(作为参考) */
|
|
|
|
|
- private List<String> contextFieldKeys;
|
|
|
|
|
|
|
+ "extract_type": "direct",
|
|
|
|
|
+ "extract_config": {
|
|
|
|
|
+ "trimWhitespace": true,
|
|
|
|
|
+ "removeLineBreaks": false,
|
|
|
|
|
+ "mergeSpaces": true
|
|
|
|
|
+ }
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-**示例:总结建设必要性**
|
|
|
|
|
|
|
+### 6.3 ai_extract 配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "summarizePrompt": "请对以下内容进行总结,重点描述项目建设的必要性",
|
|
|
|
|
- "focusPoints": ["建设背景", "现状问题", "建设目的"],
|
|
|
|
|
- "rules": [
|
|
|
|
|
- "使用正式的工程报告语言",
|
|
|
|
|
- "保留关键的数据和指标",
|
|
|
|
|
- "控制在200字以内"
|
|
|
|
|
- ],
|
|
|
|
|
- "style": "formal",
|
|
|
|
|
- "maxLength": 200
|
|
|
|
|
|
|
+ "extract_type": "ai_extract",
|
|
|
|
|
+ "extract_config": {
|
|
|
|
|
+ "targetDescription": "从批复文件中提取可研批复的日期",
|
|
|
|
|
+ "fieldType": "date",
|
|
|
|
|
+ "expectedFormat": "YYYY年MM月DD日",
|
|
|
|
|
+ "examples": ["2024年5月15日", "2023年12月1日"]
|
|
|
|
|
+ }
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-**示例:带提炼规则的总结**
|
|
|
|
|
|
|
+### 6.4 ai_summarize 配置
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "summarizePrompt": "以工程选址的角度,总结站址的地质条件",
|
|
|
|
|
- "focusPoints": ["地质构造", "地震烈度", "岩土条件", "地下水情况"],
|
|
|
|
|
- "rules": [
|
|
|
|
|
- "先概述整体地质环境",
|
|
|
|
|
- "重点说明对工程的影响",
|
|
|
|
|
- "给出适宜性评价"
|
|
|
|
|
- ],
|
|
|
|
|
- "style": "formal",
|
|
|
|
|
- "maxLength": 300,
|
|
|
|
|
- "contextFieldKeys": ["project_location", "project_type"]
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 4.2.4 OCR识别(ocr)
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * OCR 识别配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class OcrExtractConfig {
|
|
|
|
|
- /** OCR 后是否进行 AI 提取 */
|
|
|
|
|
- private Boolean aiPostProcess = true;
|
|
|
|
|
-
|
|
|
|
|
- /** AI 后处理配置 */
|
|
|
|
|
- private AIExtractConfig aiConfig;
|
|
|
|
|
-
|
|
|
|
|
- /** 图像预处理 */
|
|
|
|
|
- private ImagePreprocessConfig preprocess;
|
|
|
|
|
-}
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 图像预处理配置
|
|
|
|
|
- */
|
|
|
|
|
-@Data
|
|
|
|
|
-public class ImagePreprocessConfig {
|
|
|
|
|
- private Boolean deskew = true; // 纠偏
|
|
|
|
|
- private Boolean denoise = true; // 去噪
|
|
|
|
|
- private Boolean binarize = false; // 二值化
|
|
|
|
|
- private Integer contrast = 0; // 对比度调整
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
----
|
|
|
|
|
-
|
|
|
|
|
-## 五、核心实体类设计
|
|
|
|
|
-
|
|
|
|
|
-### 5.1 新增模块结构
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-backend/extract-service/
|
|
|
|
|
-├── pom.xml
|
|
|
|
|
-└── src/main/java/com/lingyue/extract/
|
|
|
|
|
- ├── ExtractServiceApplication.java
|
|
|
|
|
- ├── config/
|
|
|
|
|
- │ └── ExtractConfig.java
|
|
|
|
|
- ├── controller/
|
|
|
|
|
- │ ├── ProjectController.java
|
|
|
|
|
- │ ├── SourceDocumentController.java
|
|
|
|
|
- │ ├── ExtractRuleController.java
|
|
|
|
|
- │ └── ExtractExecuteController.java
|
|
|
|
|
- ├── dto/
|
|
|
|
|
- │ ├── request/
|
|
|
|
|
- │ │ ├── CreateProjectRequest.java
|
|
|
|
|
- │ │ ├── UpdateProjectRequest.java
|
|
|
|
|
- │ │ ├── AddSourceDocumentRequest.java
|
|
|
|
|
- │ │ ├── CreateRuleRequest.java
|
|
|
|
|
- │ │ ├── UpdateRuleRequest.java
|
|
|
|
|
- │ │ ├── BatchCreateRulesRequest.java
|
|
|
|
|
- │ │ ├── ExecuteRulesRequest.java
|
|
|
|
|
- │ │ └── ConfirmResultRequest.java
|
|
|
|
|
- │ ├── response/
|
|
|
|
|
- │ │ ├── ProjectDetailResponse.java
|
|
|
|
|
- │ │ ├── RuleListResponse.java
|
|
|
|
|
- │ │ ├── ExtractPreviewResponse.java
|
|
|
|
|
- │ │ ├── ExecuteProgressResponse.java
|
|
|
|
|
- │ │ └── ExtractResultResponse.java
|
|
|
|
|
- │ └── config/
|
|
|
|
|
- │ ├── SourceConfig.java
|
|
|
|
|
- │ ├── DocumentSourceConfig.java
|
|
|
|
|
- │ ├── SelfReferenceSourceConfig.java
|
|
|
|
|
- │ ├── FixedSourceConfig.java
|
|
|
|
|
- │ ├── ManualSourceConfig.java
|
|
|
|
|
- │ ├── LocationConfig.java
|
|
|
|
|
- │ ├── ExtractConfig.java
|
|
|
|
|
- │ ├── DirectExtractConfig.java
|
|
|
|
|
- │ ├── AIExtractConfig.java
|
|
|
|
|
- │ ├── AISummarizeConfig.java
|
|
|
|
|
- │ └── OcrExtractConfig.java
|
|
|
|
|
- ├── entity/
|
|
|
|
|
- │ ├── Project.java
|
|
|
|
|
- │ ├── SourceDocument.java
|
|
|
|
|
- │ ├── ExtractRule.java
|
|
|
|
|
- │ ├── ExtractResult.java
|
|
|
|
|
- │ └── RuleTemplate.java
|
|
|
|
|
- ├── repository/
|
|
|
|
|
- │ ├── ProjectRepository.java
|
|
|
|
|
- │ ├── SourceDocumentRepository.java
|
|
|
|
|
- │ ├── ExtractRuleRepository.java
|
|
|
|
|
- │ ├── ExtractResultRepository.java
|
|
|
|
|
- │ └── RuleTemplateRepository.java
|
|
|
|
|
- ├── service/
|
|
|
|
|
- │ ├── ProjectService.java
|
|
|
|
|
- │ ├── SourceDocumentService.java
|
|
|
|
|
- │ ├── ExtractRuleService.java
|
|
|
|
|
- │ ├── ExtractExecuteService.java
|
|
|
|
|
- │ ├── ContentLocatorService.java
|
|
|
|
|
- │ ├── AIExtractService.java
|
|
|
|
|
- │ └── RuleTemplateService.java
|
|
|
|
|
- └── executor/
|
|
|
|
|
- ├── ExtractExecutor.java
|
|
|
|
|
- ├── DirectExtractExecutor.java
|
|
|
|
|
- ├── AIExtractExecutor.java
|
|
|
|
|
- ├── AISummarizeExecutor.java
|
|
|
|
|
- └── OcrExtractExecutor.java
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 5.2 实体类定义
|
|
|
|
|
-
|
|
|
|
|
-#### 5.2.1 Project.java
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.entity;
|
|
|
|
|
-
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
|
|
-import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
|
|
-import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
|
|
-import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
|
|
-import lombok.Data;
|
|
|
|
|
-import lombok.EqualsAndHashCode;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 数据提取项目实体
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-@EqualsAndHashCode(callSuper = true)
|
|
|
|
|
-@Data
|
|
|
|
|
-@TableName(value = "projects", autoResultMap = true)
|
|
|
|
|
-@Schema(description = "数据提取项目")
|
|
|
|
|
-public class Project extends SimpleModel {
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "用户ID")
|
|
|
|
|
- @TableField("user_id")
|
|
|
|
|
- private String userId;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目名称")
|
|
|
|
|
- @TableField("name")
|
|
|
|
|
- private String name;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目描述")
|
|
|
|
|
- @TableField("description")
|
|
|
|
|
- private String description;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "状态", example = "draft/extracting/completed/archived")
|
|
|
|
|
- @TableField("status")
|
|
|
|
|
- private String status = "draft";
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目配置")
|
|
|
|
|
- @TableField(value = "config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object config;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 状态常量 =====
|
|
|
|
|
- public static final String STATUS_DRAFT = "draft";
|
|
|
|
|
- public static final String STATUS_EXTRACTING = "extracting";
|
|
|
|
|
- public static final String STATUS_COMPLETED = "completed";
|
|
|
|
|
- public static final String STATUS_ARCHIVED = "archived";
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 5.2.2 SourceDocument.java
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.entity;
|
|
|
|
|
-
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
|
|
-import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
|
|
-import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
|
|
-import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
|
|
-import lombok.Data;
|
|
|
|
|
-import lombok.EqualsAndHashCode;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 来源文档实体
|
|
|
|
|
- * 项目中用到的文档,关联已解析的 Document
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-@EqualsAndHashCode(callSuper = true)
|
|
|
|
|
-@Data
|
|
|
|
|
-@TableName(value = "source_documents", autoResultMap = true)
|
|
|
|
|
-@Schema(description = "来源文档")
|
|
|
|
|
-public class SourceDocument extends SimpleModel {
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目ID")
|
|
|
|
|
- @TableField("project_id")
|
|
|
|
|
- private String projectId;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "关联的 Document ID")
|
|
|
|
|
- @TableField("document_id")
|
|
|
|
|
- private String documentId;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "文档别名")
|
|
|
|
|
- @TableField("alias")
|
|
|
|
|
- private String alias;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "文档类型", example = "pdf/docx/xlsx")
|
|
|
|
|
- @TableField("doc_type")
|
|
|
|
|
- private String docType;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "显示顺序")
|
|
|
|
|
- @TableField("display_order")
|
|
|
|
|
- private Integer displayOrder = 0;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "元数据")
|
|
|
|
|
- @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object metadata;
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 5.2.3 ExtractRule.java
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.entity;
|
|
|
|
|
-
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
|
|
-import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
|
|
-import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
|
|
-import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
|
|
-import lombok.Data;
|
|
|
|
|
-import lombok.EqualsAndHashCode;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 提取规则实体
|
|
|
|
|
- * 描述如何从来源文档中提取数据的配置
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-@EqualsAndHashCode(callSuper = true)
|
|
|
|
|
-@Data
|
|
|
|
|
-@TableName(value = "extract_rules", autoResultMap = true)
|
|
|
|
|
-@Schema(description = "提取规则")
|
|
|
|
|
-public class ExtractRule extends SimpleModel {
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目ID")
|
|
|
|
|
- @TableField("project_id")
|
|
|
|
|
- private String projectId;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "来源文档ID")
|
|
|
|
|
- @TableField("source_doc_id")
|
|
|
|
|
- private String sourceDocId;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 目标字段 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "目标字段Key(程序用)")
|
|
|
|
|
- @TableField("target_field_key")
|
|
|
|
|
- private String targetFieldKey;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "目标字段名称(显示用)")
|
|
|
|
|
- @TableField("target_field_name")
|
|
|
|
|
- private String targetFieldName;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "字段分组")
|
|
|
|
|
- @TableField("target_field_group")
|
|
|
|
|
- private String targetFieldGroup;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "规则顺序")
|
|
|
|
|
- @TableField("rule_index")
|
|
|
|
|
- private Integer ruleIndex;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 来源配置 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "来源类型", example = "document/self_reference/fixed/manual")
|
|
|
|
|
- @TableField("source_type")
|
|
|
|
|
- private String sourceType;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "来源配置")
|
|
|
|
|
- @TableField(value = "source_config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object sourceConfig;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 提取配置 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "提取类型", example = "direct/ai_extract/ai_summarize/ocr")
|
|
|
|
|
- @TableField("extract_type")
|
|
|
|
|
- private String extractType;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "提取配置")
|
|
|
|
|
- @TableField(value = "extract_config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object extractConfig;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 结果 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "状态", example = "pending/extracting/extracted/confirmed/error")
|
|
|
|
|
- @TableField("status")
|
|
|
|
|
- private String status = STATUS_PENDING;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "提取出的值")
|
|
|
|
|
- @TableField("extracted_value")
|
|
|
|
|
- private String extractedValue;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "值类型", example = "text/table/image/list")
|
|
|
|
|
- @TableField("value_type")
|
|
|
|
|
- private String valueType = "text";
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "错误信息")
|
|
|
|
|
- @TableField("error_message")
|
|
|
|
|
- private String errorMessage;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "元数据")
|
|
|
|
|
- @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object metadata;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 常量 =====
|
|
|
|
|
-
|
|
|
|
|
- // 来源类型
|
|
|
|
|
- public static final String SOURCE_DOCUMENT = "document";
|
|
|
|
|
- public static final String SOURCE_SELF_REFERENCE = "self_reference";
|
|
|
|
|
- public static final String SOURCE_FIXED = "fixed";
|
|
|
|
|
- public static final String SOURCE_MANUAL = "manual";
|
|
|
|
|
-
|
|
|
|
|
- // 提取类型
|
|
|
|
|
- public static final String EXTRACT_DIRECT = "direct";
|
|
|
|
|
- public static final String EXTRACT_AI_EXTRACT = "ai_extract";
|
|
|
|
|
- public static final String EXTRACT_AI_SUMMARIZE = "ai_summarize";
|
|
|
|
|
- public static final String EXTRACT_OCR = "ocr";
|
|
|
|
|
-
|
|
|
|
|
- // 状态
|
|
|
|
|
- public static final String STATUS_PENDING = "pending";
|
|
|
|
|
- public static final String STATUS_EXTRACTING = "extracting";
|
|
|
|
|
- public static final String STATUS_EXTRACTED = "extracted";
|
|
|
|
|
- public static final String STATUS_CONFIRMED = "confirmed";
|
|
|
|
|
- public static final String STATUS_ERROR = "error";
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 5.2.4 ExtractResult.java
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.entity;
|
|
|
|
|
-
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
|
|
-import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
|
|
-import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
|
|
-import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
|
|
-import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
|
|
-import lombok.Data;
|
|
|
|
|
-import lombok.EqualsAndHashCode;
|
|
|
|
|
-
|
|
|
|
|
-import java.time.LocalDateTime;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 提取结果实体
|
|
|
|
|
- * 记录每次提取的详细结果,支持历史追溯
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-@EqualsAndHashCode(callSuper = true)
|
|
|
|
|
-@Data
|
|
|
|
|
-@TableName(value = "extract_results", autoResultMap = true)
|
|
|
|
|
-@Schema(description = "提取结果")
|
|
|
|
|
-public class ExtractResult extends SimpleModel {
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "规则ID")
|
|
|
|
|
- @TableField("rule_id")
|
|
|
|
|
- private String ruleId;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "项目ID")
|
|
|
|
|
- @TableField("project_id")
|
|
|
|
|
- private String projectId;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 提取结果 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "提取出的值")
|
|
|
|
|
- @TableField("extracted_value")
|
|
|
|
|
- private String extractedValue;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "值类型")
|
|
|
|
|
- @TableField("value_type")
|
|
|
|
|
- private String valueType = "text";
|
|
|
|
|
-
|
|
|
|
|
- // ===== 来源追溯 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "来源原文内容")
|
|
|
|
|
- @TableField("source_content")
|
|
|
|
|
- private String sourceContent;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "来源位置信息")
|
|
|
|
|
- @TableField(value = "source_location", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object sourceLocation;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 质量评估 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "AI提取的置信度 0-1")
|
|
|
|
|
- @TableField("confidence")
|
|
|
|
|
- private Double confidence;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 状态 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "状态", example = "extracted/confirmed/rejected/modified")
|
|
|
|
|
- @TableField("status")
|
|
|
|
|
- private String status = STATUS_EXTRACTED;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 人工处理 =====
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "人工修正后的值")
|
|
|
|
|
- @TableField("modified_value")
|
|
|
|
|
- private String modifiedValue;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "确认时间")
|
|
|
|
|
- @TableField("confirmed_at")
|
|
|
|
|
- private LocalDateTime confirmedAt;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "确认人")
|
|
|
|
|
- @TableField("confirmed_by")
|
|
|
|
|
- private String confirmedBy;
|
|
|
|
|
-
|
|
|
|
|
- @Schema(description = "元数据")
|
|
|
|
|
- @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
|
|
- private Object metadata;
|
|
|
|
|
-
|
|
|
|
|
- // ===== 常量 =====
|
|
|
|
|
- public static final String STATUS_EXTRACTED = "extracted";
|
|
|
|
|
- public static final String STATUS_CONFIRMED = "confirmed";
|
|
|
|
|
- public static final String STATUS_REJECTED = "rejected";
|
|
|
|
|
- public static final String STATUS_MODIFIED = "modified";
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 获取最终值(优先使用修正值)
|
|
|
|
|
- */
|
|
|
|
|
- public String getFinalValue() {
|
|
|
|
|
- return modifiedValue != null ? modifiedValue : extractedValue;
|
|
|
|
|
- }
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
----
|
|
|
|
|
-
|
|
|
|
|
-## 六、核心服务设计
|
|
|
|
|
-
|
|
|
|
|
-### 6.1 ContentLocatorService(内容定位服务)
|
|
|
|
|
-
|
|
|
|
|
-负责根据 `LocationConfig` 从文档中定位并提取内容。
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.service;
|
|
|
|
|
-
|
|
|
|
|
-import com.lingyue.document.entity.DocumentElement;
|
|
|
|
|
-import com.lingyue.extract.dto.config.LocationConfig;
|
|
|
|
|
-import java.util.List;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 内容定位服务
|
|
|
|
|
- * 根据定位配置从文档中提取内容
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-public interface ContentLocatorService {
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 根据定位配置获取文档元素
|
|
|
|
|
- *
|
|
|
|
|
- * @param documentId 文档ID
|
|
|
|
|
- * @param location 定位配置
|
|
|
|
|
- * @return 匹配的文档元素列表
|
|
|
|
|
- */
|
|
|
|
|
- List<DocumentElement> locateElements(String documentId, LocationConfig location);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 根据定位配置获取文本内容
|
|
|
|
|
- *
|
|
|
|
|
- * @param documentId 文档ID
|
|
|
|
|
- * @param location 定位配置
|
|
|
|
|
- * @return 提取的文本内容
|
|
|
|
|
- */
|
|
|
|
|
- String locateContent(String documentId, LocationConfig location);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 按页码定位
|
|
|
|
|
- */
|
|
|
|
|
- List<DocumentElement> locateByPage(String documentId, int pageStart, int pageEnd, String keyword);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 按章节定位
|
|
|
|
|
- */
|
|
|
|
|
- List<DocumentElement> locateByChapter(String documentId, List<String> chapterPath, String chapterTitle);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 按元素ID定位
|
|
|
|
|
- */
|
|
|
|
|
- List<DocumentElement> locateByElementIds(String documentId, List<String> elementIds);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * Excel 单元格定位
|
|
|
|
|
- */
|
|
|
|
|
- String locateExcelCell(String documentId, String sheetName, String cellRef);
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 6.2 ExtractExecuteService(提取执行服务)
|
|
|
|
|
-
|
|
|
|
|
-负责协调执行提取任务。
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.service;
|
|
|
|
|
-
|
|
|
|
|
-import com.lingyue.extract.dto.response.ExecuteProgressResponse;
|
|
|
|
|
-import com.lingyue.extract.dto.response.ExtractResultResponse;
|
|
|
|
|
-import com.lingyue.extract.entity.ExtractResult;
|
|
|
|
|
-import com.lingyue.extract.entity.ExtractRule;
|
|
|
|
|
-
|
|
|
|
|
-import java.util.List;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * 提取执行服务
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-public interface ExtractExecuteService {
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 执行单条规则
|
|
|
|
|
- *
|
|
|
|
|
- * @param ruleId 规则ID
|
|
|
|
|
- * @return 提取结果
|
|
|
|
|
- */
|
|
|
|
|
- ExtractResult executeRule(String ruleId);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 执行指定规则列表
|
|
|
|
|
- *
|
|
|
|
|
- * @param ruleIds 规则ID列表
|
|
|
|
|
- * @return 提取结果列表
|
|
|
|
|
- */
|
|
|
|
|
- List<ExtractResult> executeRules(List<String> ruleIds);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 执行项目的所有规则
|
|
|
|
|
- *
|
|
|
|
|
- * @param projectId 项目ID
|
|
|
|
|
- * @return 提取结果列表
|
|
|
|
|
- */
|
|
|
|
|
- List<ExtractResult> executeProject(String projectId);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 异步执行项目(后台任务)
|
|
|
|
|
- *
|
|
|
|
|
- * @param projectId 项目ID
|
|
|
|
|
- * @return 任务ID
|
|
|
|
|
- */
|
|
|
|
|
- String executeProjectAsync(String projectId);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 获取执行进度
|
|
|
|
|
- *
|
|
|
|
|
- * @param taskId 任务ID
|
|
|
|
|
- * @return 进度信息
|
|
|
|
|
- */
|
|
|
|
|
- ExecuteProgressResponse getProgress(String taskId);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 预览提取(不保存)
|
|
|
|
|
- *
|
|
|
|
|
- * @param rule 规则配置
|
|
|
|
|
- * @return 预览结果
|
|
|
|
|
- */
|
|
|
|
|
- ExtractResultResponse preview(ExtractRule rule);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * 重新执行规则
|
|
|
|
|
- *
|
|
|
|
|
- * @param ruleId 规则ID
|
|
|
|
|
- * @return 新的提取结果
|
|
|
|
|
- */
|
|
|
|
|
- ExtractResult retryRule(String ruleId);
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 6.3 AIExtractService(AI提取服务)
|
|
|
|
|
-
|
|
|
|
|
-封装 AI 提取和总结的逻辑。
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-package com.lingyue.extract.service;
|
|
|
|
|
-
|
|
|
|
|
-import com.lingyue.extract.dto.config.AIExtractConfig;
|
|
|
|
|
-import com.lingyue.extract.dto.config.AISummarizeConfig;
|
|
|
|
|
-
|
|
|
|
|
-/**
|
|
|
|
|
- * AI 提取服务
|
|
|
|
|
- *
|
|
|
|
|
- * @author lingyue
|
|
|
|
|
- * @since 2026-01-22
|
|
|
|
|
- */
|
|
|
|
|
-public interface AIExtractService {
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * AI 字段提取
|
|
|
|
|
- *
|
|
|
|
|
- * @param content 原文内容
|
|
|
|
|
- * @param config 提取配置
|
|
|
|
|
- * @return 提取结果(包含值和置信度)
|
|
|
|
|
- */
|
|
|
|
|
- AIExtractResult extract(String content, AIExtractConfig config);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * AI 内容总结
|
|
|
|
|
- *
|
|
|
|
|
- * @param content 原文内容
|
|
|
|
|
- * @param config 总结配置
|
|
|
|
|
- * @param context 上下文字段值(可选)
|
|
|
|
|
- * @return 总结结果
|
|
|
|
|
- */
|
|
|
|
|
- AISummarizeResult summarize(String content, AISummarizeConfig config,
|
|
|
|
|
- Map<String, String> context);
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * AI 提取结果
|
|
|
|
|
- */
|
|
|
|
|
- @Data
|
|
|
|
|
- class AIExtractResult {
|
|
|
|
|
- private String value;
|
|
|
|
|
- private Double confidence;
|
|
|
|
|
- private String reasoning;
|
|
|
|
|
- }
|
|
|
|
|
-
|
|
|
|
|
- /**
|
|
|
|
|
- * AI 总结结果
|
|
|
|
|
- */
|
|
|
|
|
- @Data
|
|
|
|
|
- class AISummarizeResult {
|
|
|
|
|
- private String summary;
|
|
|
|
|
- private List<String> keyPoints;
|
|
|
|
|
- }
|
|
|
|
|
|
|
+ "extract_type": "ai_summarize",
|
|
|
|
|
+ "extract_config": {
|
|
|
|
|
+ "summarizePrompt": "请对以下内容进行总结,重点描述项目建设的必要性",
|
|
|
|
|
+ "focusPoints": ["建设背景", "现状问题", "建设目的"],
|
|
|
|
|
+ "rules": ["使用正式的工程报告语言", "保留关键的数据和指标"],
|
|
|
|
|
+ "style": "formal",
|
|
|
|
|
+ "maxLength": 300
|
|
|
|
|
+ }
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 七、API 接口设计
|
|
|
|
|
-
|
|
|
|
|
-### 7.1 项目管理 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ------ | ----------------------------------------- | ------------ |
|
|
|
|
|
-| POST | `/api/v1/extract/projects` | 创建项目 |
|
|
|
|
|
-| GET | `/api/v1/extract/projects` | 查询项目列表 |
|
|
|
|
|
-| GET | `/api/v1/extract/projects/{id}` | 获取项目详情 |
|
|
|
|
|
-| PUT | `/api/v1/extract/projects/{id}` | 更新项目 |
|
|
|
|
|
-| DELETE | `/api/v1/extract/projects/{id}` | 删除项目 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{id}/archive` | 归档项目 |
|
|
|
|
|
-
|
|
|
|
|
-### 7.2 来源文档 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ------ | -------------------------------------------------------- | ---------------- |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/documents` | 添加来源文档 |
|
|
|
|
|
-| GET | `/api/v1/extract/projects/{projectId}/documents` | 获取来源文档列表 |
|
|
|
|
|
-| PUT | `/api/v1/extract/projects/{projectId}/documents/{id}` | 更新来源文档 |
|
|
|
|
|
-| DELETE | `/api/v1/extract/projects/{projectId}/documents/{id}` | 移除来源文档 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/documents/batch` | 批量添加来源文档 |
|
|
|
|
|
-
|
|
|
|
|
-### 7.3 提取规则 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ------ | ------------------------------------------------------------- | ------------ |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/rules` | 创建规则 |
|
|
|
|
|
-| GET | `/api/v1/extract/projects/{projectId}/rules` | 获取规则列表 |
|
|
|
|
|
-| GET | `/api/v1/extract/projects/{projectId}/rules/{id}` | 获取规则详情 |
|
|
|
|
|
-| PUT | `/api/v1/extract/projects/{projectId}/rules/{id}` | 更新规则 |
|
|
|
|
|
-| DELETE | `/api/v1/extract/projects/{projectId}/rules/{id}` | 删除规则 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/rules/batch` | 批量创建规则 |
|
|
|
|
|
-| PUT | `/api/v1/extract/projects/{projectId}/rules/reorder` | 调整规则顺序 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/rules/{id}/duplicate` | 复制规则 |
|
|
|
|
|
-
|
|
|
|
|
-### 7.4 提取执行 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ---- | ------------------------------------------------ | ---------------- |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/execute` | 执行项目所有规则 |
|
|
|
|
|
-| POST | `/api/v1/extract/rules/{ruleId}/execute` | 执行单条规则 |
|
|
|
|
|
-| POST | `/api/v1/extract/rules/batch-execute` | 批量执行规则 |
|
|
|
|
|
-| POST | `/api/v1/extract/rules/{ruleId}/preview` | 预览提取结果 |
|
|
|
|
|
-| POST | `/api/v1/extract/rules/{ruleId}/retry` | 重新执行规则 |
|
|
|
|
|
-| GET | `/api/v1/extract/tasks/{taskId}/progress` | 获取任务进度 |
|
|
|
|
|
-
|
|
|
|
|
-### 7.5 提取结果 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ---- | ------------------------------------------------------------ | ------------------ |
|
|
|
|
|
-| GET | `/api/v1/extract/projects/{projectId}/results` | 获取项目所有结果 |
|
|
|
|
|
-| GET | `/api/v1/extract/rules/{ruleId}/results` | 获取规则的结果历史 |
|
|
|
|
|
-| POST | `/api/v1/extract/results/{id}/confirm` | 确认结果 |
|
|
|
|
|
-| POST | `/api/v1/extract/results/{id}/reject` | 拒绝结果 |
|
|
|
|
|
-| PUT | `/api/v1/extract/results/{id}/modify` | 修正结果 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/results/confirm-all` | 批量确认 |
|
|
|
|
|
-
|
|
|
|
|
-### 7.6 规则模板 API
|
|
|
|
|
-
|
|
|
|
|
-| 方法 | 路径 | 描述 |
|
|
|
|
|
-| ------ | --------------------------------------------------------- | ------------------ |
|
|
|
|
|
-| POST | `/api/v1/extract/templates` | 创建模板 |
|
|
|
|
|
-| GET | `/api/v1/extract/templates` | 获取模板列表 |
|
|
|
|
|
-| GET | `/api/v1/extract/templates/{id}` | 获取模板详情 |
|
|
|
|
|
-| DELETE | `/api/v1/extract/templates/{id}` | 删除模板 |
|
|
|
|
|
-| POST | `/api/v1/extract/templates/{id}/apply` | 应用模板到项目 |
|
|
|
|
|
-| POST | `/api/v1/extract/projects/{projectId}/save-as-template` | 保存项目规则为模板 |
|
|
|
|
|
|
|
+## 七、API 设计
|
|
|
|
|
+
|
|
|
|
|
+### 7.1 模板管理 API
|
|
|
|
|
+
|
|
|
|
|
+| 方法 | 路径 | 说明 |
|
|
|
|
|
+| ---- | ---- | ---- |
|
|
|
|
|
+| POST | `/api/v1/templates` | 创建模板 |
|
|
|
|
|
+| GET | `/api/v1/templates` | 获取模板列表 |
|
|
|
|
|
+| GET | `/api/v1/templates/{id}` | 获取模板详情 |
|
|
|
|
|
+| PUT | `/api/v1/templates/{id}` | 更新模板 |
|
|
|
|
|
+| DELETE | `/api/v1/templates/{id}` | 删除模板 |
|
|
|
|
|
+| POST | `/api/v1/templates/{id}/publish` | 发布模板 |
|
|
|
|
|
+| POST | `/api/v1/templates/{id}/duplicate` | 复制模板 |
|
|
|
|
|
+
|
|
|
|
|
+### 7.2 来源文件定义 API
|
|
|
|
|
+
|
|
|
|
|
+| 方法 | 路径 | 说明 |
|
|
|
|
|
+| ---- | ---- | ---- |
|
|
|
|
|
+| POST | `/api/v1/templates/{templateId}/source-files` | 添加来源文件定义 |
|
|
|
|
|
+| GET | `/api/v1/templates/{templateId}/source-files` | 获取来源文件定义列表 |
|
|
|
|
|
+| PUT | `/api/v1/templates/{templateId}/source-files/{id}` | 更新来源文件定义 |
|
|
|
|
|
+| DELETE | `/api/v1/templates/{templateId}/source-files/{id}` | 删除来源文件定义 |
|
|
|
|
|
+| POST | `/api/v1/templates/{templateId}/source-files/reorder` | 调整顺序 |
|
|
|
|
|
+
|
|
|
|
|
+### 7.3 变量 API
|
|
|
|
|
+
|
|
|
|
|
+| 方法 | 路径 | 说明 |
|
|
|
|
|
+| ---- | ---- | ---- |
|
|
|
|
|
+| POST | `/api/v1/templates/{templateId}/variables` | 创建变量 |
|
|
|
|
|
+| GET | `/api/v1/templates/{templateId}/variables` | 获取变量列表 |
|
|
|
|
|
+| GET | `/api/v1/templates/{templateId}/variables/{id}` | 获取变量详情 |
|
|
|
|
|
+| PUT | `/api/v1/templates/{templateId}/variables/{id}` | 更新变量 |
|
|
|
|
|
+| DELETE | `/api/v1/templates/{templateId}/variables/{id}` | 删除变量 |
|
|
|
|
|
+| POST | `/api/v1/templates/{templateId}/variables/{id}/preview` | 预览提取结果 |
|
|
|
|
|
+
|
|
|
|
|
+### 7.4 生成任务 API
|
|
|
|
|
+
|
|
|
|
|
+| 方法 | 路径 | 说明 |
|
|
|
|
|
+| ---- | ---- | ---- |
|
|
|
|
|
+| POST | `/api/v1/generations` | 创建生成任务 |
|
|
|
|
|
+| GET | `/api/v1/generations` | 获取生成任务列表 |
|
|
|
|
|
+| GET | `/api/v1/generations/{id}` | 获取任务详情 |
|
|
|
|
|
+| POST | `/api/v1/generations/{id}/execute` | 执行提取 |
|
|
|
|
|
+| GET | `/api/v1/generations/{id}/progress` | 获取执行进度 |
|
|
|
|
|
+| PUT | `/api/v1/generations/{id}/variables/{varName}` | 修改变量值 |
|
|
|
|
|
+| POST | `/api/v1/generations/{id}/confirm` | 确认并生成文档 |
|
|
|
|
|
+| GET | `/api/v1/generations/{id}/download` | 下载生成的文档 |
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## 八、核心流程
|
|
## 八、核心流程
|
|
|
|
|
|
|
|
-### 8.1 完整工作流程
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-┌──────────────────────────────────────────────────────────────────────────────┐
|
|
|
|
|
-│ 用户操作流程 │
|
|
|
|
|
-└──────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ┌──────────────────────────────────┼──────────────────────────────────┐
|
|
|
|
|
- ▼ ▼ ▼
|
|
|
|
|
-┌─────────┐ ┌─────────────┐ ┌─────────────┐
|
|
|
|
|
-│ 1.创建 │ │ 2.上传文档 │ │ 3.配置规则 │
|
|
|
|
|
-│ 项目 │───────────────────►│ 并关联项目 │───────────────────►│ (可用模板) │
|
|
|
|
|
-└─────────┘ └─────────────┘ └─────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ┌────────────────────────────────────────────────────────────────────┘
|
|
|
|
|
- ▼
|
|
|
|
|
-┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
|
|
|
-│ 4.执行提取 │───────────────►│ 5.查看结果 │───────────────────►│ 6.确认/修正 │
|
|
|
|
|
-│ (可预览) │ │ 并追溯来源 │ │ 提取结果 │
|
|
|
|
|
-└─────────────┘ └─────────────┘ └─────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ┌────────────────────────────────────────────────────────────────────┘
|
|
|
|
|
- ▼
|
|
|
|
|
-┌─────────────┐ ┌─────────────┐
|
|
|
|
|
-│ 7.导出数据 │───────────────►│ 8.保存为 │
|
|
|
|
|
-│ 或生成报告 │ │ 规则模板 │
|
|
|
|
|
-└─────────────┘ └─────────────┘
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 8.2 规则执行流程
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
|
|
-│ ExtractExecuteService.executeRule() │
|
|
|
|
|
-└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ▼
|
|
|
|
|
- ┌─────────────────────┐
|
|
|
|
|
- │ 1. 获取规则配置 │
|
|
|
|
|
- │ ExtractRule │
|
|
|
|
|
- └─────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ▼
|
|
|
|
|
- ┌─────────────────────┐
|
|
|
|
|
- │ 2. 根据 sourceType │
|
|
|
|
|
- │ 获取原文内容 │
|
|
|
|
|
- └─────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ┌─────────────────────────────┼─────────────────────────────┐
|
|
|
|
|
- ▼ ▼ ▼
|
|
|
|
|
-┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
|
|
-│ document │ │ self_reference │ │ fixed/manual │
|
|
|
|
|
-│ │ │ │ │ │
|
|
|
|
|
-│ ContentLocator │ │ 查询已提取值 │ │ 直接获取配置值 │
|
|
|
|
|
-│ Service │ │ │ │ │
|
|
|
|
|
-└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
|
|
|
- │ │ │
|
|
|
|
|
- └─────────────────────────────┼─────────────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ▼
|
|
|
|
|
- ┌─────────────────────┐
|
|
|
|
|
- │ 3. 根据 extractType │
|
|
|
|
|
- │ 执行提取 │
|
|
|
|
|
- └─────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ┌─────────────────────────────┼─────────────────────────────┐
|
|
|
|
|
- ▼ ▼ ▼
|
|
|
|
|
-┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
|
|
-│ direct │ │ ai_extract │ │ ai_summarize │
|
|
|
|
|
-│ │ │ │ │ │
|
|
|
|
|
-│ DirectExtract │ │ AIExtract │ │ AISummarize │
|
|
|
|
|
-│ Executor │ │ Executor │ │ Executor │
|
|
|
|
|
-└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
|
|
|
- │ │ │
|
|
|
|
|
- └─────────────────────────────┼─────────────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ▼
|
|
|
|
|
- ┌─────────────────────┐
|
|
|
|
|
- │ 4. 保存提取结果 │
|
|
|
|
|
- │ ExtractResult │
|
|
|
|
|
- └─────────────────────┘
|
|
|
|
|
- │
|
|
|
|
|
- ▼
|
|
|
|
|
- ┌─────────────────────┐
|
|
|
|
|
- │ 5. 更新规则状态 │
|
|
|
|
|
- │ 和 extracted_value│
|
|
|
|
|
- └─────────────────────┘
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 8.3 AI 提取 Prompt 设计
|
|
|
|
|
-
|
|
|
|
|
-#### 8.3.1 字段提取 Prompt
|
|
|
|
|
-
|
|
|
|
|
-```text
|
|
|
|
|
-你是一个专业的文档信息提取助手。请从以下文档内容中提取指定的信息。
|
|
|
|
|
-
|
|
|
|
|
-## 提取目标
|
|
|
|
|
-{targetDescription}
|
|
|
|
|
-
|
|
|
|
|
-## 字段类型
|
|
|
|
|
-{fieldType}
|
|
|
|
|
-
|
|
|
|
|
-## 预期格式
|
|
|
|
|
-{expectedFormat}
|
|
|
|
|
-
|
|
|
|
|
-## 示例
|
|
|
|
|
-{examples}
|
|
|
|
|
-
|
|
|
|
|
-## 文档内容
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-{content}这这
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-## 输出要求
|
|
|
|
|
-请直接输出提取的值,不要包含任何解释。如果无法提取,请输出"[无法提取]"。
|
|
|
|
|
-如果内容中有多个可能的值,请选择最准确的一个。
|
|
|
|
|
-
|
|
|
|
|
-提取结果:
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-#### 8.3.2 内容总结 Prompt
|
|
|
|
|
-
|
|
|
|
|
-```text
|
|
|
|
|
-你是一个专业的工程报告撰写助手。请对以下内容进行总结/提炼。
|
|
|
|
|
-
|
|
|
|
|
-## 总结要求
|
|
|
|
|
-{summarizePrompt}
|
|
|
|
|
-
|
|
|
|
|
-## 关注维度
|
|
|
|
|
-{focusPoints}
|
|
|
|
|
-
|
|
|
|
|
-## 总结规则
|
|
|
|
|
-{rules}
|
|
|
|
|
-
|
|
|
|
|
-## 输出风格
|
|
|
|
|
-{style}
|
|
|
|
|
-
|
|
|
|
|
-## 字数限制
|
|
|
|
|
-{maxLength} 字以内
|
|
|
|
|
-
|
|
|
|
|
-## 上下文信息
|
|
|
|
|
-{contextInfo}
|
|
|
|
|
-
|
|
|
|
|
-## 原文内容
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-{content}
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-## 输出要求
|
|
|
|
|
-请直接输出总结内容,使用正式的工程报告语言。
|
|
|
|
|
-
|
|
|
|
|
-总结结果:
|
|
|
|
|
|
|
+### 8.1 创建模板流程
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. 用户上传示例报告
|
|
|
|
|
+ POST /api/v1/parse/upload
|
|
|
|
|
+ → 返回 document_id
|
|
|
|
|
+
|
|
|
|
|
+2. 创建模板
|
|
|
|
|
+ POST /api/v1/templates
|
|
|
|
|
+ {
|
|
|
|
|
+ "name": "110kV输变电工程预评价模板",
|
|
|
|
|
+ "baseDocumentId": "doc_001"
|
|
|
|
|
+ }
|
|
|
|
|
+ → 返回 template_id
|
|
|
|
|
+
|
|
|
|
|
+3. 添加来源文件定义
|
|
|
|
|
+ POST /api/v1/templates/{templateId}/source-files
|
|
|
|
|
+ {
|
|
|
|
|
+ "alias": "可研批复",
|
|
|
|
|
+ "description": "可研批复文件",
|
|
|
|
|
+ "fileTypes": ["pdf", "docx"],
|
|
|
|
|
+ "required": true,
|
|
|
|
|
+ "exampleDocumentId": "doc_002"
|
|
|
|
|
+ }
|
|
|
|
|
+
|
|
|
|
|
+4. 创建变量(前端在编辑器中操作,调用此API)
|
|
|
|
|
+ POST /api/v1/templates/{templateId}/variables
|
|
|
|
|
+ {
|
|
|
|
|
+ "name": "project_name",
|
|
|
|
|
+ "displayName": "工程名称",
|
|
|
|
|
+ "location": {
|
|
|
|
|
+ "elementId": "elem_001",
|
|
|
|
|
+ "type": "text",
|
|
|
|
|
+ "startOffset": 10,
|
|
|
|
|
+ "endOffset": 35
|
|
|
|
|
+ },
|
|
|
|
|
+ "exampleValue": "襄阳连云 110kV 输变电工程",
|
|
|
|
|
+ "sourceFileAlias": "可研批复",
|
|
|
|
|
+ "sourceType": "document",
|
|
|
|
|
+ "sourceConfig": { ... },
|
|
|
|
|
+ "extractType": "ai_extract",
|
|
|
|
|
+ "extractConfig": { ... }
|
|
|
|
|
+ }
|
|
|
|
|
+
|
|
|
|
|
+5. 发布模板
|
|
|
|
|
+ POST /api/v1/templates/{templateId}/publish
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### 8.2 生成报告流程
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. 创建生成任务
|
|
|
|
|
+ POST /api/v1/generations
|
|
|
|
|
+ {
|
|
|
|
|
+ "templateId": "tpl_001",
|
|
|
|
|
+ "name": "武汉东湖110kV预评价",
|
|
|
|
|
+ "sourceFileMap": {
|
|
|
|
|
+ "可研批复": "doc_123",
|
|
|
|
|
+ "站址报告": "doc_456"
|
|
|
|
|
+ }
|
|
|
|
|
+ }
|
|
|
|
|
+ → 返回 generation_id
|
|
|
|
|
+
|
|
|
|
|
+2. 执行提取
|
|
|
|
|
+ POST /api/v1/generations/{generationId}/execute
|
|
|
|
|
+
|
|
|
|
|
+3. 查询进度
|
|
|
|
|
+ GET /api/v1/generations/{generationId}/progress
|
|
|
|
|
+ → { "total": 10, "completed": 6, "currentVariable": "geology_summary" }
|
|
|
|
|
+
|
|
|
|
|
+4. 查看提取结果,可修改
|
|
|
|
|
+ GET /api/v1/generations/{generationId}
|
|
|
|
|
+ PUT /api/v1/generations/{generationId}/variables/project_name
|
|
|
|
|
+ { "value": "修正后的值" }
|
|
|
|
|
+
|
|
|
|
|
+5. 确认并生成文档
|
|
|
|
|
+ POST /api/v1/generations/{generationId}/confirm
|
|
|
|
|
+
|
|
|
|
|
+6. 下载
|
|
|
|
|
+ GET /api/v1/generations/{generationId}/download
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 九、与现有系统的集成
|
|
|
|
|
-
|
|
|
|
|
-### 9.1 依赖现有服务
|
|
|
|
|
-
|
|
|
|
|
-| 服务 | 用途 | 调用方式 |
|
|
|
|
|
-| ----------------------------------- | ------------------------ | ------------------ |
|
|
|
|
|
-| `DocumentService` | 获取文档信息 | Feign Client |
|
|
|
|
|
-| `DocumentElementService` | 获取文档元素 | Feign Client |
|
|
|
|
|
-| `DeepSeekClient` | AI 提取/总结 | 直接调用 |
|
|
|
|
|
-| `WordStructuredExtractionService` | Word 文档解析 | 通过 parse-service |
|
|
|
|
|
-| `DataSourceService` | 可选:将结果注册为数据源 | Feign Client |
|
|
|
|
|
-
|
|
|
|
|
-### 9.2 事件集成
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 监听文档解析完成事件
|
|
|
|
|
- * 自动更新来源文档的状态
|
|
|
|
|
- */
|
|
|
|
|
-@EventListener
|
|
|
|
|
-public void onDocumentParsed(DocumentParsedEvent event) {
|
|
|
|
|
- // 查找关联此文档的来源文档记录
|
|
|
|
|
- List<SourceDocument> sourceDocs = sourceDocumentService
|
|
|
|
|
- .findByDocumentId(event.getDocumentId());
|
|
|
|
|
-
|
|
|
|
|
- // 更新解析状态
|
|
|
|
|
- for (SourceDocument sourceDoc : sourceDocs) {
|
|
|
|
|
- sourceDoc.updateMetadata("parseStatus", "completed");
|
|
|
|
|
- sourceDocumentService.update(sourceDoc);
|
|
|
|
|
- }
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### 9.3 与 DataSource 的关系
|
|
|
|
|
-
|
|
|
|
|
-提取结果可以选择性地注册为 `DataSource`,供其他模块(如模板渲染)使用:
|
|
|
|
|
-
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 将提取结果注册为数据源
|
|
|
|
|
- */
|
|
|
|
|
-public DataSource registerAsDataSource(ExtractResult result, String userId) {
|
|
|
|
|
- CreateDataSourceRequest request = new CreateDataSourceRequest();
|
|
|
|
|
- request.setName(result.getRule().getTargetFieldName());
|
|
|
|
|
- request.setType("text");
|
|
|
|
|
- request.setSourceType("extract_result");
|
|
|
|
|
- request.setConfig(Map.of(
|
|
|
|
|
- "extractResultId", result.getId(),
|
|
|
|
|
- "projectId", result.getProjectId()
|
|
|
|
|
- ));
|
|
|
|
|
-
|
|
|
|
|
- return dataSourceService.create(userId, request);
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
----
|
|
|
|
|
|
|
+## 九、与现有系统集成
|
|
|
|
|
|
|
|
-## 十、错误处理与日志
|
|
|
|
|
|
|
+### 9.1 复用已有服务
|
|
|
|
|
|
|
|
-### 10.1 错误码定义
|
|
|
|
|
|
|
+| 已有服务 | 复用内容 |
|
|
|
|
|
+| -------- | -------- |
|
|
|
|
|
+| document-service | 文档存储、DocumentElement 结构 |
|
|
|
|
|
+| parse-service | 文档解析、结构化提取 |
|
|
|
|
|
+| ai-service | DeepSeek API 调用、AI 提取 |
|
|
|
|
|
+| graph-service | 可选:将变量注册为数据源 |
|
|
|
|
|
|
|
|
-| 错误码 | 说明 |
|
|
|
|
|
-| --------------- | ---------------- |
|
|
|
|
|
-| `EXTRACT_001` | 项目不存在 |
|
|
|
|
|
-| `EXTRACT_002` | 来源文档不存在 |
|
|
|
|
|
-| `EXTRACT_003` | 规则配置无效 |
|
|
|
|
|
-| `EXTRACT_004` | 文档未解析完成 |
|
|
|
|
|
-| `EXTRACT_005` | 内容定位失败 |
|
|
|
|
|
-| `EXTRACT_006` | AI 提取失败 |
|
|
|
|
|
-| `EXTRACT_007` | 引用的字段未提取 |
|
|
|
|
|
-| `EXTRACT_008` | 循环引用 |
|
|
|
|
|
|
|
+### 9.2 重构 extract-service
|
|
|
|
|
|
|
|
-### 10.2 日志规范
|
|
|
|
|
|
|
+现有 `extract-service` 的代码需要重构:
|
|
|
|
|
|
|
|
-```java
|
|
|
|
|
-// 规则执行日志
|
|
|
|
|
-log.info("开始执行提取规则: ruleId={}, projectId={}, targetField={}",
|
|
|
|
|
- rule.getId(), rule.getProjectId(), rule.getTargetFieldKey());
|
|
|
|
|
-
|
|
|
|
|
-// AI 调用日志
|
|
|
|
|
-log.info("AI提取: ruleId={}, contentLength={}, extractType={}",
|
|
|
|
|
- ruleId, content.length(), extractType);
|
|
|
|
|
-
|
|
|
|
|
-// 结果日志
|
|
|
|
|
-log.info("提取完成: ruleId={}, status={}, valueLength={}, confidence={}",
|
|
|
|
|
- ruleId, status, value.length(), confidence);
|
|
|
|
|
-```
|
|
|
|
|
|
|
+| 原概念 | 新概念 | 说明 |
|
|
|
|
|
+| ------ | ------ | ---- |
|
|
|
|
|
+| Project | Template | 模板取代项目 |
|
|
|
|
|
+| SourceDocument | SourceFile | 来源文件**定义**,不是具体文件 |
|
|
|
|
|
+| ExtractRule | Variable | 变量,绑定到文档位置 |
|
|
|
|
|
+| ExtractResult | Generation.variable_values | 生成任务中的变量值 |
|
|
|
|
|
+| - | Generation | 新增:生成任务 |
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## 十一、性能优化建议
|
|
|
|
|
|
|
+## 十、未来扩展
|
|
|
|
|
|
|
|
-### 11.1 批量执行优化
|
|
|
|
|
|
|
+### 10.1 模板市场
|
|
|
|
|
|
|
|
-1. **并行执行**:无依赖关系的规则可以并行执行
|
|
|
|
|
-2. **批量 AI 调用**:多个提取请求可以合并为批量请求
|
|
|
|
|
-3. **缓存内容定位**:同一文档的相同定位条件缓存结果
|
|
|
|
|
|
|
+- 用户可将模板设为公开
|
|
|
|
|
+- 其他用户可基于公开模板创建自己的模板
|
|
|
|
|
+- 模板评分和使用统计
|
|
|
|
|
|
|
|
-### 11.2 依赖分析
|
|
|
|
|
|
|
+### 10.2 批量生成
|
|
|
|
|
|
|
|
-```java
|
|
|
|
|
-/**
|
|
|
|
|
- * 分析规则依赖关系,构建执行顺序
|
|
|
|
|
- */
|
|
|
|
|
-public List<List<String>> buildExecutionOrder(List<ExtractRule> rules) {
|
|
|
|
|
- // 1. 构建依赖图
|
|
|
|
|
- Map<String, Set<String>> dependencyGraph = buildDependencyGraph(rules);
|
|
|
|
|
-
|
|
|
|
|
- // 2. 拓扑排序,识别可并行执行的规则组
|
|
|
|
|
- return topologicalSort(dependencyGraph);
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
----
|
|
|
|
|
|
|
+- 上传多组来源文件
|
|
|
|
|
+- 一次生成多份报告
|
|
|
|
|
+- 生成任务队列管理
|
|
|
|
|
|
|
|
-## 十二、后续扩展
|
|
|
|
|
|
|
+### 10.3 版本控制
|
|
|
|
|
|
|
|
-### 12.1 计划中的功能
|
|
|
|
|
-
|
|
|
|
|
-1. **规则推荐**:根据文档类型自动推荐常用规则
|
|
|
|
|
-2. **智能定位**:AI 辅助识别章节和内容位置
|
|
|
|
|
-3. **批量项目**:支持多个同类型项目的批量处理
|
|
|
|
|
-4. **版本对比**:规则配置的版本管理和对比
|
|
|
|
|
-5. **协作编辑**:多人协作配置规则
|
|
|
|
|
-
|
|
|
|
|
-### 12.2 集成扩展
|
|
|
|
|
-
|
|
|
|
|
-1. **对接 graph-service**:将提取结果构建为知识图谱
|
|
|
|
|
-2. **对接报告生成**:提取结果直接用于报告生成
|
|
|
|
|
-3. **对接审批流程**:提取结果需审批后生效
|
|
|
|
|
-
|
|
|
|
|
----
|
|
|
|
|
-
|
|
|
|
|
-## 附录 A:配置示例
|
|
|
|
|
-
|
|
|
|
|
-### A.1 完整规则配置示例
|
|
|
|
|
-
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "projectId": "proj_001",
|
|
|
|
|
- "targetFieldKey": "project_name",
|
|
|
|
|
- "targetFieldName": "工程名称",
|
|
|
|
|
- "targetFieldGroup": "基本信息",
|
|
|
|
|
- "ruleIndex": 1,
|
|
|
|
|
- "sourceType": "document",
|
|
|
|
|
- "sourceConfig": {
|
|
|
|
|
- "sourceDocId": "sd_001",
|
|
|
|
|
- "documentAlias": "可研批复",
|
|
|
|
|
- "location": {
|
|
|
|
|
- "type": "page",
|
|
|
|
|
- "pageStart": 1,
|
|
|
|
|
- "pageEnd": 1
|
|
|
|
|
- }
|
|
|
|
|
- },
|
|
|
|
|
- "extractType": "ai_extract",
|
|
|
|
|
- "extractConfig": {
|
|
|
|
|
- "targetDescription": "从批复文件中提取工程项目的完整名称",
|
|
|
|
|
- "fieldType": "text",
|
|
|
|
|
- "expectedFormat": "XX市XX工程",
|
|
|
|
|
- "examples": ["襄阳连云220千伏输变电工程"]
|
|
|
|
|
- }
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-### A.2 复杂引用规则示例
|
|
|
|
|
-
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "projectId": "proj_001",
|
|
|
|
|
- "targetFieldKey": "report_summary",
|
|
|
|
|
- "targetFieldName": "报告摘要",
|
|
|
|
|
- "ruleIndex": 50,
|
|
|
|
|
- "sourceType": "self_reference",
|
|
|
|
|
- "sourceConfig": {
|
|
|
|
|
- "referenceFieldKeys": ["project_name", "construction_unit", "project_location", "investment_amount"],
|
|
|
|
|
- "combineTemplate": "《{project_name}可行性研究报告》由{construction_unit}编制,项目位于{project_location},预计总投资{investment_amount}万元。"
|
|
|
|
|
- },
|
|
|
|
|
- "extractType": "direct",
|
|
|
|
|
- "extractConfig": {}
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+- 模板版本管理
|
|
|
|
|
+- 生成任务关联模板版本
|
|
|
|
|
+- 版本对比和回滚
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|