|
|
@@ -0,0 +1,1751 @@
|
|
|
+# 数据提取规则系统设计文档
|
|
|
+
|
|
|
+> 版本: 1.0.0
|
|
|
+> 日期: 2026-01-22
|
|
|
+> 作者: AI Assistant (Claude Opus 4.5)
|
|
|
+
|
|
|
+## 一、概述
|
|
|
+
|
|
|
+### 1.1 背景
|
|
|
+
|
|
|
+在电力工程预评价报告生成场景中,用户需要从多个来源文档(PDF、Word、Excel)中提取特定数据,并按照规则进行处理(直接提取、AI提取、AI总结等),最终生成结构化的报告。
|
|
|
+
|
|
|
+当前人工整理的数据提取规则以表格形式存在,包含:
|
|
|
+
|
|
|
+- 来源数据/文件
|
|
|
+- 来源数据/文件的具体章节/内容
|
|
|
+- 取值数据规则
|
|
|
+- 待提供/备注
|
|
|
+
|
|
|
+本系统的目标是**将这一人工整理流程可视化、结构化**,让用户能够在界面上配置提取规则,系统自动执行提取任务。
|
|
|
+
|
|
|
+### 1.2 核心概念
|
|
|
+
|
|
|
+| 概念 | 说明 |
|
|
|
+| ------------------------------------ | -------------------------------------------- |
|
|
|
+| **项目(Project)** | 一个报告生成任务,包含多个来源文档和提取规则 |
|
|
|
+| **来源文档(SourceDocument)** | 项目中用到的文档,关联已解析的 Document |
|
|
|
+| **提取规则(ExtractRule)** | 描述如何从来源文档中提取数据的配置 |
|
|
|
+| **提取结果(ExtractResult)** | 规则执行后的提取值,可被后续规则引用 |
|
|
|
+
|
|
|
+### 1.3 设计原则
|
|
|
+
|
|
|
+1. **数据溯源**:每个提取值都能追溯到来源文档的具体位置
|
|
|
+2. **灵活配置**:支持多种来源类型和提取方式的组合
|
|
|
+3. **可复用**:规则配置可保存为模板,应用到类似项目
|
|
|
+4. **渐进式**:支持分步提取、人工确认、修正
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 二、系统架构
|
|
|
+
|
|
|
+### 2.1 整体架构图
|
|
|
+
|
|
|
+```
|
|
|
+┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
+│ 前端 (Vue.js) │
|
|
|
+│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
|
+│ │ 项目管理 │ │ 文档管理 │ │ 规则配置 │ │ 提取执行 │ │
|
|
|
+│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
|
+└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
+│ Gateway Service │
|
|
|
+└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
+ │
|
|
|
+ ┌──────────────────┼──────────────────┐
|
|
|
+ ▼ ▼ ▼
|
|
|
+┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
|
|
|
+│ extract-service │ │ document-service │ │ ai-service │
|
|
|
+│ (新增) │ │ (已有) │ │ (已有) │
|
|
|
+│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
|
|
|
+│ │ ProjectService │ │ │ │ DocumentService│ │ │ │ DeepSeekClient │ │
|
|
|
+│ │ RuleService │ │ │ │ ElementService │ │ │ │ AIService │ │
|
|
|
+│ │ ExecuteService │ │ │ └────────────────┘ │ │ └────────────────┘ │
|
|
|
+│ └────────────────┘ │ └──────────────────────┘ └──────────────────────┘
|
|
|
+└──────────────────────┘
|
|
|
+ │ │ │
|
|
|
+ └──────────────────┴──────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
+│ PostgreSQL + Redis │
|
|
|
+└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
+```
|
|
|
+
|
|
|
+### 2.2 模块职责
|
|
|
+
|
|
|
+| 模块 | 职责 | 位置 |
|
|
|
+| -------------------------- | ---------------------------------------------- | ---------------------------- |
|
|
|
+| **extract-service** | 项目管理、规则配置、提取执行(**新增**) | `backend/extract-service` |
|
|
|
+| **document-service** | 文档管理、元素存储(已有) | `backend/document-service` |
|
|
|
+| **parse-service** | 文档解析、结构化提取(已有) | `backend/parse-service` |
|
|
|
+| **ai-service** | AI 提取、总结、润色(已有) | `backend/ai-service` |
|
|
|
+| **graph-service** | 数据源、知识图谱(已有) | `backend/graph-service` |
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 三、数据库设计
|
|
|
+
|
|
|
+### 3.1 ER 图
|
|
|
+
|
|
|
+```
|
|
|
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
+│ projects │ │ source_documents│ │ extract_rules │
|
|
|
+├─────────────────┤ ├─────────────────┤ ├─────────────────┤
|
|
|
+│ id │◄──────│ project_id │ │ id │
|
|
|
+│ user_id │ │ id │◄──────│ project_id │
|
|
|
+│ name │ │ document_id │ │ source_doc_id │
|
|
|
+│ description │ │ alias │ │ target_field_key│
|
|
|
+│ status │ │ doc_type │ │ target_field_name│
|
|
|
+│ config │ │ metadata │ │ rule_index │
|
|
|
+│ created_at │ │ created_at │ │ source_type │
|
|
|
+│ updated_at │ └─────────────────┘ │ source_config │
|
|
|
+└─────────────────┘ │ extract_type │
|
|
|
+ │ extract_config │
|
|
|
+┌─────────────────┐ │ status │
|
|
|
+│ extract_results │ │ extracted_value │
|
|
|
+├─────────────────┤ │ value_type │
|
|
|
+│ id │ │ metadata │
|
|
|
+│ rule_id │◄────────────────────────────────│ created_at │
|
|
|
+│ project_id │ │ updated_at │
|
|
|
+│ extracted_value │ └─────────────────┘
|
|
|
+│ value_type │
|
|
|
+│ source_content │ ┌─────────────────┐
|
|
|
+│ confidence │ │ rule_templates │
|
|
|
+│ status │ ├─────────────────┤
|
|
|
+│ metadata │ │ id │
|
|
|
+│ created_at │ │ user_id │
|
|
|
+│ confirmed_at │ │ name │
|
|
|
+│ confirmed_by │ │ description │
|
|
|
+└─────────────────┘ │ rules_snapshot │
|
|
|
+ │ doc_type_pattern│
|
|
|
+ │ created_at │
|
|
|
+ └─────────────────┘
|
|
|
+```
|
|
|
+
|
|
|
+### 3.2 表结构定义
|
|
|
+
|
|
|
+#### 3.2.1 projects(项目表)
|
|
|
+
|
|
|
+```sql
|
|
|
+CREATE TABLE projects (
|
|
|
+ id VARCHAR(32) PRIMARY KEY,
|
|
|
+ user_id VARCHAR(32) NOT NULL,
|
|
|
+ name VARCHAR(255) NOT NULL COMMENT '项目名称',
|
|
|
+ description TEXT COMMENT '项目描述',
|
|
|
+ status VARCHAR(32) DEFAULT 'draft' COMMENT '状态: draft/extracting/completed/archived',
|
|
|
+ config JSONB COMMENT '项目配置',
|
|
|
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+ updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+
|
|
|
+ INDEX idx_user_id (user_id),
|
|
|
+ INDEX idx_status (status)
|
|
|
+);
|
|
|
+
|
|
|
+COMMENT ON TABLE projects IS '数据提取项目';
|
|
|
+```
|
|
|
+
|
|
|
+**config 字段结构**:
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "outputFormat": "docx", // 输出格式
|
|
|
+ "autoExtract": false, // 是否自动执行提取
|
|
|
+ "notifyOnComplete": true, // 完成时通知
|
|
|
+ "aiModel": "deepseek-chat" // 使用的AI模型
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 3.2.2 source_documents(来源文档表)
|
|
|
+
|
|
|
+```sql
|
|
|
+CREATE TABLE source_documents (
|
|
|
+ id VARCHAR(32) PRIMARY KEY,
|
|
|
+ project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
+ document_id VARCHAR(32) NOT NULL COMMENT '关联的 Document ID',
|
|
|
+ alias VARCHAR(128) NOT NULL COMMENT '文档别名,如"可研批复"',
|
|
|
+ doc_type VARCHAR(32) NOT NULL COMMENT '文档类型: pdf/docx/xlsx',
|
|
|
+ display_order INT DEFAULT 0 COMMENT '显示顺序',
|
|
|
+ metadata JSONB COMMENT '元数据',
|
|
|
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+
|
|
|
+ INDEX idx_project_id (project_id),
|
|
|
+ UNIQUE (project_id, alias)
|
|
|
+);
|
|
|
+
|
|
|
+COMMENT ON TABLE source_documents IS '项目来源文档';
|
|
|
+```
|
|
|
+
|
|
|
+**metadata 字段结构**:
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "fileName": "鄂电司发展〔2024〕124号...批复.pdf",
|
|
|
+ "fileSize": 1024000,
|
|
|
+ "pageCount": 18,
|
|
|
+ "parseStatus": "completed"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 3.2.3 extract_rules(提取规则表)
|
|
|
+
|
|
|
+```sql
|
|
|
+CREATE TABLE extract_rules (
|
|
|
+ id VARCHAR(32) PRIMARY KEY,
|
|
|
+ project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
+ source_doc_id VARCHAR(32) COMMENT '来源文档ID(可为空,表示引用/固定/手动)',
|
|
|
+
|
|
|
+ -- 目标字段
|
|
|
+ target_field_key VARCHAR(128) NOT NULL COMMENT '目标字段Key(程序用)',
|
|
|
+ target_field_name VARCHAR(255) NOT NULL COMMENT '目标字段名称(显示用)',
|
|
|
+ target_field_group VARCHAR(128) COMMENT '字段分组',
|
|
|
+ rule_index INT NOT NULL COMMENT '规则顺序',
|
|
|
+
|
|
|
+ -- 来源配置
|
|
|
+ source_type VARCHAR(32) NOT NULL COMMENT '来源类型: document/self_reference/fixed/manual',
|
|
|
+ source_config JSONB NOT NULL COMMENT '来源配置',
|
|
|
+
|
|
|
+ -- 提取配置
|
|
|
+ extract_type VARCHAR(32) NOT NULL COMMENT '提取类型: direct/ai_extract/ai_summarize/ocr',
|
|
|
+ extract_config JSONB COMMENT '提取配置',
|
|
|
+
|
|
|
+ -- 结果
|
|
|
+ status VARCHAR(32) DEFAULT 'pending' COMMENT '状态: pending/extracting/extracted/confirmed/error',
|
|
|
+ extracted_value TEXT COMMENT '提取出的值',
|
|
|
+ value_type VARCHAR(32) DEFAULT 'text' COMMENT '值类型: text/table/image/list',
|
|
|
+ error_message TEXT COMMENT '错误信息',
|
|
|
+
|
|
|
+ -- 元数据
|
|
|
+ metadata JSONB COMMENT '元数据',
|
|
|
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+ updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+
|
|
|
+ INDEX idx_project_id (project_id),
|
|
|
+ INDEX idx_status (status),
|
|
|
+ INDEX idx_target_field_key (target_field_key),
|
|
|
+ UNIQUE (project_id, target_field_key)
|
|
|
+);
|
|
|
+
|
|
|
+COMMENT ON TABLE extract_rules IS '数据提取规则';
|
|
|
+```
|
|
|
+
|
|
|
+#### 3.2.4 extract_results(提取结果表)
|
|
|
+
|
|
|
+```sql
|
|
|
+CREATE TABLE extract_results (
|
|
|
+ id VARCHAR(32) PRIMARY KEY,
|
|
|
+ rule_id VARCHAR(32) NOT NULL REFERENCES extract_rules(id) ON DELETE CASCADE,
|
|
|
+ project_id VARCHAR(32) NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
|
|
+
|
|
|
+ -- 提取结果
|
|
|
+ extracted_value TEXT NOT NULL COMMENT '提取出的值',
|
|
|
+ value_type VARCHAR(32) DEFAULT 'text' COMMENT '值类型',
|
|
|
+
|
|
|
+ -- 来源追溯
|
|
|
+ source_content TEXT COMMENT '来源原文内容',
|
|
|
+ source_location JSONB COMMENT '来源位置信息',
|
|
|
+
|
|
|
+ -- 质量评估
|
|
|
+ confidence DECIMAL(5,4) COMMENT 'AI提取的置信度 0-1',
|
|
|
+
|
|
|
+ -- 状态
|
|
|
+ status VARCHAR(32) DEFAULT 'extracted' COMMENT '状态: extracted/confirmed/rejected/modified',
|
|
|
+
|
|
|
+ -- 人工处理
|
|
|
+ modified_value TEXT COMMENT '人工修正后的值',
|
|
|
+ confirmed_at TIMESTAMP COMMENT '确认时间',
|
|
|
+ confirmed_by VARCHAR(32) COMMENT '确认人',
|
|
|
+
|
|
|
+ -- 元数据
|
|
|
+ metadata JSONB COMMENT '元数据(AI输出、处理日志等)',
|
|
|
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+
|
|
|
+ INDEX idx_rule_id (rule_id),
|
|
|
+ INDEX idx_project_id (project_id),
|
|
|
+ INDEX idx_status (status)
|
|
|
+);
|
|
|
+
|
|
|
+COMMENT ON TABLE extract_results IS '提取结果历史';
|
|
|
+```
|
|
|
+
|
|
|
+**source_location 字段结构**:
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "documentId": "doc_001",
|
|
|
+ "documentAlias": "可研批复",
|
|
|
+ "locationType": "page",
|
|
|
+ "pageStart": 1,
|
|
|
+ "pageEnd": 2,
|
|
|
+ "elementIds": ["elem_001", "elem_002"],
|
|
|
+ "chapterPath": ["1", "建设必要性"],
|
|
|
+ "textPreview": "本项目建设必要性主要体现在..."
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 3.2.5 rule_templates(规则模板表)
|
|
|
+
|
|
|
+```sql
|
|
|
+CREATE TABLE rule_templates (
|
|
|
+ id VARCHAR(32) PRIMARY KEY,
|
|
|
+ user_id VARCHAR(32) NOT NULL,
|
|
|
+ name VARCHAR(255) NOT NULL COMMENT '模板名称',
|
|
|
+ description TEXT COMMENT '模板描述',
|
|
|
+
|
|
|
+ -- 模板内容
|
|
|
+ rules_snapshot JSONB NOT NULL COMMENT '规则配置快照',
|
|
|
+ doc_type_pattern JSONB COMMENT '适用的文档类型模式',
|
|
|
+
|
|
|
+ -- 统计
|
|
|
+ use_count INT DEFAULT 0 COMMENT '使用次数',
|
|
|
+
|
|
|
+ -- 元数据
|
|
|
+ is_public BOOLEAN DEFAULT FALSE COMMENT '是否公开',
|
|
|
+ tags JSONB COMMENT '标签',
|
|
|
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+ updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
+
|
|
|
+ INDEX idx_user_id (user_id),
|
|
|
+ INDEX idx_is_public (is_public)
|
|
|
+);
|
|
|
+
|
|
|
+COMMENT ON TABLE rule_templates IS '提取规则模板';
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 四、核心数据结构
|
|
|
+
|
|
|
+### 4.1 source_config 详细设计
|
|
|
+
|
|
|
+#### 4.1.1 来自文档(document)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 文档来源配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class DocumentSourceConfig {
|
|
|
+ /** 来源文档ID(source_documents 表的 ID) */
|
|
|
+ private String sourceDocId;
|
|
|
+
|
|
|
+ /** 文档别名(便于显示) */
|
|
|
+ private String documentAlias;
|
|
|
+
|
|
|
+ /** 定位方式 */
|
|
|
+ private LocationConfig location;
|
|
|
+}
|
|
|
+
|
|
|
+/**
|
|
|
+ * 定位配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class LocationConfig {
|
|
|
+ /**
|
|
|
+ * 定位类型
|
|
|
+ * - page: 按页码
|
|
|
+ * - chapter: 按章节
|
|
|
+ * - element: 按元素ID
|
|
|
+ * - excel_cell: 按Excel单元格
|
|
|
+ * - full_document: 全文档
|
|
|
+ */
|
|
|
+ private String type;
|
|
|
+
|
|
|
+ // === 按页码定位 ===
|
|
|
+ private Integer pageStart;
|
|
|
+ private Integer pageEnd;
|
|
|
+
|
|
|
+ // === 按章节定位 ===
|
|
|
+ /** 章节路径,如 ["3", "5", "3", "3"] 表示 3.5.3.3 */
|
|
|
+ private List<String> chapterPath;
|
|
|
+ /** 章节标题关键词 */
|
|
|
+ private String chapterTitle;
|
|
|
+
|
|
|
+ // === 按段落过滤 ===
|
|
|
+ /** 段落范围 [start, end],1-based */
|
|
|
+ private List<Integer> paragraphRange;
|
|
|
+ /** 段落关键词过滤 */
|
|
|
+ private String paragraphKeyword;
|
|
|
+
|
|
|
+ // === 按元素ID定位 ===
|
|
|
+ /** 直接指定的 DocumentElement ID 列表 */
|
|
|
+ private List<String> elementIds;
|
|
|
+
|
|
|
+ // === Excel 定位 ===
|
|
|
+ /** Sheet 名称 */
|
|
|
+ private String sheetName;
|
|
|
+ /** 单元格范围,如 "A1:C10" 或 "1.5.1"(自定义格式) */
|
|
|
+ private String cellRef;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:按页码定位**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "sourceDocId": "sd_001",
|
|
|
+ "documentAlias": "可研批复",
|
|
|
+ "location": {
|
|
|
+ "type": "page",
|
|
|
+ "pageStart": 1,
|
|
|
+ "pageEnd": 2,
|
|
|
+ "paragraphKeyword": "(一)建设必要性"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:按章节定位**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "sourceDocId": "sd_002",
|
|
|
+ "documentAlias": "站址报告",
|
|
|
+ "location": {
|
|
|
+ "type": "chapter",
|
|
|
+ "chapterPath": ["3", "5", "3", "3"],
|
|
|
+ "chapterTitle": "区域地质及地震概况"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:按Excel定位**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "sourceDocId": "sd_003",
|
|
|
+ "documentAlias": "法规模板",
|
|
|
+ "location": {
|
|
|
+ "type": "excel_cell",
|
|
|
+ "sheetName": "变电站扩建页",
|
|
|
+ "cellRef": "1.5.1"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.1.2 引用已提取字段(self_reference)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 自引用来源配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class SelfReferenceSourceConfig {
|
|
|
+ /** 引用的字段Key */
|
|
|
+ private String referenceFieldKey;
|
|
|
+
|
|
|
+ /** 引用的字段名称(便于显示) */
|
|
|
+ private String referenceFieldName;
|
|
|
+
|
|
|
+ /** 多字段引用(用于组合) */
|
|
|
+ private List<String> referenceFieldKeys;
|
|
|
+
|
|
|
+ /** 组合模板,如 "{project_name}可行性研究报告" */
|
|
|
+ private String combineTemplate;
|
|
|
+
|
|
|
+ /** 转换规则 */
|
|
|
+ private TransformConfig transform;
|
|
|
+}
|
|
|
+
|
|
|
+/**
|
|
|
+ * 转换配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class TransformConfig {
|
|
|
+ /** 转换类型: replace/format/substring */
|
|
|
+ private String type;
|
|
|
+
|
|
|
+ // === replace 类型 ===
|
|
|
+ private String searchText;
|
|
|
+ private String replaceText;
|
|
|
+
|
|
|
+ // === format 类型 ===
|
|
|
+ private String formatPattern; // 如日期格式 "yyyy年MM月dd日"
|
|
|
+
|
|
|
+ // === substring 类型 ===
|
|
|
+ private Integer startIndex;
|
|
|
+ private Integer endIndex;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:引用并替换**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "referenceFieldKey": "project_overview",
|
|
|
+ "referenceFieldName": "项目概述",
|
|
|
+ "transform": {
|
|
|
+ "type": "replace",
|
|
|
+ "searchText": "XX项目",
|
|
|
+ "replaceText": "{project_name}"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:多字段组合**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "referenceFieldKeys": ["project_name", "design_unit", "report_date"],
|
|
|
+ "combineTemplate": "《{project_name}可行性研究报告》由{design_unit}于{report_date}编制"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.1.3 固定内容(fixed)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 固定内容配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class FixedSourceConfig {
|
|
|
+ /** 固定文本内容 */
|
|
|
+ private String content;
|
|
|
+
|
|
|
+ /** 内容类型 */
|
|
|
+ private String contentType; // text/html/markdown
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "content": "本报告依据《电力建设工程预算编制办法》(2018版)编制。",
|
|
|
+ "contentType": "text"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.1.4 手动输入(manual)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 手动输入配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class ManualSourceConfig {
|
|
|
+ /** 输入提示 */
|
|
|
+ private String placeholder;
|
|
|
+
|
|
|
+ /** 是否必填 */
|
|
|
+ private Boolean required;
|
|
|
+
|
|
|
+ /** 默认值 */
|
|
|
+ private String defaultValue;
|
|
|
+
|
|
|
+ /** 输入类型 */
|
|
|
+ private String inputType; // text/textarea/date/number/select
|
|
|
+
|
|
|
+ /** 选项列表(inputType=select时) */
|
|
|
+ private List<String> options;
|
|
|
+
|
|
|
+ /** 校验规则 */
|
|
|
+ private ValidationConfig validation;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "placeholder": "请输入项目联系人姓名",
|
|
|
+ "required": true,
|
|
|
+ "inputType": "text",
|
|
|
+ "validation": {
|
|
|
+ "maxLength": 50,
|
|
|
+ "pattern": "^[\\u4e00-\\u9fa5]{2,10}$"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### 4.2 extract_config 详细设计
|
|
|
+
|
|
|
+#### 4.2.1 直接提取(direct)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 直接提取配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class DirectExtractConfig {
|
|
|
+ /** 是否去除首尾空白 */
|
|
|
+ private Boolean trimWhitespace = true;
|
|
|
+
|
|
|
+ /** 是否移除换行符 */
|
|
|
+ private Boolean removeLineBreaks = false;
|
|
|
+
|
|
|
+ /** 是否合并连续空格 */
|
|
|
+ private Boolean mergeSpaces = true;
|
|
|
+
|
|
|
+ /** 保留的HTML标签(如需保留格式) */
|
|
|
+ private List<String> preserveTags;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.2.2 AI提取(ai_extract)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * AI 字段提取配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class AIExtractConfig {
|
|
|
+ /** 提取目标描述 */
|
|
|
+ private String targetDescription;
|
|
|
+
|
|
|
+ /** 字段类型 */
|
|
|
+ private String fieldType; // text/date/number/person/org/location
|
|
|
+
|
|
|
+ /** 预期格式描述 */
|
|
|
+ private String expectedFormat;
|
|
|
+
|
|
|
+ /** 示例值 */
|
|
|
+ private List<String> examples;
|
|
|
+
|
|
|
+ /** 是否返回多个结果 */
|
|
|
+ private Boolean multipleResults = false;
|
|
|
+
|
|
|
+ /** 自定义提示词(高级) */
|
|
|
+ private String customPrompt;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:提取工程名称**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "targetDescription": "从批复文件中提取工程项目的完整名称",
|
|
|
+ "fieldType": "text",
|
|
|
+ "expectedFormat": "XX市XX工程",
|
|
|
+ "examples": ["襄阳连云220千伏输变电工程", "武汉东湖110千伏输变电工程"]
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:提取日期**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "targetDescription": "提取可研报告的批复日期",
|
|
|
+ "fieldType": "date",
|
|
|
+ "expectedFormat": "YYYY年MM月DD日",
|
|
|
+ "examples": ["2024年5月15日", "2023年12月1日"]
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.2.3 AI总结(ai_summarize)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * AI 总结/提炼配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class AISummarizeConfig {
|
|
|
+ /** 总结提示词 */
|
|
|
+ private String summarizePrompt;
|
|
|
+
|
|
|
+ /** 总结维度/角度 */
|
|
|
+ private List<String> focusPoints;
|
|
|
+
|
|
|
+ /** 总结规则 */
|
|
|
+ private List<String> rules;
|
|
|
+
|
|
|
+ /** 输出风格 */
|
|
|
+ private String style; // formal/concise/detailed/bullet_points
|
|
|
+
|
|
|
+ /** 最大字数 */
|
|
|
+ private Integer maxLength;
|
|
|
+
|
|
|
+ /** 是否保留关键数据 */
|
|
|
+ private Boolean preserveKeyData = true;
|
|
|
+
|
|
|
+ /** 引用的上下文字段(作为参考) */
|
|
|
+ private List<String> contextFieldKeys;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:总结建设必要性**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "summarizePrompt": "请对以下内容进行总结,重点描述项目建设的必要性",
|
|
|
+ "focusPoints": ["建设背景", "现状问题", "建设目的"],
|
|
|
+ "rules": [
|
|
|
+ "使用正式的工程报告语言",
|
|
|
+ "保留关键的数据和指标",
|
|
|
+ "控制在200字以内"
|
|
|
+ ],
|
|
|
+ "style": "formal",
|
|
|
+ "maxLength": 200
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+**示例:带提炼规则的总结**
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "summarizePrompt": "以工程选址的角度,总结站址的地质条件",
|
|
|
+ "focusPoints": ["地质构造", "地震烈度", "岩土条件", "地下水情况"],
|
|
|
+ "rules": [
|
|
|
+ "先概述整体地质环境",
|
|
|
+ "重点说明对工程的影响",
|
|
|
+ "给出适宜性评价"
|
|
|
+ ],
|
|
|
+ "style": "formal",
|
|
|
+ "maxLength": 300,
|
|
|
+ "contextFieldKeys": ["project_location", "project_type"]
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 4.2.4 OCR识别(ocr)
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * OCR 识别配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class OcrExtractConfig {
|
|
|
+ /** OCR 后是否进行 AI 提取 */
|
|
|
+ private Boolean aiPostProcess = true;
|
|
|
+
|
|
|
+ /** AI 后处理配置 */
|
|
|
+ private AIExtractConfig aiConfig;
|
|
|
+
|
|
|
+ /** 图像预处理 */
|
|
|
+ private ImagePreprocessConfig preprocess;
|
|
|
+}
|
|
|
+
|
|
|
+/**
|
|
|
+ * 图像预处理配置
|
|
|
+ */
|
|
|
+@Data
|
|
|
+public class ImagePreprocessConfig {
|
|
|
+ private Boolean deskew = true; // 纠偏
|
|
|
+ private Boolean denoise = true; // 去噪
|
|
|
+ private Boolean binarize = false; // 二值化
|
|
|
+ private Integer contrast = 0; // 对比度调整
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 五、核心实体类设计
|
|
|
+
|
|
|
+### 5.1 新增模块结构
|
|
|
+
|
|
|
+```
|
|
|
+backend/extract-service/
|
|
|
+├── pom.xml
|
|
|
+└── src/main/java/com/lingyue/extract/
|
|
|
+ ├── ExtractServiceApplication.java
|
|
|
+ ├── config/
|
|
|
+ │ └── ExtractConfig.java
|
|
|
+ ├── controller/
|
|
|
+ │ ├── ProjectController.java
|
|
|
+ │ ├── SourceDocumentController.java
|
|
|
+ │ ├── ExtractRuleController.java
|
|
|
+ │ └── ExtractExecuteController.java
|
|
|
+ ├── dto/
|
|
|
+ │ ├── request/
|
|
|
+ │ │ ├── CreateProjectRequest.java
|
|
|
+ │ │ ├── UpdateProjectRequest.java
|
|
|
+ │ │ ├── AddSourceDocumentRequest.java
|
|
|
+ │ │ ├── CreateRuleRequest.java
|
|
|
+ │ │ ├── UpdateRuleRequest.java
|
|
|
+ │ │ ├── BatchCreateRulesRequest.java
|
|
|
+ │ │ ├── ExecuteRulesRequest.java
|
|
|
+ │ │ └── ConfirmResultRequest.java
|
|
|
+ │ ├── response/
|
|
|
+ │ │ ├── ProjectDetailResponse.java
|
|
|
+ │ │ ├── RuleListResponse.java
|
|
|
+ │ │ ├── ExtractPreviewResponse.java
|
|
|
+ │ │ ├── ExecuteProgressResponse.java
|
|
|
+ │ │ └── ExtractResultResponse.java
|
|
|
+ │ └── config/
|
|
|
+ │ ├── SourceConfig.java
|
|
|
+ │ ├── DocumentSourceConfig.java
|
|
|
+ │ ├── SelfReferenceSourceConfig.java
|
|
|
+ │ ├── FixedSourceConfig.java
|
|
|
+ │ ├── ManualSourceConfig.java
|
|
|
+ │ ├── LocationConfig.java
|
|
|
+ │ ├── ExtractConfig.java
|
|
|
+ │ ├── DirectExtractConfig.java
|
|
|
+ │ ├── AIExtractConfig.java
|
|
|
+ │ ├── AISummarizeConfig.java
|
|
|
+ │ └── OcrExtractConfig.java
|
|
|
+ ├── entity/
|
|
|
+ │ ├── Project.java
|
|
|
+ │ ├── SourceDocument.java
|
|
|
+ │ ├── ExtractRule.java
|
|
|
+ │ ├── ExtractResult.java
|
|
|
+ │ └── RuleTemplate.java
|
|
|
+ ├── repository/
|
|
|
+ │ ├── ProjectRepository.java
|
|
|
+ │ ├── SourceDocumentRepository.java
|
|
|
+ │ ├── ExtractRuleRepository.java
|
|
|
+ │ ├── ExtractResultRepository.java
|
|
|
+ │ └── RuleTemplateRepository.java
|
|
|
+ ├── service/
|
|
|
+ │ ├── ProjectService.java
|
|
|
+ │ ├── SourceDocumentService.java
|
|
|
+ │ ├── ExtractRuleService.java
|
|
|
+ │ ├── ExtractExecuteService.java
|
|
|
+ │ ├── ContentLocatorService.java
|
|
|
+ │ ├── AIExtractService.java
|
|
|
+ │ └── RuleTemplateService.java
|
|
|
+ └── executor/
|
|
|
+ ├── ExtractExecutor.java
|
|
|
+ ├── DirectExtractExecutor.java
|
|
|
+ ├── AIExtractExecutor.java
|
|
|
+ ├── AISummarizeExecutor.java
|
|
|
+ └── OcrExtractExecutor.java
|
|
|
+```
|
|
|
+
|
|
|
+### 5.2 实体类定义
|
|
|
+
|
|
|
+#### 5.2.1 Project.java
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.entity;
|
|
|
+
|
|
|
+import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
+import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
+import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
+import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
+import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
+import lombok.Data;
|
|
|
+import lombok.EqualsAndHashCode;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 数据提取项目实体
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+@EqualsAndHashCode(callSuper = true)
|
|
|
+@Data
|
|
|
+@TableName(value = "projects", autoResultMap = true)
|
|
|
+@Schema(description = "数据提取项目")
|
|
|
+public class Project extends SimpleModel {
|
|
|
+
|
|
|
+ @Schema(description = "用户ID")
|
|
|
+ @TableField("user_id")
|
|
|
+ private String userId;
|
|
|
+
|
|
|
+ @Schema(description = "项目名称")
|
|
|
+ @TableField("name")
|
|
|
+ private String name;
|
|
|
+
|
|
|
+ @Schema(description = "项目描述")
|
|
|
+ @TableField("description")
|
|
|
+ private String description;
|
|
|
+
|
|
|
+ @Schema(description = "状态", example = "draft/extracting/completed/archived")
|
|
|
+ @TableField("status")
|
|
|
+ private String status = "draft";
|
|
|
+
|
|
|
+ @Schema(description = "项目配置")
|
|
|
+ @TableField(value = "config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object config;
|
|
|
+
|
|
|
+ // ===== 状态常量 =====
|
|
|
+ public static final String STATUS_DRAFT = "draft";
|
|
|
+ public static final String STATUS_EXTRACTING = "extracting";
|
|
|
+ public static final String STATUS_COMPLETED = "completed";
|
|
|
+ public static final String STATUS_ARCHIVED = "archived";
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 5.2.2 SourceDocument.java
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.entity;
|
|
|
+
|
|
|
+import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
+import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
+import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
+import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
+import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
+import lombok.Data;
|
|
|
+import lombok.EqualsAndHashCode;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 来源文档实体
|
|
|
+ * 项目中用到的文档,关联已解析的 Document
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+@EqualsAndHashCode(callSuper = true)
|
|
|
+@Data
|
|
|
+@TableName(value = "source_documents", autoResultMap = true)
|
|
|
+@Schema(description = "来源文档")
|
|
|
+public class SourceDocument extends SimpleModel {
|
|
|
+
|
|
|
+ @Schema(description = "项目ID")
|
|
|
+ @TableField("project_id")
|
|
|
+ private String projectId;
|
|
|
+
|
|
|
+ @Schema(description = "关联的 Document ID")
|
|
|
+ @TableField("document_id")
|
|
|
+ private String documentId;
|
|
|
+
|
|
|
+ @Schema(description = "文档别名")
|
|
|
+ @TableField("alias")
|
|
|
+ private String alias;
|
|
|
+
|
|
|
+ @Schema(description = "文档类型", example = "pdf/docx/xlsx")
|
|
|
+ @TableField("doc_type")
|
|
|
+ private String docType;
|
|
|
+
|
|
|
+ @Schema(description = "显示顺序")
|
|
|
+ @TableField("display_order")
|
|
|
+ private Integer displayOrder = 0;
|
|
|
+
|
|
|
+ @Schema(description = "元数据")
|
|
|
+ @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object metadata;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 5.2.3 ExtractRule.java
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.entity;
|
|
|
+
|
|
|
+import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
+import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
+import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
+import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
+import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
+import lombok.Data;
|
|
|
+import lombok.EqualsAndHashCode;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 提取规则实体
|
|
|
+ * 描述如何从来源文档中提取数据的配置
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+@EqualsAndHashCode(callSuper = true)
|
|
|
+@Data
|
|
|
+@TableName(value = "extract_rules", autoResultMap = true)
|
|
|
+@Schema(description = "提取规则")
|
|
|
+public class ExtractRule extends SimpleModel {
|
|
|
+
|
|
|
+ @Schema(description = "项目ID")
|
|
|
+ @TableField("project_id")
|
|
|
+ private String projectId;
|
|
|
+
|
|
|
+ @Schema(description = "来源文档ID")
|
|
|
+ @TableField("source_doc_id")
|
|
|
+ private String sourceDocId;
|
|
|
+
|
|
|
+ // ===== 目标字段 =====
|
|
|
+
|
|
|
+ @Schema(description = "目标字段Key(程序用)")
|
|
|
+ @TableField("target_field_key")
|
|
|
+ private String targetFieldKey;
|
|
|
+
|
|
|
+ @Schema(description = "目标字段名称(显示用)")
|
|
|
+ @TableField("target_field_name")
|
|
|
+ private String targetFieldName;
|
|
|
+
|
|
|
+ @Schema(description = "字段分组")
|
|
|
+ @TableField("target_field_group")
|
|
|
+ private String targetFieldGroup;
|
|
|
+
|
|
|
+ @Schema(description = "规则顺序")
|
|
|
+ @TableField("rule_index")
|
|
|
+ private Integer ruleIndex;
|
|
|
+
|
|
|
+ // ===== 来源配置 =====
|
|
|
+
|
|
|
+ @Schema(description = "来源类型", example = "document/self_reference/fixed/manual")
|
|
|
+ @TableField("source_type")
|
|
|
+ private String sourceType;
|
|
|
+
|
|
|
+ @Schema(description = "来源配置")
|
|
|
+ @TableField(value = "source_config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object sourceConfig;
|
|
|
+
|
|
|
+ // ===== 提取配置 =====
|
|
|
+
|
|
|
+ @Schema(description = "提取类型", example = "direct/ai_extract/ai_summarize/ocr")
|
|
|
+ @TableField("extract_type")
|
|
|
+ private String extractType;
|
|
|
+
|
|
|
+ @Schema(description = "提取配置")
|
|
|
+ @TableField(value = "extract_config", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object extractConfig;
|
|
|
+
|
|
|
+ // ===== 结果 =====
|
|
|
+
|
|
|
+ @Schema(description = "状态", example = "pending/extracting/extracted/confirmed/error")
|
|
|
+ @TableField("status")
|
|
|
+ private String status = STATUS_PENDING;
|
|
|
+
|
|
|
+ @Schema(description = "提取出的值")
|
|
|
+ @TableField("extracted_value")
|
|
|
+ private String extractedValue;
|
|
|
+
|
|
|
+ @Schema(description = "值类型", example = "text/table/image/list")
|
|
|
+ @TableField("value_type")
|
|
|
+ private String valueType = "text";
|
|
|
+
|
|
|
+ @Schema(description = "错误信息")
|
|
|
+ @TableField("error_message")
|
|
|
+ private String errorMessage;
|
|
|
+
|
|
|
+ @Schema(description = "元数据")
|
|
|
+ @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object metadata;
|
|
|
+
|
|
|
+ // ===== 常量 =====
|
|
|
+
|
|
|
+ // 来源类型
|
|
|
+ public static final String SOURCE_DOCUMENT = "document";
|
|
|
+ public static final String SOURCE_SELF_REFERENCE = "self_reference";
|
|
|
+ public static final String SOURCE_FIXED = "fixed";
|
|
|
+ public static final String SOURCE_MANUAL = "manual";
|
|
|
+
|
|
|
+ // 提取类型
|
|
|
+ public static final String EXTRACT_DIRECT = "direct";
|
|
|
+ public static final String EXTRACT_AI_EXTRACT = "ai_extract";
|
|
|
+ public static final String EXTRACT_AI_SUMMARIZE = "ai_summarize";
|
|
|
+ public static final String EXTRACT_OCR = "ocr";
|
|
|
+
|
|
|
+ // 状态
|
|
|
+ public static final String STATUS_PENDING = "pending";
|
|
|
+ public static final String STATUS_EXTRACTING = "extracting";
|
|
|
+ public static final String STATUS_EXTRACTED = "extracted";
|
|
|
+ public static final String STATUS_CONFIRMED = "confirmed";
|
|
|
+ public static final String STATUS_ERROR = "error";
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+#### 5.2.4 ExtractResult.java
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.entity;
|
|
|
+
|
|
|
+import com.baomidou.mybatisplus.annotation.TableField;
|
|
|
+import com.baomidou.mybatisplus.annotation.TableName;
|
|
|
+import com.lingyue.common.domain.entity.SimpleModel;
|
|
|
+import com.lingyue.common.mybatis.PostgreSqlJsonbTypeHandler;
|
|
|
+import io.swagger.v3.oas.annotations.media.Schema;
|
|
|
+import lombok.Data;
|
|
|
+import lombok.EqualsAndHashCode;
|
|
|
+
|
|
|
+import java.time.LocalDateTime;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 提取结果实体
|
|
|
+ * 记录每次提取的详细结果,支持历史追溯
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+@EqualsAndHashCode(callSuper = true)
|
|
|
+@Data
|
|
|
+@TableName(value = "extract_results", autoResultMap = true)
|
|
|
+@Schema(description = "提取结果")
|
|
|
+public class ExtractResult extends SimpleModel {
|
|
|
+
|
|
|
+ @Schema(description = "规则ID")
|
|
|
+ @TableField("rule_id")
|
|
|
+ private String ruleId;
|
|
|
+
|
|
|
+ @Schema(description = "项目ID")
|
|
|
+ @TableField("project_id")
|
|
|
+ private String projectId;
|
|
|
+
|
|
|
+ // ===== 提取结果 =====
|
|
|
+
|
|
|
+ @Schema(description = "提取出的值")
|
|
|
+ @TableField("extracted_value")
|
|
|
+ private String extractedValue;
|
|
|
+
|
|
|
+ @Schema(description = "值类型")
|
|
|
+ @TableField("value_type")
|
|
|
+ private String valueType = "text";
|
|
|
+
|
|
|
+ // ===== 来源追溯 =====
|
|
|
+
|
|
|
+ @Schema(description = "来源原文内容")
|
|
|
+ @TableField("source_content")
|
|
|
+ private String sourceContent;
|
|
|
+
|
|
|
+ @Schema(description = "来源位置信息")
|
|
|
+ @TableField(value = "source_location", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object sourceLocation;
|
|
|
+
|
|
|
+ // ===== 质量评估 =====
|
|
|
+
|
|
|
+ @Schema(description = "AI提取的置信度 0-1")
|
|
|
+ @TableField("confidence")
|
|
|
+ private Double confidence;
|
|
|
+
|
|
|
+ // ===== 状态 =====
|
|
|
+
|
|
|
+ @Schema(description = "状态", example = "extracted/confirmed/rejected/modified")
|
|
|
+ @TableField("status")
|
|
|
+ private String status = STATUS_EXTRACTED;
|
|
|
+
|
|
|
+ // ===== 人工处理 =====
|
|
|
+
|
|
|
+ @Schema(description = "人工修正后的值")
|
|
|
+ @TableField("modified_value")
|
|
|
+ private String modifiedValue;
|
|
|
+
|
|
|
+ @Schema(description = "确认时间")
|
|
|
+ @TableField("confirmed_at")
|
|
|
+ private LocalDateTime confirmedAt;
|
|
|
+
|
|
|
+ @Schema(description = "确认人")
|
|
|
+ @TableField("confirmed_by")
|
|
|
+ private String confirmedBy;
|
|
|
+
|
|
|
+ @Schema(description = "元数据")
|
|
|
+ @TableField(value = "metadata", typeHandler = PostgreSqlJsonbTypeHandler.class)
|
|
|
+ private Object metadata;
|
|
|
+
|
|
|
+ // ===== 常量 =====
|
|
|
+ public static final String STATUS_EXTRACTED = "extracted";
|
|
|
+ public static final String STATUS_CONFIRMED = "confirmed";
|
|
|
+ public static final String STATUS_REJECTED = "rejected";
|
|
|
+ public static final String STATUS_MODIFIED = "modified";
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 获取最终值(优先使用修正值)
|
|
|
+ */
|
|
|
+ public String getFinalValue() {
|
|
|
+ return modifiedValue != null ? modifiedValue : extractedValue;
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 六、核心服务设计
|
|
|
+
|
|
|
+### 6.1 ContentLocatorService(内容定位服务)
|
|
|
+
|
|
|
+负责根据 `LocationConfig` 从文档中定位并提取内容。
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.service;
|
|
|
+
|
|
|
+import com.lingyue.document.entity.DocumentElement;
|
|
|
+import com.lingyue.extract.dto.config.LocationConfig;
|
|
|
+import java.util.List;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 内容定位服务
|
|
|
+ * 根据定位配置从文档中提取内容
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+public interface ContentLocatorService {
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 根据定位配置获取文档元素
|
|
|
+ *
|
|
|
+ * @param documentId 文档ID
|
|
|
+ * @param location 定位配置
|
|
|
+ * @return 匹配的文档元素列表
|
|
|
+ */
|
|
|
+ List<DocumentElement> locateElements(String documentId, LocationConfig location);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 根据定位配置获取文本内容
|
|
|
+ *
|
|
|
+ * @param documentId 文档ID
|
|
|
+ * @param location 定位配置
|
|
|
+ * @return 提取的文本内容
|
|
|
+ */
|
|
|
+ String locateContent(String documentId, LocationConfig location);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 按页码定位
|
|
|
+ */
|
|
|
+ List<DocumentElement> locateByPage(String documentId, int pageStart, int pageEnd, String keyword);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 按章节定位
|
|
|
+ */
|
|
|
+ List<DocumentElement> locateByChapter(String documentId, List<String> chapterPath, String chapterTitle);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 按元素ID定位
|
|
|
+ */
|
|
|
+ List<DocumentElement> locateByElementIds(String documentId, List<String> elementIds);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * Excel 单元格定位
|
|
|
+ */
|
|
|
+ String locateExcelCell(String documentId, String sheetName, String cellRef);
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### 6.2 ExtractExecuteService(提取执行服务)
|
|
|
+
|
|
|
+负责协调执行提取任务。
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.service;
|
|
|
+
|
|
|
+import com.lingyue.extract.dto.response.ExecuteProgressResponse;
|
|
|
+import com.lingyue.extract.dto.response.ExtractResultResponse;
|
|
|
+import com.lingyue.extract.entity.ExtractResult;
|
|
|
+import com.lingyue.extract.entity.ExtractRule;
|
|
|
+
|
|
|
+import java.util.List;
|
|
|
+
|
|
|
+/**
|
|
|
+ * 提取执行服务
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+public interface ExtractExecuteService {
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 执行单条规则
|
|
|
+ *
|
|
|
+ * @param ruleId 规则ID
|
|
|
+ * @return 提取结果
|
|
|
+ */
|
|
|
+ ExtractResult executeRule(String ruleId);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 执行指定规则列表
|
|
|
+ *
|
|
|
+ * @param ruleIds 规则ID列表
|
|
|
+ * @return 提取结果列表
|
|
|
+ */
|
|
|
+ List<ExtractResult> executeRules(List<String> ruleIds);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 执行项目的所有规则
|
|
|
+ *
|
|
|
+ * @param projectId 项目ID
|
|
|
+ * @return 提取结果列表
|
|
|
+ */
|
|
|
+ List<ExtractResult> executeProject(String projectId);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 异步执行项目(后台任务)
|
|
|
+ *
|
|
|
+ * @param projectId 项目ID
|
|
|
+ * @return 任务ID
|
|
|
+ */
|
|
|
+ String executeProjectAsync(String projectId);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 获取执行进度
|
|
|
+ *
|
|
|
+ * @param taskId 任务ID
|
|
|
+ * @return 进度信息
|
|
|
+ */
|
|
|
+ ExecuteProgressResponse getProgress(String taskId);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 预览提取(不保存)
|
|
|
+ *
|
|
|
+ * @param rule 规则配置
|
|
|
+ * @return 预览结果
|
|
|
+ */
|
|
|
+ ExtractResultResponse preview(ExtractRule rule);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * 重新执行规则
|
|
|
+ *
|
|
|
+ * @param ruleId 规则ID
|
|
|
+ * @return 新的提取结果
|
|
|
+ */
|
|
|
+ ExtractResult retryRule(String ruleId);
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### 6.3 AIExtractService(AI提取服务)
|
|
|
+
|
|
|
+封装 AI 提取和总结的逻辑。
|
|
|
+
|
|
|
+```java
|
|
|
+package com.lingyue.extract.service;
|
|
|
+
|
|
|
+import com.lingyue.extract.dto.config.AIExtractConfig;
|
|
|
+import com.lingyue.extract.dto.config.AISummarizeConfig;
|
|
|
+
|
|
|
+/**
|
|
|
+ * AI 提取服务
|
|
|
+ *
|
|
|
+ * @author lingyue
|
|
|
+ * @since 2026-01-22
|
|
|
+ */
|
|
|
+public interface AIExtractService {
|
|
|
+
|
|
|
+ /**
|
|
|
+ * AI 字段提取
|
|
|
+ *
|
|
|
+ * @param content 原文内容
|
|
|
+ * @param config 提取配置
|
|
|
+ * @return 提取结果(包含值和置信度)
|
|
|
+ */
|
|
|
+ AIExtractResult extract(String content, AIExtractConfig config);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * AI 内容总结
|
|
|
+ *
|
|
|
+ * @param content 原文内容
|
|
|
+ * @param config 总结配置
|
|
|
+ * @param context 上下文字段值(可选)
|
|
|
+ * @return 总结结果
|
|
|
+ */
|
|
|
+ AISummarizeResult summarize(String content, AISummarizeConfig config,
|
|
|
+ Map<String, String> context);
|
|
|
+
|
|
|
+ /**
|
|
|
+ * AI 提取结果
|
|
|
+ */
|
|
|
+ @Data
|
|
|
+ class AIExtractResult {
|
|
|
+ private String value;
|
|
|
+ private Double confidence;
|
|
|
+ private String reasoning;
|
|
|
+ }
|
|
|
+
|
|
|
+ /**
|
|
|
+ * AI 总结结果
|
|
|
+ */
|
|
|
+ @Data
|
|
|
+ class AISummarizeResult {
|
|
|
+ private String summary;
|
|
|
+ private List<String> keyPoints;
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 七、API 接口设计
|
|
|
+
|
|
|
+### 7.1 项目管理 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ------ | ----------------------------------------- | ------------ |
|
|
|
+| POST | `/api/v1/extract/projects` | 创建项目 |
|
|
|
+| GET | `/api/v1/extract/projects` | 查询项目列表 |
|
|
|
+| GET | `/api/v1/extract/projects/{id}` | 获取项目详情 |
|
|
|
+| PUT | `/api/v1/extract/projects/{id}` | 更新项目 |
|
|
|
+| DELETE | `/api/v1/extract/projects/{id}` | 删除项目 |
|
|
|
+| POST | `/api/v1/extract/projects/{id}/archive` | 归档项目 |
|
|
|
+
|
|
|
+### 7.2 来源文档 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ------ | -------------------------------------------------------- | ---------------- |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/documents` | 添加来源文档 |
|
|
|
+| GET | `/api/v1/extract/projects/{projectId}/documents` | 获取来源文档列表 |
|
|
|
+| PUT | `/api/v1/extract/projects/{projectId}/documents/{id}` | 更新来源文档 |
|
|
|
+| DELETE | `/api/v1/extract/projects/{projectId}/documents/{id}` | 移除来源文档 |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/documents/batch` | 批量添加来源文档 |
|
|
|
+
|
|
|
+### 7.3 提取规则 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ------ | ------------------------------------------------------------- | ------------ |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/rules` | 创建规则 |
|
|
|
+| GET | `/api/v1/extract/projects/{projectId}/rules` | 获取规则列表 |
|
|
|
+| GET | `/api/v1/extract/projects/{projectId}/rules/{id}` | 获取规则详情 |
|
|
|
+| PUT | `/api/v1/extract/projects/{projectId}/rules/{id}` | 更新规则 |
|
|
|
+| DELETE | `/api/v1/extract/projects/{projectId}/rules/{id}` | 删除规则 |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/rules/batch` | 批量创建规则 |
|
|
|
+| PUT | `/api/v1/extract/projects/{projectId}/rules/reorder` | 调整规则顺序 |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/rules/{id}/duplicate` | 复制规则 |
|
|
|
+
|
|
|
+### 7.4 提取执行 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ---- | ------------------------------------------------ | ---------------- |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/execute` | 执行项目所有规则 |
|
|
|
+| POST | `/api/v1/extract/rules/{ruleId}/execute` | 执行单条规则 |
|
|
|
+| POST | `/api/v1/extract/rules/batch-execute` | 批量执行规则 |
|
|
|
+| POST | `/api/v1/extract/rules/{ruleId}/preview` | 预览提取结果 |
|
|
|
+| POST | `/api/v1/extract/rules/{ruleId}/retry` | 重新执行规则 |
|
|
|
+| GET | `/api/v1/extract/tasks/{taskId}/progress` | 获取任务进度 |
|
|
|
+
|
|
|
+### 7.5 提取结果 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ---- | ------------------------------------------------------------ | ------------------ |
|
|
|
+| GET | `/api/v1/extract/projects/{projectId}/results` | 获取项目所有结果 |
|
|
|
+| GET | `/api/v1/extract/rules/{ruleId}/results` | 获取规则的结果历史 |
|
|
|
+| POST | `/api/v1/extract/results/{id}/confirm` | 确认结果 |
|
|
|
+| POST | `/api/v1/extract/results/{id}/reject` | 拒绝结果 |
|
|
|
+| PUT | `/api/v1/extract/results/{id}/modify` | 修正结果 |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/results/confirm-all` | 批量确认 |
|
|
|
+
|
|
|
+### 7.6 规则模板 API
|
|
|
+
|
|
|
+| 方法 | 路径 | 描述 |
|
|
|
+| ------ | --------------------------------------------------------- | ------------------ |
|
|
|
+| POST | `/api/v1/extract/templates` | 创建模板 |
|
|
|
+| GET | `/api/v1/extract/templates` | 获取模板列表 |
|
|
|
+| GET | `/api/v1/extract/templates/{id}` | 获取模板详情 |
|
|
|
+| DELETE | `/api/v1/extract/templates/{id}` | 删除模板 |
|
|
|
+| POST | `/api/v1/extract/templates/{id}/apply` | 应用模板到项目 |
|
|
|
+| POST | `/api/v1/extract/projects/{projectId}/save-as-template` | 保存项目规则为模板 |
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 八、核心流程
|
|
|
+
|
|
|
+### 8.1 完整工作流程
|
|
|
+
|
|
|
+```
|
|
|
+┌──────────────────────────────────────────────────────────────────────────────┐
|
|
|
+│ 用户操作流程 │
|
|
|
+└──────────────────────────────────────────────────────────────────────────────┘
|
|
|
+ │
|
|
|
+ ┌──────────────────────────────────┼──────────────────────────────────┐
|
|
|
+ ▼ ▼ ▼
|
|
|
+┌─────────┐ ┌─────────────┐ ┌─────────────┐
|
|
|
+│ 1.创建 │ │ 2.上传文档 │ │ 3.配置规则 │
|
|
|
+│ 项目 │───────────────────►│ 并关联项目 │───────────────────►│ (可用模板) │
|
|
|
+└─────────┘ └─────────────┘ └─────────────┘
|
|
|
+ │
|
|
|
+ ┌────────────────────────────────────────────────────────────────────┘
|
|
|
+ ▼
|
|
|
+┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
|
+│ 4.执行提取 │───────────────►│ 5.查看结果 │───────────────────►│ 6.确认/修正 │
|
|
|
+│ (可预览) │ │ 并追溯来源 │ │ 提取结果 │
|
|
|
+└─────────────┘ └─────────────┘ └─────────────┘
|
|
|
+ │
|
|
|
+ ┌────────────────────────────────────────────────────────────────────┘
|
|
|
+ ▼
|
|
|
+┌─────────────┐ ┌─────────────┐
|
|
|
+│ 7.导出数据 │───────────────►│ 8.保存为 │
|
|
|
+│ 或生成报告 │ │ 规则模板 │
|
|
|
+└─────────────┘ └─────────────┘
|
|
|
+```
|
|
|
+
|
|
|
+### 8.2 规则执行流程
|
|
|
+
|
|
|
+```
|
|
|
+┌─────────────────────────────────────────────────────────────────────────────┐
|
|
|
+│ ExtractExecuteService.executeRule() │
|
|
|
+└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ ┌─────────────────────┐
|
|
|
+ │ 1. 获取规则配置 │
|
|
|
+ │ ExtractRule │
|
|
|
+ └─────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ ┌─────────────────────┐
|
|
|
+ │ 2. 根据 sourceType │
|
|
|
+ │ 获取原文内容 │
|
|
|
+ └─────────────────────┘
|
|
|
+ │
|
|
|
+ ┌─────────────────────────────┼─────────────────────────────┐
|
|
|
+ ▼ ▼ ▼
|
|
|
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
+│ document │ │ self_reference │ │ fixed/manual │
|
|
|
+│ │ │ │ │ │
|
|
|
+│ ContentLocator │ │ 查询已提取值 │ │ 直接获取配置值 │
|
|
|
+│ Service │ │ │ │ │
|
|
|
+└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
|
+ │ │ │
|
|
|
+ └─────────────────────────────┼─────────────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ ┌─────────────────────┐
|
|
|
+ │ 3. 根据 extractType │
|
|
|
+ │ 执行提取 │
|
|
|
+ └─────────────────────┘
|
|
|
+ │
|
|
|
+ ┌─────────────────────────────┼─────────────────────────────┐
|
|
|
+ ▼ ▼ ▼
|
|
|
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
|
+│ direct │ │ ai_extract │ │ ai_summarize │
|
|
|
+│ │ │ │ │ │
|
|
|
+│ DirectExtract │ │ AIExtract │ │ AISummarize │
|
|
|
+│ Executor │ │ Executor │ │ Executor │
|
|
|
+└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
|
+ │ │ │
|
|
|
+ └─────────────────────────────┼─────────────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ ┌─────────────────────┐
|
|
|
+ │ 4. 保存提取结果 │
|
|
|
+ │ ExtractResult │
|
|
|
+ └─────────────────────┘
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ ┌─────────────────────┐
|
|
|
+ │ 5. 更新规则状态 │
|
|
|
+ │ 和 extracted_value│
|
|
|
+ └─────────────────────┘
|
|
|
+```
|
|
|
+
|
|
|
+### 8.3 AI 提取 Prompt 设计
|
|
|
+
|
|
|
+#### 8.3.1 字段提取 Prompt
|
|
|
+
|
|
|
+```text
|
|
|
+你是一个专业的文档信息提取助手。请从以下文档内容中提取指定的信息。
|
|
|
+
|
|
|
+## 提取目标
|
|
|
+{targetDescription}
|
|
|
+
|
|
|
+## 字段类型
|
|
|
+{fieldType}
|
|
|
+
|
|
|
+## 预期格式
|
|
|
+{expectedFormat}
|
|
|
+
|
|
|
+## 示例
|
|
|
+{examples}
|
|
|
+
|
|
|
+## 文档内容
|
|
|
+```
|
|
|
+
|
|
|
+{content}这这
|
|
|
+
|
|
|
+```
|
|
|
+
|
|
|
+## 输出要求
|
|
|
+请直接输出提取的值,不要包含任何解释。如果无法提取,请输出"[无法提取]"。
|
|
|
+如果内容中有多个可能的值,请选择最准确的一个。
|
|
|
+
|
|
|
+提取结果:
|
|
|
+```
|
|
|
+
|
|
|
+#### 8.3.2 内容总结 Prompt
|
|
|
+
|
|
|
+```text
|
|
|
+你是一个专业的工程报告撰写助手。请对以下内容进行总结/提炼。
|
|
|
+
|
|
|
+## 总结要求
|
|
|
+{summarizePrompt}
|
|
|
+
|
|
|
+## 关注维度
|
|
|
+{focusPoints}
|
|
|
+
|
|
|
+## 总结规则
|
|
|
+{rules}
|
|
|
+
|
|
|
+## 输出风格
|
|
|
+{style}
|
|
|
+
|
|
|
+## 字数限制
|
|
|
+{maxLength} 字以内
|
|
|
+
|
|
|
+## 上下文信息
|
|
|
+{contextInfo}
|
|
|
+
|
|
|
+## 原文内容
|
|
|
+```
|
|
|
+
|
|
|
+{content}
|
|
|
+
|
|
|
+```
|
|
|
+
|
|
|
+## 输出要求
|
|
|
+请直接输出总结内容,使用正式的工程报告语言。
|
|
|
+
|
|
|
+总结结果:
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 九、与现有系统的集成
|
|
|
+
|
|
|
+### 9.1 依赖现有服务
|
|
|
+
|
|
|
+| 服务 | 用途 | 调用方式 |
|
|
|
+| ----------------------------------- | ------------------------ | ------------------ |
|
|
|
+| `DocumentService` | 获取文档信息 | Feign Client |
|
|
|
+| `DocumentElementService` | 获取文档元素 | Feign Client |
|
|
|
+| `DeepSeekClient` | AI 提取/总结 | 直接调用 |
|
|
|
+| `WordStructuredExtractionService` | Word 文档解析 | 通过 parse-service |
|
|
|
+| `DataSourceService` | 可选:将结果注册为数据源 | Feign Client |
|
|
|
+
|
|
|
+### 9.2 事件集成
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 监听文档解析完成事件
|
|
|
+ * 自动更新来源文档的状态
|
|
|
+ */
|
|
|
+@EventListener
|
|
|
+public void onDocumentParsed(DocumentParsedEvent event) {
|
|
|
+ // 查找关联此文档的来源文档记录
|
|
|
+ List<SourceDocument> sourceDocs = sourceDocumentService
|
|
|
+ .findByDocumentId(event.getDocumentId());
|
|
|
+
|
|
|
+ // 更新解析状态
|
|
|
+ for (SourceDocument sourceDoc : sourceDocs) {
|
|
|
+ sourceDoc.updateMetadata("parseStatus", "completed");
|
|
|
+ sourceDocumentService.update(sourceDoc);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### 9.3 与 DataSource 的关系
|
|
|
+
|
|
|
+提取结果可以选择性地注册为 `DataSource`,供其他模块(如模板渲染)使用:
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 将提取结果注册为数据源
|
|
|
+ */
|
|
|
+public DataSource registerAsDataSource(ExtractResult result, String userId) {
|
|
|
+ CreateDataSourceRequest request = new CreateDataSourceRequest();
|
|
|
+ request.setName(result.getRule().getTargetFieldName());
|
|
|
+ request.setType("text");
|
|
|
+ request.setSourceType("extract_result");
|
|
|
+ request.setConfig(Map.of(
|
|
|
+ "extractResultId", result.getId(),
|
|
|
+ "projectId", result.getProjectId()
|
|
|
+ ));
|
|
|
+
|
|
|
+ return dataSourceService.create(userId, request);
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 十、错误处理与日志
|
|
|
+
|
|
|
+### 10.1 错误码定义
|
|
|
+
|
|
|
+| 错误码 | 说明 |
|
|
|
+| --------------- | ---------------- |
|
|
|
+| `EXTRACT_001` | 项目不存在 |
|
|
|
+| `EXTRACT_002` | 来源文档不存在 |
|
|
|
+| `EXTRACT_003` | 规则配置无效 |
|
|
|
+| `EXTRACT_004` | 文档未解析完成 |
|
|
|
+| `EXTRACT_005` | 内容定位失败 |
|
|
|
+| `EXTRACT_006` | AI 提取失败 |
|
|
|
+| `EXTRACT_007` | 引用的字段未提取 |
|
|
|
+| `EXTRACT_008` | 循环引用 |
|
|
|
+
|
|
|
+### 10.2 日志规范
|
|
|
+
|
|
|
+```java
|
|
|
+// 规则执行日志
|
|
|
+log.info("开始执行提取规则: ruleId={}, projectId={}, targetField={}",
|
|
|
+ rule.getId(), rule.getProjectId(), rule.getTargetFieldKey());
|
|
|
+
|
|
|
+// AI 调用日志
|
|
|
+log.info("AI提取: ruleId={}, contentLength={}, extractType={}",
|
|
|
+ ruleId, content.length(), extractType);
|
|
|
+
|
|
|
+// 结果日志
|
|
|
+log.info("提取完成: ruleId={}, status={}, valueLength={}, confidence={}",
|
|
|
+ ruleId, status, value.length(), confidence);
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 十一、性能优化建议
|
|
|
+
|
|
|
+### 11.1 批量执行优化
|
|
|
+
|
|
|
+1. **并行执行**:无依赖关系的规则可以并行执行
|
|
|
+2. **批量 AI 调用**:多个提取请求可以合并为批量请求
|
|
|
+3. **缓存内容定位**:同一文档的相同定位条件缓存结果
|
|
|
+
|
|
|
+### 11.2 依赖分析
|
|
|
+
|
|
|
+```java
|
|
|
+/**
|
|
|
+ * 分析规则依赖关系,构建执行顺序
|
|
|
+ */
|
|
|
+public List<List<String>> buildExecutionOrder(List<ExtractRule> rules) {
|
|
|
+ // 1. 构建依赖图
|
|
|
+ Map<String, Set<String>> dependencyGraph = buildDependencyGraph(rules);
|
|
|
+
|
|
|
+ // 2. 拓扑排序,识别可并行执行的规则组
|
|
|
+ return topologicalSort(dependencyGraph);
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 十二、后续扩展
|
|
|
+
|
|
|
+### 12.1 计划中的功能
|
|
|
+
|
|
|
+1. **规则推荐**:根据文档类型自动推荐常用规则
|
|
|
+2. **智能定位**:AI 辅助识别章节和内容位置
|
|
|
+3. **批量项目**:支持多个同类型项目的批量处理
|
|
|
+4. **版本对比**:规则配置的版本管理和对比
|
|
|
+5. **协作编辑**:多人协作配置规则
|
|
|
+
|
|
|
+### 12.2 集成扩展
|
|
|
+
|
|
|
+1. **对接 graph-service**:将提取结果构建为知识图谱
|
|
|
+2. **对接报告生成**:提取结果直接用于报告生成
|
|
|
+3. **对接审批流程**:提取结果需审批后生效
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 附录 A:配置示例
|
|
|
+
|
|
|
+### A.1 完整规则配置示例
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "projectId": "proj_001",
|
|
|
+ "targetFieldKey": "project_name",
|
|
|
+ "targetFieldName": "工程名称",
|
|
|
+ "targetFieldGroup": "基本信息",
|
|
|
+ "ruleIndex": 1,
|
|
|
+ "sourceType": "document",
|
|
|
+ "sourceConfig": {
|
|
|
+ "sourceDocId": "sd_001",
|
|
|
+ "documentAlias": "可研批复",
|
|
|
+ "location": {
|
|
|
+ "type": "page",
|
|
|
+ "pageStart": 1,
|
|
|
+ "pageEnd": 1
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "extractType": "ai_extract",
|
|
|
+ "extractConfig": {
|
|
|
+ "targetDescription": "从批复文件中提取工程项目的完整名称",
|
|
|
+ "fieldType": "text",
|
|
|
+ "expectedFormat": "XX市XX工程",
|
|
|
+ "examples": ["襄阳连云220千伏输变电工程"]
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### A.2 复杂引用规则示例
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "projectId": "proj_001",
|
|
|
+ "targetFieldKey": "report_summary",
|
|
|
+ "targetFieldName": "报告摘要",
|
|
|
+ "ruleIndex": 50,
|
|
|
+ "sourceType": "self_reference",
|
|
|
+ "sourceConfig": {
|
|
|
+ "referenceFieldKeys": ["project_name", "construction_unit", "project_location", "investment_amount"],
|
|
|
+ "combineTemplate": "《{project_name}可行性研究报告》由{construction_unit}编制,项目位于{project_location},预计总投资{investment_amount}万元。"
|
|
|
+ },
|
|
|
+ "extractType": "direct",
|
|
|
+ "extractConfig": {}
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+> 文档结束
|