FastGPT 引申:奥运选手知识图谱构建与混合检索应用-第一部分:数据构建流程

时间:2025-03-07 07:11:29

1. 数据抽取与预处理

目标:从奥运官网抓取土耳其射击选手信息
工具链

  • Firecrawl:动态网页抓取
  • Unstructured.io:PDF/HTML解析
  • Mistral-7B:信息抽取模型
# 数据抓取与清洗
from camel.tools import FirecrawlScraper, TextCleaner

scraper = FirecrawlScraper(api_key="fc_123")
raw_data = scraper.scrape(url="olympics.com/tr/shooting")

cleaner = TextCleaner()
structured_text = cleaner.clean(
    raw_data, 
    chunk_strategy="section",  # 按章节分块
    keep_headers=True          # 保留标题结构
)

信息抽取结果示例

{
  "athlete": "Yusuf Dikeç",
  "nationality": "Turkey",
  "event": "10m Air Pistol",
  "medal": "Silver",
  "game": {"year":2024, "location":"Paris"}
}

2. 向量化处理

技术栈

  • Mistral Embed:生成768维向量
  • Qdrant:向量数据库存储
from camel.embeddings import MistralEmbed
from qdrant_client import QdrantClient

embedder = MistralEmbed(model="large-v2")
qdrant = QdrantClient(host="localhost", port=6333)

# 批量生成向量
vectors = [embedder.encode(text) for text in structured_text]

# 向量存储
qdrant.upsert(
    collection_name="olympic_docs",
    points=[
        {"id": idx, "vector": vec, "payload": {"text": text}}
        for idx, (vec, text) in enumerate(zip(vectors, structured_text))
    ]
)

3. 知识图谱构建

Neo4j节点关系建模

// 节点定义
CREATE (:Athlete {
  id: "ATH_TR_001",
  name: "Yusuf Dikeç",
  nationality: "Turkey"
})

CREATE (:Event {
  id: "EVT_10MAP",
  discipline: "10m Air Pistol"
})

CREATE (:Game {
  id: "OG_2024",
  year: 2024,
  location: "Paris"
})

// 关系建立
MATCH (a:Athlete {id:"ATH_TR_001"}), (e:Event {id:"EVT_10MAP"})
CREATE (a)-[:WON_MEDAL {
  type: "Silver",
  score: 243.7
}]->(e)

MATCH (e:Event {id:"EVT_10MAP"}), (g:Game {id:"OG_2024"})
CREATE (e)-[:BELONGS_TO]->(g)

索引优化

CREATE INDEX FOR (a:Athlete) ON (a.nationality)
CREATE INDEX FOR (g:Game) ON (g.year)

4. 数据持久化

存储类型 技术方案 数据示例
原始文本 MongoDB (分片集群) HTML/PDF原始文档
向量数据 Qdrant (分布式部署) 768维向量+文本元数据
图谱数据 Neo4j (因果集群) 节点+关系网络