1. 数据抽取与预处理
目标:从奥运官网抓取土耳其射击选手信息
工具链:
- Firecrawl:动态网页抓取
- Unstructured.io:PDF/HTML解析
- Mistral-7B:信息抽取模型
# 数据抓取与清洗
from camel.tools import FirecrawlScraper, TextCleaner
scraper = FirecrawlScraper(api_key="fc_123")
raw_data = scraper.scrape(url="olympics.com/tr/shooting")
cleaner = TextCleaner()
structured_text = cleaner.clean(
raw_data,
chunk_strategy="section", # 按章节分块
keep_headers=True # 保留标题结构
)
信息抽取结果示例:
{
"athlete": "Yusuf Dikeç",
"nationality": "Turkey",
"event": "10m Air Pistol",
"medal": "Silver",
"game": {"year":2024, "location":"Paris"}
}
2. 向量化处理
技术栈:
- Mistral Embed:生成768维向量
- Qdrant:向量数据库存储
from camel.embeddings import MistralEmbed
from qdrant_client import QdrantClient
embedder = MistralEmbed(model="large-v2")
qdrant = QdrantClient(host="localhost", port=6333)
# 批量生成向量
vectors = [embedder.encode(text) for text in structured_text]
# 向量存储
qdrant.upsert(
collection_name="olympic_docs",
points=[
{"id": idx, "vector": vec, "payload": {"text": text}}
for idx, (vec, text) in enumerate(zip(vectors, structured_text))
]
)
3. 知识图谱构建
Neo4j节点关系建模:
// 节点定义
CREATE (:Athlete {
id: "ATH_TR_001",
name: "Yusuf Dikeç",
nationality: "Turkey"
})
CREATE (:Event {
id: "EVT_10MAP",
discipline: "10m Air Pistol"
})
CREATE (:Game {
id: "OG_2024",
year: 2024,
location: "Paris"
})
// 关系建立
MATCH (a:Athlete {id:"ATH_TR_001"}), (e:Event {id:"EVT_10MAP"})
CREATE (a)-[:WON_MEDAL {
type: "Silver",
score: 243.7
}]->(e)
MATCH (e:Event {id:"EVT_10MAP"}), (g:Game {id:"OG_2024"})
CREATE (e)-[:BELONGS_TO]->(g)
索引优化:
CREATE INDEX FOR (a:Athlete) ON (a.nationality)
CREATE INDEX FOR (g:Game) ON (g.year)
4. 数据持久化
存储类型 | 技术方案 | 数据示例 |
---|---|---|
原始文本 | MongoDB (分片集群) | HTML/PDF原始文档 |
向量数据 | Qdrant (分布式部署) | 768维向量+文本元数据 |
图谱数据 | Neo4j (因果集群) | 节点+关系网络 |