项目简介
欢迎! kg-gen
帮助您从任何纯文本中提取知识图谱,使用 AI。它可以处理小型和大型文本输入,还可以处理对话格式的消息。
为什么生成知识图谱? kg-gen
如果你想:
-
创建一个图来辅助 RAG(检索增强生成) -
创建用于模型训练和测试的图合成数据 -
将任何文本结构化为图 -
分析源文本中概念之间的关系
我们通过 LiteLLM 支持基于 API 和本地模型提供商,包括 OpenAI、Ollama、Anthropic、Gemini、Deepseek 等,还使用 DSPy 进行结构化输出生成。
尝试通过运行
tests/
中的脚本来试用。运行我们的 KG 基准测试 MINE 的说明在
MINE/
。阅读论文:KGGen:使用语言模型从纯文本中提取知识图谱Quick
快速开始
安装模块:
pip install kg-gen
然后导入并使用 kg-gen
。您可以以两种格式之一提供您的文本输入:
消息对象列表(每个对象具有角色和内容)
以下是一些示例片段:
from kg_gen import KGGen
# Initialize KGGen with optional configuration
kg = KGGen(
model="openai/gpt-4o", # Default model
temperature=0.0, # Default temperature
api_key="YOUR_API_KEY" # Optional if set in environment
)
# EXAMPLE 1: Single string with context
text_input = "Linda is Josh's mother. Ben is Josh's brother. Andrew is Josh's father."
graph_1 = kg.generate(
input_data=text_input,
context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Josh'}
# edges={'is brother of', 'is father of', 'is mother of'}
# relations={('Ben', 'is brother of', 'Josh'),
# ('Andrew', 'is father of', 'Josh'),
# ('Linda', 'is mother of', 'Josh')}
# EXAMPLE 2: Large text with chunking and clustering
with open('large_text.txt', 'r') as f:
large_text = f.read()
# Example input text:
# """
# Neural networks are a type of machine learning model. Deep learning is a subset of machine learning
# that uses multiple layers of neural networks. Supervised learning requires training data to learn
# patterns. Machine learning is a type of AI technology that enables computers to learn from data.
# AI, also known as artificial intelligence, is related to the broader field of artificial intelligence.
# Neural nets (NN) are commonly used in ML applications. Machine learning (ML) has revolutionized
# many fields of study.
# ...
# """
graph_2 = kg.generate(
input_data=large_text,
chunk_size=5000, # Process text in chunks of 5000 chars
cluster=True # Cluster similar entities and relations
)
# Output:
# entities={'neural networks', 'deep learning', 'machine learning', 'AI', 'artificial intelligence',
# 'supervised learning', 'unsupervised learning', 'training data', ...}
# edges={'is type of', 'requires', 'is subset of', 'uses', 'is related to', ...}
# relations={('neural networks', 'is type of', 'machine learning'),
# ('deep learning', 'is subset of', 'machine learning'),
# ('supervised learning', 'requires', 'training data'),
# ('machine learning', 'is type of', 'AI'),
# ('AI', 'is related to', 'artificial intelligence'), ...}
# entity_clusters={
# 'artificial intelligence': {'AI', 'artificial intelligence'},
# 'machine learning': {'machine learning', 'ML'},
# 'neural networks': {'neural networks', 'neural nets', 'NN'}
# ...
# }
# edge_clusters={
# 'is type of': {'is type of', 'is a type of', 'is a kind of'},
# 'is related to': {'is related to', 'is connected to', 'is associated with'
# ...}
# }
# EXAMPLE 3: Messages array
messages = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
graph_3 = kg.generate(input_data=messages)
# Output:
# entities={'Paris', 'France'}
# edges={'has capital'}
# relations={('France', 'has capital', 'Paris')}
# EXAMPLE 4: Combining multiple graphs
text1 = "Linda is Joe's mother. Ben is Joe's brother."
# Input text 2: also goes by Joe."
text2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph also goes by Joe."
graph4_a = kg.generate(input_data=text1)
graph4_b = kg.generate(input_data=text2)
# Combine the graphs
combined_graph = kg.aggregate([graph4_a, graph4_b])
# Optionally cluster the combined graph
clustered_graph = kg.cluster(
combined_graph,
context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Joe', 'Joseph', 'Judy'}
# edges={'is mother of', 'is father of', 'is brother of', 'is sister of'}
# relations={('Linda', 'is mother of', 'Joe'),
# ('Ben', 'is brother of', 'Joe'),
# ('Andrew', 'is father of', 'Joe'),
# ('Judy', 'is sister of', 'Andrew')}
# entity_clusters={
# 'Joe': {'Joe', 'Joseph'},
# ...
# }
# edge_clusters={ ... }
功能
大文本分块
对于长文本,您可以指定一个 chunk_size
参数以将文本分块处理:
graph = kg.generate(
input_data=large_text,
chunk_size=5000 # Process in chunks of 5000 characters
)
聚类相似实体和关系
您可以聚类相似实体和关系,无论是在生成过程中还是之后:
# During generation
graph = kg.generate(
input_data=text,
cluster=True,
context="Optional context to guide clustering"
)
# Or after generation
clustered_graph = kg.cluster(
graph,
context="Optional context to guide clustering"
)
聚合多个图
您可以使用聚合方法组合多个图表:
graph1 = kg.generate(input_data=text1)
graph2 = kg.generate(input_data=text2)
combined_graph = kg.aggregate([graph1, graph2])
消息数组处理
处理消息数组时,kg-gen:
-
保留每条消息的角色信息 -
维护消息顺序和边界 -
能提取实体和关系: -
消息中提到的概念之间 -
演讲者(角色)与概念之间 -
在对话中的多条消息
例如,给定这个对话:
messages = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
生成的图形可能包括以下实体:
-
“France” -
“Paris”
并且关系如下:
API 参考
KGGen 类
构造函数参数
model
: str = “openai/gpt-4o” – 使用的生成模型
-
temperature : 浮点数 = 0.0 – 模型采样的温度 -
api_key : Optional[str] = None – 模型访问的 API 密钥
生成()方法参数
-
model : Optional[str] – 覆盖默认模型 -
api_key : Optional[str] – 覆盖默认 API 密钥 -
context : str = “” – 数据上下文描述 -
chunk_size : 可选[int] – 处理文本块的大小 cluster
: 布尔型 = False - 是否在生成后对图进行聚类 -
temperature : Optional[float] – 覆盖默认温度 output_folder:可选的路径以保存部分进度
cluster() 方法参数
graph 聚类图
-
context : str = “” – 数据上下文描述
-
model : Optional[str] – 覆盖默认模型
-
temperature : Optional[float] – 覆盖默认温度
-
api_key : Optional[str] – 覆盖默认 API 密钥
graph 聚类图
aggregate() 方法参数
graphs
: 图列表 – 要组合的图列表项目链接
http://github.com/stair-lab/kg-gen
扫码加入技术交流群,备注「开发语言-城市-昵称」
(文:GitHubStore)