Simple CSC:基于大型语言模型的中文拼写纠错工具

项目简介

一键让中文大模型化身中文拼写纠错模型!!!

本仓库提供了论文 A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models 的实现。

环境要求

  • torch>=2.0.1

  • transformers>=4.27.0

  • xformers==0.0.21

  • accelerate

  • bitsandbytes

  • sentencepiece

  • pypinyin

  • pypinyin-dict

  • opencc-python-reimplemented

  • modelscope (可选,用于从 modelscope 下载模型)

  • streamlit (可选,用于演示应用)

  • uvicorn (可选,用于 RESTful API 服务器)

  • fastapi (可选,用于 RESTful API 服务器)

  • loguru (可选,用于 RESTful API 服务器)

  • sse_starlette (可选,用于 RESTful API 服务器)

安装

您可以通过运行以下命令来配置环境:

bash scripts/set_enviroment.sh

这将自动创建虚拟环境并安装所需的包。

为获得更好的性能,您可以安装 flash-attn:

pip install flash-attn --no-build-isolation

使用方法

模型准备

如果在本地缓存中未找到模型,代码将自动从 Huggingface 模型仓库下载模型。

Python API

我们提供了一个简单的 Python API 用于纠错:

from lmcsc import LMCorrector
corrector = LMCorrector( model="Qwen/Qwen2.5-0.5B", config_path="configs/default_config.yaml",)
outputs = corrector("完善农产品上行发展机智。")print(outputs)# [('完善农产品上行发展机制。',)]

也支持流式模式:

outputs = corrector("完善农产品上行发展机智。", stream=True)for output in outputs:    print(output[0][0], end="\r", flush=True)print()

RESTful API 服务器和调用

我们还提供了纠错器的 RESTful API 服务器。

python api_server.py  \    --model "Qwen/Qwen2.5-0.5B"  \    --host 127.0.0.1  \    --port 8000  \    --workers 1

您可以使用 curl 来测试 RESTful API 服务器。

# 默认模式curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "完善农产品上行发展机智。"}'# > {"id":"","object":"correction","choices":[{"index":0,"message":{"content":"完善农产品上行发展机制。"}}],"created":1727058762}
# 流式模式curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "完善农产品上行发展机智。", "stream": "True"}'# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展模式。"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"完善农产品上行发展机制。"},"index":0}],"created":1727058762}# > data: [DONE]
# 带上下文的纠错curl -X POST 'http://127.0.0.1:8000/correction' -H 'Content-Type: application/json' -d '{"input": "未挨前兆", "contexts": "患者提问:", "stream": "True"}'# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"未"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"未挨"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058762}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}# > data: {"id":"","object":"correction.chunk","choices":[{"delta":{"content":"胃癌前兆"},"index":0}],"created":1727058763}# > data: [DONE]

演示应用

我们为我们的方法提供了一个演示应用。要运行演示:

  1. 确保您已安装 streamlit 包。

  2. 运行以下命令:

streamlit run demo.py

默认情况下,演示使用 Qwen/Qwen2.5-0.5B,可以在具有 32GB 内存的 V100 GPU 上运行。您可以在演示的侧边栏中或通过修改 configs/demo_app_config.yaml 中的 default_model 来更换其他模型。

侧边栏还允许您调整 n_beamalpha 和 use_faithfulness_reward 参数。

侧边栏中提供了几个示例,包括一个包含 1866 个字的长句。

项目链接

http://github.com/Jacob-Zhou/simple-csc

扫码加入技术交流群,备注「开发语言-城市-昵称

(文:GitHubStore)

发表评论