一站式文本提取神器，轻松搞定PDF、图片、文档等多格式文件的文本提取Kreuzberg

项目简介

Kreuzberg是一个现代 Python 库，用于从文档中提取文本，旨在简洁高效。它提供统一的异步接口，用于从包括 PDF、图片、办公文档等多种文件格式中提取文本。

特点

简单便捷：无需复杂配置即可运行的清洁 API
本地处理：无需外部 API 调用或云依赖
资源高效：无需 GPU 要求的轻量级处理
格式支持：全面支持文档、图像和文本格式
现代 Python：使用 async/await、类型提示和当前最佳实践构建

Kreuzberg是为了解决 RAG（检索增强生成）应用中的文本提取需求而创建的，但它适用于任何文本提取用例。与许多需要 API 调用或复杂设置的商用解决方案不同，库尔茨贝格专注于本地处理，依赖性最小。

功能

通用文本提取：从 PDF（可搜索和扫描的）中提取文本，图像，办公文档等
智能处理：扫描文档的自动 OCR，文本文件的编码检测
现代 Python 设计：

异步优先的 API 使用 anyio
全面类型提示以获得更好的 IDE 支持
详细错误处理，包含上下文信息

生产就绪：

鲁棒错误处理
详细的调试信息
内存高效处理

安装

安装 Python 包

pip install kreuzberg

2. 安装系统依赖项

Kreuzberg 需要两个系统级依赖项：

Pandoc – 用于文档格式转换
Tesseract OCR – 用于图像和 PDF 光学字符识别

请使用各自的安装指南进行安装。

架构

Kreuzberg 被设计为一个在现有开源工具之上的高级异步抽象。它集成了：

PDF 处理：

pdfium2 为可搜索的 PDF 文件
Tesseract OCR 用于扫描内容

文档转换：

Pandoc 支持许多文档和标记格式
python-pptx 为 PowerPoint 文件
html-to-markdown 用于 HTML 内容
为 Excel 电子表格

文本处理：

智能编码检测
Markdown 和纯文本处理

支持格式

文档格式

PDF（ .pdf ，可搜索和扫描的文档）
Microsoft Word（ .docx ， .doc ）
PowerPoint 演示文稿（ .pptx ）
OpenDocument 文本（ .odt ）
富文本格式（ .rtf ）
EPUB (.epub)
DocBook XML（ .dbk ， .xml ）
FictionBook (.fb2)
LaTeX (.tex, .latex)
Typst (.typ)

标记和文本格式

HTML（ .html ， .htm ）
纯文本（ .txt ）和 Markdown（ .md ， .markdown ）
reStructuredText (.rst)
Org-mode (.org)
DokuWiki (.txt)
Pod (.pod)
手册页（ .1 ， .2 等）

数据和研究成果格式

Excel 电子表格（ .xlsx ）
CSV（ .csv ）和 TSV（ .tsv ）文件
Jupyter Notebooks (.ipynb)
BibTeX（ .bib ）和 BibLaTeX（ .bib ）
CSL-JSON (.json)
EndNote XML (.xml)
RIS (.ris)
JATS XML（ .xml ）

图片格式

JPEG (.jpg, .jpeg, .pjpeg)
PNG (.png)
TIFF (.tiff, .tif)
BMP (.bmp)
GIF (.gif)
WebP（ .webp ）
JPEG 2000（ .jp2 ， .jpx ， .jpm ， .mj2 ）
便携式 Anymap（ .pnm ）
便携式位图（ .pbm ）
便携式灰度图（ .pgm ）
便携式像素图（ .ppm ）

使用

库克斯堡提供了一个简单、以异步为主的文本提取 API。该库导出两个主要功能：

extract_file():从文件中提取文本（接受字符串路径或 pathlib.Path ）
extract_bytes(): 从字节中提取文本（接受字节字符串）

快速开始

from pathlib import Pathfrom kreuzberg import extract_file, extract_bytes
# Basic file extractionasync def extract_document():    # Extract from a PDF file    pdf_result = await extract_file("document.pdf")    print(f"PDF text: {pdf_result.content}")
    # Extract from an image    img_result = await extract_file("scan.png")    print(f"Image text: {img_result.content}")
    # Extract from Word document    docx_result = await extract_file(Path("document.docx"))    print(f"Word text: {docx_result.content}")

处理上传的文件

from kreuzberg import extract_bytes
async def process_upload(file_content: bytes, mime_type: str):    """Process uploaded file content with known MIME type."""    result = await extract_bytes(file_content, mime_type=mime_type)    return result.content
# Example usage with different file typesasync def handle_uploads():    # Process PDF upload    pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")
    # Process image upload    img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")
    # Process Word document upload    docx_result = await extract_bytes(docx_bytes,        mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document")

高级功能

PDF 处理选项

from kreuzberg import extract_file
async def process_pdf():    # Force OCR for PDFs with embedded images or scanned content    result = await extract_file("document.pdf", force_ocr=True)
    # Process a scanned PDF (automatically uses OCR)    scanned = await extract_file("scanned.pdf")

提取结果对象

所有提取函数返回一个包含：

content: 提取的文本（字符串）
mime_type: 输出格式（”text/plain” 或 “text/markdown” 用于 Pandoc 转换）

from kreuzberg import ExtractionResult
async def process_document(path: str) -> tuple[str, str]:    # Access as a named tuple    result: ExtractionResult = await extract_file(path)    print(f"Content: {result.content}")    print(f"Format: {result.mime_type}")
    # Or unpack as a tuple    content, mime_type = await extract_file(path)    return content, mime_type

错误处理

库克斯堡通过几种异常类型提供全面的错误处理，所有异常都继承自 KreuzbergError 。每个异常都包含有助于调试的上下文信息。

from kreuzberg import extract_filefrom kreuzberg.exceptions import (    ValidationError,    ParsingError,    OCRError,    MissingDependencyError)
async def safe_extract(path: str) -> str:    try:        result = await extract_file(path)        return result.content
    except ValidationError as e:        # Input validation issues        # - Unsupported or undetectable MIME types        # - Missing files        # - Invalid input parameters        print(f"Validation failed: {e}")
    except OCRError as e:        # OCR-specific issues        # - Tesseract processing failures        # - Image conversion problems        print(f"OCR failed: {e}")
    except MissingDependencyError as e:        # System dependency issues        # - Missing Tesseract OCR        # - Missing Pandoc        # - Incompatible versions        print(f"Dependency missing: {e}")
    except ParsingError as e:        # General processing errors        # - PDF parsing failures        # - Format conversion issues        # - Encoding problems        print(f"Processing failed: {e}")
    return ""
# Example error contextstry:    result = await extract_file("document.xyz")except ValidationError as e:    # Error will include context:    # ValidationError: Unsupported mime type    # Context: {    #    "file_path": "document.xyz",    #    "supported_mimetypes": ["application/pdf", ...]    # }    print(e)
try:    result = await extract_file("scan.jpg")except OCRError as e:    # Error will include context:    # OCRError: OCR failed with a non-0 return code    # Context: {    #    "file_path": "scan.jpg",    #    "tesseract_version": "5.3.0"    # }    print(e)

所有异常提供：

描述性错误信息
相关上下文在 context 属性中
字符串表示，包含消息和上下文
适当的异常链式处理以进行调试

项目链接

https://github.com/Goldziher/kreuzberg

扫码加入技术交流群，备注「开发语言-城市-昵称」

（文：GitHubStore）

一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30