Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR
General Introduction
Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling. The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution. zerox supports Node and Python two programming languages , the use of graphicsmagick and ghostscript for PDF to image processing . Users can quickly convert documents to Markdown format by providing the file path and OpenAI API key for a variety of documents with complex layouts, such as tables and charts.

Function List
- Support PDF, DOCX, images and other file formats conversion
- Provides support for both Node and Python programming languages
- Efficient OCR Processing Using Visual Models
- Automatically installs graphicsmagick and ghostscript for PDF-to-image processing.
- Supports both file path and URL input
- Provide a variety of optional parameters, such as concurrency processing, page orientation correction, error handling mode, etc.
- Support for pre-processing and post-processing callback functions
- Option to save conversion results to a specified directory
Using Help
Installation process
Node version
- Installing Node.js and npm
- Run command
npm install zerox
- Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:
sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript
Python version
- Install Python and pip
- Run command
pip install zerox
- Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:
sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript
Usage
Node version
- Import the zerox module:
import { zerox } from "zerox";
- Use the file path for conversion:
const result = await zerox({
filePath: "path/to/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
});
- Use the URL for conversion:
const result = await zerox({
filePath: "https://example.com/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
});
Python version
- Import the zerox module:
from zerox import zerox
- Use the file path for conversion:
result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key"
)
- Use the URL for conversion:
result = zerox(
file_path="https://example.com/file.pdf",
openai_api_key="your_openai_api_key"
)
Main function operation flow
- file conversion: Provide the path or URL of the file, call the zerox function to convert it and return the text in Markdown format.
- concurrent processing: By setting the
concurrency
parameter to control the number of pages processed at the same time to improve processing efficiency. - Page orientation correction: The page orientation correction feature is enabled by default to ensure that the converted text is oriented correctly.
- error handling mode: Optionally, errors can be ignored or thrown, by setting the
errorMode
parameters are configured. - Pre- and post-processing callbacks: Provides callback functions to perform custom actions before and after each page is processed.
- Save results: By setting the
outputDir
parameter to save the conversion result to the specified directory.
sample code (computing)
Node version
import { zerox } from "zerox";
const result = await zerox({
filePath: "path/to/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
cleanup: true,
concurrency: 10,
correctOrientation: true,
errorMode: "IGNORE",
maintainFormat: false,
maxRetries: 1,
maxTesseractWorkers: -1,
model: "gpt-4o-mini",
onPostProcess: async ({ page, progressSummary }) => Promise<void>,
onPreProcess: async ({ imagePath, pageNumber }) => Promise<void>,
outputDir: "output",
pagesToConvertAsImages: -1,
});
Python version
from zerox import zerox
result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key",
cleanup=True,
concurrency=10,
correct_orientation=True,
error_mode="IGNORE",
maintain_format=False,
max_retries=1,
max_tesseract_workers=-1,
model="gpt-4o-mini",
on_post_process=lambda page, progress_summary: None,
on_pre_process=lambda image_path, page_number: None,
output_dir="output",
pages_to_convert_as_images=-1,
)
We use libreoffice
cap (a poem) graphicsmagick
The document to image conversion is done using a combination of the following. For non-image/non-PDF files, we use libreoffice to convert the file to PDF and then to image.
[ "pdf", // Portable Document Format "doc", // Microsoft Word 97-2003 "docx", // Microsoft Word 2007-2019 "odt", // OpenDocument Text "ott", // OpenDocument Text Template "rtf", // Rich Text Format "txt", // Plain Text "html", // HTML Document "htm", // HTML Document (alternative extension) "xml", // XML Document "wps", // Microsoft Works Word Processor "wpd", // WordPerfect Document "xls", // Microsoft Excel 97-2003 "xlsx", // Microsoft Excel 2007-2019 "ods", // OpenDocument Spreadsheet "ots", // OpenDocument Spreadsheet Template "csv", // Comma-Separated Values "tsv", // Tab-Separated Values "ppt", // Microsoft PowerPoint 97-2003 "pptx", // Microsoft PowerPoint 2007-2019 "odp", // OpenDocument Presentation "otp", // OpenDocument Presentation Template ];
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...