LOADING

Sharenet
  • AI hands-on tutorials
  • AI utility commands
  • Course materials
  • AI Knowledge Base
  • AI Answers
  • I want to contribute
    • Top AI Recommendations
    • Latest AI tools
    • AI Article Writing
    • AI image generation
    • AI Video Tools
    • AI Office Efficiency
    • AI Programming Tools
    • AI chat assistant
    • AI Intelligence
    • AI design tools
    • AI Audio Tools
    • AI search engine
    • AI Learning Resources

    Document Extraction and Cleaning

    Total 67 articles posts
    Sorting
    releaseupdateViewsLike
    Chunkr:使用视觉模型进行文档摄取以及根据文本段落层级智能分块的一体化服务

    Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

    综合介绍 Chunkr 是一个自托管的 API,专门用于将 PDF、PPTX、DOCX 和 Excel 文件转换为适合 RAG(检索增强生成)和 LLM(大语言模型)使用的数据。该项目由 Lumina...
    Latest AI tools# AI Java Open Source Projecct# OCR# Document Extraction and Cleaning
    7mos ago
    01.4K
    OmniParse:从文档/多媒体中提取任何非结构化数据解析为结构化数据

    OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

    综合介绍 OmniParse是一个强大的数据解析与优化平台,旨在将任何非结构化数据转换为结构化、可操作的数据,优化后适用于GenAI(生成式人工智能)框架。无论是处理文档、表格、图像、视频、音频文件还...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    8mos ago
    01.4K
    ExtractThinker:提取和分类文档为结构化数据,优化文档处理流程

    ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

    综合介绍 ExtractThinker 是一个灵活的文档智能工具,利用大型语言模型(LLMs)从文档中提取和分类结构化数据,提供类似 ORM 的无缝文档处理工作流。它支持多种文档加载器,包括 Tess...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    7mos ago
    01.4K
    Outlines:通过正则表达式、JSON或Pydantic模型生成结构化文本输出

    Outlines: Generate structured text output via regular expressions, JSON or Pydantic models

    综合介绍 Outlines 是一个由 dottxt-ai 开发的开源库,旨在通过结构化文本生成来提升大语言模型(LLM)的应用能力。该库支持多种模型集成,包括 OpenAI、transformers...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    5mos ago
    01.3K
    pdf2htmlEX:PDF无损转换为HTML,保持文本格式,适用于学术论文和杂志排版

    pdf2htmlEX: PDF lossless conversion to HTML, maintaining text formatting, suitable for academic papers and magazine layout

    综合介绍 pdf2htmlEX 是一个开源工具,旨在将 PDF 文件转换为 HTML 格式,通过分析 PDF 文件的内容并使用 HTML + CSS 精确还原其视觉效果, 将 PDF 文档转换为浏览器...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    8mos ago
    01.3K
    Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

    Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

    综合介绍 Vision Parse是一个革命性的文档处理工具,它巧妙地结合了最先进的视觉语言模型(Vision Language Models)技术,能够将PDF文档智能转换为优质的Markdown格...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    7mos ago
    01.3K
    TextIn:通用文档转换,PDF转Markdown工具

    TextIn: Universal Document Conversion, PDF to Markdown Tool

    综合介绍 TextIn是一款专业的PDF转Markdown工具,旨在帮助用户高效地将PDF文档转换为Markdown格式。该工具支持多种文件格式,操作简单,转换速度快,能够保留原始PDF的格式和内容...
    Latest AI tools# Document Extraction and Cleaning
    8mos ago
    01.3K
    NV Ingest:解析复杂格式文档,提取多模态数据为元数据和文本

    NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

    综合介绍 NV Ingest(NVIDIA Ingest) 是一套早期访问的微服务,专为解析数十万复杂、混乱的非结构化 PDF 和其他企业文档而设计。它能够将这些文档转换为元数据和文本,以便嵌入到检索...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    6mos ago
    01.3K
    Zerox:PDF、DOCX、图像转换为Markdown,视觉模型高精度OCR

    Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR

    Comprehensive introduction Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling. The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution.Ze...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    6mos ago
    01.3K
    E2M:将多种文件格式转换为Markdown,轻松实现文档格式统一

    E2M: Convert multiple file formats to Markdown for easy document formatting unification

    综合介绍 E2M(Everything to Markdown)是一个开源的Python库,旨在将多种文件格式转换为Markdown格式。该工具支持包括doc、docx、epub、html、htm、u...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    7mos ago
    01.3K
    SemHash:快速实现语义文本去重,提升数据清理效率

    SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency

    综合介绍 SemHash 是一个轻量级且灵活的工具,用于通过语义相似性来去重数据集。它结合了 Model2Vec 的快速嵌入生成和 Vicinity 的高效 ANN(近似最近邻)相似性搜索。SemHa...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    6mos ago
    01.2K
    ViTLP:排版复杂PDF文档提取结构化数据,视觉引导生成文本布局预训练模型

    ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

    综合介绍 ViTLP(Visually Guided Generative Text-Layout Pre-training for Document Intelligence)是一个开源项目,旨在通...
    Latest AI tools# OCR# Document Extraction and Cleaning
    8mos ago
    01.2K
    LlamaParse:Llamaindex推出的高品质解析文档,提取数据服务(每日免费提取1000页)

    LlamaParse: High-quality document parsing and data extraction service by Llamaindex (1000 free pages per day).

    Comprehensive Introduction LlamaParse is a powerful document parsing tool that can process complex documents such as PDF, PowerPoint, Word documents and spreadsheets and convert them into structured data.LlamaParse offers a variety of ways to use...
    Latest AI tools# AI Open Services# Document Extraction and Cleaning
    6mos ago
    01.2K
    Yek:读取git仓库文本文件并快速分块,以供大模型使用

    Yek: reading git repository text files and quickly chunking them for use in large models

    综合介绍 Yek 是一个基于 Rust 的快速工具,用于读取存储库或目录中的文本文件,将其分块并序列化以供大型语言模型(LLM)使用。该工具默认使用 .gitignore 规则跳过不需要的文件,并利用...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    6mos ago
    01.2K
    ScrapeGraphAI:一个提示词搞定网页抓取,无需编写规则智能网页内容提取工具

    ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

    综合介绍 ScrapeGraphAI是一个创新的Python网页抓取库,它巧妙地结合了大语言模型(LLM)和直接图逻辑来创建网站和本地文档的抓取管道。这个工具的独特之处在于它的简单性和强大功能的完美平...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    6mos ago
    01.2K
    Parseur:自动化提取文档数据,各类文档中提取结构化文本

    Parseur: automated extraction of document data, all types of documents to extract structured text

    综合介绍 Parseur是一款领先的AI数据提取软件,旨在帮助用户从PDF、电子邮件和其他文档中自动提取文本数据。通过Parseur,用户可以轻松地将非结构化数据转换为结构化数据,并将其发送到各种应用...
    Latest AI tools# Document Extraction and Cleaning
    6mos ago
    01.1K
    Firecrawl MCP Server:基于 Firecrawl 的网页爬虫 MCP 服务

    Firecrawl MCP Server: Firecrawl-based Web Crawler MCP Service

    综合介绍 Firecrawl MCP Server 是由 MendableAI 开发的一款开源工具,基于 Model Context Protocol (MCP) 协议实现,与 Firecrawl A...
    Latest AI tools# AI Java Open Source Projecct# MCP services# Document Extraction and Cleaning
    4mos ago
    01.1K
    Trieve:提供搜索、推荐和分析的全方位RAG云基础设施

    Trieve: a full-service RAG cloud infrastructure for search, recommendations and analytics

    综合介绍 Trieve 是由 Devflow, Inc. 开发的全方位基础设施,专为搜索、推荐、RAG(检索增强生成)和分析而设计。该平台通过 API 提供服务,支持自托管,适用于 AWS、GCP、K...
    Latest AI tools# AI Open Services# Document Extraction and Cleaning
    8mos ago
    01.1K
    olmOCR:PDF文档转换为文本,支持表格、公式和手写内容的识别

    olmOCR: PDF document conversion to text, support for tables, formulas and handwritten content recognition

    综合介绍 olmOCR 是由 Allen Institute for Artificial Intelligence (AI2) 的 AllenNLP 团队开发的一款开源工具,专注于将 PDF 文件转...
    Latest AI tools# AI Java Open Source Projecct# Document Extraction and Cleaning
    5mos ago
    01.1K
    Load More
    Sharenet
    Sharenet.ai, the best and most complete AI learning guide and tool navigation. Dedicated to helping learners in the field of artificial intelligence from scratch, step by step towards proficiency! Sharenet also provides convenient access to resources. AI era, sharing is king! Ctrl + D or ⌘ + D Bookmark this site to your browser bookmark bar ❤️

    Friendly Link Applicationstatement denying or limiting responsibilityAdvertisement CooperationAbout Us

    Copyright © 2025 Sharenet 
    en_USEnglish
    en_USEnglishzh_CN简体中文 ja日本語 ko_KR한국어 es_ESEspañol de_DEDeutsch fr_FRFrançais pt_BRPortuguês do Brasil
    posts
    poststoolsappbook