Crawl4LLM:为LLM预训练提供的高效网页爬取工具

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 1...
5mos ago
0798
dsRAG:用于处理非结构化数据和复杂查询的检索引擎

dsRAG: A Retrieval Engine for Unstructured Data and Complex Queries

Comprehensive Introduction dsRAG is a high-performance retrieval engine designed to handle complex queries on unstructured data. It performs particularly well in handling challenging queries in dense text such as financial reports, legal documents, and academic papers. dsRAG employs three key approaches to improve performance: language...
5mos ago
0852