项目简介
先决条件
请求clueweb22数据集。
numpy
tqdm
fasttext
pyyaml
wandb
fasttext_scorers/
.将DCLM FastText分类器下载到
fasttext_scorers/
。要运行(模拟的)爬网,请首先在
configs/
下创建YAML配置文件,然后运行以下命令:python crawl.py crawl --config <path_to_your_config_file>
Crawl4LLM
在configs/
带有以下内容:
cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_dclm_fasttext
num_selected_docs_per_iter: 10000
num_workers: 16 # set to a number that fits your machine
save_state_every: -1 # set to a positive number to save the state (queue & visited set) of the crawler every certain steps
max_num_docs: 20000000
selection_method: dclm_fasttext_score
order: desc # desc for descending, asc for ascending
wandb: true # set to false to disable wandb logging
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_dclm_fasttext
rating_methods:
-
type: length
-
type: fasttext_score
rater_name: dclm_fasttext_score
model_path: fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin
所有得分子在rating_methods
中都对文档进行评分。在上面的配置文件中,我们设置了一个length
得分手,该长度得分器按其长度分数和一个使用DCLM FastText模型来评分文档的fasttext_score
得分手。最终排名由selection_method
确定,该selection_method设置为dclm_fasttext_score
, fasttext_score
得分手的名称。
基线爬行者
随机爬行者
cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_random
num_selected_docs_per_iter: 10000
num_workers: 16
save_state_every: -1
max_num_docs: 20000000
selection_method: random_score
order: desc
wandb: true
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_random
rating_methods:
-
type: random_score
基于indegree的爬行者
cw22_root_path: <path_to_clueweb22_a>
seed_docs_file: seed.txt
output_dir: crawl_results/seed_10k_crawl_20m_indegree
num_selected_docs_per_iter: 10000
num_workers: 16
save_state_every: -1
max_num_docs: 20000000
selection_method: inlink_count
order: desc
wandb: true
wandb_project: crawler
wandb_run_name: seed_10k_crawl_20m_indegree
rating_methods:
-
type: inlink_count
预处理和评估
运行爬网后,将crawled文档ID放置在配置文件中的output_dir
中。运行以下命令获取文档文本:
python fetch_docs.py --input_dir <document_ids_dir> --output_dir <document_texts_dir> --num_workers <num_workers>
然后,您可以使用DCLM框架运行LLM预处理和评估。
各种各样的
浏览数据
运行以下命令以通过其ID打印文档及其链接:
python access_data.py <path_to_clueweb22> <document_id>
项目链接
https://github.com/cxcscmu/Crawl4LLM
扫码加入技术交流群,备注「开发语言-城市-昵称」
(文:GitHubStore)