Citation
If you use AutoData in research, please cite:
@inproceedings{autodata2025,
title={AutoData: A Multi-Agent System for Open Web Data Collection},
author={Ma, Tianyi and Qian, Yiyue and Zhang, Zheyuan and Wang, Zehong and Qian, Xiaoye and Bai, Feifan and Ding, Yifan and Luo, Xuwei and Zhang, Shinan and Murugesan, Keerthiram and others},
booktitle={NeurIPS},
year={2025}
}
Agent Roster
Agent |
Responsibility |
|---|---|
|
Owns the LangGraph, routes between squads, and halts when the task is satisfied. |
|
Drafts collection strategies, defines subtasks, and seeds OHCache with plan messages. |
|
Executes LangChain tools (e.g., Perplexity search) plus plugin-specified tools. |
|
Operates the |
|
Consolidates research output into executable Python + testing guidance. |
|
Writes crawler code, typically in the run-scoped |
|
Runs the generated code (pytest, uv run, etc.) and reports errors/logs. |
|
Validates dataset schema, does QA, and reports pass/fail signals. |
|
Optional manual approval gate; disabled automatically with |
Directory Layout
AutoData/
├── autodata/ # source package containing agents, core, tools, plugins
├── configs/ # YAML/TOML/JSON presets for AutoDataConfig
├── docs/, openspec/, evaluation/, tests/
└── outputs/<run_name>/ # generated per run (config, summary, artifacts, cache, checkpoints)
File/folder highlights under each run:
summary.json– metadata about the plan, code artifacts, validation verdicts.results/– packaged datasets, scripts, markdown reports, zipped deliverables.browser/– browser-use recordings/screenshots (record_video_dirwhen enabled).logs/– run-specific logs, useful for debugging agent loops.cache/– OHCache artifact store (meta/*.json+artifacts/*).checkpoint/– serialized state snapshots, loadable viapython -m autodata.checkpoint.
Command Quick Reference
Command |
Purpose |
|---|---|
|
Primary entry point; respects CLI overrides and writes outputs. |
`uv run python -m autodata.checkpoint list |
save |
|
Exercise an individual agent with a custom prompt. |
|
Apply formatting and lint rules required by CI. |
|
Execute the unit/integration test suite (see |
|
Ensure browser-use has the Chromium binaries it needs. |
Environment Variables
Variable |
Usage |
|---|---|
|
Consumed automatically by LangChain’s |
|
Enables third-party OpenAI-compatible endpoints. |
|
Unlocks the Perplexity search tool (used by |
Domain-specific API keys |
e.g., |
Supported Arguments
AutoData groups configuration into dataclasses (such as task_config or storage_config). Unless otherwise noted, each field below can be supplied either in your config file or on the CLI with the auto-generated flag --<field-name> (underscores become hyphens). When a field is marked config only, prefer defining it in YAML/TOML/JSON for readability.
Task Configuration (task_config)
Field |
Default |
CLI Flag(s) |
Description |
|---|---|---|---|
|
|
|
Location of the configuration file to load. |
|
|
|
Explicit config format ( |
|
|
|
Natural-language instruction executed by the Supervisor. |
|
|
|
Logical name used to derive output folders and checkpoints. |
|
|
|
Auto-approve HumanAgent prompts for unattended runs. |
|
|
|
Maximum runtime in seconds before the graph aborts. |
|
|
|
Execution API: |
|
|
|
Validate configuration and exit without building the graph. |
|
|
|
Emit additional initialization logs. |
|
|
|
Persist the LangGraph diagram to disk. |
Storage Configuration (storage_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Storage backend ( |
|
|
|
Root directory that holds every run. |
|
|
|
Serialization format for summary files. |
|
|
|
Compression codec ( |
|
|
|
Connection string if writing to a database backend. |
|
|
|
Allow reusing an existing run directory. |
|
|
|
Skip the confirmation prompt when overwrite is enabled. |
Logging (log_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Enable Prometheus metrics server. |
|
|
|
Port exposed by the metrics endpoint. |
|
|
|
Logging verbosity. |
|
|
|
Optional log file path relative to the run directory. |
Language Model (llm_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Chat model identifier. |
|
|
|
Explicit provider name when inference cannot deduce it. |
|
|
|
Sampling temperature. |
|
|
|
Custom OpenAI-compatible endpoint (e.g., OpenRouter). |
|
|
|
Override API key instead of relying on environment variables. |
|
|
|
Runtime-editable LLM fields ( |
Tool Configuration (tool_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Override the directory exposed to tool processes. |
|
|
|
Scratch directory for engineers/tests (defaults to |
|
|
|
Persistent cache for tool downloads. |
|
|
|
Model slug passed to the Perplexity API. |
OHCache (ohcache_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Toggle the OHCache hypergraph + caching layer. |
|
|
|
Directory where cache metadata and artifacts live. |
|
|
|
Delete expired cache entries on startup. |
|
|
config only |
Define template hyperedges (YAML/TOML keeps the structure readable). |
Checkpoints (checkpoint_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Master switch for checkpoint support. |
|
|
|
Save checkpoints automatically between agents. |
|
|
|
Custom directory for checkpoint binaries. |
|
|
|
Emit human-readable JSON next to binaries. |
|
|
|
Path to the checkpoint to restore before execution. |
|
|
|
Retention limit for automatic checkpoint pruning. |
Plugins (plugin_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
List of plugin identifiers (e.g., |
Browser Settings (browser_use_browser_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Run browser automation without a visible window. |
|
|
|
Relax browser security features (use cautiously). |
|
|
|
Custom user agent string. |
|
|
|
Extra Chromium launch flags (comma-separated). |
|
|
|
Directory for browser-use session recordings. |
Browser Agent (browser_use_agent_config)
Field |
Default |
CLI Flag |
Description |
|---|---|---|---|
|
|
|
Maximum browser agent steps. |
|
|
|
Cap on actions executed within a single step. |
|
|
|
Timeout in seconds for LLM calls during browser control. |
|
|
|
Enable GIF generation (path or |
|
|
|
Custom filesystem root for browser-use session artifacts. |
Special Thanks
Browser-use – browser automation foundation.
awesome-cursorrules, spec-kit, Cursor, Codex, Claude Code – tooling inspiration.