Citation

If you use AutoData in research, please cite:

@inproceedings{autodata2025,
  title={AutoData: A Multi-Agent System for Open Web Data Collection},
  author={Ma, Tianyi and Qian, Yiyue and Zhang, Zheyuan and Wang, Zehong and Qian, Xiaoye and Bai, Feifan and Ding, Yifan and Luo, Xuwei and Zhang, Shinan and Murugesan, Keerthiram and others},
  booktitle={NeurIPS},
  year={2025}
}

Agent Roster

Agent	Responsibility
`Supervisor`	Owns the LangGraph, routes between squads, and halts when the task is satisfied.
`PlanAgent`	Drafts collection strategies, defines subtasks, and seeds OHCache with plan messages.
`ToolAgent`	Executes LangChain tools (e.g., Perplexity search) plus plugin-specified tools.
`BrowserAgent`	Operates the `browser-use` automation stack to browse, click, and scrape pages.
`BlueprintAgent`	Consolidates research output into executable Python + testing guidance.
`EngineerAgent`	Writes crawler code, typically in the run-scoped `work/` directory.
`TestAgent`	Runs the generated code (pytest, uv run, etc.) and reports errors/logs.
`ValidationAgent`	Validates dataset schema, does QA, and reports pass/fail signals.
`HumanAgent`	Optional manual approval gate; disabled automatically with `--disable-human`.

Directory Layout

AutoData/
├── autodata/           # source package containing agents, core, tools, plugins
├── configs/            # YAML/TOML/JSON presets for AutoDataConfig
├── docs/, openspec/, evaluation/, tests/
└── outputs/<run_name>/ # generated per run (config, summary, artifacts, cache, checkpoints)

File/folder highlights under each run:

summary.json – metadata about the plan, code artifacts, validation verdicts.
results/ – packaged datasets, scripts, markdown reports, zipped deliverables.
browser/ – browser-use recordings/screenshots (record_video_dir when enabled).
logs/ – run-specific logs, useful for debugging agent loops.
cache/ – OHCache artifact store (meta/*.json + artifacts/*).
checkpoint/ – serialized state snapshots, loadable via python -m autodata.checkpoint.

Command Quick Reference

Command	Purpose
`uv run python -m autodata.main --config <file>`	Primary entry point; respects CLI overrides and writes outputs.
`uv run python -m autodata.checkpoint list	save
`uv run python -m dev.testing.debug --agent=PlanAgent --prompt="..."`	Exercise an individual agent with a custom prompt.
`uv run ruff format && uv run ruff check .`	Apply formatting and lint rules required by CI.
`uv run pytest`	Execute the unit/integration test suite (see `README.md` for targeted workflows).
`playwright install && playwright install-deps`	Ensure browser-use has the Chromium binaries it needs.

Environment Variables

Variable	Usage
`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`	Consumed automatically by LangChain’s `init_chat_model`.
`OPENROUTER_API_KEY`, `OPENROUTER_BASE_URL`	Enables third-party OpenAI-compatible endpoints.
`PPLX_API_KEY`	Unlocks the Perplexity search tool (used by `ToolAgent`).
Domain-specific API keys	e.g., `TIINGO_API_KEY`, `SPORTSDATA_API_KEY` when enabling plugins such as `financial` or `sport`.

Supported Arguments

AutoData groups configuration into dataclasses (such as task_config or storage_config). Unless otherwise noted, each field below can be supplied either in your config file or on the CLI with the auto-generated flag --<field-name> (underscores become hyphens). When a field is marked config only, prefer defining it in YAML/TOML/JSON for readability.

Task Configuration (`task_config`)

Field	Default	CLI Flag(s)	Description
`config`	`configs/default.yaml`	`--config`, `--config-path`, `-c`	Location of the configuration file to load.
`config_format`	`null`	`--config-format`	Explicit config format (`yaml`, `json`, `toml`) when AutoData cannot infer it.
`task`	`""`	`--task`	Natural-language instruction executed by the Supervisor.
`run_name`	`null`	`--run-name`	Logical name used to derive output folders and checkpoints.
`disable_human`	`False`	`--disable-human`	Auto-approve HumanAgent prompts for unattended runs.
`task_timeout`	`3600`	`--task-timeout`	Maximum runtime in seconds before the graph aborts.
`execution_strategy`	`"stream"`	`--execution-strategy`	Execution API: `stream`, `run`, `astream`, or `arun`.
`dry_run`	`False`	`--dry-run`	Validate configuration and exit without building the graph.
`verbose`	`False`	`--verbose`	Emit additional initialization logs.
`visualize_graph`	`False`	`--visualize-graph`	Persist the LangGraph diagram to disk.

Storage Configuration (`storage_config`)

Field	Default	CLI Flag	Description
`type`	`"file"`	`--type`	Storage backend (`file`, future database adapters).
`output_dir`	`"./outputs"`	`--output-dir`	Root directory that holds every run.
`file_format`	`"json"`	`--file-format`	Serialization format for summary files.
`compression`	`null`	`--compression`	Compression codec (`gzip`, `bz2`, `lzma`).
`database_url`	`null`	`--database-url`	Connection string if writing to a database backend.
`overwrite`	`True`	`--overwrite / --no-overwrite`	Allow reusing an existing run directory.
`force_overwrite`	`True`	`--force-overwrite / --no-force-overwrite`	Skip the confirmation prompt when overwrite is enabled.

Logging (`log_config`)

Field	Default	CLI Flag	Description
`metrics_enabled`	`True`	`--metrics-enabled / --no-metrics-enabled`	Enable Prometheus metrics server.
`metrics_port`	`9090`	`--metrics-port`	Port exposed by the metrics endpoint.
`log_level`	`"INFO"`	`--log-level`	Logging verbosity.
`log_file`	`null`	`--log-file`	Optional log file path relative to the run directory.

Language Model (`llm_config`)

Field	Default	CLI Flag	Description
`model`	`"gpt-4o"`	`--model`	Chat model identifier.
`model_provider`	`null`	`--model-provider`	Explicit provider name when inference cannot deduce it.
`temperature`	`0.0`	`--temperature`	Sampling temperature.
`base_url`	`null`	`--base-url`	Custom OpenAI-compatible endpoint (e.g., OpenRouter).
`api_key`	`null`	`--api-key`	Override API key instead of relying on environment variables.
`configurable_fields`	`null`	`--configurable-fields`	Runtime-editable LLM fields (`"any"` or comma-separated list).

Tool Configuration (`tool_config`)

Field	Default	CLI Flag	Description
`run_dir`	`null`	`--run-dir`	Override the directory exposed to tool processes.
`work_dir`	`null`	`--work-dir`	Scratch directory for engineers/tests (defaults to `outputs/<run>/work`).
`tools_cache_dir`	`null`	`--tools-cache-dir`	Persistent cache for tool downloads.
`PerplexitySearchToolModel`	`"sonar"`	`--perplexity-search-tool-model`	Model slug passed to the Perplexity API.

OHCache (`ohcache_config`)

Field	Default	CLI Flag	Description
`enable_ohcache`	`False`	`--enable-ohcache / --no-enable-ohcache`	Toggle the OHCache hypergraph + caching layer.
`cache_dir`	`null`	`--cache-dir`	Directory where cache metadata and artifacts live.
`auto_cleanup`	`False`	`--auto-cleanup / --no-auto-cleanup`	Delete expired cache entries on startup.
`hyperedges`	`[]`	config only	Define template hyperedges (YAML/TOML keeps the structure readable).

Checkpoints (`checkpoint_config`)

Field	Default	CLI Flag	Description
`checkpoint_enabled`	`False`	`--checkpoint-enabled / --no-checkpoint-enabled`	Master switch for checkpoint support.
`auto_checkpoint`	`False`	`--auto-checkpoint / --no-auto-checkpoint`	Save checkpoints automatically between agents.
`checkpoint_dir`	`null`	`--checkpoint-dir`	Custom directory for checkpoint binaries.
`export_json`	`False`	`--export-json / --no-export-json`	Emit human-readable JSON next to binaries.
`resume_from`	`null`	`--resume-from`	Path to the checkpoint to restore before execution.
`max_checkpoints`	`null`	`--max-checkpoints`	Retention limit for automatic checkpoint pruning.

Plugins (`plugin_config`)

Field	Default	CLI Flag	Description
`enabled_plugins`	`[]`	`--enabled-plugins`	List of plugin identifiers (e.g., `financial`, `sport`).

Browser Settings (`browser_use_browser_config`)

Field	Default	CLI Flag	Description
`headless`	`True`	`--headless / --no-headless`	Run browser automation without a visible window.
`disable_security`	`False`	`--disable-security / --no-disable-security`	Relax browser security features (use cautiously).
`user_agent`	`null`	`--user-agent`	Custom user agent string.
`args`	`null`	`--args`	Extra Chromium launch flags (comma-separated).
`record_video_dir`	`null`	`--record-video-dir`	Directory for browser-use session recordings.

Browser Agent (`browser_use_agent_config`)

Field	Default	CLI Flag	Description
`max_steps`	`20`	`--max-steps`	Maximum browser agent steps.
`max_actions_per_step`	`50`	`--max-actions-per-step`	Cap on actions executed within a single step.
`llm_timeout`	`null`	`--llm-timeout`	Timeout in seconds for LLM calls during browser control.
`generate_gif`	`null`	`--generate-gif`	Enable GIF generation (path or `true`).
`file_system_path`	`null`	`--file-system-path`	Custom filesystem root for browser-use session artifacts.

Special Thanks

Browser-use – browser automation foundation.
awesome-cursorrules, spec-kit, Cursor, Codex, Claude Code – tooling inspiration.

Citation

Agent Roster

Directory Layout

Command Quick Reference

Environment Variables

Supported Arguments

Task Configuration (task_config)

Storage Configuration (storage_config)

Logging (log_config)

Language Model (llm_config)

Tool Configuration (tool_config)

OHCache (ohcache_config)

Checkpoints (checkpoint_config)

Plugins (plugin_config)

Browser Settings (browser_use_browser_config)

Browser Agent (browser_use_agent_config)