# Citation If you use AutoData in research, please cite: ```bibtex @inproceedings{autodata2025, title={AutoData: A Multi-Agent System for Open Web Data Collection}, author={Ma, Tianyi and Qian, Yiyue and Zhang, Zheyuan and Wang, Zehong and Qian, Xiaoye and Bai, Feifan and Ding, Yifan and Luo, Xuwei and Zhang, Shinan and Murugesan, Keerthiram and others}, booktitle={NeurIPS}, year={2025} } ``` # Agent Roster | Agent | Responsibility | | --- | --- | | `Supervisor` | Owns the LangGraph, routes between squads, and halts when the task is satisfied. | | `PlanAgent` | Drafts collection strategies, defines subtasks, and seeds OHCache with plan messages. | | `ToolAgent` | Executes LangChain tools (e.g., Perplexity search) plus plugin-specified tools. | | `BrowserAgent` | Operates the `browser-use` automation stack to browse, click, and scrape pages. | | `BlueprintAgent` | Consolidates research output into executable Python + testing guidance. | | `EngineerAgent` | Writes crawler code, typically in the run-scoped `work/` directory. | | `TestAgent` | Runs the generated code (pytest, uv run, etc.) and reports errors/logs. | | `ValidationAgent` | Validates dataset schema, does QA, and reports pass/fail signals. | | `HumanAgent` | Optional manual approval gate; disabled automatically with `--disable-human`. | # Directory Layout ``` AutoData/ ├── autodata/ # source package containing agents, core, tools, plugins ├── configs/ # YAML/TOML/JSON presets for AutoDataConfig ├── docs/, openspec/, evaluation/, tests/ └── outputs// # generated per run (config, summary, artifacts, cache, checkpoints) ``` File/folder highlights under each run: - `summary.json` – metadata about the plan, code artifacts, validation verdicts. - `results/` – packaged datasets, scripts, markdown reports, zipped deliverables. - `browser/` – browser-use recordings/screenshots (`record_video_dir` when enabled). - `logs/` – run-specific logs, useful for debugging agent loops. - `cache/` – OHCache artifact store (`meta/*.json` + `artifacts/*`). - `checkpoint/` – serialized state snapshots, loadable via `python -m autodata.checkpoint`. # Command Quick Reference | Command | Purpose | | --- | --- | | `uv run python -m autodata.main --config ` | Primary entry point; respects CLI overrides and writes outputs. | | `uv run python -m autodata.checkpoint list|save|load|clean` | Inspect or manage checkpoints without kicking off a full task. | | `uv run python -m dev.testing.debug --agent=PlanAgent --prompt="..."` | Exercise an individual agent with a custom prompt. | | `uv run ruff format && uv run ruff check .` | Apply formatting and lint rules required by CI. | | `uv run pytest` | Execute the unit/integration test suite (see `README.md` for targeted workflows). | | `playwright install && playwright install-deps` | Ensure browser-use has the Chromium binaries it needs. | # Environment Variables | Variable | Usage | | --- | --- | | `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY` | Consumed automatically by LangChain's `init_chat_model`. | | `OPENROUTER_API_KEY`, `OPENROUTER_BASE_URL` | Enables third-party OpenAI-compatible endpoints. | | `PPLX_API_KEY` | Unlocks the Perplexity search tool (used by `ToolAgent`). | | Domain-specific API keys | e.g., `TIINGO_API_KEY`, `SPORTSDATA_API_KEY` when enabling plugins such as `financial` or `sport`. | (references-supported-arguments)= # Supported Arguments AutoData groups configuration into dataclasses (such as `task_config` or `storage_config`). Unless otherwise noted, each field below can be supplied either in your config file or on the CLI with the auto-generated flag `--` (underscores become hyphens). When a field is marked *config only*, prefer defining it in YAML/TOML/JSON for readability. ## Task Configuration (`task_config`) | Field | Default | CLI Flag(s) | Description | | --- | --- | --- | --- | | `config` | `configs/default.yaml` | `--config`, `--config-path`, `-c` | Location of the configuration file to load. | | `config_format` | `null` | `--config-format` | Explicit config format (`yaml`, `json`, `toml`) when AutoData cannot infer it. | | `task` | `""` | `--task` | Natural-language instruction executed by the Supervisor. | | `run_name` | `null` | `--run-name` | Logical name used to derive output folders and checkpoints. | | `disable_human` | `False` | `--disable-human` | Auto-approve HumanAgent prompts for unattended runs. | | `task_timeout` | `3600` | `--task-timeout` | Maximum runtime in seconds before the graph aborts. | | `execution_strategy` | `"stream"` | `--execution-strategy` | Execution API: `stream`, `run`, `astream`, or `arun`. | | `dry_run` | `False` | `--dry-run` | Validate configuration and exit without building the graph. | | `verbose` | `False` | `--verbose` | Emit additional initialization logs. | | `visualize_graph` | `False` | `--visualize-graph` | Persist the LangGraph diagram to disk. | ## Storage Configuration (`storage_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `type` | `"file"` | `--type` | Storage backend (`file`, future database adapters). | | `output_dir` | `"./outputs"` | `--output-dir` | Root directory that holds every run. | | `file_format` | `"json"` | `--file-format` | Serialization format for summary files. | | `compression` | `null` | `--compression` | Compression codec (`gzip`, `bz2`, `lzma`). | | `database_url` | `null` | `--database-url` | Connection string if writing to a database backend. | | `overwrite` | `True` | `--overwrite / --no-overwrite` | Allow reusing an existing run directory. | | `force_overwrite` | `True` | `--force-overwrite / --no-force-overwrite` | Skip the confirmation prompt when overwrite is enabled. | ## Logging (`log_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `metrics_enabled` | `True` | `--metrics-enabled / --no-metrics-enabled` | Enable Prometheus metrics server. | | `metrics_port` | `9090` | `--metrics-port` | Port exposed by the metrics endpoint. | | `log_level` | `"INFO"` | `--log-level` | Logging verbosity. | | `log_file` | `null` | `--log-file` | Optional log file path relative to the run directory. | ## Language Model (`llm_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `model` | `"gpt-4o"` | `--model` | Chat model identifier. | | `model_provider` | `null` | `--model-provider` | Explicit provider name when inference cannot deduce it. | | `temperature` | `0.0` | `--temperature` | Sampling temperature. | | `base_url` | `null` | `--base-url` | Custom OpenAI-compatible endpoint (e.g., OpenRouter). | | `api_key` | `null` | `--api-key` | Override API key instead of relying on environment variables. | | `configurable_fields` | `null` | `--configurable-fields` | Runtime-editable LLM fields (`"any"` or comma-separated list). | ## Tool Configuration (`tool_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `run_dir` | `null` | `--run-dir` | Override the directory exposed to tool processes. | | `work_dir` | `null` | `--work-dir` | Scratch directory for engineers/tests (defaults to `outputs//work`). | | `tools_cache_dir` | `null` | `--tools-cache-dir` | Persistent cache for tool downloads. | | `PerplexitySearchToolModel` | `"sonar"` | `--perplexity-search-tool-model` | Model slug passed to the Perplexity API. | ## OHCache (`ohcache_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `enable_ohcache` | `False` | `--enable-ohcache / --no-enable-ohcache` | Toggle the OHCache hypergraph + caching layer. | | `cache_dir` | `null` | `--cache-dir` | Directory where cache metadata and artifacts live. | | `auto_cleanup` | `False` | `--auto-cleanup / --no-auto-cleanup` | Delete expired cache entries on startup. | | `hyperedges` | `[]` | *config only* | Define template hyperedges (YAML/TOML keeps the structure readable). | ## Checkpoints (`checkpoint_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `checkpoint_enabled` | `False` | `--checkpoint-enabled / --no-checkpoint-enabled` | Master switch for checkpoint support. | | `auto_checkpoint` | `False` | `--auto-checkpoint / --no-auto-checkpoint` | Save checkpoints automatically between agents. | | `checkpoint_dir` | `null` | `--checkpoint-dir` | Custom directory for checkpoint binaries. | | `export_json` | `False` | `--export-json / --no-export-json` | Emit human-readable JSON next to binaries. | | `resume_from` | `null` | `--resume-from` | Path to the checkpoint to restore before execution. | | `max_checkpoints` | `null` | `--max-checkpoints` | Retention limit for automatic checkpoint pruning. | ## Plugins (`plugin_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `enabled_plugins` | `[]` | `--enabled-plugins` | List of plugin identifiers (e.g., `financial`, `sport`). | ## Browser Settings (`browser_use_browser_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `headless` | `True` | `--headless / --no-headless` | Run browser automation without a visible window. | | `disable_security` | `False` | `--disable-security / --no-disable-security` | Relax browser security features (use cautiously). | | `user_agent` | `null` | `--user-agent` | Custom user agent string. | | `args` | `null` | `--args` | Extra Chromium launch flags (comma-separated). | | `record_video_dir` | `null` | `--record-video-dir` | Directory for browser-use session recordings. | ## Browser Agent (`browser_use_agent_config`) | Field | Default | CLI Flag | Description | | --- | --- | --- | --- | | `max_steps` | `20` | `--max-steps` | Maximum browser agent steps. | | `max_actions_per_step` | `50` | `--max-actions-per-step` | Cap on actions executed within a single step. | | `llm_timeout` | `null` | `--llm-timeout` | Timeout in seconds for LLM calls during browser control. | | `generate_gif` | `null` | `--generate-gif` | Enable GIF generation (path or `true`). | | `file_system_path` | `null` | `--file-system-path` | Custom filesystem root for browser-use session artifacts. | # Special Thanks - [Browser-use](https://github.com/browser-use/browser-use) – browser automation foundation. - [awesome-cursorrules](https://github.com/PatrickJS/awesome-cursorrules.git), [spec-kit](https://github.com/github/spec-kit.git), [Cursor](https://cursor.com), [Codex](https://openai.com/codex/), [Claude Code](https://www.claude.com/product/claude-code) – tooling inspiration.