Welcome to AutoData
AutoData is a multi-agent system that automates how you study a target domain, plan a crawler, generate Python code, and validate the collected dataset. The public release lives at AutoData while these docs mirror the development branch.
Note
Need the CLI switches or configuration schema? Jump to Prerequisites and LLM Provider Setup for the most common workflows.
What AutoData Solves
Modern websites require research, tooling, scripted browsing, and a validation loop before data is trustworthy. AutoData turns that process into a repeatable workflow:
Automated research – the Research Squad (Plan, Tool, Browser, Blueprint agents) explores APIs and webpages, accumulating facts in OHCache.
Code generation – the Development Squad (Engineer, Test, Validation agents) turns blueprints into runnable crawlers and executes them inside the sandboxed work directory.
Supervisor loop – the Supervisor agent arbitrates hand-offs, injects HumanAgent confirmations when
disable_humanis false, and stops when your success criteria are satisfied.Provenance artifacts – every run writes configs, summaries, browser captures, and checkpoints under
outputs/<run_name>/so you can re-run or audit later.
Architecture in Brief
Layer |
Responsibilities |
|---|---|
Supervisor Agent |
Owns the AutoData graph, routes work, enforces task deadlines, and decides when to finish. |
Research Squad |
|
Development Squad |
|
Shared substrate |
OHCache hypergraph for context routing, LangGraph for execution, CheckpointManager for persistence, and PluginSpec for optional domain tuning. |
See LLM Provider Setup for all tunable knobs exposed by AutoDataConfig.
Execution Lifecycle
Initialize –
uv run python -m autodata.main --config configs/default.yamlloads the YAML/TOML/JSON config, merges CLI overrides (model, task, log level, etc.), and materializesoutputs/<run_name>/.Build graph –
AutoData.build()constructs the LangGraph with every agent plus optional plugins. Ifcheckpoint_config.resume_fromis set, state is hydrated before execution.Research loop – Supervisor dispatches to Research Squad members. OHCache routes only the relevant conversations via hyperedges you can predefine in
ohcache_config.hyperedges.Development loop – Blueprint instructions trigger Engineer/Test/Validation. Tool execution happens in
work/while artifacts land inresults/.Finalize – On success AutoData writes
summary.json, optional checkpoints, browser recordings, cached artifacts, and log files. Clean exits keep directories intact; enablingauto_checkpointpersists snapshots during the run.
Key Innovations
OHCache (Oriented Hypergraph Cache) keeps token budgets predictable by caching artifacts and routing messages by type, not history length.
Checkpoint CLI (
python -m autodata.checkpoint ...) lets you list, clean, or resume runs without touching the full pipeline.Plugin surface allows prompt injections and additional LangChain tools per domain (financial, sport, academic, etc.).
uv-first toolchain guarantees reproducible environments (see Prerequisites).
When to Use AutoData
Use AutoData whenever you must collect structured data from the open web with auditability:
Need autonomous research before writing a crawler.
Want generated code you can inspect, test, or adapt.
Require reproducible outputs, cached context, and resumable checkpoints.
Prefer orchestration that plays nicely with OpenAI, Anthropic, Google, or an OpenRouter-compatible provider without code changes.
GETTING STARTED
CONFIGURATION
REFERENCES
- Citation
- Agent Roster
- Directory Layout
- Command Quick Reference
- Environment Variables
- Supported Arguments
- Task Configuration (
task_config) - Storage Configuration (
storage_config) - Logging (
log_config) - Language Model (
llm_config) - Tool Configuration (
tool_config) - OHCache (
ohcache_config) - Checkpoints (
checkpoint_config) - Plugins (
plugin_config) - Browser Settings (
browser_use_browser_config) - Browser Agent (
browser_use_agent_config)
- Task Configuration (
- Special Thanks