This feature addresses a fundamental challenge: AI agents need controlled access to external resources. Network Policies solve these problems by making network access explicit, auditable, and enforceable.

A practical walkthrough of building production-ready AI agents for tax form document processing - with agentic architectural patterns you can apply to any compliance-sensitive use case.

If you'd like to follow along via video, watch our walkthrough on YouTube: https://youtu.be/41rLM9aZb18
Building custom agents with OpenAI's Codex SDK is straightforward—you define your tools, write your prompts, and run everything locally in a sandbox. That works great for development, but once you need to run the agent on more than one person's data, share it with a team, or deploy it to production, you quickly realize you need things like:
This is the gap we wanted to address with AI TaxMan project—an open-source GitHub repository that shows what production-ready agent infrastructure looks like when you move beyond local development.
Fork & explore the repo: github.com/runloopai/codex-tax-man

AI TaxMan is a Codex SDK-based agent with a basic web frontend that transforms W2 forms into completed Form 1040s. The tax logic itself isn't the most interesting part—what makes this relevant as a reference architecture is how it handles everything around the core task:

Here's the core insight: when AI agents handle sensitive data, the execution environment matters as much as the model itself.
Running locally with Codex's sandbox is sufficient during early stage development. But local sandboxing becomes insufficient once you want to run the agent on multiple customers' data simultaneously, share the agent with teammates, or deploy to production. You need cloud infrastructure and that's where things typically get complicated.
Isolation means each agent runs in its own sandbox. Customer A's tax documents shouldn't exist in the same environment as Customer B's agent. This is a fundamental compliance requirement in regulated industries. Auditability is understanding what the agent did when something goes wrong: every file created, every command run, every intermediate decision. Finally, controlled access means explicit boundaries. Agents are unpredictable; they might hit the network, read files they shouldn't, or consume unexpected resources.
Runloop makes all of this easy. We built AI TaxMan on Runloop devboxes because a devbox is essentially a dedicated computer for each agent with isolated compute, isolated filesystem, permission controls, and secure tunnel access. Each session spins up, does its work, and tears down cleanly. All from a single API call, with no Kubernetes clusters to manage or VM provisioning scripts.

The AI TaxMan repository includes everything you need to run this yourself:
The agent reasons about what needs to happen and delegates tools for each step. While LLMs are good at planning and execution, they are unreliable at arithmetic, so we add calculator tools to our agent so that it can comfortably handle math operations. TaxMan decides the sequence, the tools handle execution and all of our operations happen in an isolated and secure space inside the devbox. Every step the agent takes gets logged. If a return looks wrong, you can trace back through the agent's reasoning and tool calls to find exactly where things went sideways.
Once you have a working agent, the next step is to improve it. This requires systematic experimentation with prompts in reproducible environments combined with observability to understand what's actually happening inside your agent.
The AI TaxMan repository includes six different prompts, each more comprehensive than the last. The initial prompt from part one was basic, it told the agent to process a W2 and calculate taxes with minimal instructions and no tools.
By the final iteration, we added:
Runloop's reproducible environment allows us to isolate the prompts to improve our agent. Every run is observable on the Runloop platform with devbox status, execution logs, and debug artifacts clearly visible. You can also interact with your agent via shell commands, inspecting output files, and downloading any logs from the agent as well.

After running experiments manually, we integrated the Weave API from Weights & Biases for automated tracing and observability. Weave helps you iterate, observe, and productionize your AI agents by providing tools to define, track, and debug your application's logic and data flow.
In the trace view, each trace contains a collection of the prompt given and all command executions with reasoning, plus the output from OpenAI's Codex within TaxMan. Running TaxMan on devboxes allows us to connect Weave for granular logging, and those logs are then stored by Weights & Biases for further downstream processing.
As we vary our prompts, Weave traces these outputs and helps us understand how our agent is behaving. This automated monitoring becomes essential as we move from manual experimentation to production-level automated improvement.
If you are interested in learning how Runloop’s Benchmarking product can further enhance your agent development process, check out video 3 of the AI TaxMan series where we learn how to automate the evaluation process. With Weave handling tracing and general observability, Runloop benchmarks extends the refinement experience by allowing you to test the full agentic lifecycle with comprehensive data at your fingertips.
Fork the repository to continue exploring enterprise agent development patterns: github.com/runloopai/codex-tax-man
This content is available in video form on our YouTube channel:
AI TaxMan on GitHub: https://github.com/runloopai/codex-tax-man
Runloop Documentation: https://runloop.ai/docs
Runloop Website: https://runloop.ai/