Backends for AI Agents That Handle Sensitive Data (Runloop AI TaxMan Series)

Introducing Runloop AI TaxMan Agent

If you'd like to follow along via video, watch our walkthrough on YouTube: https://youtu.be/41rLM9aZb18

Building custom agents with OpenAI's Codex SDK is straightforward—you define your tools, write your prompts, and run everything locally in a sandbox. That works great for development, but once you need to run the agent on more than one person's data, share it with a team, or deploy it to production, you quickly realize you need things like:

Isolation for secure code execution
Secure audit trails to ensure that our agents are compliant
Reliable guarantees that one customer's data can't leak into another session

This is the gap we wanted to address with AI TaxMan project—an open-source GitHub repository that shows what production-ready agent infrastructure looks like when you move beyond local development.

Fork & explore the repo: github.com/runloopai/codex-tax-man

What We Built

AI TaxMan is a Codex SDK-based agent with a basic web frontend that transforms W2 forms into completed Form 1040s. The tax logic itself isn't the most interesting part—what makes this relevant as a reference architecture is how it handles everything around the core task:

The W2 gets uploaded to an isolated filesystem that only this agent session can access
A parsing tool extracts the relevant fields into a format the agent can reason about
Tax math runs through dedicated calculator tools, not LLM arithmetic
Every step is logged, with the final 1040 available for review

Why Execution Environment Matters

Here's the core insight: when AI agents handle sensitive data, the execution environment matters as much as the model itself.

Running locally with Codex's sandbox is sufficient during early stage development. But local sandboxing becomes insufficient once you want to run the agent on multiple customers' data simultaneously, share the agent with teammates, or deploy to production. You need cloud infrastructure and that's where things typically get complicated.

Isolation means each agent runs in its own sandbox. Customer A's tax documents shouldn't exist in the same environment as Customer B's agent. This is a fundamental compliance requirement in regulated industries. Auditability is understanding what the agent did when something goes wrong: every file created, every command run, every intermediate decision. Finally, controlled access means explicit boundaries. Agents are unpredictable; they might hit the network, read files they shouldn't, or consume unexpected resources.

Runloop makes all of this easy. We built AI TaxMan on Runloop devboxes because a devbox is essentially a dedicated computer for each agent with isolated compute, isolated filesystem, permission controls, and secure tunnel access. Each session spins up, does its work, and tears down cleanly. All from a single API call, with no Kubernetes clusters to manage or VM provisioning scripts.

What's in the Repo

‍

The AI TaxMan repository includes everything you need to run this yourself:

/agent - The Codex SDK agent with tool definitions for parsing W2s and calculating taxes
/web - A simple frontend for uploading documents and viewing results
/prompts - The prompt configurations we use to instruct the agent
Setup instructions for connecting to Runloop devboxes

The agent reasons about what needs to happen and delegates tools for each step. While LLMs are good at planning and execution, they are unreliable at arithmetic, so we add calculator tools to our agent so that it can comfortably handle math operations. TaxMan decides the sequence, the tools handle execution and all of our operations happen in an isolated and secure space inside the devbox. Every step the agent takes gets logged. If a return looks wrong, you can trace back through the agent's reasoning and tool calls to find exactly where things went sideways.

Prompt Experimentation and Observability

Once you have a working agent, the next step is to improve it. This requires systematic experimentation with prompts in reproducible environments combined with observability to understand what's actually happening inside your agent.

The AI TaxMan repository includes six different prompts, each more comprehensive than the last. The initial prompt from part one was basic, it told the agent to process a W2 and calculate taxes with minimal instructions and no tools.

By the final iteration, we added:

CLI tools for parsing documents
A calculator tool for all math operations
A seven-step workflow with 2024 tax brackets
Support for all filing statuses
Optional child tax credit calculations and phase-out rules
Example outputs for comparison

Runloop's reproducible environment allows us to isolate the prompts to improve our agent. Every run is observable on the Runloop platform with devbox status, execution logs, and debug artifacts clearly visible. You can also interact with your agent via shell commands, inspecting output files, and downloading any logs from the agent as well.

Integrating Weights & Biases Weave for Observability

After running experiments manually, we integrated the Weave API from Weights & Biases for automated tracing and observability. Weave helps you iterate, observe, and productionize your AI agents by providing tools to define, track, and debug your application's logic and data flow.

In the trace view, each trace contains a collection of the prompt given and all command executions with reasoning, plus the output from OpenAI's Codex within TaxMan. Running TaxMan on devboxes allows us to connect Weave for granular logging, and those logs are then stored by Weights & Biases for further downstream processing.

As we vary our prompts, Weave traces these outputs and helps us understand how our agent is behaving. This automated monitoring becomes essential as we move from manual experimentation to production-level automated improvement.

What's Next

If you are interested in learning how Runloop’s Benchmarking product can further enhance your agent development process, check out video 3 of the AI TaxMan series where we learn how to automate the evaluation process. With Weave handling tracing and general observability, Runloop benchmarks extends the refinement experience by allowing you to test the full agentic lifecycle with comprehensive data at your fingertips.

Fork the repository to continue exploring enterprise agent development patterns: github.com/runloopai/codex-tax-man