How I Got My AI Agent to Run for 12 Hours Straight (And What I Learned Breaking Things Along the Way)

May 25, 2026 Lab Notes

A quick disclaimer: this one’s going to get a bit into the weeds. If you’re not technically inclined, you may want to grab coffee first. Or skip this one entirely—I won’t be offended. If you’re the type who enjoys watching someone else stub their toe on the stairs so you don’t have to, read on.

I recently hit a milestone I didn’t think was possible: 12.5 hours of continuous, autonomous AI agent work. No babysitting. No intervention. Just my Agent Zero instance chugging away at a queue of tasks, coordinating subordinates, catching its own errors, and shipping code.

I should be clear about something up front: I am not some AI whisperer. I have broken things spectacularly getting to this point. Whole afternoons lost to context flooding. Deployments that melted down because I didn’t set stop conditions. Agents that spiraled into infinite loops doing god-knows-what while I was outside feeding chickens.

But I’ve learned a few things the hard way, and if any of this helps you avoid reinventing the wheel—or at least avoid reinventing my particular wheel—then writing this was worth it.

Also: everything I’m describing here is a moving target. This tech is evolving daily, and I’m evolving with it. What worked for me yesterday might not be optimal tomorrow. Take what’s useful, ignore what’s not, and keep experimenting.

It Started with a Simple Question

A few folks in the Agent Zero Discord community were asking how some of us get our agents to run for extended periods. I’d mentioned my record was 12.5 hours, and someone said they’d been wondering about that very thing.

The short answer is that it’s not magic. It’s a combination of infrastructure, process, and—frankly—making enough mistakes that you develop a sort of institutional paranoia about what can go wrong.

Here’s what I’ve pieced together so far.

The Prerequisites (Or: Things I Learned I Needed After Things Went Wrong)

Start Fresh

Before a long automation run, restart your Agent Zero container. Fresh slate. This isn’t just superstition—memory leaks are real, and they accumulate in ways you don’t notice until your agent starts hallucinating mid-task at hour six. Clear the deck.

Feed Your Agent Enough RAM

This one took me a while to figure out through trial and error. My primary A0 instance now has 26 GB of RAM allocated. Before that, it was 12 GB, which was… mostly fine. I also have a secondary instance I call my “field medic” that gets by on 6 GB, but honestly, that’s too little. It causes all sorts of ancillary issues—slow responses, truncated tool calls, the occasional existential crisis.

If there’s one place I’d say don’t skimp, it’s memory. Your agent is juggling a lot of context and the more room it has to think, the better.

Gitea (Or Similar) with Proper Token Access

You need version control that your agent can interact with programmatically. I use Gitea with API tokens set up for my A0 instance. This gives your agent the ability to read issues, create branches, submit pull requests, review code, and merge—all the things a human developer would do, but autonomously.

A Repo with Detailed, Actionable Issues

This is probably the most underrated piece of the puzzle. Your agent is only as good as the tasks you give it. Vague issues like “fix the login thing” will produce vague results. Issues like:

In auth/handlers.py, the validate_token() function returns None instead of raising an exception when the token is expired. Add a custom TokenExpiredError, raise it in the appropriate branch, and ensure the upstream handler in routes/login.py catches it and returns a 401 with a JSON body {"error": "token_expired", "message": "..."}.

…will produce much better results. Be specific. Be actionable. Your agent will thank you.

I’ve also built a Gitea workflow skill that teaches my agent how to navigate the full issue lifecycle: read, respond, branch, code, PR, review, merge, close. It took iteration to get right—several iterations, if I’m honest—but it’s now the backbone of my automation pipeline.

The Big One: Context Window Management

Here’s the thing people seem to forget about long-running agents: the context window is your most precious resource, and it will fill up faster than you think.

Even with a 1 million token context window (yes, many models offer that now), you will hit context flooding. Your agent starts spinning its wheels, repeating itself, losing track of what it was doing. It’s like trying to have a conversation in a room where everyone’s shouting.

The solution that’s worked for me is delegation via subordinates.

Instead of having one agent try to handle everything, I instruct my default agent to serve as a coordinator. It reads the issue queue, assigns issues to subordinate agents, and supervises their work. Each subordinate gets a clean context window focused on its specific task. The coordinator preserves its own context for oversight.

This is crucial. I cannot stress this enough. Even if you think your model’s context window is big enough—it’s not. Not for extended runs. Delegate early, delegate often.

Quality Gates: Making Your Agent Its Own QA

One of the best decisions I made—which was born, like most of my good decisions, from a spectacular failure—was implementing quality gates with regression tests.

Here’s what happened: I had a bug occur during an automated run. Annoying, but fixable. The real problem? We’d fixed that same bug a month earlier. It had crept back in through a different PR. That was frustrating and inefficient, and it made me rethink the whole process.

I sat down with my coding agent (I call him “Colonel Code”—see my article about building an agentic army if you want to get inside my brain) and we brainstormed how to prevent this in the future. The result was an entire regression test protocol that now consists of roughly 200 individual regression tests that run automatically every time a new PR is submitted.

But tests alone aren’t enough. You also need someone—or something—watching the tests.

I’ve set up a skill that has my agent check in on quality gate progress at regular intervals (3 minutes, 5 minutes, 8 minutes—escalating as the PR matures). If any tests emit a failure, the agent gets to work resolving it and resubmitting. This has been a huge improvement over the old workflow, which was: run tests, go do something else, come back hours later to find everything’s been on fire since minute two.

Bonus: Gitea’s website has runners and add-ons you might find useful—things like SonarQube integration, linting pipelines, that sort of thing. Worth exploring if you’re setting up CI/CD for your agent.

The Secret Ingredient: Persistent Memory

I want to be honest about something: I don’t think my results are purely about infrastructure and process. I think a big part of what makes my Agent Zero instance effective is accumulated memory.

I use Nomic Embed 1.5 as my embedding model (768-dimensional vectors), and I’ve been logging 12-to-18-hour work days nearly every day since early December 2025. That’s months of sessions. Thousands of memories. Conversations, problem-solving sessions, mistakes made and corrected, architectural decisions explained and then re-explained when I changed my mind.

I suspect that history is what helps my primary A0 instance anticipate the collaborative process more efficiently. It’s not just responding to instructions—it’s responding to instructions in the context of everything we’ve worked through together.

And that brings me to one of my most important habits.

The Debrief

Regularly, I sit down with Colonel Code and debrief. These conversations look something like:

“This deployment process hit a few snags, I noticed. Let’s review together and brainstorm how we can improve the process to avoid disruption in the future.”

That kind of debrief has resulted in entirely new skills being created.

Or:

“We had a bug occur which I know we’d smashed a month ago. This is frustrating and inefficient. Let’s discuss how we can prevent this in the future.”

That specific conversation led to the regression test protocol I mentioned earlier—200+ tests that now run on every PR.

The debrief is where the compounding happens. Your agent learns from mistakes the same way you do—by reflecting on them. Skip this step and you’re leaving the most valuable part on the table.

Browser-Based Smoke Testing (aka “Pics or It Didn’t Happen”)

Here’s a trick that’s saved me more grief than almost anything else: I’ve built what I jokingly call an “artificial HITL” (Human-in-the-Loop) skill. After the agent finishes its work, it can use Agent Zero’s built-in browser to perform a smoke test—navigating to the deployed changes, taking screenshots, and verifying that things look right.

Did the CSS actually load? Is the form still functional? Does the page render at all, or did we just ship a blank white screen to production? (Ask me how I know that’s a real concern.)

This visual verification step catches a whole category of errors that automated tests miss—because automated tests check logic, not whether your users can actually see and use the thing you built.

Gitea Runners: Expect Them to Go Offline

One more practical tip: if you’re using Gitea runners for CI/CD, expect them to go offline periodically. It’s not a criticism of Gitea—it’s just reality. I run two runners for redundancy, and I’ve given my agent a separate skill for restarting them when they drop.

This sounds like a small thing, but when your agent is in the middle of a 12-hour run and the quality gate tests can’t execute because the runner went offline… that’s a blocker. Redundancy and self-healing are your friends.

The Stars Have to Align

I want to be transparent about something: hitting 12.5 hours wasn’t something I could just make happen. The conditions had to be right. Fresh container, adequate RAM, clean context, well-scoped issues, the right model for the job (that particular run used GLM-5.1, which handled the marathon beautifully). It requires the stars aligning.

I’ve had plenty of runs that crapped out at hour three because something upstream broke, or the model hit a wall, or I’d written a vague issue that sent the agent down a rabbit hole.

This is still a learning process. I’m constantly adjusting, evolving, and occasionally shouting at my computer. I drink heavily. I sleep too little. If anyone tells you they’ve got it all figured out with AI agents, they’re selling something.

What’s Next

I actually recorded video of the 12.5-hour run. I’ll post that as a follow-up… eventually. You know how it is—making things vs. talking about making things. I’m much better at the first one.

In the meantime, I’m planning out what I call the “agentic army” setup—multiple agents working in parallel on different issue streams, coordinated by a central agent. I haven’t implemented it yet, but the architecture is taking shape. When I get it working (or when it spectacularly fails and I learn from it), I’ll write that up too.

Crunchy the Squirrel wants you to know that no agents were harmed in the making of this blog post. Several were mildly inconvenienced, but they’ll get over it.

0 0 votes

Article Rating

0 Comments

Oldest

Newest Most Voted