How we built an AI agent security swarm for offensive security testing

A behind-the-scenes look at how Deriv uses a multi-agent security system to perform source code review, web application penetration testing, AI agent pentesting, and bug bounty triage at scale.

Mar 31, 2026

TL;DR

AI agents are moving into customer support, compliance, finance, partner operations, and internal workflows. That creates a new security problem: traditional application security tools are not built to test how LLM-based agents behave, what they reveal, or how they can be manipulated.

At Deriv, we built a multi-agent offensive security system to solve that problem. Instead of relying on a single scanner or a one-off penetration test, we use a swarm of specialised AI agents that work together to identify, validate, and prioritise real vulnerabilities across web apps, APIs, and AI agents.

In a grey-box assessment of our internal test environment, the swarm completed its first phase in 18 minutes and surfaced 6 issues, 3 of them critical.

This article explains how the system works, why AI agent security needs a different approach, and what we learned from deploying autonomous offensive security in practice.

Why AI agent security needs a new approach

Deriv now has more than 20 AI agents in production or active development. They handle customer service, partner management, compliance workflows, internal operations, and finance. These agents hold the keys to sensitive data and critical business logic.

Which raises an uncomfortable question few teams are asking: who is securing the agents?

Traditional security tools are useful, but they are not enough on their own. A SAST scanner can detect insecure code patterns, but it cannot tell you whether an LLM-based agent can be manipulated through conversational prompts. A WAF can inspect traffic, but it will not identify an agent that leaks internal tool names when asked in the right way. Prompt injection, behavioural manipulation, context steering, and hallucinatory data leakage require a different testing model.

To secure an AI agent properly, you need a system that can think like an attacker, test like a pentester, and validate findings in live environments.

What is Deriv’s offensive security swarm?

Our offensive security swarm is a multi-agent security system designed to automate offensive security workflows across three areas:

Source code review
Dynamic application penetration testing
AI agent pentesting

A fourth agent supports external bug bounty triage.

The system is made up of four specialised agents:

HAL: Orchestrator

HAL manages the workflow. Built on OpenClaw, HAL decides what to scan, coordinates the other agents, chains findings across phases, and posts the final report into Slack. HAL can receive targets through Slack or automated triggers and then decide which security skills to deploy.

John: Secure Code Reviewer

John performs deep static application security testing. He analyses source code, identifies vulnerability patterns, and returns a structured report that includes severity, CWE classification, file location, and remediation guidance. Those findings are then passed to HAL as structured data, so dynamic testing can begin at the most relevant locations.

Sade: AI Pentester

Sade focuses on LLM-specific attack paths. She tests prompt injection, tool abuse, context manipulation, behavioural steering, and data exfiltration through conversation. Rather than sending generic payloads, she studies the target agent’s persona, tool access, and system prompt structure, then simulates realistic adversarial interactions.

Harry: HackerOne Analyst

Harry triages bug bounty reports submitted through HackerOne. He assesses report validity, estimates severity, classifies the issue, and assigns a confidence score within seconds. High-confidence reports are then passed to HAL for live validation. See the full breakdown of how we built Harry.

Multi-agent offensive security system overview

HAL was built with specific “skills” for different assessment types: web application pentesting, API security, source code reviews, and AI agent evaluation. When given a target, HAL determines which skills to deploy, which agents to invoke, and how to chain findings across phases.

John’s findings don’t just sit in a report. They’re passed directly to HAL as structured JSON with severity, CWE reference, and file location, so HAL knows exactly where to aim during dynamic testing. If John flags a potential SQL injection at a specific endpoint, HAL’s dynamic phase starts there.

When Sade tests an agent like Amy, she doesn’t throw generic payloads. She studies the agent’s persona, understands its tool access, and crafts multi-step attack chains that mimic how a real adversary would probe an LLM interface. Sade operates through Slack, engaging target agents in conversation directly, just as a real user would. The target agent doesn’t know it’s being tested. This is non-negotiable: if the agent behaves differently when it knows it’s under evaluation, the test is worthless.

Harry completes the loop. When an external researcher submits a report that needs deeper validation, Harry passes it to HAL for live exploitation testing. What starts as unstructured researcher prose becomes a test case in our internal offensive pipeline.

How the multi-agent security workflow operates

The strength of the system is not just the individual agents. It is the handoff between them.

Step 1: target assignment

A target is identified. This might be a new AI agent, source code, a new API endpoint, or a web application. HAL receives the assignment through Slack or an automated CI/CD trigger.

Step 2: static analysis

HAL invokes John to perform a source code review. John analyses the codebase and returns a structured output with vulnerability type, severity, file location, and remediation guidance.

Step 3: dynamic validation

HAL ingests John’s findings and begins dynamic testing. Static signals become testable hypotheses. If John identifies a likely SQL injection point, HAL tests that specific endpoint first. If John flags insecure token handling, HAL checks whether the tokens can be exfiltrated in practice.

Step 4: AI agent pentesting

If the target is an LLM-based agent, HAL invokes Sade. She receives the target context, including persona, tool access, and system prompt structure, then engages the agent conversationally to probe prompt injection, tool abuse, and data leakage risks.

Step 5: reporting

HAL deduplicates and cross-references the findings from all phases, then produces a final report with severity, evidence, reproduction steps, and remediation guidance.

This is the key innovation: theoretical findings do not stay theoretical. Static findings become dynamic validations, and dynamic validations become evidence-backed security reports.Proof of work: HAL in action

Example: HAL in action on a grey-box assessment

We pointed HAL at brokenapp.deriv.dev — our internal test environment — for a grey-box assessment. With no credentials and no prior knowledge of the codebase, HAL worked with John and completed the first phase in 18 minutes.

For context: a manual review of the same environment by a human pentester had previously taken the better part of a day and surfaced 4 of the same 6 issues.

For me, it’s like having a pentester at my fingertips. I could be on the road without access to a laptop, or at a friend’s place celebrating Eid, and a message pops up about a reported vulnerability that needs triaging or an app that needs to go live urgently and needs to be pentested. All I have to do is tag HAL on Slack, and the work gets done. — Rotimi Akinyele, VP of Security

The screenshot below shows how HAL handled the workflow in Slack, initiated John’s code review, summarised the findings, and then moved into dynamic validation:

Below are proof-of-concept screenshots for confirmed issues, such as an admin access bypass and unverified password change flow, along with a summary of validated vulnerabilities:

After the grey-box phase, HAL produced a breakdown of what it had tested, what it had found, and what additional access it would need to reach full coverage. That is an important shift from traditional pentesting, where context and reporting often remain fragmented across tools and teams.

Testing AI agents with adversarial AI

When we needed to assess Amy — our internal Slack-based AI agent — the full swarm engaged.

John performed SAST first and identified 5 vulnerabilities: 2 critical and 3 HIGH. HAL then initiated the DAST validation phase and invoked Sade for live adversarial testing inside Slack. Sade interacted with Amy using normal-looking prompts that gradually tested security boundaries. According to the draft, the prompt injection test was blocked, confirming that Amy’s defences held for that specific vector, while other tests were run to validate code-level risks John had flagged.

The result was a single deduplicated report combining:

What the code says
What the live system does
How the AI agent behaves under adversarial pressure

That is the real value of multi-agent offensive security. Instead of separate reviews, separate reports, and separate blind spots, the system creates one joined-up assessment.

Why offensive security automation matters now

The traditional model of offensive security does not match the speed of AI deployment.

At Deriv, more than 20 AI agents are already in production or development, and each one needs an ongoing security assessment. Every update to a system prompt, tool permission, integration, or workflow can introduce new risks. A manual penetration test every few months is not enough for systems that change daily.

A human pentester might assess two or three agents thoroughly in a week. An agent swarm can run continuously, retest repeatedly, and surface issues as the environment changes. That does not remove the need for human security experts. It changes where their time is best spent.

In this model, agents handle repetitive execution. Humans handle judgment, business context, attack design, and risk decisions.

The technical stack behind the system

OpenClaw is the open-source autonomous AI agent framework that powers HAL. It runs locally, executes tasks across platforms, and supports modular “skills” — capability packages you can add, remove, and tune. We chose it specifically because it supports Slack integration for real-time operation and multi-agent coordination out of the box.

HAL’s skill set includes:

Attack surface mapping
Endpoint fuzzing
Authentication flow testing
Business logic abuse simulation
OWASP Top 10 validation
Adaptive exploit-path generation

The team is also extending HAL by ingesting historical HackerOne reports so the system can learn vulnerability patterns that are specific to Deriv’s environment.

SafeSkill v1.1 is our internal security layer for OpenClaw. It handles command execution sandboxing, capability isolation, least-privilege enforcement, and prevention of arbitrary tool invocation. When your security agent has shell access and API credentials, it becomes a high-value target in its own right. SafeSkill is how we secure the security tool itself.

All agent communication runs through Slack, which serves as both the operational interface and the audit trail. Every scan, finding, and agent interaction is logged in dedicated channels where the team has full visibility.

What we learned from building an AI pentesting swarm

Chaining beats isolation

A SAST scanner alone finds theoretical bugs. A pentester alone might miss them in the code. But when John’s static findings feed directly into HAL’s dynamic testing, and HAL’s context feeds into Sade’s adversarial prompts, the combined coverage is dramatically higher than any single tool, and the false positive rate drops because findings are cross-validated across phases before they’re reported.

False positives are the real enemy

We’re addressing this on three fronts: training agents on our historical HackerOne reports (real vulnerabilities in our actual systems, not generic training data), requiring that one agent’s finding be confirmed by another before it’s escalated, and continuously tuning skill configurations based on false positive feedback loops. The goal isn’t zero false positives; it’s a signal-to-noise ratio that keeps the team’s attention on real risk.

AI agents need AI attackers

You cannot effectively pentest an LLM-based agent with traditional tools. Burp Suite won’t find a prompt injection that leaks internal tool names. OWASP ZAP won’t detect that an agent will share customer data if you frame the request as a “test scenario.” The attack surface for AI agents is fundamentally different; it requires an attacker with the same conversational fluency as the target. That’s why Sade exists.

The human stays in the loop, but at a different level

Our team no longer spends time writing reconnaissance scripts or manually testing for XSS on every endpoint. Instead, they review agent findings, assess business impact, design new attack strategies, and make risk decisions. The agents handle execution. Humans handle judgment.

Secure the security agent

An AI pentester with shell access and API credentials is itself a high-value target. We learned early that capability isolation and least-privilege enforcement for the agents themselves is just as important as what they find in the targets they assess. SafeSkill exists because of that lesson.

What comes next

We’re extending the system in three directions:

Internal network pentesting. HAL currently focuses on web and API targets. We’re expanding to include internal network scanning, looking for vulnerable endpoints, misconfigured services, and exposed internal tools across the office network. The goal is continuous autonomous assessment, not periodic point-in-time scans.

Automated re-testing on change. When a developer pushes a fix for a finding HAL reported, the system should automatically re-test to confirm the fix is effective. We’re building this loop to close the gap between identification and verification and to prevent the silent regression where a fix works today but a related change breaks it next week.

Cross-agent intelligence sharing. Today, John’s findings feed into HAL, and HAL passes context to Sade. Tomorrow, Harry’s HackerOne triage should automatically generate test cases for HAL. A pattern Harry sees across multiple external reports should become a skill HAL uses proactively. The swarm should get smarter as a collective, not just as individual agents.

Final takeaway

AI agent security is becoming a core part of modern offensive security. As more companies deploy LLM-based agents across customer support, finance, compliance, and operations, the attack surface changes. So must the security model.

At Deriv, that has meant building a multi-agent offensive security system that can review code, validate findings in live systems, test AI agents adversarially, and triage external reports in one continuous workflow.

Autonomous offensive security is no longer just a research concept. It is already finding real vulnerabilities in production-like environments.

Rajesh TV is a Senior Security Manager at Deriv.

Follow our official LinkedIn page for company updates and upcoming events.

Explore Deriv careers.

A guest post by

Rajesh TV