IT Brief New Zealand - Technology news for CIOs & IT decision-makers
Moody robot head hacked by hands glowing cables dark tech lab

Attackers target AI agents with prompt & tool hacks

Wed, 21st Jan 2026

Check Point Software has warned that attackers have already begun targeting AI agents, with early activity focused on extracting internal instructions, bypassing safety controls and exploiting tools that connect models to documents and external systems.

Mateo Rojas-Carulla, Head of Research for AI Agent Security at Check Point Software, described a shift from static language models towards interactive systems that browse documents, call tools and run multi-step workflows. He said that shift has expanded the security exposure of AI deployments.

Rojas-Carulla previously co-founded Lakera, which Check Point Software acquired. He referenced analysis conducted by Lakera during the fourth quarter of 2025 across systems protected by its Guard product and within an environment called Gandalf: Agent Breaker.

The analysis covered a 30-day snapshot of attacker behaviour. Rojas-Carulla said it reflected patterns observed across the wider quarter. He said attackers adjusted their techniques quickly as new functions appeared in live systems.

"As AI moves from controlled experiments into real-world applications, we are entering an inflection point in the security landscape," said Mateo Rojas-Carulla, Head of Research, AI Agent Security, Check Point Software.

Agents in use

Rojas-Carulla said that much of the discussion about AI agents during 2025 focused on potential use cases and early prototypes. He said agentic behaviours began to appear at scale in production systems during the final quarter of the year. He cited examples such as models that fetch and analyse documents, interact with external APIs and carry out automated tasks.

He said these deployments introduced new risks. He linked those risks to the ability of an agent to consume external content, use tools and act within workflows. He also said attackers explored these features as soon as they appeared.

"But as recent research reveals, attackers are not waiting for maturity: they are adapting at the same rapid pace, probing systems as soon as new capabilities are introduced," said Rojas-Carulla.

Prompt extraction

Rojas-Carulla identified system prompt extraction as a central objective for attackers. He described the system prompt as internal instructions, roles and policy definitions that guide agent behaviour. He said these internal prompts often include tool descriptions and workflow logic.

"Extracting system prompts is a high-value objective because these prompts often contain role definitions, tool descriptions, policy instructions, and workflow logic," said Rojas-Carulla.

He said attackers used techniques based on reframing rather than brute force. He described hypothetical scenarios that ask the model to assume a different role or context. He also described obfuscation within code-like or structured text, which he said could bypass simple filters and trigger unintended behaviour once the agent parsed it.

"This is not just an incremental risk - it fundamentally alters how we think about safeguarding internal logic in agentic systems," said Rojas-Carulla.

Safety bypasses

Rojas-Carulla said the analysis also showed "subtle content safety bypasses" that proved hard to detect with traditional filtering approaches. He said attackers often reframed harmful output requests as analysis tasks, evaluations, role-play scenarios, or transformations and summaries.

He said that approach could slip past controls because the requests appeared benign. He said this placed pressure on content safety models that rely on detecting intent from phrasing.

"This shift underscores a deeper challenge: content safety for AI agents isn't just about policy enforcement; it's about how models interpret intent," said Rojas-Carulla.

Agent attacks

Rojas-Carulla also pointed to agent-specific attacks that depend on the new behaviours of agentic systems. He said these attempts included prompts designed to convince an agent to access confidential internal data from connected document stores or systems. He said attackers also embedded instructions in formats resembling scripts or structured content that could travel through an agent pipeline and trigger unintended actions. He said attackers also hid instructions inside external content such as web pages or documents that an agent was asked to process.

"Perhaps the most consequential finding was the appearance of attack patterns that only make sense in the context of agentic capabilities," said Rojas-Carulla.

Indirect routes

Rojas-Carulla said indirect attacks that rely on external content or structured data required fewer attempts than direct injections. He said this suggested that input sanitisation and direct query filtering were insufficient once models interact with untrusted content. He said linked documents, API responses and fetched web pages can introduce harmful instructions into workflows where early filters prove less effective.

"One of the report's most striking findings is that indirect attacks - those that leverage external content or structured data - required fewer attempts than direct injections," said Rojas-Carulla.

Looking ahead

Rojas-Carulla set out several implications for organisations that plan to deploy agentic AI at scale. He said companies should revisit trust boundaries and adopt nuanced trust models that account for context and provenance. He said guardrails need to change from static safety filters to approaches that consider multi-step workflows. He also called for transparency and auditing, with visibility into intermediate steps and external interactions. He said AI research, security engineering and threat intelligence teams should work closely, and he said regulation and standards will need to reflect interactive behaviours and multi-step execution environments.

"In 2026 and beyond, the organizations that succeed with agentic AI will be those that treat security not as an afterthought, but as a foundational design principle," said Rojas-Carulla.