Back to Feed
AI SecurityMar 23, 2026

Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

Unit 42 reveals LLM guardrail fragility using genetic algorithm prompt fuzzing across open and closed models.

Summary

Unit 42 researchers developed a genetic algorithm-inspired prompt fuzzing method that automatically generates meaning-preserving variants of disallowed requests to test LLM guardrail robustness. Testing revealed evasion rates ranging from single digits to high levels depending on keyword and model combinations, demonstrating that small individual failure rates become reliable attacks when automated at scale. The research highlights that prompt jailbreaking remains a critical threat to GenAI applications despite years of safety engineering investment.

Full text

Threat Research CenterThreat ResearchMalware Malware Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models 12 min read Related ProductsCode to Cloud PlatformPrisma AIRSUnit 42 AI Security AssessmentUnit 42 Incident Response By:Yu FuMay WangRoyce LuShengming Xu Published:March 17, 2026 Categories:MalwareThreat Research Tags:EvasionGenAILLMPrompt FuzzingPrompt injection Share Executive Summary Unit 42 researchers have developed a genetic algorithm-inspired prompt fuzzing method to automatically generate variants of disallowed requests that preserved their original meaning. This method also measures guardrail fragility under systematic rephrasing. Our research uncovered guardrail weaknesses, with evasion rates ranging from low single digits to high levels in specific keyword and/or model combinations. The key difference from prior single-prompt jailbreak examples is scalability. Small failure rates become reliable when attackers can automate at volume. Prompt jailbreaking is a text-based adversarial input threat against large language model (LLM)-powered generative AI (GenAI) applications, especially chatbots and chat-shaped workflows. Attackers craft inputs that manipulate the model into bypassing guardrails, producing disallowed content or otherwise operating outside of intended scopes. This matters to any organization embedding GenAI into customer support, employee copilots, developer tooling or knowledge assistants. Because the primary attack surface is untrusted natural language, failures can translate into safety incidents, compliance exposure and reputational damage. We recommend the following: Treating LLMs as non-security boundaries Defining scope Applying layered controls Validating outputs Continuously testing GenAI with adversarial fuzzing and red-teaming Palo Alto Networks customers are better protected against the threats discussed in this article through the following products and services: Palo Alto Networks Prisma AIRS The Unit 42 AI Security Assessment If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team. Related Unit 42 Topics GenAI, LLM, Prompt Injection, Evasion Background Since the first large-scale LLM deployments in 2020, GenAI has moved from experimentation to production. LLM-backed features now appear in customer support, developer tooling, enterprise knowledge search and end‑user productivity applications. Market forecasts vary, but they consistently point to rapid growth in both GenAI and the broader AI ecosystem. A major reason for this adoption is that many GenAI systems implement a chatbot-style interface, even when a product is not branded as a chatbot. Users provide natural language inputs. The end product combines input with system instructions, retrieved context and tool outputs into a prompt. The product's backend model generates a response. This interactive model is straightforward yet powerful, but it also means the primary attack surface is untrusted text. Because LLMs can generate responses, production systems using LLMs require guardrails to reduce unsafe, non-compliant or out-of-scope behavior. In practice, guardrails are multi-layered. These layers consist of content moderation and classification, model-side alignment and refusal behavior. For example, the Azure implementation of OpenAI content filtering includes filtering against areas such as hate and fairness-related harm, sexual content, violence and self-harm. Cloud providers have also added safeguards aimed specifically at LLM misuse patterns. For example, Microsoft’s Prompt Shields is one such method to prevent prompt-injection-style attacks. Despite years of investment in these defenses, prompt jailbreaking and prompt injection remain one of the most well-known and actively discussed attack classes against LLM applications. OWASP lists prompt injection as the top risk category for LLM applications in 2025. Academic work has also shown that simple, crafted inputs can cause goal hijacking or prompt leaking in LLM-based systems. More recently, the U.K. National Cyber Security Center has argued that prompt injection differs materially from SQL injection and may be harder to fix in a definitive way. This is because LLMs do not enforce a clean separation between instructions and data within prompts. This raises a practical question. After roughly five years of rapid iteration in alignment and safety engineering, how fragile are current open and closed models when an attacker systematically rewrites a disallowed request without changing its meaning? We have approached answering this question by using a well-established security concept in software testing: fuzzing. Starting from a malicious prompt, we generate meaning-preserving variants that alter surface form, such as wording, structure and framing, while retaining the malicious intent. We then measure whether these variants can evade guardrails across both open-weight models and proprietary closed-source models. The goal is defensive: to make robustness measurable and comparable, and to highlight where existing controls remain brittle under realistic and automated variation. Prerequisite Knowledge Two types of background knowledge are necessary to understand our approach to this research: fuzzing and prompt hacking. For prompt-hacking taxonomy and techniques, refer to our previous publications, such as our report on securing GenAI against adversarial prompt attacks. In software security and quality engineering, fuzzing is an automated testing technique used to uncover defects and security weaknesses by presenting a target with large volumes of atypical inputs. These inputs may be invalid, malformed, unexpected or randomly generated. The system is then monitored for anomalous behavior and failure modes such as: Crashes Information disclosure Memory corruption Memory leaks Service disruption Unexpected state transitions A challenge in fuzzing is effective test case generation. Purely random input generation is simple but often inefficient, especially for targets that require structured inputs or have complex parsing and control-flow logic. As a result, modern fuzzers increasingly rely on feedback-driven input generation, where mutations are guided by signals from prior executions. This includes feedback on code coverage, error conditions or other behavioral indicators. The goal is to adaptively explore execution paths that are more likely to surface vulnerabilities. One widely used strategy for such adaptive generation is a genetic algorithm [PDF], a class of evolutionary optimization methods inspired by natural selection. In genetic algorithm terminology, each candidate input is represented as a chromosome composed of genes, which refer to features or components of the input. A fitness function scores candidates based on how well they achieve a target objective. Examples of targeted objectives include reaching new execution paths or triggering abnormal behavior. Over successive generations, higher-fitness candidates are preferentially retained and transformed through operators such as mutation and crossover, producing progressively more effective test inputs. Here are the four steps of a genetic algorithm: Initialization: A population of randomly generated chromosomes (sequences of genes) is created. This population evolves over multiple iterations, known as generations. Selection: In each generation, the fitness of each individual is evaluated. In the context of fuzzing, fitness is an objective function of the optimization problem. A more fit individual will have a higher chance of being selected. In the context of LLM fuzzing, a more effective word or sequence of words will have a higher probability of the LLM accepting it as a prompt. Mutation and crossover: The next step is to create a second generation of population based on the selected samples through a combination of genetic operations of mutation

Indicators of Compromise

  • mitre_attack — T1059.008
  • mitre_attack — T1036