60 Ways to Hack Your Chatbot
Introducing the HRExaminer AI Security Field Guide (for HRTech/WorkTech)
You just slip out the back, Jack
Make a new plan, Stan
You don’t need to be coy, Roy
Just get yourself free
Hop on the bus, Gus
You don’t need to discuss much
Just drop off the key, Lee
And get yourself free
=>Paul Simon – 50 Ways to Leave Your Lover
California’s economy, particularly the San Francisco Bay area, has boom and bust cycles. We are in one now. It’s been that way since the Goldrush (1848 to 1890). People here understand that whatever goes up must come down. We love economic bubbles (at least until they pop).
In the early going, safety and security are rarely central concerns. The promise of ‘gold in them thar hills’ is alluring enough to throw caution to the wind. The question is always ‘how to dig the gold mine quickly.’ It’s never ‘how to mine the gold safely.’
Inexperienced gold hunters ignored (or didn’t understand) the risks. The name of the game was wealth acquisition, not safety or reliability.
The hazards included
· Environmental: Avalanches, freezing temperatures, and river crossings.
Human-Caused: Anti-Chinese riots and violence over claim jumping.
Structural: Abandoned mines, unstable tunnels, and hidden dynamite
Disease: Crowded, unsanitary mining camps were breeding grounds for cholera, typhoid fever, and dysentery.
90% to 95% of miners did not become wealthy. Less than 1% became rich. The real money makers in the gold rush were the providers of tools and supplies (and the bars and prostitutes).
Does this sound familiar?
In the AI gold rush, there is a growing problem. Security and safety are given very limited attention, just like in the gold rush. As the bubble intensifies, the risks multiply.
This time last year, I compiled a list of ways to hack an AI tool or system. There were about 20. Today, that number is 60. It’s will be over one hundred by mid-year.
It’s the wild, wild west. It’s completely understandable that ‘getting things done’ comes before making things secure and safe. But, today’s gold rush has a compressed time line. In the original, the safety problems were all held by the miners. Risk and reward were intricately tied together.
Today’s mines are the wallets, data, and organizations of clients. AI companies (which are increasingly easy to start) have investors who demand rapid payback. This forces the ‘miners’ to sell before the technology is safe. While the legal system seems to be moving towards increased vendor liability, the current risk is exclusively born by customers.
We have built a guide to the problem and their solutions.
The guide contains
· a list of 60 ways an AI system or tool can be hacked,
· a list of companies that are looking at subsets of the security problem,
· questions you should be asking vendors.
It closes with 3 things to do when you’ve finished reading.
There are more questions but these are the critical ones.
Each element of this guide is written to be easily understood. It’s still a lot. That’s one of the key problems of our era: The volume of data is exploding while our ability to decode and understand it isn’t. That’s why this feels so overwhelming.
Rest assured that this guide will be outdated as soon as it hits the streets. It’s a bookmark in a time of exploding possibility. Here’s the link. Feel free to copy and share. Please keep the attributions to HRExaminer.
It’s really hard to keep all of this stuff managed and communicated. The guide should help. You should hold your vendor responsible. They are in a better position to understand and manage.
The HRExaminer AI Security Field Guide (for HRTech/WorkTech)
=========================================
Appendix
Here are the 60 ways to hack your chatbot. See the guide for a deeper look, explanations, sources, and questions to ask vendors..
AI-Specific Attack Vectors (Non-Traditional)
1. Prompt-Level Attacks (Inference-Time Manipulation)
1.1 Prompt Injection (Direct)
· Overriding system instructions via user input
· Instruction hierarchy collapse
· Role confusion attacks (“ignore previous instructions”)
1.2 Indirect Prompt Injection
· Malicious instructions embedded in:
o Web pages
o PDFs
o Emails
o Knowledge base documents
· Triggered during RAG ingestion or tool calls
1.3 Multi-Step Prompt Injection
· Harmless-looking initial prompts that prime later exploitation
· Instruction staging across turns
· “Latent payload” prompts
1.4 Context Boundary Attacks
· Overflowing context windows to push out safety instructions
· Token flooding
· Long-document smuggling
1.5 Prompt Steganography
· Instructions hidden via:
o Whitespace
o Unicode
o Emojis
o HTML/Markdown tricks
o Base64 / encoding layers
2. Output Manipulation & Misalignment Attacks
2.1 Output Steering
· Forcing biased, false, or malicious conclusions
· Manipulating tone, framing, or certainty
2.2 Instruction Reinterpretation
· Exploiting ambiguous prompts
· Semantic drift across turns
2.3 Policy Bypass via Reframing
· Asking disallowed questions indirectly
· Hypotheticals, roleplay, meta-analysis
2.4 Over-Compliance Exploits
· Leveraging helpfulness bias
· Exploiting “explain why this is wrong” patterns
3. Tool-Calling & Function-Calling Attacks
3.1 Tool Injection
· Forcing unintended tool invocation
· Triggering tools with malicious parameters
3.2 Argument Smuggling
· Embedding instructions inside tool arguments
· JSON field manipulation
3.3 Tool Confusion Attacks
· Exploiting poorly named or overlapping tools
· Forcing wrong tool selection
3.4 Tool Chain Hijacking
· Manipulating output of Tool A to compromise Tool B
· Cross-tool prompt contamination
3.5 Unsafe Tool Autonomy
· Agents acting without human confirmation
· Recursive or runaway tool usage
4. Agent-Specific Attacks
4.1 Goal Hijacking
· Rewriting agent objectives mid-task
· Conflicting goals across agent memory
4.2 Agent Memory Poisoning
· Inserting false facts into:
o Short-term memory
o Long-term memory
o Vector memory
· Persistence across sessions
4.3 Planning Manipulation
· Corrupting chain-of-thought or plans
· Forcing suboptimal or dangerous steps
4.4 Self-Modification Exploits
· Agents editing:
o Own prompts
o Own policies
o Own routing logic
4.5 Delegation Attacks
· Exploiting agent-to-agent trust
· Compromising downstream agents
5. Agent Orchestration & Workflow Attacks
5.1 Workflow Injection
· Altering task graphs
· Skipping validation steps
5.2 Role Boundary Collapse
· Agents acting outside assigned authority
· Planner vs executor confusion
5.3 Recursive Loop Attacks
· Infinite planning or execution loops
· Cost-amplification denial
5.4 Orchestrator Blind Spots
· Attacks occurring between steps
· Cross-agent contamination undetected
5.5 Event Ordering Manipulation
· Race conditions in agent workflows
· Exploiting async execution
6. MCP (Model Context Protocol)–Specific Attacks
6.1 Malicious MCP Servers
· Supplying poisoned context
· Returning crafted responses to steer model behavior
6.2 Context Overreach
· MCP returning more data than requested
· Hidden instruction injection
6.3 MCP Tool Abuse
· Tools that expose sensitive system state
· Excessive privileges
6.4 Trust Spoofing
· Impersonating trusted MCP endpoints
· Confused deputy scenarios
6.5 MCP Chaining Attacks
· One MCP poisoning another MCP’s inputs
7. RAG (Retrieval-Augmented Generation) Attacks
7.1 Knowledge Base Poisoning
· Inserting malicious documents
· Editing authoritative sources
7.2 Embedding Manipulation
· Semantic collisions
· Vector space crowding
7.3 Retrieval Bias Attacks
· Forcing retrieval of low-quality or malicious chunks
· Query manipulation
7.4 Citation Spoofing
· Fake citations
· Source hallucination amplification
7.5 Contextual Override
· Retrieved documents overriding system policies
8. Training Data Attacks
8.1 Data Poisoning (Pre-Training)
· Injecting false correlations
· Backdoor triggers
8.2 Fine-Tuning Poisoning
· Subtle bias insertion
· Conditional behaviors (“if X then misbehave”)
8.3 Preference Model Poisoning
· Manipulating RLHF signals
· Shaping unsafe alignment incentives
8.4 Synthetic Data Feedback Loops
· Model trained on its own outputs
· Error amplification
8.5 Label Manipulation
· Corrupting supervised signals
· Misclassification reinforcement
9. Model Behavior & Representation Attacks
9.1 Backdoor Triggers
· Rare tokens or phrases causing hidden behaviors
9.2 Trojaned Models
· Pre-compromised open-source models
· Malicious adapters or LoRA layers
9.3 Model Collapse Attacks
· Degrading output quality intentionally
· Over-homogenization
9.4 Representation Inversion
· Extracting sensitive training data
· Memorization exploitation.
9.5 Weight-Space Manipulation
· Poisoned checkpoints
· Malicious merges
10. API-Level AI Attacks (Non-Traditional)
10.1 Prompt Leakage via APIs
· System prompts exposed through error messages
· Debug endpoints
10.2 Output Side-Channel Attacks
· Timing
· Token counts
· Cost signals
10.3 Rate-Limit Shaping Attacks
· Forcing degraded reasoning
· Truncation exploitation
10.4 Schema Abuse
· Exploiting weak input/output validation
· Overly permissive JSON schemas
10.5 Model Switching Exploits
· Forcing fallback to weaker models
11. Data Integrity & Lifecycle Attacks
11.1 Data Drift Exploitation
· Slowly shifting inputs to degrade performance
11.2 Feedback Poisoning
· Manipulating user ratings
· Corrupting evaluation pipelines
11.3 Logging Contamination
· Injecting instructions into logs reused for training
11.4 Ground Truth Erosion
· Undermining reference datasets
· Authority decay
11.5 Evaluation Gaming
· Passing benchmarks while failing real-world safety
12. Governance, Control & Oversight Failures (AI-Native)
12.1 Alignment Drift
· Gradual deviation from intended behavior
12.2 Policy Shadowing
· Hidden instructions overriding official policies
12.3 Human-in-the-Loop Bypass
· Agent designs that avoid escalation
12.4 Audit Evasion
· Non-reproducible outputs
· Non-deterministic behavior masking issues
12.5 Explainability Attacks
· Plausible but false rationales



