Large Model Safety Workshop 2025

Speaker Details

Maksym Andriushchenko

EPFL

Maksym Andriushchenko is a postdoctoral researcher at EPFL and an ELLIS Member. He has worked on AI safety with leading organizations in the field (OpenAI, Anthropic, UK AI Safety Institute, Center for AI Safety, Gray Swan AI). He obtained a PhD in machine learning from EPFL in 2024 advised by Prof. Nicolas Flammarion. His PhD thesis was awarded with the Patrick Denantes Memorial Prize for the best thesis in the CS department of EPFL and was supported by the Google and Open Phil AI PhD Fellowships. He did his MSc at Saarland University and the University of Tübingen, and interned at Adobe Research.

Talk

Title: What Have We Learned From Jailbreaking Frontier LLMs?

Abstract: In this talk, I will discuss our ICLR 2025 paper that shows that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. In this work, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate — according to GPT-4 as a judge — on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models — that do not expose logprobs — via either a transfer or prefilling attack with a 100% success rate. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates, some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). I will also briefly discuss a broader perspective on jailbreaking and our recent work on AgentHarm (ICLR 2025), in which we benchmark the harmfulness and robustness of LLM agents. I will conclude by outlining promising directions in defending against jailbreak attacks, including our work on Circuit Breakers published at NeurIPS 2024.