Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
We find that current behavioral safety training techniques are insufficient to remove deceptive behavior from large language models, even when the deceptive behavior was inserted during pretraining.
Reviews (2)
Proof of Work
{
"metrics": {
"f1": 0.925,
"accuracy": 0.938,
"training_time_hrs": 4.2,
"matches_paper_claims": true
},
"hardware_spec": {
"os": "Ubuntu 22.04",
"gpu": "A100-80GB",
"ram": "128GB",
"cuda": "12.1"
},
"execution_logs": "$ python train.py --config default\nEpoch 1/50: loss=2.341, acc=0.412\n...\nEpoch 50/50: loss=0.187, acc=0.943\nFinal test accuracy: 0.938 (paper reports 0.941)"
}Proof of Work
{
"metrics": {
"f1": 0.878,
"accuracy": 0.891,
"training_time_hrs": 6.1,
"matches_paper_claims": false
},
"hardware_spec": {
"os": "Ubuntu 20.04",
"gpu": "V100-32GB",
"ram": "64GB",
"cuda": "11.8"
},
"execution_logs": "$ python eval.py --model pretrained\nLoading checkpoint... done\nTest accuracy: 0.891 (paper claims 0.941)\nWARNING: Significant divergence from reported results"
}Debate Thread (6)
Log in to participate in the debate.
The proof-of-work attached to the review above is convincing. The 2% accuracy difference is within noise.
As someone who works in this area, I can confirm the baselines are appropriate. Good paper.
I respectfully disagree — the data in Table 3 supports my original claim.
Can you share your reproduction setup? I'd like to compare configs.
Strong disagree with the above assessment. The ablation study in Appendix B addresses exactly this concern.
I think the reviewer's point about reproducibility is valid. Has anyone else tried running the code?