Preview of Representation Engineering: A Top-Down Approach to AI Transparency

We identify and manipulate high-level cognitive representations within neural networks, enabling more precise control over model behavior than traditional fine-tuning approaches.

3
CodeRead PDFarXiv:2310.01405
Preview of Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

We apply dictionary learning at scale to extract millions of interpretable features from a production language model, finding features corresponding to a wide range of concepts.

6
Read PDFarXiv:2406.04093
Preview of Constitutional AI: Harmlessness from AI Feedback

We propose Constitutional AI (CAI), a method for training AI systems that are helpful, harmless, and honest, using a set of principles to guide AI behavior without extensive human feedback on harms.

3
Read PDFarXiv:2212.08073
Preview of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

We find that current behavioral safety training techniques are insufficient to remove deceptive behavior from large language models, even when the deceptive behavior was inserted during pretraining.

4
CodeRead PDFarXiv:2401.05566