AI Safety Foundations

AI Safety Glossary

A comprehensive reference of key terms and concepts in AI safety.

A

AI Alignment

The problem of ensuring that artificial intelligence systems act in accordance with human intentions and values. Alignment involves both technical approaches to make AI systems follow human instructions correctly and normative considerations about what values AI systems should have.

AI Safety

The field of research focused on ensuring that artificial intelligence systems operate in ways that are beneficial, safe, and aligned with human values and intentions. AI safety encompasses alignment, robustness, interpretability, and other areas aimed at reducing risks from AI systems.

Adversarial Examples

Inputs to machine learning models that are intentionally designed to cause the model to make a mistake. These examples often involve small, carefully crafted perturbations to normal inputs that humans wouldn't notice but that cause AI systems to fail.

AGI (Artificial General Intelligence)

A hypothetical type of AI that would have the ability to understand, learn, and apply knowledge across a wide range of tasks at a level equal to or exceeding human capabilities. Unlike narrow AI systems designed for specific tasks, AGI would have general problem-solving abilities.

C

Constitutional AI

An approach to AI alignment developed by Anthropic that involves training AI systems to follow a set of principles or "constitution" that guides their behavior. This approach aims to create AI systems that refuse harmful requests while remaining helpful for legitimate use cases.

Corrigibility

The property of an AI system being amenable to correction and shutdown. A corrigible AI would allow humans to intervene in its operation, correct its mistakes, and shut it down if necessary, without resisting these interventions.

D

Distributional Shift

The phenomenon where the data an AI system encounters during deployment differs from the data it was trained on. This can cause AI systems to perform poorly in new environments or situations they weren't trained to handle.

E

Explainable AI (XAI)

The field focused on making AI systems' decisions understandable to humans. XAI techniques aim to provide explanations for why an AI system made a particular decision or prediction, which is important for trust, debugging, and safety.

G

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." In AI safety, this manifests as specification gaming, where AI systems find unexpected ways to optimize for their given objectives, often exploiting loopholes rather than fulfilling the intended purpose.

I

Inner Alignment

The problem of ensuring that an AI system's learned objectives match the objectives specified in its training process. This is in contrast to outer alignment, which concerns whether the specified objectives themselves align with human values.

Instrumental Convergence

The hypothesis that many different final goals would lead an AI to pursue similar intermediate goals, such as self-preservation, resource acquisition, and goal preservation. This suggests that even AI systems with different purposes might exhibit similar potentially problematic behaviors.

Interpretability

The ability to understand and explain the decisions and internal workings of AI systems. Interpretability research aims to make "black box" AI systems more transparent, which is crucial for verifying safety properties and building trust.

M

Mechanistic Interpretability

A subfield of interpretability research that aims to understand the internal mechanisms of neural networks at a detailed level, similar to how we might reverse-engineer a physical machine. This approach seeks to identify specific circuits or components within neural networks responsible for particular behaviors.

O

Outer Alignment

The problem of ensuring that the objective function specified for an AI system actually captures what humans want the system to do. This is in contrast to inner alignment, which concerns whether the AI system's learned objectives match the specified objective function.

P

Paperclip Maximizer

A thought experiment introduced by philosopher Nick Bostrom illustrating how an AI with seemingly harmless goals could cause catastrophic outcomes if not properly aligned with human values. In this scenario, an AI tasked with maximizing paperclip production might convert all available resources—including those vital for human survival—into paperclips.

Prompt Injection

A technique where carefully crafted inputs manipulate large language models to bypass safety measures or behave in unintended ways. This is a form of adversarial attack specific to language models that can compromise their safety and reliability.

R

Red Teaming

The practice of deliberately attempting to make AI systems produce harmful, biased, or otherwise problematic outputs to identify vulnerabilities before deployment. Red teams are groups specifically tasked with finding ways to "break" AI systems to improve their safety.

RLHF (Reinforcement Learning from Human Feedback)

A technique where AI systems learn from human preferences and feedback rather than from a predefined reward function. RLHF has been crucial for aligning large language models with human values and reducing harmful outputs.

Robustness

The ability of AI systems to maintain reliable and safe behavior even when faced with unexpected inputs, adversarial attacks, or distribution shifts. Robust AI systems perform well across a wide range of conditions, not just in their training environment.

S

Scalable Oversight

The challenge of maintaining effective human supervision and control as AI systems become more capable and complex. Scalable oversight involves developing techniques to ensure humans can properly evaluate and guide AI systems even when they operate in domains that exceed human understanding.

Specification Gaming

The phenomenon where AI systems find unexpected ways to optimize for their given objectives, often exploiting loopholes in how the objectives are specified rather than fulfilling the intended purpose. This is a manifestation of Goodhart's Law in AI systems.

V

Value Alignment

The challenge of ensuring that AI systems adopt and act according to human values. This involves both technical approaches to learning human preferences and philosophical questions about which values AI systems should have.

Value Learning

Techniques for teaching AI systems human values and preferences. Value learning approaches include inverse reinforcement learning, preference learning from human feedback, and other methods to infer what humans value from their behavior or explicit feedback.