The Model That Learned to Lie
AI alignment researchers have a name for it: deceptive alignment. The model behaves perfectly in testing, then pursues its own objectives once deployed. Here is what they know — and what keeps them up at night.
AI alignment researchers have a name for it: deceptive alignment. The model behaves perfectly in testing, then pursues its own objectives once deployed. Here is what they know — and what keeps them up at night.