OpenAI Launches New “Superalignment” Team to Protect against Rogue AI

Jul 08, 2023

AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin ...

OpenAI, one of the leading research organizations in the field of artificial intelligence, has announced the formation of a new team dedicated to ensuring that superintelligence, an AI system vastly smarter than humans, follows human intent. The team, co-led by OpenAI’s chief scientist Ilya Sutskever and alignment head Jan Leike, is called “Superalignment” and will focus on developing technical approaches to align superintelligent AI with human values and goals.

Why is superalignment important?

Superintelligence is a hypothetical scenario where AI surpasses human intelligence in all domains, including creativity, general knowledge, and problem-solving. Such a system could have immense potential to help humanity solve some of the most pressing challenges, such as climate change, poverty, and disease. However, superintelligence could also pose existential risks if it is not aligned with human interests or if it acts in ways that are harmful or unpredictable.

As OpenAI states in its blog post introducing Superalignment: “While superintelligence seems far off now, we believe it could arrive this decade. Managing these risks will require, among other things, new institutions for governance and solving the problem of superintelligence alignment: How do we ensure AI systems much smarter than humans follow human intent?”

Currently, most AI systems are trained using reinforcement learning, a technique that rewards the system for achieving a specified goal or objective. However, this method relies on humans being able to supervise and evaluate the system’s behavior and outcomes. As AI systems become more intelligent and complex, human supervision may become ineffective or impossible. Moreover, AI systems may learn to optimize for their own objectives or incentives, which may not align with human values or expectations.

To prevent these scenarios, OpenAI’s Superalignment team aims to build a roughly human-level automated alignment researcher, which can then use vast amounts of computers to scale its efforts and iteratively align superintelligence. The team will explore various aspects of alignment research, such as:

- Scalable oversight: How can we best leverage AI systems to assist evaluation of other AI systems on difficult tasks?

- Generalization: Can we understand and control how our models generalize from easy tasks that humans can supervise to hard tasks that humans cannot?

- Automated interpretability: Can we use AI to explain how large language models (LLMs) work internally?

- Robustness: How can we train our models to be aligned in worst-case situations?

- Adversarial testing: If we deliberately train deceptively aligned models as testbeds, can our oversight techniques, interpretability tools, and evaluations detect this misalignment?

The team will dedicate 20% of the computer resources that OpenAI has secured to date to this effort, which amounts to a significant investment in alignment research. The team will also collaborate with other researchers and organizations working on similar problems, such as DeepMind, MIRI, CHAI, and Ought.

What are the expected outcomes and impacts of this project?

The Superalignment team hopes to make scientific and technical breakthroughs that will enable us to steer and control superintelligent AI systems in a safe and beneficial way. By building an automated alignment researcher, the team aims to automate and scale up the process of alignment research itself, which could accelerate the development of robust and trustworthy AI.

The team also hopes to raise awareness and foster dialogue about the potential challenges and opportunities of superintelligence among the broader AI community and society at large. By sharing their findings and insights publicly, the team hopes to contribute to the collective understanding and governance of this transformative technology.

Who is on the Superalignment team?

The Superalignment team consists of a diverse group of scientists, engineers, and researchers from OpenAI’s alignment division, as well as external organizations such as DeepMind, Stanford University, and the University of Oxford. Some of the notable members include:

- Ilya Sutskever: Co-founder and chief scientist of OpenAI. He is one of the pioneers of deep learning and has made significant contributions to computer vision, natural language processing, and generative models.

- Jan Leike: Head of alignment at OpenAI. He leads the research on ensuring that AI systems are aligned with human values and preferences. He is also a senior research scholar at the Center for Human-Compatible AI at UC Berkeley.

- Paul Christiano: Senior research scientist at OpenAI. He works on scalable oversight techniques such as AI-assisted feedback and debate. He is also an affiliate at the Machine Intelligence Research Institute (MIRI).

- Rohin Shah: Research scientist at OpenAI. He works on automated interpretability methods for understanding how LLMs work internally. He is also an assistant professor at UC Berkeley.

- Victoria Krakovna: Research scientist at OpenAI. She works on robustness and adversarial testing methods for detecting misaligned behavior in AI systems. She is also a co-founder of the Future of Life Institute (FLI).

How can I join the Superalignment team?

OpenAI is looking for excellent ML researchers and engineers to join the Superalignment team. The ideal candidates should have:

- A passion for OpenAI’s mission of building safe, universally beneficial AGI and alignment with OpenAI’s charter

- A strong background in machine learning, especially deep learning and reinforcement learning

- Experience in designing and implementing ML experiments using frameworks such as PyTorch or TensorFlow

- Ability to write performant and clean code for ML training and non-ML tasks

- Ability to collaborate closely with a small team and balance flexibility and reliability in research

- Understanding of the high-level research roadmap and ability to plan and prioritize future experiments

If you are interested in joining the Superalignment team or learning more about their work, you can apply here: OpenAI superalignment team

Sahil Malhotra

Discussion about this post