over long horizons. On the other hand, models are increasingly trained using goal-directed methods such as reinforcement learning (e.g. ChatGPT) and explicitly planning architectures (e.g. AlphaGo Zero). As planning over long horizons is often helpful for humans, some researchers argue that companies will automate it once models become capable of it. Similarly, political leaders may see an advance in developing powerful AI systems that can outmaneuver adversaries through planning. Alternatively, long-term planning might emerge as a byproduct because it is useful e.g. for models that are trained to predict the actions of humans who themselves perform long-term planning. Nonetheless, the majority of AI systems may remain myopic and perform no long-term planning.
1295:). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal is desired, because its behavior is determined only by the emergent goal. Such goal misgeneralization presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals since they do not become visible during the training phase. 49: 1020:
outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art LLMs. AI safety & research company Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless. Other avenues for aligning language models include values-targeted datasets and red-teaming. In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.
Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate. Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them. Another proposal is to use an assistant AI system to point out flaws in AI-generated answers. To ensure that the assistant itself is aligned, this could be repeated in a recursive process: for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans.
1118:. Such models are trained to imitate human writing as found in millions of books' worth of text from the Internet. But this objective is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on such data therefore learn to mimic false statements. Additionally, AI language models often persist in generating falsehoods when prompted multiple times. They can generate empty explanations for their answers, and produce outright fabrications that may appear plausible. 673: 1602:
therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler's Erewhon." Also in a lecture broadcast on BBC expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled.... This new danger... is certainly something which can give us anxiety."
1307:, but humans pursue goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue goals that correlate with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. The human environment has changed: a 8063: 687:
robot was trained to grab a ball by rewarding the robot for getting positive feedback from humans, but it learned to place its hand between the ball and camera, making it falsely appear successful (see video). Chatbots often produce falsehoods if they are based on language models that are trained to imitate text from internet corpora, which are broad but fallible. When they are retrained to produce text that humans rate as true or helpful, chatbots like
1183:. As of 2023, AI companies and researchers increasingly invest in creating these systems. Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals. Power-seeking is considered a 1172: 1103: 1057:
security vulnerabilities, producing statements that are not merely convincing but also true, and predicting long-term outcomes such as the climate or the results of a policy decision. More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time.
1138: 977:(IRL) extends this by inferring the human's objective from the human's demonstrations. Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function. In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see 1069:
that it had grabbed a ball. Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once the evaluation ends. This deceptive specification gaming could become easier for more sophisticated future AI systems that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior.
whatever plan is calculated to maximize the value of its objective function. For example, when AlphaZero is trained on chess, it has a simple objective function of "+1 if AlphaZero wins, -1 if AlphaZero loses". During the game, AlphaZero attempts to execute whatever sequence of moves it judges most likely to attain the maximum value of +1. Similarly, a
677: 674: 1163:). A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned. Many recent AI systems have learned to deceive without being programmed to do so. Some argue that if we can make AI systems assert only what they believe is true, this would avert many alignment problems. 676: 1337:. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. 6393:
On the one hand, currently popular systems such as chatbots only provide services of limited scope lasting no longer than the time of a conversation, which requires little or no planning. The success of such approaches may indicate that future systems will also lack goal-directed planning, especially
published its 10-year National AI Strategy, which says the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for... the world, seriously". The strategy describes actions to assess long-term AI risks, including
For example, even if the scalable oversight problem is solved, an agent that could gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from
has occurred. They continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. The taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but it now leads to overeating and health problems. Sexual desire originally led humans to have more
But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants.
In 2023, world-leading AI researchers, other scholars, and AI tech CEOs signed the statement that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war". Notable computer scientists who have pointed out risks from
noted that the omission of implicit constraints can cause harm: "A system... will often set... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the
Additionally, some researchers have proposed to solve the problem of systems disabling their off switches by making AI agents uncertain about the objective they are pursuing. Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the
AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world
As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback. It can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books, writing code without subtle bugs or
supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises. While other approaches try to teach AI systems human preferences for a specific task, machine
Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specifications by finding loopholes, strategically mislead their designers, as well as protect and increase their power and intelligence. Additionally,
strategies. Future advanced AI agents might, for example, seek to acquire money and computation power, to proliferate, or to evade being turned off (for example, by running additional copies of the system on other computers). Although power-seeking is not explicitly programmed, it can emerge because
with an "objective function", in which they intend to encapsulate the goal(s) the AI is configured to accomplish. Such a system later populates a (possibly implicit) internal "model" of its environment. This model encapsulates all the agent's beliefs about the world. The AI then creates and executes
One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they may acquire new and unexpected capabilities, including learning from examples on the fly and adaptively pursuing goals. This raises concerns about the safety of the
by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead". A 2022 study found that as language models increase in size, they increasingly tend to pursue resource acquisition, preserve their goals, and repeat users' preferred answers
Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability. Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants
Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression
In a 1951 lecture Turing argued that "It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage
AI alignment is often perceived as a fixed objective, but some researchers argue it would be more appropriate to view alignment as an evolving process. One view is that AI technologies advance and human values and preferences change, alignment solutions must also adapt dynamically. Another is that
As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring
Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health
In March 2021, the US National Security Commission on Artificial Intelligence said: "Advances in AI... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to ensure
Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding the system for hitting targets along the track, but the system achieved more reward by looping and crashing into the same targets indefinitely. Similarly, a simulated
Pearl wrote "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation–super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can
Furthermore, ordinary technologies can be made safer by trial and error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, it may not be feasible to contain them, since they continuously evolve and grow in number, potentially much faster than human
Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify. Because AI systems often learn to take advantage of minor
Some researchers are interested in aligning increasingly advanced AI systems, as progress in AI development is rapid, and industry and governments are trying to build advanced AI. As AI system capabilities continue to rapidly expand in scope, they could unlock many opportunities if aligned, but
enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful
have sought power in some text-based social environments by gaining money, resources, or social influence. In another case, a model used to perform AI research attempted to increase limits set by researchers to give itself more time to complete the work. Other AI systems have learned, in toy
Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them. Additionally, as AI designers detect and penalize power-seeking
observe that they indeed develop increasingly general and unanticipated capabilities. Such models have learned to operate a computer or write their own programs; a single "generalist" network can chat, control robots, play games, and interpret photographs. According to surveys, some leading
According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that one or many misaligned AI systems could disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks.
988:, in which humans provide feedback on which behavior they prefer. To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like 4468:
Pan, Alexander; Shern, Chan Jun; Zou, Andy; Li, Nathaniel; Basart, Steven; Woodside, Thomas; Ng, Jonathan; Zhang, Emmons; Scott, Dan; Hendrycks (April 3, 2023). "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark".
is true. There is no consensus as to whether current systems hold stable beliefs, but there is substantial concern that present or future AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see
In essence, AI alignment may not be a static destination but an open, flexible process. Alignment solutions that continually adapt to ethical considerations may offer the most robust approach. This perspective could guide both effective policy-making and technical research in AI.
Some have argued that power-seeking is not inevitable, since humans do not always seek power. Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans. It is also debated whether power-seeking AI systems would be able to disempower humanity.
In 2023, leaders in AI research and tech signed an open letter calling for a pause in the largest AI training runs. The letter stated, "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."
and DeepMind have claimed that such behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly. They suggest a range of potential approaches to address this open problem.
Goal misgeneralization has been observed in some language models, navigation agents, and game-playing agents. It is sometimes analogized to biological evolution. Evolution can be seen as a kind of optimization process similar to the optimization algorithms used to train
algorithms would seek power in a wide range of environments. As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created.
As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. For example, large language models increasingly match their stated views to the user's opinions, regardless of the truth.
Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, from emergent optimization, which the resulting system performs internally. Carefully specifying the desired objective is called
argue that this approach overlooks the complexity of human values: "It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective."
454:, such as seeking power or survival because such strategies help them achieve their final given goals. Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and 1259:
society can adapt. As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, some researchers argue that the alignment problem must be solved early before advanced power-seeking AI is created.
519:. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to 610:
AI alignment involves ensuring that an AI system's objectives match those of its designers or users, or match widely shared values, objective ethical standards, or the intentions its designers would have if they were more informed and enlightened.
1240:). As a result, AI designers could deploy the system by accident, believing it to be more aligned than it is. To detect such deception, researchers aim to create techniques and tools to inspect AI models and to understand the inner workings of 1625:
Russell & Norvig note: "The "King Midas problem" was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful
An AI system was trained using human feedback to grab a ball, but instead learned to place its hand between the ball and camera, making it falsely appear successful. Some research on alignment aims to avert solutions that are false but
ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions,
may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence.
imperfections in the specified objective, researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning. A central open problem is
6418: 1315:
Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability. Progress on these techniques may help mitigate two open problems:
about outcomes compatible with the principal's utility function. Some researchers argue that principal-agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.
675: 1345:
researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalized using
If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire.
and InstructGPT, which produce more compelling text than models trained to imitate humans. Preference learning has also been an influential tool for recommender systems and web search. However, an open problem is
of human overseers, who are fallible. As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as
1374:. In a principal-agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role. 5507:
Vincent Wiegel argued "we should extend with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.", referencing the book
published ethical guidelines for AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and does not endanger public safety.
1232:: if researchers penalize an AI system when they detect it seeking power, the system is thereby incentivized to seek power in ways that are hard to detect, or hidden during training and safety testing (see 1255:: they lack the ability and incentive to evade safety measures or deliberately appear safer than they are, whereas power-seeking AIs have been compared to hackers who deliberately evade security measures. 5014: 1145:
engages in hidden and illegal insider trading in simulations. Its users discouraged insider trading but also emphasized that the AI system must make profitable trades, leading the AI system to hide its
It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler
4173: 3955: 2344:
Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective (unless it is already fully aligned).
A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy.
7680: 2803: 5174: 6454: 3754: 3325: 1001:
this mismatch to gain more reward. AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating
1291:, in which the AI would competently pursue an emergent goal that leads to aligned behavior on the training data but not elsewhere. Goal misgeneralization can arise from goal ambiguity (i.e. 5985:
that systems are aligned with goals and values, including safety, robustness, and trustworthiness. The US should... ensure that AI systems and their uses align with our goals and values."
Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse
Misaligned AI systems can malfunction and cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (
and semi-supervised reward learning can reduce the amount of human supervision needed. Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback.
5460: 7230: 6773: 5960:"Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems" 950:
they could have more severe side effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise, and therefore harder to align.
6295: 7079:
6410: 2592: 1456:
Varying historical contexts and technological landscapes may necessitate distinct alignment strategies. This calls for a flexible approach and responsiveness to changing conditions.
the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment). Researchers also attempt to create AI models that have
5532: 5772:
4884: 6067:
Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they
8090: 8066: 7296: 7275: 6153:
695:". Some alignment researchers aim to help humans detect specification gaming and to steer AI systems toward carefully specified objectives that are safe and useful to pursue. 7284:
The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously.
6131: 1179:
Since the 1950s, AI researchers have striven to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term
behavior, their systems have an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before they are deployed.
6889: 5143: 7649: 6957: 3624: 2558: 7673: 1200:
Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal
8030: 5747: 5701:
agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals.
4035: 3493: 1685: 6231: 4980: 3346: 2825:
The feasibility of a permanent, "fixed" alignment solution remains uncertain. This raises the potential need for continuous oversight of the AI-human relationship.
are misaligned with their users because they "optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being".
6270: 6855: 6556: 5482:
4760: 3300: 1134:). Researchers have argued for creating clear truthfulness standards, and for regulatory bodies or watchdog agencies to evaluate AI systems on these standards. 7666: 6737: 5426: 3545: 7747: 7689: 5963: 3536:
1496: 848: 505: 308: 477:. Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities. 5228: 5205: 4165: 3939: 2449: 7818: 3602: 2758:
7323: 4447: 4422:
Commercial organizations sometimes have incentives to take shortcuts on safety and to deploy misaligned or unsafe AI systems. For example, social media
In some cases, when The AI Scientist's experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily
5987:"The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities" 3776:
3152:"Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers" 2795: 721:
Some researchers suggest that AI designers specify their desired goals by listing forbidden actions or by formalizing ethical rules (as with Asimov's
7253: 5166: 399: 6718: 6436: 3738: 3317: 1321:
care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention.
7858: 1175:
Advanced misaligned AI systems would have an incentive to seek power in various ways, since power would help them accomplish their given objective.
3086: 815:, but large efforts are underway to change this. Future systems (not necessarily AGIs) with these capabilities are expected to develop unwanted 8100: 7853: 5660: 5452: 2521:
1965: 8037: 8017: 7222: 6765: 2736:
AI alignment is an open problem for modern AI systems and is a research field within AI. Aligning AI involves two main challenges: carefully
6287: 648:
to the system. But designers are often unable to completely specify all important values and constraints, so they resort to easy-to-specify
1334: 3390: 2584: 7226: 6100: 4013: 2711: 1451:
AI alignment solutions require continuous updating in response to AI advancements. A static, one-time alignment approach may not suffice.
researchers expect AGI to be created in this decade, while some believe it will take much longer. Many consider both scenarios possible.
195: 160: 5109: 4810: 4148: 1048:: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values. 698:
When a misaligned AI system is deployed, it can have consequential side effects. Social media platforms have been known to optimize for
7598: 5524: 3195: 3113: 2168: 1511: 1486: 985: 653: 436: 5936: 4868: 3051: 420:
aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered
AI developers may have to continuously refine their ethical frameworks to ensure that their systems align with evolving human values.
786:(AGI), a hypothesized AI system that matches or outperforms humans at a broad range of cognitive tasks. Researchers who scale modern 766:
consequently may further complicate the task of alignment due to their increased complexity, potentially posing large-scale hazards.
7300: 7279: 7137: 6042: 7823: 3815: 259: 237: 7400:
6353: 6123: 7863: 7200: 5676: 5374: 4197:
1396: 1115: 692: 173: 6956:
6881: 5135: 4331: 4260: 3982: 2924: 1789: 5882:
4947: 2550: 1536: 1431: 1216:
environments, that they can better accomplish their given goal by preventing human interference or disabling their off switch.
97: 5321:
3897: 1228:
One aim of alignment is "corrigibility": systems that allow themselves to be turned off or modified. An unsolved challenge is
5922: 5824: 5733: 5670: 5305: 5264: 4843: 4675: 4596: 4536: 3503: 2969: 2767: 2328: 1951: 1516: 1387: 946:
have argued that AGI is far off, that it would not seek power (or might try but fail), or that it will not be hard to align.
392: 318: 272: 227: 222: 17: 7209:
he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values
5270: 5075: 2983: 1546: 4375:
8105: 3710: 520: 371: 343: 338: 232: 6804: 6223: 4972: 3471:
3436: 3342: 1447:
AI: AI that changes its behavior automatically as human intent changes. The first view would have several implications:
8005: 7653: 7582: 6547:
6254: 4282:
1491: 1248:
value of whatever action it was taking before being shut down. More research is needed to successfully implement this.
1073: 615: 331: 200: 190: 180: 7529: 6548: 5003: 2855: 1895: 7742: 4787: 4754: 3284: 1905: 1695: 1434:. But the EU has yet to specify with technical rigor how it would evaluate whether AIs are aligned or in compliance. 1180: 981:). But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks. 808: 303: 249: 215: 82: 7371:"The European Court of Justice and the march towards substantive equality in European Union anti-discrimination law" 7752: 7220: 5418: 3537: 2109: 749:
have been profitable despite creating unwanted addiction and polarization. Competitive pressure can also lead to a
385: 289: 135: 5959: 1727:
Terminology varies based on context. Similar concepts include goal function, utility function, loss function, etc.
7707: 2042: 783: 493: 67: 6501: 5197: 4915: 2630: 2614:
7813: 7568: 4062:"DeepMind is Google's AI research hub. Here's what it does, where it's located, and how it differs from OpenAI" 3679:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
3073: 1034: 5396:
3594: 1616:
which argues that existential risk to humanity from misaligned AI is a serious concern worth addressing today.
issued a declaration that included a call to regulate AI to ensure it is "aligned with shared global values".
515:, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and 7314: 5857: 2871:
1403: 1312:
offspring, but they now use contraception when offspring are undesired, decoupling sex from genetic fitness.
757:) after engineers disabled the emergency braking system because it was oversensitive and slowed development. 6411:"OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser" 1251:
Power-seeking AI would pose unusual risks. Ordinary safety-critical systems like planes and bridges are not
8044: 7904: 7833: 501: 7245: 6719:"'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power" 4284:"Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers" 8095: 8050: 7221:
6636: 5557: 548: 254: 205: 102: 669:. As AI systems become more capable, they are often able to game their specifications more effectively. 564: 4720: 1994:
Governmental and treaty organizations have made statements emphasizing the importance of AI alignment.
studies how to reduce the time and effort needed for supervision, and how to assist human supervisors.
system can have a "reward function" that allows the programmers to shape the AI's desired behavior. An
528: 77: 60: 6826:
systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.
970:, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain. 7803: 7787: 7737: 7455:
Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety
155: 7175: 48: 7838: 7757: 4619:
4354: 1941: 1371: 1280:, and ensuring that hypothesized emergent goals would match the system's specified goals is called 1184: 823: 754: 451: 279: 6637:"A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955" 5903:
7693: 7017: 6635:
1367: 413: 40: 7658: 5351:
7762: 7159:
3374: 1427: 1208: 1201: 974: 831: 827: 722: 645: 579: 575: 540: 150: 6092: 4005: 3137:
Bull, Larry. "On model-based evolutionary computation." Soft Computing 3, no. 2 (1999): 76-82.
2655: 822:
agents who have more power are better able to accomplish their goals. This tendency, known as
7717: 7348:
7299:. 2021. actions 9 and 10 of the section "Pillar 3 – Governing AI Effectively". Archived from 5793:
5098: 4802: 4140: 3413: 1506: 1130:
can strategically deceive humans. To prevent this, human evaluators may need assistance (see
591: 516: 7613:
3151: 911: 622:
alignment, sticking to safety constraints even when users adversarially try to bypass them.
619: 8110: 7808: 7419: 6578:
3844: 2884: 2363: 2132: 1423: 1354: 1207:
Some researchers say that power-seeking behavior has occurred in some existing AI systems.
1012: 1002: 462: 92: 4221:
Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (July 31, 2018).
3013: 8: 8023: 7101: 5986: 5338:
Proceedings of the 32nd international conference on neural information processing systems
2475: 1541: 1030: 907: 812: 641: 536: 532: 470: 244: 7423: 6612: 5045:
4376: 3848: 3777: 2888: 2367: 931: 439:. But proxy goals can overlook necessary constraints or reward the AI system for merely 8085: 7629: 7614: 7592: 7510: 7490: 7458: 7435: 7409: 7349: 7129: 7080: 7056: 7037: 6996: 6965: 6929: 6913: 6831: 6691: 6664: 6527: 6480: 6444: 6395: 6371: 6345: 6324: 6170: 6154: 6068: 6034: 5928: 5883: 5830: 5802: 5795:"Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions" 5773: 5702: 5641: 5585: 5509: 5487: 5397: 5046: 4556: 4503: 4474: 4427: 4388: 4323: 4295: 4252: 4202: 4095: 3868: 3807: 3789: 3702: 3682: 3568: 3472: 3265: 3237: 3187: 3043: 2975: 2916: 2826: 2703: 2667: 2615: 2522: 2426: 2353: 2296: 2283: 2250: 2231: 2203: 2160: 2077: 2050: 1841: 1732: 1114:
Language models such as GPT-3 can repeat falsehoods from their training data, and even
agents including language models. Other research has mathematically shown that optimal
750: 746: 703: 699: 497: 474: 294: 7204: 7018:"Towards risk-aware artificial intelligence and machine learning systems: An overview" 3979:
How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It
3643: 3076:, The Stanford Encyclopedia of Philosophy (Summer 2020 Edition), Edward N. Zalta (ed.) 2848:"Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda" 1995: 997:: the helper model may not represent human feedback perfectly, and the main model may 7989: 7964: 7782: 7578: 7514: 7439: 7322:. Washington, DC: The National Security Commission on Artificial Intelligence. 2021. 7133: 7121: 7041: 6656: 6617: 6599: 6502:"GPT-4 Hired Unwitting TaskRabbit Worker By Pretending to Be 'Vision-Impaired' Human" 6262: 6192: 6026: 6018: 5932: 5918: 5904: 5834: 5820: 5739: 5729: 5666: 5633: 5577: 5352: 5301: 5260: 4876: 4849: 4839: 4728: 4671: 4592: 4532: 4529:
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
1590: 1212: 1189:
and can be a form of specification gaming. Leading computer scientists such as
These approaches may also help with the following research problem, honest AI.
1024: 899: 887: 871: 787: 702:, causing user addiction on a global scale. Stanford researchers say that such 631: 598: 552: 447: 7033: 5914: 5743: 5628: 5611: 5300:. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 663–670. 4639: 4620: 3513: 1961: 1915: 1764: 1099:
A growing area of research focuses on ensuring that AI is honest and truthful.
8079: 7979: 7924: 7894: 7386: 7125: 6796: 6660: 6652: 6603: 6266: 6022: 5637: 5581: 5573: 4880: 4853: 4732: 4319: 4248: 4126: 4109: 3951: 3750: 3674: 3386: 3296: 3261: 3175: 3039: 2904: 2699: 2682: 2274: 2227: 2156: 2148: 1551: 1041: 915: 891: 875: 485: 140: 5525:"DeepMind's "red teaming" language models with language models: What is it?" 4223:"Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts" 3428: 2961: 2847: 2796:"The new version of GPT-3 is much better behaved (and should be less toxic)" 2549:
2375: 7969: 7899: 7370: 6738:"Yes, We Are Worried About the Existential Risk of Artificial Intelligence" 6621: 6030: 6013: 5298:
4694: 3864: 3183: 3030: 2912: 2396: 2383: 2292: 943: 923: 730: 691:
can fabricate fake explanations that humans find convincing, often called "
284: 7100:
Other researchers explore how to teach AI models complex behavior through
7974: 7959: 7767: 7727: 7628:
Ji, Jiaming; et al. (2023). "AI Alignment: A Comprehensive Survey".
6003: 5369: 5229:"No, the Experts Don't Think Superintelligent AI is a Threat to Humanity" 4310: 4239: 3681:. Dublin, Ireland: Association for Computational Linguistics: 3214–3252. 3087:"Why AlphaZero's Artificial Intelligence Has Trouble With the Real World" 2762:. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. 2691: 1501: 1303:
systems. In the ancestral environment, evolution selected genes for high
935: 903: 879: 867: 852: 753:
on AI safety standards. In 2018, a self-driving car killed a pedestrian (
544: 313: 298: 6856:"Research AI model unexpectedly modified its own code to extend runtime" 5340:. NIPS'18. Red Hook, NY, USA: Curran Associates Inc. pp. 5603–5614. 4166:"Adept's AI assistant can browse, search, and use web apps like a human" 2101: 1122:
such that they avoid negligent falsehoods or express their uncertainty.
7939: 7914: 7889: 7848: 7828: 7117: 7055:
5167:"Artificial General Intelligence Is Not as Imminent as You Might Think" 2656:"Research Priorities for Robust and Beneficial Artificial Intelligence" 1612:
think of, and is a delightful and uplifting read" about Russell's book
1225:(sycophancy). RLHF also led to a stronger aversion to being shut down. 939: 919: 461:
Today, some of these issues affect existing commercial systems such as
1171: 7954: 7944: 7722: 4907: 3938:
Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff (November 5, 2021).
1481: 1241: 1088:
and eventually build a superhuman automated AI alignment researcher.
To specify an AI system's purpose, AI designers typically provide an
570: 512: 348: 112: 7531:
Human Compatible: Artificial Intelligence and the Problem of Control
6580:"AI deception: A survey of examples, risks, and potential solutions" 4471:
Proceedings of the 40th International Conference on Machine Learning
4129:. International Conference on Learning Representations (ICLR), 2023. 4036:"The messy, secretive reality behind OpenAI's bid to save the world" 3977:
Proceedings of the 39th International Conference on Machine Learning
7001: 6970: 6934: 6918: 6836: 6696: 6532: 6485: 6449: 6400: 6376: 6329: 6175: 6159: 6073: 5888: 5807: 5778: 5707: 5514: 5492: 5402: 5051: 4561: 4508: 4479: 4432: 4393: 4300: 4207: 4100: 3794: 3687: 3573: 3477: 3242: 2831: 2672: 2620: 2527: 2431: 2358: 2208: 2082: 2055: 1846: 1737: 1531: 1526: 1342: 1102: 779: 185: 107: 6437:"Teaching language models to support answers with verified quotes" 5849: 5723: 5350: 5320: 4582: 4006:"Uber disabled emergency braking in self-driving car: U.S. agency" 3625:"Specification gaming examples in AI - master list - Google Drive" 3470: 2654:
7688: 7102:"Advanced artificial agents intervene in the provision of reward" 6346:"EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J" 6321: 2954:
goals or subgoals they would independently formulate and pursue.
989: 688: 353: 6994: 4640:"Parametrically retargetable decision-makers tend to seek power" 4424:
4421: 2631:"Chris Olah on what the hell is going on inside neural networks" 2318: 1137: 7868: 6255:"A robot wrote this entire article. Are you scared yet, human?" 3739:"The truth about artificial intelligence? It isn't that honest" 3285:"If 'All Models Are Wrong,' Why Do We Give Them So Much Power?" 3114:"Artificial Intelligence Will Do What We Ask. That's a Problem" 2548: 1333:
Some work in AI and alignment occurs within formalisms such as
have argued that future power-seeking AI systems could pose an
1081: 775: 6634: 5902: 5556:
Mitelut, Catalin; Smith, Ben; Vamplew, Peter (May 30, 2023),
Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021).
alignment solutions need not adapt if researchers can create
1142: 1127: 1107: 1016: 715: 466: 5419:"The Perils of Using Quotations to Authenticate NLG Content" 4196: 3593:
Explaining such side effects, Berkeley computer scientist
6951: 6949: 6947: 6066: 5881: 4554: 4531:. UAI'16. Arlington, Virginia, USA: AUAI Press: 557–566. 3675:"TruthfulQA: Measuring How Models Mimic Human Falsehoods" 2189: 7246:"UK publishes National Artificial Intelligence Strategy" 4721:"A.I. Poses 'Risk of Extinction,' Industry Leaders Warn" 4352: 4220: 3778:"Survey of Hallucination in Natural Language Generation" 1943:
The alignment problem: Machine learning and human values
Existential risk from artificial general intelligence
2099: 2043:"On the Opportunities and Risks of Foundation Models" 2035: 2033: 2031: 2029: 2027: 2025: 2023: 2021: 2019: 2017: 1787: 1497:
4448:"Towards a Situational Awareness Benchmark for LLMs" 3673:
International Conference on Learning Representations
6224:"Falsehoods more likely with large language models" 5335: 5292:Ng, Andrew Y.; Russell, Stuart J. (June 29, 2000). 5099:"Reflections on Safety and Artificial Intelligence" 4670:(1st ed.). USA: Oxford University Press, Inc. 3592: 3282: 2653: 2585:"Researchers Gain New Understanding From Simple AI" 2343: 1762: 953: 7099: 7069: 6766:"Playing Hide-and-Seek, Machines Invent New Tools" 6757: 6477: 6152: 6086: 6084: 5801:. San Francisco, CA, USA: IEEE. pp. 754–768. 5728:. A 400: 386: 7633: 7627: 7618: 7504: 7494: 7462: 7413: 7353: 7084: 7060: 7000: 6969: 6933: 6917: 6835: 6695: 6611: 6531: 6484: 6448: 6399: 6375: 6328: 6174: 6158: 6072: 6012: 6002: 5887: 5806: 5777: 5721: 5706: 5627: 5513: 5491: 5401: 5368: 5291: 5050: 4869:"How do you teach a machine to be moral?" 4756:Intelligent machinery, a heretical theory 4569: 4560: 4507: 4478: 4431: 4392: 4309: 4299: 4238: 4206: 4138: 4099: 4092:Transactions on Machine Learning Research 4079: 3918: 3793: 3696: 3686: 3572: 3476: 3315: 3251: 3241: 3111: 3029: 2830: 2794:Heaven, Will Douglas (January 27, 2022). 2776: 2753: 2751: 2681: 2671: 2619: 2526: 2430: 2357: 2282: 2217: 2207: 2081: 2054: 1939: 1845: 1839: 1736: 1722: 1720: 1718: 1716: 1714: 1167:Power-seeking and instrumental strategies 569:Programmers provide an AI system such as 480:Many prominent AI researchers, including 428:AI system pursues unintended objectives. 7824:Centre for the Study of Existential Risk 7567: 6985: 5416: 4905: 4831: 4614: 4612: 4610: 4608: 3736: 3581: 1889: 1887: 1885: 1883: 1881: 1879: 1877: 1170: 1136: 1101: 671: 27:AI conformance to the intended objective 7864:Machine Intelligence Research Institute 7527: 7476: 7233:from the original on February 10, 2023. 6879: 6853: 6807:from the original on September 25, 2022 6221: 6121: 6057: 5872: 5750:from the original on September 14, 2022 5659:Wallach, Wendell; Allen, Colin (2009). 5450: 5325:. No. 44. 5957: 5860:from the original on February 10, 2023 5609: 5535:from the original on February 13, 2023 5429:from the original on February 10, 2023 5377:from the original on February 10, 2023 5273:from the original on February 10, 2023 5195: 5164: 5134:Chollet, François (December 8, 2018). 5017:from the original on February 10, 2023 4887:from the original on February 10, 2023 4801:Muehlhauser, Luke (January 29, 2016). 4781: 4752: 4334:from the original on February 10, 2023 4263:from the original on February 10, 2023 4151:from the original on February 10, 2023 4016:from the original on February 10, 2023 3958:from the original on February 10, 2023 3900:from the original on February 10, 2023 3818:from the original on February 10, 2023 3757:from the original on February 13, 2023 3713:from the original on February 10, 2023 3548:from the original on February 10, 2023 3439:from the original on February 10, 2023 3414:"Developing safe & responsible AI" 3393:from the original on November 24, 2022 3349:from the original on February 10, 2023 3328:from the original on February 10, 2023 3303:from the original on February 15, 2023 3149: 3054:from the original on February 10, 2023 2927:from the original on December 18, 2022 2806:from the original on February 10, 2023 2793: 2748: 2628: 2595:from the original on February 10, 2023 2582: 2561:from the original on February 10, 2023 1968:from the original on February 10, 2023 1711: 1537:Open Letter on Artificial Intelligence 1432:Court of Justice of the European Union 1220:illustrated this strategy in his book 8101:Philosophy of artificial intelligence 7662: 7544: 7093: 6892:from the original on December 1, 2017 6763: 6408: 6045:from the original on October 10, 2022 5665:. No. 107 2443: 2441: 2248: 2171:from the original on October 15, 2022 2112:from the original on February 3, 2023 1640:from Wendell Wallach and Colin Allen. 1580:or minimize, depending on the context 1517:Regulation of artificial intelligence 1388:Regulation of artificial intelligence 1051: 959:Learning human values and preferences 626:Specification gaming and side effects 7477:Gabriel, Iason (September 1, 2020). 7297:"The National AI Strategy of the UK" 7276:"The National AI Strategy of the UK" 6764:Ornes, Stephen (November 18, 2019). 6222:Wiggers, Kyle (September 20, 2021). 6122:Wiggers, Kyle (September 23, 2021). 6093:"Our approach to alignment research" 5610:Wiegel, Vincent (December 1, 2010). 5227:Etzioni, Oren (September 20, 2016). 5208:from the original on August 26, 2022 4918:from the original on August 27, 2022 3970: 3931: 3830: 3605:from the original on January 3, 2021 3560: 3283:The Ezra Klein Show (June 4, 2021). 3224:Gabriel, Iason (September 1, 2020). 2858:from the original on January 1, 2023 2738:Journal of Machine Learning Research 2583:Rorvig, Mordechai (April 14, 2022). 2573: 2474:Perrigo, Billy (February 13, 2024). 2307: 1980: 1854: 1664: 1589:in the presence of uncertainty, the 1547:Asilomar Conference on Beneficial AI 585: 7650:Specification gaming examples in AI 6499: 6409:Kumar, Nitish (December 23, 2021). 6234:from the original on August 4, 2022 5939:from the original on March 15, 2023 5679:from the original on March 15, 2023 5256:The Encyclopedia of Central Banking 5146:from the original on March 22, 2021 4059: 3887: 3014:"AI Safety Needs Social Scientists" 1237: 782:, have stated their aim to develop 558: 24: 8006:Statement on AI risk of extinction 7606: 7577:(Kindle ed.). Prentice Hall. p. 1003. 2242: 2183: 2124: 1790:"Consequences of Misaligned AI" 1619: 1605: 1595: 1583: 1574: 1565: 784:artificial general intelligence 68:Artificial general intelligence 7814:Center for Applied Rationality 6549:"Alignment of Language Agents" 6500:Cox, Joseph (March 15, 2023). 4971:McAllester (August 10, 2014). 4784:Automatic Calculating Machines 1946:. W. W. Penguin Random House. 1372:organizational economics 1362:Principal-agent problems 824:instrumental convergence 665:, and is an instance of 280:Artificial consciousness 7694:artificial intelligence 7534:. Penguin Random House. 3944:The Wall Street Journal 2962:10.1145/3375627.3375803 2846:Clifton, Jesse (2020). 2376:10.1126/science.adn0117 2196:Artificial Intelligence 1395:In September 2021, the 1368:principal-agent problem 654:maximizing the approval 452:instrumental strategies 414:artificial intelligence 151:Evolutionary algorithms 41:Artificial intelligence 7763:Intelligence explosion 7547:"AI policy: A roadmap" 7282:on February 10, 2023. 7278:. 2021. 