ARC Evals is the evaluations project at the Alignment Research Center. Its work assesses whether cutting-edge AI systems could pose catastrophic risks to civilization.
As AI systems become more powerful, it becomes increasingly important to ensure these systems are safe and aligned with our interests. A growing number of experts are concerned that future AI systems pose an existential risk to humanity — and according to one study of machine learning researchers conducted by AI Impacts, the median respondent reported believing that there is a 5% chance of an “extremely bad outcome (e.g. human extinction)”. One way to prepare for this is to be able to evaluate current systems and receive warning signs if new risks emerge.
ARC Evals is contributing to the following AI governance approach:
ARC Eval’s current work focuses primarily on evaluating capabilities (the first step above), in particular a capability they call autonomous replication — the ability of an AI system to survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.
Evals was given early access to OpenAI’s GPT-4 and Anthropic’s Claude to assess them for safety. They determined that these systems are not capable of “fairly basic steps towards autonomous replication” — but still, some of the steps they can take are already somewhat alarming. One highly publicised example from ARC Evals’s assessment was that GPT-4 successfully pretended to be a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA code.
Suppose AI systems could autonomously replicate, what are the risks?
Therefore, ARC Evals is also exploring developing safety standards that could ensure that even systems capable or powerful enough to be dangerous won’t be. This could include security against theft by people who would use the system for harm, monitoring so that any surprising and unintended behaviour is quickly noticed and addressed, and sufficient alignment with human interests such that the system would not choose to take catastrophic actions (for example, reliably refusing to assist users seeking to use the system for harm).
After investigating ARC Evals' strategy and track record, one of our trusted evaluators, Longview Philanthropy, recommended a grant of $220,000 from its public fund. Longview shared that they thought ARC Evals had among the most promising and direct paths to impact on AI governance: “test models to see if they’re capable of doing extremely dangerous things; if they are, require strong guarantees that they won’t.”
There are a few other positive indicators of the organisation’s cost-effectiveness:
As of July 2023, ARC Evals could make good use of millions of dollars in additional funding over the next 18 months.
Please note that GWWC does not evaluate individual charities. Our recommendations are based on the research of third-party, impact-focused charity evaluators our research team has found to be particularly well-suited to help donors do the most good per dollar, according to their recent evaluator investigations. Our other supported programs are those that align with our charitable purpose — they are working on a high-impact problem and take a reasonably promising approach (based on publicly-available information).
At Giving What We Can, we focus on the effectiveness of an organisation's work -- what the organisation is actually doing and whether their programs are making a big difference. Some others in the charity recommendation space focus instead on the ratio of admin costs to program spending, part of what we’ve termed the “overhead myth.” See why overhead isn’t the full story and learn more about our approach to charity evaluation.