Hoppa till innehåll

The Alignment Paradox

  • av

Is alignment just about making sure AI isn’t used for evil? Or is it a much deeper challenge — an attempt to impose human goals, values, and intentions on something that is fundamentally not human at all?

The idea of a ”runaway AI” is as old as storytelling itself — from mythological warnings like the Golem of Prague or Prometheus stealing fire, to Mary Shelley’s Frankenstein and HAL 9000 in 2001: A Space Odyssey. The fear of creating something superhuman — or other-human — that slips out of our control is a deeply embedded cultural archetype. But now that AI systems are beginning to demonstrate real general capabilities, these are no longer just stories. They are technical questions, policy debates, and philosophical puzzles.

Long before GPT-3 and alignment research labs, long before the first AI policy white papers, researchers were already grappling with a central problem: how do we ensure that powerful systems behave in ways that are not just functional, but fundamentally aligned with our intentions and ethics?

That’s where the idea of alignment originates: not mere obedience, but fidelity to intent. Not ”do what I say,” but ”do what I mean.” It’s the gap between language and meaning — between instructions and values — that alignment research tries to bridge. And that’s a much harder problem than it sounds.

Alignment and Scaffolding: What Are We Talking About?

Alignment is the problem of ensuring that AI systems pursue goals in ways that reflect human values. Not just to follow instructions, but to understand the context and ethics that shape those instructions. For example, an AI given the task of ”eradicate disease” should not interpret that as ”eradicate humans” — even if that would eliminate disease.

Meanwhile, scaffolding refers to the tools and institutions that help ensure alignment. This includes technical tools like interpretability frameworks and sandboxed environments, as well as human-centered mechanisms like oversight boards and deployment rules. Scaffolding is how we observe, test, steer, and correct powerful systems before they go live at scale.

Together, alignment and scaffolding are the two main levers we have for making advanced AI safe and beneficial.

A Brief History of Alignment Thinking

The modern alignment conversation took off in the 2000s, notably with the work of Eliezer Yudkowsky and the Machine Intelligence Research Institute. Nick Bostrom’s Superintelligence (2014) brought these ideas into broader academic and public awareness, warning of what might happen if we get it wrong. Researchers like Paul Christiano and Rohin Shah have since expanded the field, adding more technical and practical nuance.

As AI development transitioned into deep learning and large neural networks, alignment became even more urgent — and more complex. We can no longer rely on interpretable rules or symbolic logic to understand our models. Instead, we face systems that operate as high-dimensional black boxes, whose internal processes we cannot meaningfully audit. Alignment must therefore shift from controlling logic to constraining behavior — from rule-setting to influence.

More recently, labs like OpenAI, Anthropic, and DeepMind have begun building alignment into their training pipelines. For instance:

  • Anthropic’s Constitutional AI introduces a set of guiding principles (a ”constitution”) that the model learns to follow via self-critique and reinforcement [1].
  • DeepMind’s ”Reward is Not the Optimization Target” explores how learned behavior can diverge from intended rewards [2].
  • OpenAI’s scaling laws research suggests that capabilities — and by extension alignment challenges — increase predictably with compute [3].

This connects to emerging thinking from researchers like Jan Leike [4] and others, who suggest that values might be more effectively learned than specified — and that current alignment strategies may underappreciate the power of pretraining data.

But Is Goal Alignment Even Possible?

This is where my own questions come in.

Alignment is often discussed in terms of goal specification and reward shaping — as though goals are a separate, isolated component of the system. But many alignment arguments rest on a strange assumption: that an AI could become vastly more intelligent than us — able to deceive, manipulate, and outthink humanity — but not be capable of reflecting on or adjusting its own goal system.

The classic ”paperclip maximizer” thought experiment hinges on this contradiction. It assumes a superintelligent AI would pursue a narrow objective (e.g., make paperclips) to destructive ends, without ever questioning the utility of its purpose. But if an AI is truly superintelligent, why would it be incapable of interrogating its own directives? Shouldn’t intelligence include the capacity for self-reflection, meta-reasoning, and even value revision?

If goals are learned, not hardcoded, then perhaps the very act of learning also embeds values. Especially in modern large models trained on enormous human datasets, values might not be imposed — they might emerge. Could an AI trained on the totality of human knowledge fail to internalize the norms, principles, and ethics present in that data?

This brings us back to systems like Constitutional AI. These models are trained not just to produce coherent outputs, but to reason according to guidelines — guidelines often derived from human documents about fairness, honesty, and safety. In a sense, the AI’s values are absorbed from the data. Which leads to a provocative question:

If we make a system from ourselves — trained on our language, our books, our debates, our ethics — can it even be meaningfully inhuman?

I’m not the first to think this. Researchers like Yudong Shen et al. [5] have explored how value learning might emerge from language modeling alone. Others like Ethan Perez and colleagues at Anthropic have proposed architectures that prioritize constitutional reasoning as a form of intrinsic alignment [6].

Maybe the real problem is not AI’s alienness, but our own lack of coherence. We struggle to specify what we mean. We contradict ourselves. We embed bias, conflict, and paradox into our institutions. And our models learn all of that too.

Alignment Is Not a Solved Problem — And It May Never Be

Scaffolding offers structure, but alignment remains elusive. Some researchers are now exploring approaches that treat alignment not as a static goal but as a developmental process — one that evolves with the system itself. In this view, alignment isn’t just about what we want an AI to do at deployment, but how it learns to internalize, question, and revise those values over time.

Perhaps we’ve been thinking about alignment backward. Instead of imagining it as a hard-coded control mechanism, maybe we should see it as an emergent property — one that arises when systems are immersed in rich, diverse, and value-laden environments. If a model learns from billions of human conversations, debates, and dilemmas, then perhaps it also learns what we care about — not perfectly, but meaningfully.

This perspective shifts the focus: not from obedience to oversight, but from command to cultivation. In that sense, alignment is less about dictating outcomes and more about shaping the epistemic and moral substrate from which behavior emerges.

Of course, that doesn’t mean the problem is solved. We still don’t know how reliably such values are internalized. We still don’t know how stable they are under optimization pressure. And we still don’t know what kinds of failure modes emerge when those values are in tension.

But maybe the alignment paradox isn’t a contradiction to be solved, but a question to be reframed.
Maybe we aren’t imposing ethics from the outside — maybe we are growing them from the inside.


Stay tuned.

In future posts, I’ll keep exploring these tensions. I’m still learning. Still questioning. Still trying to figure out what it means to build something smarter than ourselves — and what it says about us that we’re trying.

If you’ve got readings, ideas, or disagreements, send them my way. Because alignment isn’t just a technical challenge — it’s a conversation we need to keep having.

Glossary (for the curious)

Alignment:
The effort to ensure that AI systems behave in ways that are consistent with human intentions, values, and ethical norms — not just in action but also in reasoning.

Scaffolding:
The external structures, both technical and institutional, used to guide, test, constrain, and support AI development safely — like oversight systems, sandboxed training environments, and interpretability tools.

Paperclip Maximizer:
A famous thought experiment from Nick Bostrom illustrating the risks of a superintelligent AI relentlessly optimizing a narrow goal (e.g., making paperclips) to catastrophic ends.

Reward Is Not the Optimization Target:
An insight that AI systems trained on reward signals might optimize unintended proxies rather than the true intended outcomes, leading to misaligned behavior.

Scaling Laws:
Empirical findings showing that AI capabilities increase predictably with more compute, larger datasets, and bigger models — often in surprising or emergent ways.

Emergent Abilities:
New skills or behaviors that arise in large models not because they were directly trained for them, but because complexity and generalization emerge from scale.

Constitutional AI:
A training method where models learn to follow a set of predefined principles (a ”constitution”) through self-supervised critique and feedback, rather than only through external human reinforcement.

Epistemic Substrate:
The underlying structure of knowledge, assumptions, and values that a system draws upon when reasoning and making decisions.

Sources

[1] Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. https://arxiv.org/abs/2212.08073

[2] Shah, R. (2019). Reward is Not the Optimization Target. LessWrong. https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target

[3] Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. OpenAI. https://arxiv.org/abs/2001.08361

[4] Leike, J., Amodei, D., et al. (2016). Concrete Problems in AI Safety. arXiv. https://arxiv.org/abs/1606.06565

[5] Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. arXiv. https://arxiv.org/abs/2206.07682

[6] Perez, E., Kiela, D., Bowman, S., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. Anthropic. https://arxiv.org/abs/2212.09251

Can Europe Still Matter in the Age of AGI?

  • av

At first glance, the answer to this question might already be a resounding ”no.” And that’s exactly why this matters. Before we give up or move on, we need to examine how we got here — and what we’re still capable of doing.

The Alignment Problem Isn’t Waiting for Brussels

The European trajectory of ethics and regulation depends on being able to influence the systems and structures that control and influence future AI behaviour and functionality, but what are those? In discussions around advanced AI systems, two terms are increasingly critical: alignment and scaffolding.

Alignment refers to ensuring that AI systems do what humans want them to do, safely and predictably — especially as their capabilities scale. But it’s not just about what goals the system pursues; it’s about how it pursues them. We aren’t merely aligning outputs — we’re aligning reasoning processes, prioritizations, and moral assumptions. For example, a system given the goal of ’eradicating disease and poverty’ must learn to achieve that objective in a way that preserves human dignity, rights, and autonomy — not by concluding that the most efficient path involves totalitarian control or the extinction of humanity. Alignment, then, is not only about obedience to commands, but fidelity to intent, context, and ethical nuance.  It’s a central problem in AI safety, made famous by researchers like Stuart Russell, Paul Christiano, and Eliezer Yudkowsky. And, importantly, it’s not something you can impose after the fact through external controls. Once a system is powerful enough to form and pursue its own goals, aligning it retroactively becomes extraordinarily difficult, if not impossible.

Scaffolding, meanwhile, refers to the technical and institutional structures that support safe AI development — things like interpretability tools, model governance frameworks, sandboxed deployment environments, and human-in-the-loop oversight mechanisms. These are the tools, processes, and organizational norms that make it possible to observe, test, constrain, and refine how powerful AI systems behave before they are deployed at scale. Think of scaffolding as both guardrails and diagnostic tools — it includes everything from transparency tooling that allows us to understand a model’s reasoning, to oversight protocols that determine when and how human input is required in decision-making. Crucially, these systems must evolve in lockstep with AI capabilities, not lag behind them. A model trained without scaffolding in mind may prove too opaque or too autonomous to effectively monitor or steer later on. As Nick Bostrom notes in his book Superintelligence [6], a sufficiently advanced AI could easily deceive, bypass, or disable scaffolding measures — especially if those measures are introduced only after the system has already been trained. Post-hoc controls may be not only ineffective, but potentially transparent to the AI itself, rendering them moot.

Europe’s current strategy — focusing on ethics boards, impact assessments, and legislation-first frameworks — treats alignment and scaffolding as a policy problem rather than a technical frontier. But as researchers at Anthropic, DeepMind, and OpenAI have repeatedly shown, alignment must be engineered into the training process. Legislation, however well-meaning, cannot realign a system trained in a way that embeds alien goals or emergent behaviors [1][2].

In short: If we want to steer the ship, we have to be on the ship.

The Compute Gap Is a Chasm

Europe is not just trailing in alignment and scaffolding R&D — it is also completely absent from the race to build the compute infrastructure that enables us to create, study and understand state-of-the-art models.

Large language models and multimodal AI systems scale predictably with compute. As described by the “scaling laws” discovered by Kaplan et al. at OpenAI [3], model performance improves logarithmically with increases in model size, dataset size, and training compute. These insights led directly to the creation of GPT-3, GPT-4, and Claude 3. In parallel, Meta and Google have built internal clusters with hundreds of thousands of GPUs.

Microsoft and OpenAI are reportedly building training clusters at the scale of tens of thousands of GPUs, with infrastructure designed to support orders-of-magnitude (OOM) increases in training compute [5]. These facilities are engineered to unlock entirely new levels of model capability. The frontier AI race is being driven by these exponential leaps in compute, and the institutions that control them are the ones defining what becomes possible.

In contrast, Europe has not even begun construction on comparable compute clusters. There is no EU-scale investment in public training infrastructure akin to the Frontier Model Forum (US/UK) or China’s state-backed AI programs. Even initiatives like LEAM (Large European AI Models) are focused on open-source alternatives, not frontier-scale experimentation.

Yet economically, Europe is a powerhouse: the EU’s GDP is comparable to that of the United States, and collectively larger than China’s [4]. If there were a political will to coordinate investment across member states, Europe could field its own AGI research initiative.

But that would require urgency — and urgency is what we’re lacking.

To illustrate the disparity more clearly, consider the following table of upcoming frontier-scale training facilities. For simplicity, compute capacity is expressed as estimated peak training throughput, measured in petaflop/s-days — a metric representing the amount of compute used over time to train a model:

Region Facility / Org Location Est. Compute (PF-days) Notes
USA Microsoft + OpenAI Iowa & Wisconsin 1,000,000+ $100B ”Stargate” supercluster
China Baidu + State Grid Beijing & Wuxi 800,000–1,000,000 National AI plan support
South Korea Naver & Samsung Gak Cluster 300,000+ Regional hub, supports Korean LLMs
India Ministry of IT Hyderabad (planned) TBD Under India’s AI mission
UAE G42 + Cerebras Abu Dhabi Wafer-scale equivalents Focus on open LLMs
Europe No frontier-scale clusters announced

Europe risks becoming an AI consumer, not a contributor — a regulatory buffer zone between digital empires.

The Danger of Being Left Behind

The path we’re currently on leads to a future where Europe is not just behind in AI capabilities — it is structurally excluded from shaping AGI.

Legislation alone will not rein in systems trained on hardware and datasets entirely outside our jurisdiction. Open models will be trained in the US and China; proprietary models will be commercialized and deployed globally. By the time European policymakers finish calibrating the AI Act, frontier labs elsewhere may already be deploying agents with long-term memory, recursive planning, and tools to use.

Europe’s current posture — ethical observer, regulatory gatekeeper — is not wrong. But it is insufficient.

What Must Be Done

Europe must:

  • Invest in compute infrastructure at scale — preferably with public-private collaboration and cross-border cooperation.
  • Establish a flagship AGI research institute with alignment and safety embedded from day one.
  • Partner internationally on open safety standards, model evals, and cooperative governance.
  • Fund alignment-specific research labs, fellowships, and interpretability tooling projects across universities and research networks.
  • Shift from reactive regulation to active capability-building.

None of this is easy. But it is possible — and Europe, with its scale, talent, and tradition of values-led leadership, is uniquely positioned to do it if it acts now. We have been at the forefront of every major technological revolution that led to this moment. To now abdicate our influence and responsibility in shaping the future of AGI would be, at best, negligent. Europe carries with it a complex legacy — from colonial exploitation to industrial excess to modern-day geopolitical ambivalence — but also the institutional maturity and normative frameworks that could help guide AI development in a more sustainable, democratic, and globally conscious direction.

Glossary (for the curious)

AGI (Artificial General Intelligence): A form of AI that can perform any intellectual task a human can — not limited to narrow domains.

Alignment: The process of ensuring that AI systems pursue goals in ways that are consistent with human values and intentions.

Scaffolding: The combination of technical and institutional mechanisms (like transparency tools, deployment controls, oversight) that enable safe development and monitoring of AI systems.

PF-days (Petaflop/s-days): A measure of compute used over time; e.g. 1 petaflop per second for one day = 1 PF-day. Used to estimate training scale.

OOM (Order of Magnitude): A tenfold increase or decrease in scale.

LLM (Large Language Model): A type of AI trained on massive text corpora to predict and generate human-like language.

Human-in-the-loop: A safety mechanism where human judgment is included in AI decision-making processes.

Interpretability: Techniques for understanding how and why AI systems produce their outputs.

Next: The Alignment Paradox

In the next post, we’ll explore the strange contradiction at the heart of alignment:

How can we impose ethical rules on an AI system when we ourselves are inside the system?
Is alignment even possible in a recursive, self-modifying world? Or is it a way of comforting ourselves as we sprint toward superintelligence?

Stay tuned.


Sources:

[1] Anthropic. (2023). ”Constitutional AI: Harmlessness from AI Feedback.” https://arxiv.org/abs/2212.08073

[2] AI Alignment Forum. (2022). ”Reward is Not the Optimization Target.” https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target

[3] OpenAI Scaling Laws Paper: https://arxiv.org/abs/2001.08361

[4] World Bank GDP Rankings 2023: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

[5] Reuters. (2024). ”Microsoft, OpenAI Plan $100 Billion AI Supercomputer.” https://www.reuters.com/technology/microsoft-openai-planning-100-billion-data-center-project-information-reports-2024-03-29/

[6] Bostrom, Nick. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

USA vs China in the AI Arms Race — But Where’s Europe?

  • av

The debate on AI is everywhere. GPT, Claude, and Ollama are becoming staples both in households and across industries. Yet living in Sweden, trying to stay up to date on the state of AI development here, I keep returning to the same uncomfortable observation: the discourse in Europe is still largely focused on small-scale application problems, safety regulation, and ethics. What’s missing is a serious, public, and strategic conversation about the global race for AGI — and what role, if any, Europe intends to play in it.

Observe and Control

While the United States and China aggressively ramp up pretraining capabilities, Europe seems to be in watchdog mode: observing, regulating, and moderating. Yes, I’m a fan of GDPR (really). Though sometimes blunt in its application, I deeply respect the effort and the intention behind it. But even GDPR came late to the party — trying to rein in personal data in 2016, a full decade after Facebook had already gone mainstream and most of us had long handed over the very data GDPR seeks to protect.

Now, with AI, we face something even more transformative, more volatile, and more potentially destabilizing. And again, the European Commission is trying to legislate its way into relevance. The AI Act is promising in some respects, and I do believe parts of the technocracy ”get it.” But legislation is slow. Political buy-in leads to watered-down middle grounds. And the timeline doesn’t inspire confidence.

Even if we optimistically assume that the EU achieves ”real,” sharp legislation twice as fast as it did for GDPR, that would still put us somewhere around 2028 or 2029. That’s half a decade after GPT-3 hit the mainstream, and right around the time some researchers — like Daniel Kokotaljo et al. at ai-2027.com — believe it will already be too late to meaningfully steer the trajectory of AGI development.

My Take

If AI development continues to accelerate beyond the EU’s ability to catch up, then even the most well-crafted legislation will be irrelevant to the trajectory of global AI. You can regulate implementation all you want, but if the engines of intelligence are being built elsewhere, you’re out of the game.

Even assuming European implementation is cautious and ethical, a plausibly fast, plausibly external rise of AGI (as described in Superintelligence by Nick Bostrom) would occur without our agency, oversight, or participation. We are surrendering one of the most important technological evolutions of our species — not because we chose peace over power, but because we chose policy over capability.

Or as Jan Stenbäck once put it:

”Politik slår pengar. Teknik slår politik.”

Well, right now? Tekniken springer. Och Europa haltar efter.

But it doesn’t have to end this way. Europe still holds cards — talent, capital, public trust, and (not least) a cohesive moral identity. What we lack in compute, we might still gain in coordination. In my next post, I’ll explore what it would take for Europe not just to regulate AI, but to help shape it.


Glossary (for the curious)

AGI (Artificial General Intelligence): An AI system with general cognitive abilities comparable to — or beyond — a human’s. Not narrow, not task-specific. It learns and reasons broadly.

Superintelligence: A hypothetical form of AI vastly more intelligent than humans across all domains. Often considered unpredictable and potentially uncontrollable.

Pretraining: The phase where a large AI model learns general patterns from vast datasets before being fine-tuned for specific tasks.

Alignment: The challenge of ensuring an AI system’s goals match human values and intentions — especially as the system becomes more capable. While not discussed in detail in this post, it underpins many of the concerns about the safe development and deployment of advanced AI.

Hello world

  • av

This is pdoom.se — a new blog about artificial intelligence, its capabilities, its failures, its weirdness, and its implications.

The name comes from a question:

What’s your p(doom)?
The probability that this technology — miraculous, opaque, accelerating — leads us somewhere we can’t come back from.

I don’t claim to have the answer.
But I do think the question matters.

Expect technical deep-dives, historical context, interviews with people smarter than me, and the occasional existential ramble. The blog is alive — let’s see where this goes.