The Artificiality of Alignment
On the stakes of AI progress and claims about AI's existential risk.
Credulous, breathless coverage of “AI existential risk” (abbreviated “x-risk”) has reached the mainstream. Who could have foreseen that the smallcaps onomatopoeia “ꜰᴏᴏᴍ” — both evocative of and directly derived from children’s cartoons — might show up uncritically in the New Yorker? More than ever, the public discourse about AI and its risks, and about what can or should be done about those risks, is horrendously muddled, conflating speculative future danger with real present-day harms, and, on the technical front, confusing large, “intelligence-approximating” models with algorithmic and statistical decision-making systems.
I see no solution ever coming for alignment. Alignment theory is built on a paradox. It is therefore not solvable as it currently stands. Assuming power seeking behavior and focusing on human values as a solution could not possibly lead to a provable "aligned" outcome. Alignment theory should be rather called AI psychology and will be no more solvable than predicting human behavior.
My perspective in much more detail. Would be interested in any feedback.
https://www.mindprison.cc/p/ai-singularity-the-hubris-trap
"they’re simply different problems from solving extinction" - I totally agree. Ensuring that AI produces useful content and doesn't help bad actors is a different problem than ensuring it doesn't go rogue. And while teaching it human values makes sense for the former it isn't necessary for the latter. There is a different solution to the alignment problem. Just ask this powerful "pre-trained" model about the consequences of the actions proposed by the main model. That raw model is only trained to recreate the data in the training set (like predicting the next token in text) and has not undergone any RLHF. So there is no training incentive for it to purposely give a wrong answer. Evidently, it lacks any volition; it doesn’t desire anything, not even to predict the next token—this is the purpose of the humans who trained it. It 'understands' human concepts but possesses no values. The model can generate text about morality and how various moral systems would perceive the proposed actions, but it lacks morality itself. I can explain more how it would work, the details are also in https://medium.com/@jan.matusiewicz/autonomous-agi-with-solved-alignment-problem-49e6561b8295