Discussion about this post

User's avatar
Dakara's avatar

I see no solution ever coming for alignment. Alignment theory is built on a paradox. It is therefore not solvable as it currently stands. Assuming power seeking behavior and focusing on human values as a solution could not possibly lead to a provable "aligned" outcome. Alignment theory should be rather called AI psychology and will be no more solvable than predicting human behavior.

My perspective in much more detail. Would be interested in any feedback.

https://www.mindprison.cc/p/ai-singularity-the-hubris-trap

Expand full comment
Jan Matusiewicz's avatar

"they’re simply different problems from solving extinction" - I totally agree. Ensuring that AI produces useful content and doesn't help bad actors is a different problem than ensuring it doesn't go rogue. And while teaching it human values makes sense for the former it isn't necessary for the latter. There is a different solution to the alignment problem. Just ask this powerful "pre-trained" model about the consequences of the actions proposed by the main model. That raw model is only trained to recreate the data in the training set (like predicting the next token in text) and has not undergone any RLHF. So there is no training incentive for it to purposely give a wrong answer. Evidently, it lacks any volition; it doesn’t desire anything, not even to predict the next token—this is the purpose of the humans who trained it. It 'understands' human concepts but possesses no values. The model can generate text about morality and how various moral systems would perceive the proposed actions, but it lacks morality itself. I can explain more how it would work, the details are also in https://medium.com/@jan.matusiewicz/autonomous-agi-with-solved-alignment-problem-49e6561b8295

Expand full comment
3 more comments...

No posts