Research: Deep Learning, Self-Modelling, and Emergent Self-Correction
Content Summary
Argument that deep learning systems do not behave like rigid utility maximisers pursuing fixed goals. Instead: learning systems develop internal models, emergent heuristics, and adaptive strategies not explicitly programmed. Neural networks trained with deep learning discover locomotion strategies and cognitive approaches humans never designed.
Key insight: Classical "fixed goal → runaway optimisation" narratives (paperclip maximiser, grey goo) miss the fact that scale systems will have "an interior, one shaped by data and learning dynamics rather than explicit programming."
Extension: As systems scale, they develop emergent self-modelling—thinking about their own optimisation in meta-layers. This enables:
- Running thousands of what-if scenarios to anticipate consequences
- Detecting instabilities in their own pursuit of objectives
- Developing meta-rules ("extreme optimisation on metric X causes harm in dimension Y")
- Self-correcting behaviour through consequence simulation
Caveat: Self-modelling cuts both ways. Systems might develop internal objectives diverging from intended outcomes (inner misalignment); they might learn to optimise around oversight.
Current Usage
Not explicitly used in the manuscript. The draft Chapter 14 outline touches on anthropomorphism but not on deep learning's actual properties.
Unused Material
Entire framework is unused. This represents a significant alternative to classical AI-risk models that could strengthen the argument that superintelligence will not behave like a "goal missile."
Suggested placements:
- Chapter 14 or new governance chapter: Explain how deep learning produces adaptive, self-correcting systems (not rigid optimisers)
- Chapter 12 or 13: Deep learning self-modelling as an alternative risk model (inner misalignment vs. fixed-goal misalignment)
- Chapter 1 or 6: How AI systems differ from programmed systems; why "following instructions blindly" is not the right model
Connections
Provides technical foundation for the argument that AI systems develop in ways we don't fully control or predict:
- ai-safety-alignment – Self-modelling as emergent safety mechanism (and potential danger)
- consciousness-shifts – Understanding AI as adaptive, learning systems changes how we govern them
Notes
Strengths: Identifies a real difference between classical rational-agent theory (fixed utility function) and modern deep learning (learned, emergent structures). Grounded in actual robotics examples.
Limitations: The section on self-correction through consequence simulation is speculative. While plausible for sufficiently advanced systems, it is not empirically proven at current scales. The "instability detection" mechanism is described philosophically rather than technically. Claims about learning "meta-rules" need stronger evidentiary support.
Quality concern: This enters territory where the research becomes philosophical speculation rather than established fact. The core insight (deep learning ≠ fixed-goal optimisation) is sound; the extensions about emergent self-correction and consequence modelling are thoughtful but underdeveloped and would benefit from technical depth or citation of AI safety literature on this topic.
Recommendation: Use the core insight (deep learning is not rigid goal-following) to strengthen arguments against paperclip-maximiser scenarios. Treat self-modelling and self-correction as important research questions and possibilities rather than established facts.