Skip to content

Instantly share code, notes, and snippets.

@vihari
Last active February 19, 2024 12:38
Show Gist options
  • Save vihari/0d2af831b28e3e67b3664ea220e5ffff to your computer and use it in GitHub Desktop.
Save vihari/0d2af831b28e3e67b3664ea220e5ffff to your computer and use it in GitHub Desktop.
Short Summary of AI Alignment

The following is a short summary of AI alignment that you may find handy.

Imagine a maid robot with which we are interacting.

  • Outer alignment problem, aka Reward hacking, task misspecification, specification gaming.

    You ask for a coffee. It understood the assignment, but grabbed it from your father and gave it to you. You got the coffee, but that is not how you want it.
    Problem: Your values and preferences are not encoded.
    Challenging part: How to specify innumerably many preferences and ensure they are adhered?
    Methods: Tune it to be honest, harmless and helpful: RLHF. Feedback at scale for super-intelligence: Scalable oversight, weak-to-strong generalisation, super-alignment. Explain the process instead of simply specifying the outcome: process-based feedback.

  • Inner alignment problem, aka goal misgeneralisation, spurious correlations, distribution shift

    You ask for a coffee. It misunderstood the assignment from its experience and instead (a) gave you a cup of hot milk (goal misgeneralisation), or otherwise (b) failed because it cannot operate an unseen coffee machine (capability misgeneralisation).
    Problem: Training with sparse feedback (reward or label) leaves to imagination its causes.
    Challenging part: How to attribute the reward to the appropriate feature/action while keeping the feedback sparse?
    Methods: Many classic methods to tackle distribution shifts (causal learning, domain generalisation, learning from explanations etc.), Interpretability methods to weed out problematic concepts.

  • Existential risk (hypothesised)

    You ask for a coffee. It gets you one. But in the free time, builds strategies for long-horizon reward accumulation: (1) ensure humanity never runs out of coffee, (2) the bot is irreplaceable.
    Problem: Extreme case of outer-(mis)alignment.
    Challenge: same as outer-alignment. Also, how to monitor and control the true intentions of a learning system?
    Methods: Any outer-alignment method, my personal favorite is process-based feedback.

  • Grounding the common terms with their technical causes

    1. deceptively-aligned — hard to detect failures of complex system
    2. situationally-aware policies — train-test distribution shift of policies
    3. manipulative — can provide convincing explanation even for a wrong answer
    4. power-seeking — outcome-based feedback (inadvertently) make actions that guarantee perpetuation more desirable.

    I fear that by describing the agent behaviour using such terms as above is portraying the technical problem as some kind of a rehabilitation program. Misalignment is an engineering challenge that I believe we can solve.

Summary.

  1. Alignment is not a new problem or necessarily require super-intelligence. Alignment is an outcome of black-box models and input high-dimensionality.
  2. However, increased capability of systems may lead to increased autonomy thereby triggering even greater concern.
  3. The expected scenarios of AI takeover require very long-range planning, much higher capabilities (not sure on what all), strong harm-causing motive (read very poorly designed reward), strong persuasion. While I appreciate the risks of misalignment, harms to the extent of existential risk are still unsubstantiated.

References or Further Reading

  1. https://80000hours.org/articles/what-could-an-ai-caused-existential-catastrophe-actually-look-like/
    Contains a list of extreme risks due to superintelligence with somewhat more concrete scenarios.
  2. "The alignment problem from a deep learning perspective" http://arxiv.org/abs/2209.00626
    Longer summary of AI Alignment with many references.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment