In the context of reinforcement learning, especially in policy-gradient methods, preferences refer to internal scores or parameters associated with each action. These scores represent how much an agent "prefers" one action over another in a given state. Unlike probabilities, preferences are unbounded real numbers and are typically transformed into action probabilities using the softmax function.
Preferences are useful because they allow smooth updates to an agent’s behavior. Instead of choosing the best action directly, the agent maintains preferences and adjusts them incrementally based on feedback (rewards). The more successful an action is, the more its preference increases. This makes the agent more likely to choose it in the future, but not necessarily always — keeping the policy stochastic encourages continued exploration.
Using preferences enables a differentiable policy, which is important for algorithms that use Gradient ascent to optimize the policy directly. These methods often include a baseline (like a value estimate) to reduce variance in updates. Overall, preferences are an internal mechanism that captures the agent’s leaning toward specific actions and allows for nuanced, probabilistic decision-making.