Understanding Policy | Reinforcement, Supervised, Unsupervised Learning

Policy

A policy in reinforcement learning defines the agent’s behavior — it specifies the way in which the agent chooses actions based on the current state. Formally, a policy is a mapping from states to probabilities over actions. There are two main types of policies: deterministic, where the action is always the same for a given state, and stochastic, where actions are selected according to a probability distribution.

Policies can be fixed, provided by a human designer, or learned by the agent. In policy-based methods, the goal is to directly optimize the policy to maximize expected return. This is often done using Gradient ascent, where the parameters of the policy are adjusted in the direction that increases cumulative reward.

Policies are at the heart of reinforcement learning. Algorithms like REINFORCE, Actor-Critic, and PPO (Proximal Policy Optimization) explicitly learn and refine policies. In contrast, value-based methods derive the policy indirectly by acting greedily with respect to value functions. Regardless of the approach, an effective policy should strike a balance between exploration (trying new actions) and exploitation (using the best-known actions). Learning a good policy is what ultimately enables an agent to succeed in its environment.

Policy

Mentioned in blog posts: