Reinforcement Learning is different from the other 2 kinds as here an agent interacts with the environment. The agent initially selects action by random. But after performing action it receives positive or negative reinforcement (reward or punishment) and ultimately the agent tries to select those actions which will earn it most rewards.
The agent selects action based on a policy.
A policy is just a probability function that takes in the state or the observed environment as input and gives a probability for each valid action (Summing to one). As the agent learns or observes the relation between state and action and rewards it updates its policy function.
There are 2 kinds of value functions: State-value function and Actions-value function.
The state-value function takes the state as input and gives value or expectation of reward associated with the state.
The action-value function takes state and action as input and gives value or expected reward of taking that action in that state.
Along with the policy function, the value function too needs to be updated as the experience increases.
The interaction between agent and environment may or may not be continuous. It means a task may have a terminal or an end state after which a new episode of interaction may start. eg. Self-driving car-episode ends when the car reaches the destination. eg. Robot in a factory, the task never ends but the robot is just switched off when shutting down the factory at night. Therefore, the rewards may be given at the end of an episode. There are several algorithms that account for this delayed reward.
Most of the game playing systems are created using Reinforcement Learning. Some of the famous examples are Samuel's Checkers Playing program, Deep Blue, AlphaGo (It recently made headlines when it won against the world champion in the game of Go).
Some of the Reinforcement learning algorithms are:
Monte-Carlo
Sarsa
Expected Sarsa
Q-learning
Actor-Critic
TD (lambda)

Comments
Post a Comment