Markov Decision Process

Anant Dhakad
Last Updated: May 13, 2022


Reinforcement learning is an area of Machine Learning. It enables machines and software agents to automatically select the best behavior in a given situation to improve their efficiency. For the agent to learn its behavior, simple reward feedback is required, known as the reinforcement signal.

Many different algorithms address this problem. A specific type of problem characterizes reinforcement Learning, and all solutions to that problem are classified as Reinforcement Learning algorithms. An agent is meant to choose the best action based on his present state in the situation. The problem becomes a Markov Decision Process when this phase is repeated. 

Markov Decision Process (MDP)

A Markov decision process includes:

  • A collection of potential world states S.
  • A collection of Models.
  • A list of possible actions A.
  • Reward function R(s, a).
  • A policy.


A state is a collection of tokens that represent every possible state for the agent.


A Model (also known as a Transition Model) describes how an action affects a state. T(S, a, S') specifies a transition T that occurs when we are in state S and do action 'a', resulting in state S' (S and S' may be the same). We also construct a probability P(S'|S, a) for stochastic actions (noisy, non-deterministic), which is the chance of reaching a state S' if action 'a' is conducted in state S

The Markov property asserts that the results of an action made in a state are solely determined by that state and not by the previous history.


A course of action A is a collection of all actions that can be taken. The set of actions that can be taken while in state S is defined by A(s).


A Reward is a reward function with a real value. The reward for simply being in the state S is R(s). The reward for being in state S and doing action 'a' is R(S, a). The reward for being in state S, taking action 'a', and ending in state S' is R(S, a, S').


A policy is a Markov Decision Process solution. A mapping from S to 'a' is referred to as a policy. It specifies the 'a' action to be performed while in state S.



















Consider the above grid example.

Agent lives in the cell(1, 3). A 3*4 grid is used in this example. A START state exists in the grid (cell 1,1). The agent's mission is to make his way around the grid until he finds the Gift (cell 4,3). The agent should stay away from the Fire grid at all times (orange color, cell 4,2). In addition, grid no. 2,2 is a blocked grid that works as a barrier, preventing the agent from entering. 

Any of the following actions are available to the agent: L(left), R(right), U(up), D(down)

Walls obstruct the agent's path; if a wall is in the way of the agent's path, the agent remains in the same spot. If the agent, for example, says L in the START grid, he will remain in the START grid.

Objective: To discover the shortest path from START to the Gift.

There are two such paths : (R, R, U, U, R) and (U, U, R, R, R). Let’s consider the first one. This move is now noisy. The targeted action is successful 80 percent of the time. The action agent will move at right angles 20% of the time. If the agent says UP, the chance of traveling U is 0.8, 0.1 for going L, and 0.1 for traveling R (since L and R are right angles to U).

Each time the agent takes a step, it is rewarded with a little reward (can be negative when it can also be termed as punishment, in the above example, entering the Fire can have a bonus of -1). In the end, the agent is given big rewards (good or bad). The goal is to maximize the total sum of rewards.

Constrained Markov Decision Processes

CMDPs (Constrained Markov Decision Processes) are extensions to Markov Decision Process extensions (MDPs). There are three key distinctions between MDPs and CMDPs.

  • Instead of a single cost, several expenses are incurred when an action is taken.
  • Only linear programs are used to solve CMDPs, and dynamic programming is not an option.
  • The final policy is determined by the initial state.


  1. What are the Markov Property and its condition?
    Formally, for a state St to be Markov, the probability of the future state St+1 being s' should be solely determined by the present state St = st, and not on the remainder of the previous states S1 = s1, S2 = s2
    Condition: P[ St+1 | St ] = P[ St+1 | S1 … St ]
  2. Explain briefly the Markov Reward Process (MRP).
    A Markov reward process is a stochastic process that adds a reward rate to each state of a Markov chain or a continuous-time Markov chain. An additional variable keeps track of the reward collected up to this point.
  3. What are value functions? 
    The long-term value of a state or action, i.e., the predicted Return over a state or action, is represented by a value function.
  4. Briefly explain the idea of Reinforcement Learning.
    Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. 

Key Takeaways

Congratulations on making it this far. In this blog, we saw what the Markov decision process is.

Yet learning never stops, To learn more about reinforcement learning, refer link and there is a lot more to learn. Happy Learning!!

Was this article helpful ?


No comments yet

Be the first to share what you think