POMDP

1 minute read

Published:

Minesweeper as a Markov Process

As promised two weeks ago, I’ve started my attempt to formulate Minesweeper in the language of reinforcement learning (RL). The mathematical backbone of RL is a Markov decision process (MDP), consisting of a state space $\mathcal{S}$, an action space $\mathcal{A}_s$ for each state $s \in \mathcal{S}$, transition probabilities $P_a(s,s’)$ for states $s,s’\in\mathcal{S}$ and action $a \in \mathcal{A}_s$, and the reward functions $R_a(s,s’)$. The goal of the MDP is to find the policy $\pi$ that determines the best action $a$ from any given state, i.e., \begin{equation} \pi^* = \arg\max_\pi \mathbb{E}_{s \sim \pi, a \sim \pi(s)} \left[ \sum_{t=0}^\infty \gamma^t R_a(s_t, a_t) \right] \end{equation} for some discount factor $\gamma \in [0,1]$.

Naïvely, the state of our Minesweeper game is the board configuration, an action would correspond to selecting/revealing cells, and the reward would be whether we have exploded or not. But things are going to get interesting… if our MDP knew the board configuration, the problem would be trivial! The optimal policy is to reveal all cells that aren’t mines. Somehow, we need to obscure the location of the mines from our MDP. Enter the partially observable Markov decision process (POMDP). Amazingly, in this sense the game of Minesweeper is more mathematically complicated than say Go or Chess where the players have perfect information. To reflect the partial observability of Minesweeper, we need to augment our MDP with an observation space $\mathcal{O}$ and base our policy and the corresponding actions on observations from $\mathcal{O}$. This simple distinction already rules out several RL algorithms, e.g. vanilla Q-learning.

To be continued…

As an aside, I love the way Andy Jones explains the policy gradient theorem.

Update from my Cursor Agent

First RL agent (no reward shaping). Model: PPO with CNN state encoder (3-channel obs: revealed, flagged, adj). Trained on Beginner 9×9, 500k steps (25:11, 331 step/s). Result: best=-0.141, mean_r=-0.182, wins=0. Exported to ONNX and wired into RLMS solver dropdown so you can watch it play—it’s bad on purpose, to motivate reward shaping. Next step: add light reward shaping (+0.02 per safe reveal) and retrain.