Solving Blackjack with Q-Learning #

Raw Text

In this tutorial, we’ll explore and solve the Blackjack-v1 environment.

Blackjack is one of the most popular casino card games that is also infamous for being beatable under certain conditions. This version of the game uses an infinite deck (we draw the cards with replacement), so counting cards won’t be a viable strategy in our simulated game. Full documentation can be found at https://gymnasium.farama.org/environments/toy_text/blackjack

Objective : To win, your card sum should be greater than the dealers without exceeding 21.

stand (0): the player takes no more cards

stand (0): the player takes no more cards

hit (1): the player will be given another card, however the player could get over 21 and bust

hit (1): the player will be given another card, however the player could get over 21 and bust

Approach : To solve this environment by yourself, you can pick your favorite discrete RL algorithm. The presented solution uses Q-learning (a model-free RL algorithm).

Imports and Environment Setup #

Observing the environment #

First of all, we call env.reset() to start an episode. This function resets the environment to a starting position and returns an initial observation . We usually also set done = False . This variable will be useful later to check if a game is terminated (i.e., the player wins or loses).

Note that our observation is a 3-tuple consisting of 3 values:

The players current sum

The players current sum

Value of the dealers face-up card

Value of the dealers face-up card

Boolean whether the player holds a usable ace (An ace is usable if it counts as 11 without busting)

Boolean whether the player holds a usable ace (An ace is usable if it counts as 11 without busting)

Executing an action #

After receiving our first observation, we are only going to use the env.step(action) function to interact with the environment. This function takes an action as input and executes it in the environment. Because that action changes the state of the environment, it returns four useful variables to us. These are:

next_state : This is the observation that the agent will receive after taking the action.

next_state : This is the observation that the agent will receive after taking the action.

reward : This is the reward that the agent will receive after taking the action.

reward : This is the reward that the agent will receive after taking the action.

terminated : This is a boolean variable that indicates whether or not the environment has terminated.

terminated : This is a boolean variable that indicates whether or not the environment has terminated.

truncated : This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached.

truncated : This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached.

info : This is a dictionary that might contain additional information about the environment.

info : This is a dictionary that might contain additional information about the environment.

The next_state , reward , terminated and truncated variables are self-explanatory, but the info variable requires some additional explanation. This variable contains a dictionary that might have some extra information about the environment, but in the Blackjack-v1 environment you can ignore it. For example in Atari environments the info dictionary has a ale.lives key that tells us how many lives the agent has left. If the agent has 0 lives, then the episode is over.

Note that it is not a good idea to call env.render() in your training loop because rendering slows down training by a lot. Rather try to build an extra loop to evaluate and showcase the agent after training.

Once terminated = True or truncated=True , we should stop the current episode and begin a new one with env.reset() . If you continue executing actions without resetting the environment, it still responds but the output won’t be useful for training (it might even be harmful if the agent learns on invalid data).

Building an agent #

Let’s build a Q-learning agent to solve Blackjack-v1 ! We’ll need some functions for picking an action and updating the agents action values. To ensure that the agents explores the environment, one possible solution is the epsilon-greedy strategy, where we pick a random action with the percentage epsilon and the greedy action (currently valued as the best) 1 - epsilon .

To train the agent, we will let the agent play one episode (one complete game is called an episode) at a time and then update it’s Q-values after each episode. The agent will have to experience a lot of episodes to explore the environment sufficiently.

Now we should be ready to build the training loop.

Great, let’s train!

Info: The current hyperparameters are set to quickly train a decent agent. If you want to converge to the optimal policy, try increasing the n_episodes by 10x and lower the learning_rate (e.g. to 0.001).

Visualizing the training #

Visualising the policy #

It’s good practice to call env.close() at the end of your script, so that any used resources by the environment will be closed.

Think you can do better? #

Hopefully this Tutorial helped you get a grip of how to interact with OpenAI-Gym environments and sets you on a journey to solve many more RL challenges.

It is recommended that you solve this environment by yourself (project based learning is really effective!). You can apply your favorite discrete RL algorithm or give Monte Carlo ES a try (covered in Sutton & Barto , section 5.3) - this way you can compare your results directly to the book.

Best of fun!

Download Python source code: blackjack_tutorial.py

Download Jupyter notebook: blackjack_tutorial.ipynb

Single Line Text

In this tutorial, we’ll explore and solve the Blackjack-v1 environment. Blackjack is one of the most popular casino card games that is also infamous for being beatable under certain conditions. This version of the game uses an infinite deck (we draw the cards with replacement), so counting cards won’t be a viable strategy in our simulated game. Full documentation can be found at https://gymnasium.farama.org/environments/toy_text/blackjack. Objective : To win, your card sum should be greater than the dealers without exceeding 21. stand (0): the player takes no more cards. stand (0): the player takes no more cards. hit (1): the player will be given another card, however the player could get over 21 and bust. hit (1): the player will be given another card, however the player could get over 21 and bust. Approach : To solve this environment by yourself, you can pick your favorite discrete RL algorithm. The presented solution uses Q-learning (a model-free RL algorithm). Imports and Environment Setup # Observing the environment # First of all, we call env.reset() to start an episode. This function resets the environment to a starting position and returns an initial observation . We usually also set done = False . This variable will be useful later to check if a game is terminated (i.e., the player wins or loses). Note that our observation is a 3-tuple consisting of 3 values: The players current sum. The players current sum. Value of the dealers face-up card. Value of the dealers face-up card. Boolean whether the player holds a usable ace (An ace is usable if it counts as 11 without busting) Boolean whether the player holds a usable ace (An ace is usable if it counts as 11 without busting) Executing an action # After receiving our first observation, we are only going to use the env.step(action) function to interact with the environment. This function takes an action as input and executes it in the environment. Because that action changes the state of the environment, it returns four useful variables to us. These are: next_state : This is the observation that the agent will receive after taking the action. next_state : This is the observation that the agent will receive after taking the action. reward : This is the reward that the agent will receive after taking the action. reward : This is the reward that the agent will receive after taking the action. terminated : This is a boolean variable that indicates whether or not the environment has terminated. terminated : This is a boolean variable that indicates whether or not the environment has terminated. truncated : This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached. truncated : This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached. info : This is a dictionary that might contain additional information about the environment. info : This is a dictionary that might contain additional information about the environment. The next_state , reward , terminated and truncated variables are self-explanatory, but the info variable requires some additional explanation. This variable contains a dictionary that might have some extra information about the environment, but in the Blackjack-v1 environment you can ignore it. For example in Atari environments the info dictionary has a ale.lives key that tells us how many lives the agent has left. If the agent has 0 lives, then the episode is over. Note that it is not a good idea to call env.render() in your training loop because rendering slows down training by a lot. Rather try to build an extra loop to evaluate and showcase the agent after training. Once terminated = True or truncated=True , we should stop the current episode and begin a new one with env.reset() . If you continue executing actions without resetting the environment, it still responds but the output won’t be useful for training (it might even be harmful if the agent learns on invalid data). Building an agent # Let’s build a Q-learning agent to solve Blackjack-v1 ! We’ll need some functions for picking an action and updating the agents action values. To ensure that the agents explores the environment, one possible solution is the epsilon-greedy strategy, where we pick a random action with the percentage epsilon and the greedy action (currently valued as the best) 1 - epsilon . To train the agent, we will let the agent play one episode (one complete game is called an episode) at a time and then update it’s Q-values after each episode. The agent will have to experience a lot of episodes to explore the environment sufficiently. Now we should be ready to build the training loop. Great, let’s train! Info: The current hyperparameters are set to quickly train a decent agent. If you want to converge to the optimal policy, try increasing the n_episodes by 10x and lower the learning_rate (e.g. to 0.001). Visualizing the training # Visualising the policy # It’s good practice to call env.close() at the end of your script, so that any used resources by the environment will be closed. Think you can do better? # Hopefully this Tutorial helped you get a grip of how to interact with OpenAI-Gym environments and sets you on a journey to solve many more RL challenges. It is recommended that you solve this environment by yourself (project based learning is really effective!). You can apply your favorite discrete RL algorithm or give Monte Carlo ES a try (covered in Sutton & Barto , section 5.3) - this way you can compare your results directly to the book. Best of fun! Download Python source code: blackjack_tutorial.py. Download Jupyter notebook: blackjack_tutorial.ipynb.