Last active
June 19, 2025 16:55
-
-
Save DiegoHernanSalazar/ac50e5d49638193782332b98a507656a to your computer and use it in GitHub Desktop.
Stanford Online/ DeepLearning.AI. Unsupervised Learning, Recommenders Systems and Reinforcement Learning: Reinforcement Learning, Deep Q-Learning - Lunar Lander
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Deep Q-Learning - Lunar Lander\n", | |
| "\n", | |
| "In this assignment, you will train an agent to land a lunar lander safely on a landing pad on the surface of the moon.\n", | |
| "\n", | |
| "\n", | |
| "# Outline\n", | |
| "- [ 1 - Import Packages <img align=\"Right\" src=\"./images/lunar_lander.gif\" width = 60% >](#1)\n", | |
| "- [ 2 - Hyperparameters](#2)\n", | |
| "- [ 3 - The Lunar Lander Environment](#3)\n", | |
| " - [ 3.1 Action Space](#3.1)\n", | |
| " - [ 3.2 Observation Space](#3.2)\n", | |
| " - [ 3.3 Rewards](#3.3)\n", | |
| " - [ 3.4 Episode Termination](#3.4)\n", | |
| "- [ 4 - Load the Environment](#4)\n", | |
| "- [ 5 - Interacting with the Gym Environment](#5)\n", | |
| " - [ 5.1 Exploring the Environment's Dynamics](#5.1)\n", | |
| "- [ 6 - Deep Q-Learning](#6)\n", | |
| " - [ 6.1 Target Network](#6.1)\n", | |
| " - [ Exercise 1](#ex01)\n", | |
| " - [ 6.2 Experience Replay](#6.2)\n", | |
| "- [ 7 - Deep Q-Learning Algorithm with Experience Replay](#7)\n", | |
| " - [ Exercise 2](#ex02)\n", | |
| "- [ 8 - Update the Network Weights](#8)\n", | |
| "- [ 9 - Train the Agent](#9)\n", | |
| "- [ 10 - See the Trained Agent In Action](#10)\n", | |
| "- [ 11 - Congratulations!](#11)\n", | |
| "- [ 12 - References](#12)\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "_**NOTE:** To prevent errors from the autograder, you are not allowed to edit or delete non-graded cells in this lab. Please also refrain from adding any new cells. \n", | |
| "**Once you have passed this assignment** and want to experiment with any of the non-graded code, you may follow the instructions at the bottom of this notebook._" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"1\"></a>\n", | |
| "## 1 - Import Packages\n", | |
| "\n", | |
| "We'll make use of the following packages:\n", | |
| "- `numpy` is a package for scientific computing in python.\n", | |
| "- `deque` will be our data structure for our memory buffer.\n", | |
| "- `namedtuple` will be used to store the experience tuples.\n", | |
| "- The `gym` toolkit is a collection of environments that can be used to test reinforcement learning algorithms. We should note that in this notebook we are using `gym` version `0.24.0`.\n", | |
| "- `PIL.Image` and `pyvirtualdisplay` are needed to render the Lunar Lander environment.\n", | |
| "- We will use several modules from the `tensorflow.keras` framework for building deep learning models.\n", | |
| "- `utils` is a module that contains helper functions for this assignment. You do not need to modify the code in this file.\n", | |
| "\n", | |
| "Run the cell below to import all the necessary packages." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": { | |
| "deletable": false, | |
| "id": "KYbOPKRtfQOr" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import time # Use the time module to check the execution time, measuring time taken for\n", | |
| " # a code segmentby recording start and end times\n", | |
| "\n", | |
| "# Data structure 'deque' for memory buffer and store the experience tuples 'namedtuple'\n", | |
| "from collections import deque, namedtuple\n", | |
| "\n", | |
| "# Collection of environments used to test reinforcement learning algorithms\n", | |
| "import gym\n", | |
| "\n", | |
| "# Managing scientific computing with numpy arrays\n", | |
| "import numpy as np\n", | |
| "\n", | |
| "# Import and handle images with PIL library to render Lunar lander\n", | |
| "import PIL.Image\n", | |
| "\n", | |
| "# Get tensorflow framework for building deep learning models\n", | |
| "import tensorflow as tf\n", | |
| "\n", | |
| "# Contains 'helper' functions\n", | |
| "import utils\n", | |
| "\n", | |
| "from pyvirtualdisplay import Display # Needed to render the Lunar Lander environment\n", | |
| "from tensorflow.keras import Sequential # Create a deep learning NN 'Sequential()' model\n", | |
| "from tensorflow.keras.layers import Dense, Input # Create 'Dense()' layers and 'Input()' features of NN\n", | |
| "from tensorflow.keras.losses import MSE # Get 'Mean Squared Error (MSE)' as loss / error function\n", | |
| "from tensorflow.keras.optimizers import Adam # Get 'Adam()' optimizer as GD updating algorithm" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# Set up a virtual display to render the Lunar Lander environment.\n", | |
| "# in a non-graphical environment, such as a server.\n", | |
| "# 'visible=0' indicates that the display is not visible\n", | |
| "# 'size=(840, 480)' defines the resolution of the virtual display\n", | |
| "# '.start()' function starts the virtual display\n", | |
| "Display(visible=0, size=(840, 480)).start();\n", | |
| "\n", | |
| "# Set the random seed for TensorFlow, getting the same results every time we run this cell\n", | |
| "tf.random.set_seed(utils.SEED)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"2\"></a>\n", | |
| "## 2 - Hyperparameters\n", | |
| "\n", | |
| "Run the cell below to set the hyperparameters." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": { | |
| "deletable": false, | |
| "editable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "MEMORY_SIZE = 100_000 # size of memory buffer\n", | |
| "GAMMA = 0.995 # discount factor\n", | |
| "ALPHA = 1e-3 # learning rate \n", | |
| "NUM_STEPS_FOR_UPDATE = 4 # perform a learning update every C time steps" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"3\"></a>\n", | |
| "## 3 - The Lunar Lander Environment\n", | |
| "\n", | |
| "In this notebook we will be using [OpenAI's Gym Library](https://www.gymlibrary.dev/). The Gym library provides a wide variety of environments for reinforcement learning. To put it simply, an environment represents a problem or task to be solved. In this notebook, we will try to solve the Lunar Lander environment using reinforcement learning.\n", | |
| "\n", | |
| "The goal of the Lunar Lander environment is to land the lunar lander safely on the landing pad on the surface of the moon. The landing pad is designated by two flag poles and its center is at coordinates `(0,0)` but the lander is also allowed to land outside of the landing pad. The lander starts at the top center of the environment with a random initial force applied to its center of mass and has infinite fuel. The environment is considered solved if you get `200` points. \n", | |
| "\n", | |
| "<br>\n", | |
| "<br>\n", | |
| "<figure>\n", | |
| " <img src = \"images/lunar_lander.gif\" width = 40%>\n", | |
| " <figcaption style = \"text-align: center; font-style: italic\">Fig 1. Lunar Lander Environment.</figcaption>\n", | |
| "</figure>\n", | |
| "\n", | |
| "\n", | |
| "\n", | |
| "<a name=\"3.1\"></a>\n", | |
| "### 3.1 Action Space\n", | |
| "\n", | |
| "The agent has four discrete actions available:\n", | |
| "\n", | |
| "* Do nothing.\n", | |
| "* Fire right engine.\n", | |
| "* Fire main engine.\n", | |
| "* Fire left engine.\n", | |
| "\n", | |
| "Each action has a corresponding numerical value:\n", | |
| "\n", | |
| "```python\n", | |
| "Do nothing = 0\n", | |
| "Fire right engine = 1\n", | |
| "Fire main engine = 2\n", | |
| "Fire left engine = 3\n", | |
| "```\n", | |
| "\n", | |
| "<a name=\"3.2\"></a>\n", | |
| "### 3.2 Observation Space\n", | |
| "\n", | |
| "The agent's observation space consists of a state vector with 8 variables:\n", | |
| "\n", | |
| "* Its $(x,y)$ coordinates. The landing pad is always at coordinates $(0,0)$.\n", | |
| "* Its linear velocities $(\\dot x,\\dot y)$.\n", | |
| "* Its angle $\\theta$.\n", | |
| "* Its angular velocity $\\dot \\theta$.\n", | |
| "* Two booleans, $l$ and $r$, that represent whether each leg is in contact with the ground or not.\n", | |
| "\n", | |
| "<a name=\"3.3\"></a>\n", | |
| "### 3.3 Rewards\n", | |
| "\n", | |
| "After every step, a reward is granted. The total reward of an episode is the sum of the rewards for all the steps within that episode.\n", | |
| "\n", | |
| "For each step, the reward:\n", | |
| "- is increased/decreased the closer/further the lander is to the landing pad.\n", | |
| "- is increased/decreased the slower/faster the lander is moving.\n", | |
| "- is decreased the more the lander is tilted (angle not horizontal).\n", | |
| "- is increased by 10 points for each leg that is in contact with the ground.\n", | |
| "- is decreased by 0.03 points each frame a side engine is firing.\n", | |
| "- is decreased by 0.3 points each frame the main engine is firing.\n", | |
| "\n", | |
| "The episode receives an additional reward of -100 or +100 points for crashing or landing safely respectively.\n", | |
| "\n", | |
| "<a name=\"3.4\"></a>\n", | |
| "### 3.4 Episode Termination\n", | |
| "\n", | |
| "An episode ends (i.e the environment enters a terminal state) if:\n", | |
| "\n", | |
| "* The lunar lander crashes (i.e if the body of the lunar lander comes in contact with the surface of the moon).\n", | |
| "\n", | |
| "* The absolute value of the lander's $x$-coordinate is greater than 1 (i.e. it goes beyond the left or right border)\n", | |
| "\n", | |
| "You can check out the [Open AI Gym documentation](https://www.gymlibrary.dev/environments/box2d/lunar_lander/) for a full description of the environment. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"4\"></a>\n", | |
| "## 4 - Load the Environment\n", | |
| "\n", | |
| "We start by loading the `LunarLander-v2` environment from the `gym` library by using the `.make()` method. `LunarLander-v2` is the latest version of the Lunar Lander environment and you can read about its version history in the [Open AI Gym documentation](https://www.gymlibrary.dev/environments/box2d/lunar_lander/#version-history)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": { | |
| "deletable": false, | |
| "id": "ILVMYKewfR0n" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# Load 'LunarLander-v2' environment from 'gym' library\n", | |
| "# using the '.make()' method\n", | |
| "env = gym.make('LunarLander-v2')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Once we load the environment we use the `.reset()` method to reset the environment to the initial state. The lander starts at the top center of the environment and we can render the first frame of the environment by using the `.render()` method." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlgAAAGQCAIAAAD9V4nPAAAPk0lEQVR4nO3df2yUdZ4H8CkUoUoR0UW3EDaKgsgqtxwnxpBAbsO5nsetMdhsiEfiedoY43lnzMUYN/bOGDU5fYpRYxpDjD+jXeMF6i1REjDAGX6Uc9AF5eSXUgqClG2hoBWe+2NcddtaWjrzPM88z+v1D32G6Tyf+WSezzvzzHyfVuSAQfq3Wz/cfWTVxWP/em/n+l371u7eveHQod1ff33stt+8ebz7yynn/11J976zfeVHO5Zt3vy7ESOqJky46upf3DKqcuzEMbM//+P/vLv2sf37Py7p3iF9hsVdAJSZ+jv2tXZumFB9ddWIcZeNu37mJYtnzPj7r78+lsvlToXfjBh+TqkLOGv4mDAMc7lcd/fxmTMWnl912SXn/fKs4aMnjJk955o7Sr13SB9BCIPT+dW+U+E3546aVNj89MDKlpamws8jh48ZMayq1AVUDhv51VcdhZ/b9x042LW18POoynOrz6qZNGlmqQuAlBGEMAjfvR0sbHZ+ta/jWOsXX3waeSEVhX9WbfzP6rNqDnZtK2xOGDN75l8sjLwYKG+CEAbhyIndlcOqRp91YWFz6/7/2rZt5Xf/W5GrCHNhqWuo+FMKFuzcsaG1Y0Ph58phI39y9hWXXjqn1DVAmghCGKj6O/bt7Vg/cczswuaRE7sPfvlpe/veeKt6r+XJC86+fP/RDwqbE8ZcPW3a/IqKin5/CfieIISBOtT1yeizLhxVObawuf2L32/b9m6sFX3r/7avae3cUPgGTS6Xm1g9e9q0+fGWBGVEEMKA9Ph08FDXJ61tW44d+zLeqgre2/zkhOrZrZ3rC5sXjr5q6mW/HDmy5N9fhXQQhHB69XfsO3B0y7iqS0cMP7twy67Dq3746WDkep753Pbxu18e39598nhhc2L17N/8ujHyqqAsCUIYkB++HTxwdMuu3esLawcTosebwvPPntLVfWj06J/EWxWUhcq4C4Ckq79jX2vnxp+OnjmsYnjhls873u/z08ET33Qc697/2R/X5Xp9d7TXt0l73tDHHX7kF050t1dXX9B77x9uXT7+Z5OOd7dXjTgvl8tNrJ7961899srvbuvvuQGCEE7r5KnuL459+IuL/rGw2dq58ZNPVoXhqd73rB55UUXFsFGVY3qfuqzoeUvPG/q4w4/8wqjKsXv39nHa873NT/7Lz99v7Vx/6bhf5XK5c0dN2n8sP27czw4f3tPf04PM8x1r6M9v/2l3a+fGUZXnjj/n57lc7uSp7v9tW9r89r/HXVff5s68t+biKTWj/6p65E9zudyxr7/4rGPd8hUPdnW1x10aJJfPCKE/X588drDrD99ttnZu+OgPv4+xnv71+KSw46u9Xd0Hx46dEG9VkHDeEcJpzP3Lf5186bVHTuy6cPSMz9rXvf3f/xF3Rf2ZO/PeSZdcVTls1OETO453dL6z+rG4KwIgFebNunfhjcEVV/xN3IWc3j//w9q/vf63lZUj4y4EgNQZMaLkf1xi6IYN8yykpF3AVAMW3alDt1Knf0aG7XrtyaNbkXXoi7oJTSZ9JEEJIqmzb1vMW8LgV9Jk0EIanSe0D3YF4XhT6TJoKQVDntgO7BvD4z+kyaCEJSZbADugfzeoD0mTQRhKTKEAd0D7NmFfPR0kSfSZPKuAuABPFOJRr6TKIIQjLNRI6GPpNkgpBsMZGjoc+UEUFIypnI0dBnypcgJG1M5GjoM0AShWEYdwmZoM+kybC4C0i/mpqaJ554IgzDF198cf78+XGXAwBRmTVr1muvvRb+uba2tscff3z69OlxV5dO3qlEQ5+B01iwYMHq1avDfrW0tNxzzz3nnXde3MWmigEdDX0GflRdXd327dv7j8Aempuba2tr4y48JQzoaOgz0FN1dXV9ff3Ro0cHFYE/dOLEicbGxjlz5sT9VMqbAR0NfQa+N3Xq1MbGxjPOv9527txZX19/ySWXxP3MypIBHQ19BnK5XG7evHnNzc1FjMAe1q1bV1dXV1VVFfcTLScGdDT0GbJu0aJFLS0tpYvAHpqamhYsWBD3ky4PBnQ09Bmy67777mtra4ssAn+ovb19yZIls/zFmn4Z0NHQZ8ic7xbFJ8HWrVvvv//+mpqauLuSRKEBHQl9hgzpc1F8QqxcuXLx4sVxdyhZQgM6EvoMmTCQRfEJ8dJLL7l4W0FoQEdCnyHlzmBRfBK4eFvOgI6KPkM6DX1RfEJk+eJtoQEdCX2GtCn6oviEyODF20IDOhL6DOlR6kXxSZCpi7eFBnQk9BnSIOJF8UlQuHjb5MmT4+59CYUGdCT0GcpbjIviEyLFF28LDehI6DOUpUQtik+I9F28LTSgI6HPUGaSvCg+Cdrb25966ql0XLwtNKAjoc9QNspoUXwSpODibaEBHQl9hjJQpoviE6J8L94WGtCR0GdIrtQsik+I5ubmurq6MnqPGBrQkdBnSKK0LopPiJaWlvr6+uR/jhga0JHQZ0iWLCyKT462trbGxsbEftc0NKAjoc+lVltb29TUFJbhWRmilsFF8YmSwEM0NKAjoc8l8l3+9VYuZ2WIjkXxiZKcQzQ0oCOhz8XVT/71lvCzMkRhxowZpRvoDFHsh2hoQEdCn4tiUPnXpwSelaHkFi9eXJyBTenFcoiGBnQk9Hkohp5/vSXnrAyl5QJpZSrKQzQ0oCOhz2egFPnXW+xnZSihlStXlvoFRKlFcIiGBnQk9Hngosm/Pjlxmh41NTUHDx6M5WVE6ZToEA0N6Ejo82nFmH+9OXFa3q6//vq4X0KUVnEP0dCAjoQ+/5hE5V9vTpyWn/vvvz/ulw3RKcohGhrQkdDnHhKef31y4rQM+MNJWXbGh2hoQEdCnwvKMf96c+K0Iu4C+jBixIh8Pj9t2rS4CyF+mzdvXr58eXNz86ZNmwZy/zAMKyqS+KpOmYz3uba29uabb164cGHchRTZ/v37l/9J3LVEKnEv5VmzZm3cuDHuKkicAR6iGR/Qkclmn9Oaf316++23C4fbvn374q6l5JL1Ur7tttuef/75uKsg6fo5RLM5oKOXqT5nKv96G+xZmXKUoJfyU089dffdd8ddBeWk9yGaqQEdoyz0OeP511uKT5wm5aW8evXquXPnxl0F5SrFhygRk38DkbITp/EH4aRJk/L5/NixY+MuBBiEI0eOtA/M4cOH4y729OTfmUnHidOYg3DBggXLli2Ltwag1BKbmvKvWMr6rEycQfjggw8+/PDDMRYAJFAEqSn/SmrJkiVBEOzZsyfuQgYqtiBsamryKgSGaFCpKf+i9NZbbwVBsGbNmrgLOb0YgrCqqiqfz1922WXR7xqAKLW0tDQ0NLz88stxF9KfqIPwmmuuef/99yPeKQAxOnToUBAEQRAcP3487lr6MCzKndXV1UlBgKy54IILHnnkka6urueee+7yyy+Pu5yeontH+Oyzz955552R7Q6AZFqxYkUQBO+8807chXwroiBct27dtddeG82+AEi+rVu3BkHwxhtvdHR0xFtJyYNw8uTJ+Xz+nHPOKfWOACg7R48ebWhoCIIgxgsvlDYIb7rppjfffLOkuwAgBV588cUgCD744IPod13CIKyvr3/ooYdK9/gApMzq1asbGho++uijHTt2RLbTUgXhW2+9deONN5bowQFIsV27dgVB8Nxzz3V3d0ewu+IHYXV1dT6fv/jii4v+yABkx8mTJ4MgaGhoaG1tLemOihyEc+bMKYsL6gBQLpqamoIgKN0y9GIG4V133fX0008X8QEBoGD9+vVBEKxdu7bobxCLFoSNjY233357sR4NAHpra2srLLco4seHxQnC9evXX3311UV5KAA4rWeeeSYIgqJ8uXSoQTh16tR8Pj9y5MihlwIAg7J8+fKGhobDhw8PZQHikIKwtrb29ddfH8ojAMAQbdmyJQiCZcuWndnlac48CB955JEHHnjgjH8dAIroyJEjheUWg7146RkGYXNz8w033HBmvwsApbN06dIgCGpqagb4By4GHYTjxo3L5/MTJ04cfG0AEJGVK1cGQbBz586PP/64/3sOLgjnzZu3atWqIRQGANHZvn17Q0PDCy+8cPz48R+7zyD+Qv0999wjBQEoI1OmTHn22Wfb29sfffTR8ePH33LLLb3vM9B3hEuXLr311luLWh4AROrVV19taGgYNWrUD68GOqAgbGlpmTlzZskKA4DorFu3LgiCTZs27dmzJ3faIJw+fXo+nx8+fHgktQFARD7//PMgCIIg6C8IFy1a9Morr0RWEwBE70e/LPP4449LQQBSr+93hCtWrLjuuusiLgUAotczCMePH5/P5y+66KJYqgGAiP3ZqdH58+cfOHBACgKQHd8H4X333TfAy7IBQGp8e2r0pZde6nO9PQCkW0Uul9uyZcuVV14ZdyUAEIOKMAzjrgEAYjOIi24DQPoIQgAyTRACkGmCEIBME4QAZJogBCDTBCEAmSYIAcg0QQhApglCADJNEAKQaYIQgEwThABkmiAEINMEIQCZJggByDRBCECmCUIAMk0QApBpghCATBOEAGSaIAQg0wQhAJkmCAHINEEIQKYJQgAyTRACkGmCEIBME4QAZJogBCDTBCEAmSYIAcg0QQhApglCADJNEAKQaYIQgEwThABkmiAEINMEIQCZJggByDRBCECmCUIAMk0QApBpghCATBOEAGSaIAQg0wQhAJkmCAHINEEIQKYJQgAyTRACkGmCEIBME4QAZJogBCDTBCEAmSYIAcg0QQhApglCADJNEAKQaYIQgEwThABkmiAEINP+H4bqlqkM8g9eAAAAAElFTkSuQmCC\n", | |
| "text/plain": [ | |
| "<PIL.Image.Image image mode=RGB size=600x400 at 0x7EB475A9E650>" | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# 'env.reset()' the environment to the initial state (Top center of environment)\n", | |
| "env.reset()\n", | |
| "\n", | |
| "# 'env.render()' the first frame of the environment from RGB array, and display as pillow Image\n", | |
| "PIL.Image.fromarray(env.render(mode='rgb_array'))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In order to build our neural network later on we need to know the size of the state vector and the number of valid actions. We can get this information from our environment by using the `.observation_space.shape` and `action_space.n` methods, respectively." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": { | |
| "deletable": false, | |
| "id": "x3fdqdG4CUu2" | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "State s shape (1D vector): (8,)\n", | |
| "Number of actions (scalar): 4\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# To know the SIZE of the [state s] vector -> (8 elements,_) \n", | |
| "state_size = env.observation_space.shape\n", | |
| "\n", | |
| "# To know the NUMBER of valid actions a -> (4) scalar\n", | |
| "num_actions = env.action_space.n\n", | |
| "\n", | |
| "print('State s shape (1D vector):', state_size)\n", | |
| "print('Number of actions (scalar):', num_actions)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"5\"></a>\n", | |
| "## 5 - Interacting with the Gym Environment\n", | |
| "\n", | |
| "The Gym library implements the standard “agent-environment loop” formalism:\n", | |
| "\n", | |
| "<br>\n", | |
| "<center>\n", | |
| "<video src = \"./videos/rl_formalism.m4v\" width=\"840\" height=\"480\" controls autoplay loop poster=\"./images/rl_formalism.png\"> </video>\n", | |
| "<figcaption style = \"text-align:center; font-style:italic\">Fig 2. Agent-environment Loop Formalism.</figcaption>\n", | |
| "</center>\n", | |
| "<br>\n", | |
| "\n", | |
| "In the standard “agent-environment loop” formalism, an agent interacts with the environment in discrete time steps $t=0,1,2,...$. At each time step $t$, the agent uses a policy $\\pi$ to select an action $A_t$ based on its observation of the environment's state $S_t$. The agent receives a numerical reward $R_t$ and on the next time step, moves to a new state $S_{t+1}$.\n", | |
| "\n", | |
| "<a name=\"5.1\"></a>\n", | |
| "### 5.1 Exploring the Environment's Dynamics\n", | |
| "\n", | |
| "In Open AI's Gym environments, we use the `.step()` method to run a single time step of the environment's dynamics. In the version of `gym` that we are using the `.step()` method accepts an action and returns four values:\n", | |
| "\n", | |
| "* `observation` (**object**): an environment-specific object representing your observation of the environment. In the Lunar Lander environment this corresponds to a numpy array containing the positions and velocities of the lander as described in section [3.2 Observation Space](#3.2).\n", | |
| "\n", | |
| "\n", | |
| "* `reward` (**float**): amount of reward returned as a result of taking the given action. In the Lunar Lander environment this corresponds to a float of type `numpy.float64` as described in section [3.3 Rewards](#3.3).\n", | |
| "\n", | |
| "\n", | |
| "* `done` (**boolean**): When done is `True`, it indicates the episode has terminated and it’s time to reset the environment. \n", | |
| "\n", | |
| "\n", | |
| "* `info` (**dictionary**): diagnostic information useful for debugging. We won't be using this variable in this notebook but it is shown here for completeness.\n", | |
| "\n", | |
| "To begin an episode, we need to reset the environment to an initial state. We do this by using the `.reset()` method. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "[x, y, x’, y’, theta, theta’, l, r] =\n", | |
| " [ 0.00191936 1.4223009 0.19439952 0.5058145 -0.00221732 -0.04403437\n", | |
| " 0. 0. ]\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# 'env.reset()' the environment and get into initial state (Top center of environment)\n", | |
| "# to begin a new episode, each time.\n", | |
| "# current_state = [x, y, x’, y’, theta, theta’, l, r]\n", | |
| "current_state = env.reset()\n", | |
| "\n", | |
| "print('[x, y, x’, y’, theta, theta’, l, r] =\\n',current_state)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Once the environment is reset, the agent can start taking actions in the environment by using the `.step()` method. Note that the agent can only take one action per time step. \n", | |
| "\n", | |
| "In the cell below you can select different actions and see how the returned values change depending on the action taken. Remember that in this environment the agent has four discrete actions available and we specify them in code by using their corresponding numerical value:\n", | |
| "\n", | |
| "```python\n", | |
| "Do nothing = 0\n", | |
| "Fire right engine = 1\n", | |
| "Fire main engine = 2\n", | |
| "Fire left engine = 3\n", | |
| "```" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<style type=\"text/css\" >\n", | |
| " #T_28da7a48_4c97_11f0_9fc6_0242ac120042 th {\n", | |
| " border: 1px solid grey;\n", | |
| " text-align: center;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042 tbody td {\n", | |
| " border: 1px solid grey;\n", | |
| " text-align: center;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col0 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col1 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col2 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col3 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col4 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col5 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col6 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col7 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col8 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col9 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col10 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col11 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col0 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col1 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col2 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col3 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col4 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col5 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col6 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col7 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col8 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col9 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col10 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col11 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col1 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col2 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col3 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col4 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col5 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col6 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col7 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col8 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col9 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col10 {\n", | |
| " background-color : grey;\n", | |
| " } #T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col11 {\n", | |
| " background-color : grey;\n", | |
| " }</style><table id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042\" ><thead> <tr> <th class=\"blank level0\" ></th> <th class=\"col_heading level0 col0\" ></th> <th class=\"col_heading level0 col1\" colspan=8>State Vector</th> <th class=\"col_heading level0 col9\" colspan=3>Derived from the State Vector (the closer to zero, the better)</th> </tr> <tr> <th class=\"blank level1\" ></th> <th class=\"col_heading level1 col0\" ></th> <th class=\"col_heading level1 col1\" colspan=2>Coordinate</th> <th class=\"col_heading level1 col3\" colspan=2>Velocity</th> <th class=\"col_heading level1 col5\" colspan=2>Tilting</th> <th class=\"col_heading level1 col7\" colspan=2>Ground contact</th> <th class=\"col_heading level1 col9\" >Distance from landing pad</th> <th class=\"col_heading level1 col10\" >Velocity</th> <th class=\"col_heading level1 col11\" >Tilting Angle (absolute value)</th> </tr> <tr> <th class=\"blank level2\" ></th> <th class=\"col_heading level2 col0\" ></th> <th class=\"col_heading level2 col1\" >X (Horizontal)</th> <th class=\"col_heading level2 col2\" >Y (Vertical)</th> <th class=\"col_heading level2 col3\" >X (Horizontal)</th> <th class=\"col_heading level2 col4\" >Y (Vertical)</th> <th class=\"col_heading level2 col5\" >Angle</th> <th class=\"col_heading level2 col6\" >Angular Velocity</th> <th class=\"col_heading level2 col7\" >Left Leg?</th> <th class=\"col_heading level2 col8\" >Right Leg?</th> <th class=\"col_heading level2 col9\" ></th> <th class=\"col_heading level2 col10\" ></th> <th class=\"col_heading level2 col11\" ></th> </tr></thead><tbody>\n", | |
| " <tr>\n", | |
| " <th id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042level0_row0\" class=\"row_heading level0 row0\" >Current State</th>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col0\" class=\"data row0 col0\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col1\" class=\"data row0 col1\" >0.001919</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col2\" class=\"data row0 col2\" >1.422301</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col3\" class=\"data row0 col3\" >0.194400</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col4\" class=\"data row0 col4\" >0.505814</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col5\" class=\"data row0 col5\" >-0.002217</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col6\" class=\"data row0 col6\" >-0.044034</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col7\" class=\"data row0 col7\" >False</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col8\" class=\"data row0 col8\" >False</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col9\" class=\"data row0 col9\" >1.422302</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col10\" class=\"data row0 col10\" >0.541885</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row0_col11\" class=\"data row0 col11\" >0.002217</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042level0_row1\" class=\"row_heading level0 row1\" >Action</th>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col0\" class=\"data row1 col0\" >Do nothing</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col1\" class=\"data row1 col1\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col2\" class=\"data row1 col2\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col3\" class=\"data row1 col3\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col4\" class=\"data row1 col4\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col5\" class=\"data row1 col5\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col6\" class=\"data row1 col6\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col7\" class=\"data row1 col7\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col8\" class=\"data row1 col8\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col9\" class=\"data row1 col9\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col10\" class=\"data row1 col10\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row1_col11\" class=\"data row1 col11\" ></td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042level0_row2\" class=\"row_heading level0 row2\" >Next State</th>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col0\" class=\"data row2 col0\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col1\" class=\"data row2 col1\" >0.003839</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col2\" class=\"data row2 col2\" >1.433103</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col3\" class=\"data row2 col3\" >0.194137</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col4\" class=\"data row2 col4\" >0.480094</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col5\" class=\"data row2 col5\" >-0.004393</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col6\" class=\"data row2 col6\" >-0.043519</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col7\" class=\"data row2 col7\" >False</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col8\" class=\"data row2 col8\" >False</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col9\" class=\"data row2 col9\" >1.433108</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col10\" class=\"data row2 col10\" >0.517860</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row2_col11\" class=\"data row2 col11\" >0.004393</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042level0_row3\" class=\"row_heading level0 row3\" >Reward</th>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col0\" class=\"data row3 col0\" >1.104326</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col1\" class=\"data row3 col1\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col2\" class=\"data row3 col2\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col3\" class=\"data row3 col3\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col4\" class=\"data row3 col4\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col5\" class=\"data row3 col5\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col6\" class=\"data row3 col6\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col7\" class=\"data row3 col7\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col8\" class=\"data row3 col8\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col9\" class=\"data row3 col9\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col10\" class=\"data row3 col10\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row3_col11\" class=\"data row3 col11\" ></td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042level0_row4\" class=\"row_heading level0 row4\" >Episode Terminated</th>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col0\" class=\"data row4 col0\" >False</td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col1\" class=\"data row4 col1\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col2\" class=\"data row4 col2\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col3\" class=\"data row4 col3\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col4\" class=\"data row4 col4\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col5\" class=\"data row4 col5\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col6\" class=\"data row4 col6\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col7\" class=\"data row4 col7\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col8\" class=\"data row4 col8\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col9\" class=\"data row4 col9\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col10\" class=\"data row4 col10\" ></td>\n", | |
| " <td id=\"T_28da7a48_4c97_11f0_9fc6_0242ac120042row4_col11\" class=\"data row4 col11\" ></td>\n", | |
| " </tr>\n", | |
| " </tbody></table>" | |
| ], | |
| "text/plain": [ | |
| "<pandas.io.formats.style.Styler at 0x7eb3d592d210>" | |
| ] | |
| }, | |
| "metadata": {}, | |
| "output_type": "display_data" | |
| } | |
| ], | |
| "source": [ | |
| "# Select an action -> Do nothing = 0\n", | |
| "action = 0\n", | |
| "\n", | |
| "# Run a single time '.step()' of the environment's dynamics, with the given action a.\n", | |
| "next_state, reward, done, _ = env.step(action)\n", | |
| "\n", | |
| "# Display table with values.\n", | |
| "utils.display_table(current_state, action, next_state, reward, done)\n", | |
| "\n", | |
| "# Replace / Overwrite the previous `current_state` s with the New state s', after the action a is taken.\n", | |
| "current_state = next_state" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In practice, when we train the agent we use a loop to allow the agent to take many consecutive actions during an episode." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"6\"></a>\n", | |
| "## 6 - Deep Q-Learning\n", | |
| "\n", | |
| "In cases where both the state and action space are discrete we can estimate the action-value function iteratively by using the Bellman equation:\n", | |
| "\n", | |
| "$$\n", | |
| "Q_{i+1}(s,a) = R + \\gamma \\max_{a'}Q_i(s',a')\n", | |
| "$$\n", | |
| "\n", | |
| "This iterative method converges to the optimal action-value function $Q^*(s,a)$ as $i\\to\\infty$. This means that the agent just needs to gradually explore the state-action space and keep updating the estimate of $Q(s,a)$ until it converges to the optimal action-value function $Q^*(s,a)$. However, in cases where the state space is continuous it becomes practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate $Q(s,a)$ until it converges to $Q^*(s,a)$.\n", | |
| "\n", | |
| "In the Deep $Q$-Learning, we solve this problem by using a neural network to estimate the action-value function $Q(s,a)\\approx Q^*(s,a)$. We call this neural network a $Q$-Network and it can be trained by adjusting its weights at each iteration to minimize the mean-squared error in the Bellman equation.\n", | |
| "\n", | |
| "Unfortunately, using neural networks in reinforcement learning to estimate action-value functions has proven to be highly unstable. Luckily, there's a couple of techniques that can be employed to avoid instabilities. These techniques consist of using a ***Target Network*** and ***Experience Replay***. We will explore these two techniques in the following sections." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"6.1\"></a>\n", | |
| "### 6.1 Target Network\n", | |
| "\n", | |
| "We can train the $Q$-Network by adjusting it's weights at each iteration to minimize the mean-squared error in the Bellman equation, where the target values are given by:\n", | |
| "\n", | |
| "$$\n", | |
| "y = R + \\gamma \\max_{a'}Q(s',a';w)\n", | |
| "$$\n", | |
| "\n", | |
| "where $w$ are the weights of the $Q$-Network. This means that we are adjusting the weights $w$ at each iteration to minimize the following error:\n", | |
| "\n", | |
| "$$\n", | |
| "\\overbrace{\\underbrace{R + \\gamma \\max_{a'}Q(s',a'; w)}_{\\rm {y~target}} - Q(s,a;w)}^{\\rm {Error}}\n", | |
| "$$\n", | |
| "\n", | |
| "Notice that this forms a problem because the $y$ target is changing on every iteration. Having a constantly moving target can lead to oscillations and instabilities. To avoid this, we can create\n", | |
| "a separate neural network for generating the $y$ targets. We call this separate neural network the **target $\\hat Q$-Network** and it will have the same architecture as the original $Q$-Network. By using the target $\\hat Q$-Network, the above error becomes:\n", | |
| "\n", | |
| "$$\n", | |
| "\\overbrace{\\underbrace{R + \\gamma \\max_{a'}\\hat{Q}(s',a'; w^-)}_{\\rm {y~target}} - Q(s,a;w)}^{\\rm {Error}}\n", | |
| "$$\n", | |
| "\n", | |
| "where $w^-$ and $w$ are the weights of the target $\\hat Q$-Network and $Q$-Network, respectively.\n", | |
| "\n", | |
| "In practice, we will use the following algorithm: every $C$ time steps we will use the $\\hat Q$-Network to generate the $y$ targets and update the weights of the target $\\hat Q$-Network using the weights of the $Q$-Network. We will update the weights $w^-$ of the the target $\\hat Q$-Network using a **soft update**. This means that we will update the weights $w^-$ using the following rule:\n", | |
| " \n", | |
| "$$\n", | |
| "w^-\\leftarrow \\tau w + (1 - \\tau) w^-\n", | |
| "$$\n", | |
| "\n", | |
| "where $\\tau\\ll 1$. By using the soft update, we are ensuring that the target values, $y$, change slowly, which greatly improves the stability of our learning algorithm." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"ex01\"></a>\n", | |
| "### Exercise 1\n", | |
| "\n", | |
| "In this exercise you will create the $Q$ and target $\\hat Q$ networks and set the optimizer. Remember that the Deep $Q$-Network (DQN) is a neural network that approximates the action-value function $Q(s,a)\\approx Q^*(s,a)$. It does this by learning how to map states to $Q$ values.\n", | |
| "\n", | |
| "To solve the Lunar Lander environment, we are going to employ a DQN with the following architecture:\n", | |
| "\n", | |
| "* An `Input` layer that takes `state_size` 'state_size' of s vector = (8 elements/cols, _) as input.\n", | |
| "\n", | |
| "* A `Dense` layer as HL1 with `64` units / NEURONS and a `ReLU` activation function.\n", | |
| "\n", | |
| "* A `Dense` layer as HL2 with `64` units / NEURONS and a `ReLU` activation function.\n", | |
| "\n", | |
| "* A `Dense` layer as OL with `num_actions` = 4 units / NEURONS / outputs and a `linear` activation function. This will be the output layer (OL) of our Network.\n", | |
| "\n", | |
| "\n", | |
| "In the cell below you should create the $Q$-Network and the target $\\hat Q$-Network using the model architecture described above. Remember that BOTH the $Q$-Network and the target $\\hat Q$-Network HAVE the SAME ARCHITECTURE.\n", | |
| "\n", | |
| "Lastly, you should set `Adam` as the OPTIMIZER (GD) with a learning rate (α) equal to `ALPHA`. Recall that `ALPHA = 0.001` was defined in the [Hyperparameters](#2) section. We should note that for this exercise you should use the already imported packages:\n", | |
| "```python\n", | |
| "from tensorflow.keras.layers import Dense, Input\n", | |
| "from tensorflow.keras.optimizers import Adam\n", | |
| "```" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# EXERCISE 1\n", | |
| "# UNQ_C1\n", | |
| "# GRADED CELL\n", | |
| "\n", | |
| "# Create the Q-Network NN model\n", | |
| "q_network = Sequential([\n", | |
| " \n", | |
| " ### START CODE HERE ###\n", | |
| " \n", | |
| " Input(shape = state_size), # Input layer -> state_size s = (8,_) 1D vector\n", | |
| " # s = [x,y,x',y',theta,theta',r,l]\n", | |
| " Dense(units=64, activation='relu'), # HL1 -> 64 Units / Neurons with 'ReLU' activation function \n", | |
| " # g(z) = 0 (z < 0) and g(z) = z (z >= 0) \n", | |
| " Dense(units=64, activation='relu'), # HL2 -> 64 Units / Neurons with 'ReLU' activation function \n", | |
| " # g(z) = 0 (z < 0) and g(z) = z (z >= 0)\n", | |
| " Dense(units=num_actions, activation='linear'), # OL -> num_actions a = 4 Units / Neurons with 'linear' \n", | |
| " # g(z) = z activation function\n", | |
| " ### END CODE HERE ### \n", | |
| " ])\n", | |
| "\n", | |
| "# Create the target Q^-Network NN model (The SAME architecture as the previous NN model)\n", | |
| "target_q_network = Sequential([\n", | |
| " \n", | |
| " ### START CODE HERE ### \n", | |
| " \n", | |
| " Input(shape = state_size), # Input layer -> state_size s = (8,_) 1D vector\n", | |
| " # s = [x,y,x',y',theta,theta',r,l]\n", | |
| " Dense(units=64, activation='relu'), # HL1 -> 64 Units / Neurons with 'ReLU' activation function \n", | |
| " # g(z) = 0 (z < 0) and g(z) = z (z >= 0) \n", | |
| " Dense(units=64, activation='relu'), # HL2 -> 64 Units / Neurons with 'ReLU' activation function \n", | |
| " # g(z) = 0 (z < 0) and g(z) = z (z >= 0)\n", | |
| " Dense(units=num_actions, activation='linear'), # OL -> num_actions a = 4 Units / Neurons with 'linear' \n", | |
| " # g(z) = z activation function\n", | |
| " ### END CODE HERE ###\n", | |
| " ])\n", | |
| "\n", | |
| "# Create OPTIMIZER (GD) = Adam(alpha = 0.001) function\n", | |
| "### START CODE HERE ###\n", | |
| "\n", | |
| "optimizer = Adam(learning_rate = ALPHA) # 𝑤− = alpha 𝑤 + (1−alpha)𝑤−\n", | |
| "\n", | |
| "### END CODE HERE ###" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "\u001b[92mAll tests passed!\n", | |
| "\u001b[92mAll tests passed!\n", | |
| "\u001b[92mAll tests passed!\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# Import ALL (*) modules from ‘public_tests’ for UNIT TEST\n", | |
| "from public_tests import *\n", | |
| "\n", | |
| "# TEST Q NN model\n", | |
| "test_network(q_network)\n", | |
| "\n", | |
| "# TEST target Q^ NN model\n", | |
| "test_network(target_q_network)\n", | |
| "\n", | |
| "# TEST 'Adam()' OPTIMIZER (GD)\n", | |
| "test_optimizer(optimizer, ALPHA) " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<details>\n", | |
| " <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n", | |
| " \n", | |
| "```python\n", | |
| "# Create the Q-Network\n", | |
| "q_network = Sequential([\n", | |
| " Input(shape=state_size), \n", | |
| " Dense(units=64, activation='relu'), \n", | |
| " Dense(units=64, activation='relu'), \n", | |
| " Dense(units=num_actions, activation='linear'),\n", | |
| " ])\n", | |
| "\n", | |
| "# Create the target Q^-Network\n", | |
| "target_q_network = Sequential([\n", | |
| " Input(shape=state_size), \n", | |
| " Dense(units=64, activation='relu'), \n", | |
| " Dense(units=64, activation='relu'), \n", | |
| " Dense(units=num_actions, activation='linear'), \n", | |
| " ])\n", | |
| "\n", | |
| "optimizer = Adam(learning_rate=ALPHA) \n", | |
| "``` " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"6.2\"></a>\n", | |
| "### 6.2 Experience Replay\n", | |
| "\n", | |
| "When an agent interacts with the environment, the states, actions, and rewards the agent experiences are sequential by nature. If the agent tries to learn from these consecutive experiences it can run into problems due to the strong correlations between them. To avoid this, we employ a technique known as **Experience Replay** to generate uncorrelated experiences for training our agent. Experience replay consists of storing the agent's experiences (i.e the states, actions, and rewards the agent receives) in a memory buffer and then sampling a random mini-batch of experiences from the buffer to do the learning. The experience tuples $(S_t, A_t, R_t, S_{t+1})$ will be added to the memory buffer at each time step as the agent interacts with the environment.\n", | |
| "\n", | |
| "For convenience, we will STORE each experience with 'namedtuple()' function." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# Store each experience / tuple with 'namedtuple()' function\n", | |
| "experience = namedtuple(\"Experience\", field_names=[\"state\", \"action\", \"reward\", \"next_state\", \"done\"])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "By using experience replay we avoid problematic correlations, oscillations and instabilities. In addition, experience replay also allows the agent to potentially use the same experience in multiple weight updates, which increases data efficiency." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"7\"></a>\n", | |
| "## 7 - Deep Q-Learning Algorithm with Experience Replay\n", | |
| "\n", | |
| "Now that we know all the techniques that we are going to use, we can put them together to arrive at the Deep Q-Learning Algorithm With Experience Replay.\n", | |
| "<br>\n", | |
| "<br>\n", | |
| "<figure>\n", | |
| " <img src = \"images/deep_q_algorithm.png\" width = 90% style = \"border: thin silver solid; padding: 0px\">\n", | |
| " <figcaption style = \"text-align: center; font-style: italic\">Fig 3. Deep Q-Learning with Experience Replay.</figcaption>\n", | |
| "</figure>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"ex02\"></a>\n", | |
| "### Exercise 2\n", | |
| "\n", | |
| "In this exercise you will implement line ***12*** of the algorithm outlined in *Fig 3* above and you will also compute the loss between the $y$ targets and the $Q(s,a)$ values. In the cell below, complete the `compute_loss` function by setting the $y$ targets equal to:\n", | |
| "\n", | |
| "$$\n", | |
| "\\begin{equation}\n", | |
| " y_j =\n", | |
| " \\begin{cases}\n", | |
| " R_j & \\text{if episode terminates at step } j+1\\\\\n", | |
| " R_j + \\gamma \\max_{a'}\\hat{Q}(s_{j+1},a') & \\text{otherwise}\\\\\n", | |
| " \\end{cases} \n", | |
| "\\end{equation}\n", | |
| "$$\n", | |
| "\n", | |
| "Here are a couple of things to note:\n", | |
| "\n", | |
| "* The `compute_loss` function takes in a mini-batch of experience tuples. This mini-batch of experience tuples is unpacked to extract the `states s`, `actions a`, `rewards R(s)`, `next_states s'`, and `done_vals ('True' when episode ends or 'False' otherwise)`. You should keep in mind that these variables are *TensorFlow Tensors* whose size will depend on the mini-batch SIZE. For example, if the mini-batch SIZE is `(64, )` then both `rewards R(s)` and `done_vals 'True' or 'False'` will be TensorFlow Tensors with `(64, )` elements.\n", | |
| "\n", | |
| "\n", | |
| "* Using `if/else` statements to set the $y$ targets will not work when the variables are tensors with many elements. However, notice that you can use the `done_vals` to implement the above in a single line of code. To do this, recall that the `done` variable is a Boolean variable that takes the value `True` when an episode terminates at step $j+1$ and it is `False` otherwise. Taking into account that a Boolean value of `True` has the numerical value of `1` and a Boolean value of `False` has the numerical value of `0`, you can use the factor `(1 - done_vals)` to implement the above in a single line of code. Here's a hint: notice that `(1 - done_vals) = 0` when `done_vals = True = 1` and a value of `(1 - done_vals) = 1` when `done_vals = False = 0`. \n", | |
| "\n", | |
| "Lastly, compute the `loss = MSE(y_pred, y_true)` by calculating the Mean-Squared Error (`MSE`) between the `y_targets` and the `q_values`. To calculate the mean-squared error you should use the already imported package `MSE()`:\n", | |
| "```python\n", | |
| "from tensorflow.keras.losses import MSE\n", | |
| "```" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# EXERCISE 2\n", | |
| "# UNQ_C2\n", | |
| "# GRADED FUNCTION: calculate_loss\n", | |
| "\n", | |
| "def compute_loss(experiences, gamma, q_network, target_q_network):\n", | |
| " \"\"\" \n", | |
| " Calculates the loss.\n", | |
| " \n", | |
| " Args:\n", | |
| " experiences: (tuple) tuple of [\"state\", \"action\", \"reward\", \"next_state\", \"done\"] namedtuples\n", | |
| " gamma: (float) The discount factor.\n", | |
| " q_network: (tf.keras.Sequential) Keras model for predicting the q_values\n", | |
| " target_q_network: (tf.keras.Sequential) Keras model for predicting the targets\n", | |
| " \n", | |
| " Returns:\n", | |
| " loss: (TensorFlow Tensor(shape=(0,), dtype=int32)) the Mean-Squared Error between\n", | |
| " the y targets and the Q(s,a) values.\n", | |
| " \"\"\"\n", | |
| "\n", | |
| " # Unpack the mini-batch of 'experiences' tuples -> ( [[s]*64],[[a]*64],[[R]*64],[[s']*64],[[done]*64] )\n", | |
| " # Each tuple variable has tensors with 64 elements\n", | |
| " states, actions, rewards, next_states, done_vals = experiences\n", | |
| " \n", | |
| " # Compute max Q^(s,a) along last axis(-1) -> ALL col / elements -> rows -> [Qnothing, Qleft, Qmain, Qright]\n", | |
| " max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)\n", | |
| " \n", | |
| " ### START CODE HERE ###\n", | |
| " \n", | |
| " # Set y = R if episode terminates -> done_vals = True = 1 \n", | |
| " # -> R + γ max Q^(s,a)[1-done] = R + γ max Q^(s,a)[1-1] = R + 0\n", | |
| " \n", | |
| " # otherwise set y = R + γ max Q^(s,a) -> done_vals = False = 0\n", | |
| " # -> R + γ max Q^(s,a)[1-done] = R + γ max Q^(s,a)[1-0] = R + γ max Q^(s,a)\n", | |
| " \n", | |
| " # (64,) * 4mini-batches \n", | |
| " y_targets = rewards + ( gamma * max_qsa * (1 - done_vals) )\n", | |
| " # (64,) * 4mini-batches (64,) * 4mini-batches\n", | |
| " \n", | |
| " print('y_targets mini-batch:',y_targets.shape)\n", | |
| " \n", | |
| " ### END CODE HERE ###\n", | |
| " \n", | |
| " # Get the original 'q_values' (64, 4)\n", | |
| " q_values = q_network(states)\n", | |
| " print('q_values original size:', q_values.shape)\n", | |
| " \n", | |
| " # And reshape 'q_values' (64, 4) -> (64, ) * 4 mini-batches \n", | |
| " # to match 'y_targets' (64, ) * 4-mini-batches\n", | |
| " # col0 col1 col2 col3\n", | |
| " # Q = [row0 [Q_action0, Q_action1, Q_action2, Q_action3],\n", | |
| " # row1 [Q_action0, Q_action1, Q_action2, Q_action3],\n", | |
| " # ...\n", | |
| " # row63[Q_action0, Q_action1, Q_action2, Q_action3] ]\n", | |
| " \n", | |
| " # select Q position = [[row 0, col_action a],\n", | |
| " # [row 1, col_action a],\n", | |
| " # ...\n", | |
| " # [row 63,col_action a]]\n", | |
| " \n", | |
| " # col_action a = 0 -> Do nothing\n", | |
| " # col_action a = 1 -> Left\n", | |
| " # col_action a = 2 -> Main\n", | |
| " # col_action a = 3 -> Right\n", | |
| " \n", | |
| " # Select row indices\n", | |
| " row_indices = tf.range(q_values.shape[0])\n", | |
| " \n", | |
| " # Select col_action indices\n", | |
| " col_indices = tf.cast(actions, tf.int32)\n", | |
| " \n", | |
| " # Pile coordinates [row_indices, col_indices] as COLUMNS (axis = 1)\n", | |
| " row_col_indices = tf.stack([row_indices, col_indices], axis=1)\n", | |
| " \n", | |
| " # Select 64 Q values at [row, col_action] positions/indices\n", | |
| " q_values = tf.gather_nd(q_values, row_col_indices)\n", | |
| " \n", | |
| " ## Example of [row, col_action] stack -> \n", | |
| " ## print(tf.stack([tf.range(3), tf.cast([0.0, 0.1, 0.2], tf.int32)], axis=1))\n", | |
| " print('q_values mini - batch:',q_values.shape)\n", | |
| " \n", | |
| " ### START CODE HERE ###\n", | |
| " \n", | |
| " # Compute the loss = MSE = [1 / m_examples] SUM i=1 -> m ( y^(i) - y(i) )^2 \n", | |
| " # -> [1 / m] SUM i=1 -> m ( y_targets(i) - q_values(i) )^2\n", | |
| " # y_pred y_true\n", | |
| " loss = MSE(y_targets, q_values) \n", | |
| " \n", | |
| " print ('loss / MSE:', loss, '\\n')\n", | |
| " ### END CODE HERE ### \n", | |
| " \n", | |
| " return loss" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": { | |
| "deletable": false, | |
| "editable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: tf.Tensor(0.6991737, shape=(), dtype=float32) \n", | |
| "\n", | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: tf.Tensor(0.34327018, shape=(), dtype=float32) \n", | |
| "\n", | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: tf.Tensor(0.0, shape=(), dtype=float32) \n", | |
| "\n", | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: tf.Tensor(1.0, shape=(), dtype=float32) \n", | |
| "\n", | |
| "\u001b[92mAll tests passed!\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# UNIT TEST \n", | |
| "test_compute_loss(compute_loss)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<details>\n", | |
| " <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n", | |
| " \n", | |
| "```python\n", | |
| "def compute_loss(experiences, gamma, q_network, target_q_network):\n", | |
| " \"\"\" \n", | |
| " Calculates the loss.\n", | |
| " \n", | |
| " Args:\n", | |
| " experiences: (tuple) tuple of [\"state\", \"action\", \"reward\", \"next_state\", \"done\"] namedtuples\n", | |
| " gamma: (float) The discount factor.\n", | |
| " q_network: (tf.keras.Sequential) Keras model for predicting the q_values\n", | |
| " target_q_network: (tf.keras.Sequential) Keras model for predicting the targets\n", | |
| " \n", | |
| " Returns:\n", | |
| " loss: (TensorFlow Tensor(shape=(0,), dtype=int32)) the Mean-Squared Error between\n", | |
| " the y targets and the Q(s,a) values.\n", | |
| " \"\"\"\n", | |
| "\n", | |
| " \n", | |
| " # Unpack the mini-batch of experience tuples\n", | |
| " states, actions, rewards, next_states, done_vals = experiences\n", | |
| " \n", | |
| " # Compute max Q^(s,a)\n", | |
| " max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)\n", | |
| " \n", | |
| " # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).\n", | |
| " y_targets = rewards + (gamma * max_qsa * (1 - done_vals))\n", | |
| " \n", | |
| " # Get the q_values\n", | |
| " q_values = q_network(states)\n", | |
| " q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),\n", | |
| " tf.cast(actions, tf.int32)], axis=1))\n", | |
| " \n", | |
| " # Calculate the loss\n", | |
| " loss = MSE(y_targets, q_values)\n", | |
| " \n", | |
| " return loss\n", | |
| "\n", | |
| "``` \n", | |
| " " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"8\"></a>\n", | |
| "## 8 - Update the Network Weights\n", | |
| "\n", | |
| "We will use the `agent_learn` function below to implement lines ***12 -14*** of the algorithm outlined in [Fig 3](#7). The `agent_learn` function will UPDATE the weights W, B of the $Q$ and target $\\hat Q$ networks using a CUSTOM TRAINING LOOP. Because we are using a CUSTOM TRAINING LOOP we need to retrieve the Gradients via a `tf.GradientTape` instance, and then call `optimizer.apply_gradients()` to update the weights of our $Q$-Network, performing `AUTOMATIC DIFFERENTIATION of input tensor variables`. Note that we are also using the `@tf.function` decorator to INCREASE PERFORMANCE. Without this decorator our training will take TWICE as LONG. If you would like to know more about how to increase performance with `@tf.function` take a look at the [TensorFlow documentation](https://www.tensorflow.org/guide/function).\n", | |
| "\n", | |
| "The last line of this function updates the weights of the target $\\hat Q$-Network using a [soft update](#6.1). If you want to know how this is implemented in code we encourage you to take a look at the `utils.update_target_network` function in the `utils` module." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# '@tf.function' decorator INCREASES PERFORMANCE. \n", | |
| "# Without this decorator our training will take TWICE as LONG.\n", | |
| "@tf.function\n", | |
| "\n", | |
| "def agent_learn(experiences, gamma):\n", | |
| " \"\"\"\n", | |
| " Updates the weights of the Q networks.\n", | |
| " \n", | |
| " Args:\n", | |
| " experiences: (tuple) tuple of [\"state\", \"action\", \"reward\", \"next_state\", \"done\"] namedtuples\n", | |
| " gamma: (float) The discount factor.\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " # Use TensorFlow’s GradientTape\n", | |
| " # to record the operations used to compute the cost / loss\n", | |
| " # Record steps needed to compute COST J, enabling auto-differentiation (Line 12)\n", | |
| " with tf.GradientTape() as tape:\n", | |
| " \n", | |
| " # Compute the COST / loss function J, with 'compute_loss()' function (Line 12)\n", | |
| " loss = compute_loss(experiences, gamma, q_network, target_q_network)\n", | |
| "\n", | |
| " # Get the Gradients/Derivatives of the loss/COST J with respect to the weights W -> dJ / dW and B -> dJ / dB (Line 13)\n", | |
| " gradients = tape.gradient(loss, q_network.trainable_variables)\n", | |
| " \n", | |
| " # UPDATE the NEW weights W' <- W and B' <- B of the q_network, \n", | |
| " # using previous computed Gradients -> dJ / dW and B -> dJ / dB\n", | |
| " # gradients = [dJ/dW, dJ/dB] q_network.trainable_variables = parameters = [ [[W]](8,64), [[B]](64,) ] -> \n", | |
| " # zip(grads,parameters) = [ ( [[dJ/dW]](8,64), [[W]](8,64) ), ( [[dJ/dB]](64,), [[B]](64,) ) ]. \n", | |
| " # Combines each computed gradient/derivative in 'gradients' with its 'parameter' in a tuple (_,_), \n", | |
| " # and put ALL tuples in a list [(_,_), (_,_)], to get ready, to be UPDATED and MINIMIZED each tuple at a time, \n", | |
| " # via GD, by the picked 'optimizer'. \n", | |
| " # Run one (1) step / update of gradient descent (GD) by UPDATING / MINIMIZING\n", | |
| " # the value of the TF variables / changeable parameters W,B to also MINIMIZE the COST function J. (Line 13)\n", | |
| " optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))\n", | |
| "\n", | |
| " # UPDATE the weights W' <- W and B' <- B of target q_network (Line 14)\n", | |
| " utils.update_target_network(q_network, target_q_network)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"9\"></a>\n", | |
| "## 9 - Train the Agent\n", | |
| "\n", | |
| "We are now ready to train our agent to solve the Lunar Lander environment. In the cell below we will implement the algorithm in [Fig 3](#7) line by line (please note that we have included the same algorithm below for easy reference. This will prevent you from scrolling up and down the notebook):\n", | |
| "\n", | |
| "* **Line 1**: We initialize the `memory_buffer` with a capacity of $N =$ `MEMORY_SIZE`. Notice that we are using a `deque` as the data structure for our `memory_buffer`.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 2**: We skip this line since we already initialized the `q_network` in [Exercise 1](#ex01).\n", | |
| "\n", | |
| "\n", | |
| "* **Line 3**: We initialize the `target_q_network` by setting its weights to be equal to those of the `q_network`.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 4**: We start the outer loop. Notice that we have set $M =$ `num_episodes = 2000 iterations / episodes`. This number is reasonable because the agent should be able to solve the Lunar Lander environment in less than `2000 iterations / episodes` using this notebook's default parameters.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 5**: We use the `.reset()` method to reset the environment to the initial state (Top Center of the environment) and get the `initial state s = (s, a, R(s), s’, done)`.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 6**: We start the inner loop. Notice that we have set $T =$ `max_num_timesteps = 1000 max time steps t, each episode / iteration i`. This means that the episode will AUTOMATICALLY TERMINATE if the episode hasn't terminated after `1000` time steps t.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 7**: The agent observes the current `state s` and chooses an `action a` using an p = (1 - $\\epsilon$-greedy) policy. Our agent starts out using a value of $\\epsilon =$ `epsilon = 1` which yields an p = (1 - $\\epsilon$-greedy) policy that is equivalent to the equiprobable random policy. This means that at the beginning of our TRAINING, `𝜖 = epsilon = 1`, so `p = (1 - 𝜖-greedy) = 0`(0% of time, we’ll choose to MAXIMIZE the obtained Q(s, a) value, so we’ll pick 100% of time -> ‘Exploration’) so the agent is going to choose SMALL Q(s, a) values at the state s, so the agent is just going to take RANDOM actions a regardless of the observed `state s`. As TRAINING progresses, we will DECREASE the value of $\\epsilon$ SLOWLY towards a `minimum` value using a given $\\epsilon$-decay rate. We want this minimum value to be CLOSE to zero '0' because a value of $\\epsilon = 0$ will yield an p = (1 - $\\epsilon$-greedy) = 1 policy that is equivalent to the greedy policy. This means that towards the END of TRAINING, the agent will lean towards selecting 100% of the times, the `action a` that it believes (based on its past experiences) will MAXIMIZE $Q(s,a)$ at `state s`. We will set the MINIMUM $\\epsilon$ =`0.01` and (not exactly 0), p = (1 - $\\epsilon$-greedy) = 0.99 (99% of time we will PICK to MAXIMIZE the obtained Q(s, a) value, and 1% of time we will PICK to KEEP a LITTLE BIT of ’Exploration’ during TRAINING. If you want to know how this is implemented in code we encourage you to take a look at the `utils.get_action()` function in the `utils` module.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 8**: We use the `env.step()` method to take the given `action a` in the environment and get the `reward R(s)` and the `next_state s'`. \n", | |
| "\n", | |
| "\n", | |
| "* **Line 9**: We store the `experience(state, action, reward, next_state, done)` tuple in our `memory_buffer`. Notice that we also store the `done (True / False)` variable so that we can keep track of when an episode terminates (`done = True`). This allowed us to set the $y$ targets in [Exercise 2](#ex02).\n", | |
| "\n", | |
| "\n", | |
| "* **Line 10**: We check if the conditions are met to perform a `learning update`. We do this by using our custom `utils.check_update_conditions()` function. This function checks if $C =$ `NUM_STEPS_FOR_UPDATE = 4` time steps t have occurred and if our `memory_buffer` list [ ] has enough ‘experience’ tuples `[ experience1, experience2, ... , experience64, ... ] = [ (s,a,R,s',done)1, (s,a,R,s',done)2, ... , (s,a,R,s',done)64, ... ]` to FILL a `mini-batch`. For example, if the mini-batch size is `64`, then our `memory_buffer` should have `MORE THAN 64 'experience' tuples to be SAMPLED RANDOMLY`, in order to pass the latter condition. If the conditions are met, then the `utils.check_update_conditions()` function will return a BOOLEAN value of `True`, otherwise it will return a BOOLEAN value of `False`.\n", | |
| "\n", | |
| "\n", | |
| "* **Lines 11 - 14**: If the `update = True` then we perform a learning UPDATE. The learning UPDATE consists of `SAMPLING a RANDOM mini-batch of 'experience' tuples` from our `memory_buffer = [ experience1, experience2, ... , experience64, ... ] = [ (s,a,R,s',done)1, (s,a,R,s',done)2, ... , (s,a,R,s',done)64, ... ]` (Line 11), setting the **$y$ targets = Q^(s, a)** (Line 12), performing **Gradient Descent -> 'Adam' optimizer** (Line 13), and **UPDATING the weights W,B** (Line 14) of the Q-network and target Q^-network. We will use the `agent_learn()` function we defined in the previous [Section 8](#8) to perform the latter 3.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 15**: At the `end of each time step t iteration (t = 0, 1, ... ,999) of the inner loop`, we'll **OVERWRITE / UPDATE or SET** the **NEW** `next_state s'` as our **actual** `state s` so that the inner loop can **START AGAIN** at next iteration, from this `NEW state s = s'`. We also **ADD** the previous `‘total_points’ counter of ‘rewards’` with a `NEW ‘reward’`. In addition, we **check if the episode has reached a terminal state** (i.e we check if `done = True`). `If a terminal state HAS BEEN REACHED, then we 'break' out of the inner loop`.\n", | |
| "\n", | |
| "\n", | |
| "* **Line 16**: At the **end** of `each iteration (i = 0, 1, 2, ... , 1999) of the outer loop`, we `append ‘total_points’ counter of 'rewards’ (updated value), at 'total_point_history' list [ ]`, then `select the LAST num_p_av = 100 UPDATED 'reward’ points, at 'total_point_history' list [ ]`, and `get the ‘mean’ of these 100 points`. Finally `UPDATE the value of $\\epsilon$, returning the 'MAX' value between 'E_MIN = 0.01' and a ‘DECREASED epsilon’ value (E_DECAY * ϵ epsilon) = (0.995 * ϵ epsilon)`, and check if the 'environment' has been solved. We consider that the `'environment' has been SOLVED if the agent AT LEAST receives an average reward ‘av_latest_points’ ≥ 200 points in the last `100` episodes / i-th iterations`. If the `'environment' HAS NOT BEEN solved we continue the outer loop and` **start** `a NEW episode / i-th iteration`.\n", | |
| "\n", | |
| "Finally, we wanted to note that we have **included some extra variables to keep track** of the `total number of points (AVG rewards) the agent received in each episode/i-th iter`. This will help us **determine if the agent has SOLVED** the 'environment' and it will also allow us to `see how our agent performed during training`. We also use the `‘time.time()’` module to measure `HOW LONG` the training code/cell takes **(Execution_time = end - start)**. \n", | |
| "\n", | |
| "<br>\n", | |
| "<br>\n", | |
| "<figure>\n", | |
| " <img src = \"images/deep_q_algorithm.png\" width = 90% style = \"border: thin silver solid; padding: 0px\">\n", | |
| " <figcaption style = \"text-align: center; font-style: italic\">Fig 4. Deep Q-Learning with Experience Replay.</figcaption>\n", | |
| "</figure>\n", | |
| "<br>\n", | |
| "\n", | |
| "**Note:** With this notebook's default parameters, the following cell takes between 10 to 15 minutes to run. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: Tensor(\"Mean:0\", shape=(), dtype=float32) \n", | |
| "\n", | |
| "y_targets mini-batch: (64,)\n", | |
| "q_values original size: (64, 4)\n", | |
| "q_values mini - batch: (64,)\n", | |
| "loss / MSE: Tensor(\"Mean:0\", shape=(), dtype=float32) \n", | |
| "\n", | |
| "Episode 100 | Total point AVG 'reward' of the last 100 iters/episodes: -150.85\n", | |
| "Episode 200 | Total point AVG 'reward' of the last 100 iters/episodes: -106.11\n", | |
| "Episode 300 | Total point AVG 'reward' of the last 100 iters/episodes: -77.256\n", | |
| "Episode 400 | Total point AVG 'reward' of the last 100 iters/episodes: -25.01\n", | |
| "Episode 500 | Total point AVG 'reward' of the last 100 iters/episodes: 159.91\n", | |
| "Episode 534 | Total point AVG 'reward' of the last 100 iters/episodes: 201.37\n", | |
| "\n", | |
| "Environment solved in 534 episodes!\n", | |
| "\n", | |
| "Total Runtime: 734.99 s (12.25 min)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# Check the execution time, measuring time taken for\n", | |
| "# a code segment by recording 'start' and 'end' times\n", | |
| "# execution time = end time - start time\n", | |
| "start = time.time()\n", | |
| "\n", | |
| "# i-th iterations / episodes i=[0,1,2,...,1999] at outer loop of Lunar lander environment (line 3)\n", | |
| "num_episodes = 2000\n", | |
| "\n", | |
| "# Max Time steps t=[0,1,2,...,999] transcurred at inner loop, each episode/iteration i. (Line 6)\n", | |
| "max_num_timesteps = 1000\n", | |
| "\n", | |
| "# Initialize empty [] list to store Total updated rewards (1000 step times t), per iteration i\n", | |
| "total_point_history = []\n", | |
| "\n", | |
| "num_p_av = 100 # number of the LAST total points to use for averaging -> '100'\n", | |
| "epsilon = 1.0 # initial ε epsilon value for ε-greedy policy\n", | |
| "\n", | |
| "# Create a memory buffer D with capacity N = MEMORY_SIZE = 100_000. Generally represents an integer number,\n", | |
| "# in our case N = 100,000. \"_\" is used just for legibility of Large numbers in Python\n", | |
| "# 'Deque' is a list-like data structure -> [], that allows elements to be added and removed\n", | |
| "# from both ends /extremos/ efficiently. (Line 1)\n", | |
| "memory_buffer = deque(maxlen=MEMORY_SIZE)\n", | |
| "\n", | |
| "# Exercise 1 (Line 2)\n", | |
| "\n", | |
| "# Set the target network weights equal to the Q-Network weights (line 3)\n", | |
| "target_q_network.set_weights(q_network.get_weights())\n", | |
| "\n", | |
| "# Start outer loop with M = 2000 Iters / episodes, to solve Lunar Lander environment\n", | |
| "# during these episodes/iters, using notebook's default parameters\n", | |
| "# i = 0, 1, 2, ... , 1999 iters / episodes. (Line 4)\n", | |
| "for i in range(num_episodes):\n", | |
| " \n", | |
| " # Reset the environment to the initial state (Top center of the environment)\n", | |
| " # and get the initial state vector s = [x, y, x’, y’, theta, theta’, l, r] (Line 5)\n", | |
| " state = env.reset()\n", | |
| " \n", | |
| " # Initialize total 'rewards' counter as '0'\n", | |
| " total_points = 0\n", | |
| " \n", | |
| " # Start inner loop which, after T = 1000 time steps t transcurred, each iteration i,\n", | |
| " # and episode i-th hasn't terminated, it will automatically terminate.\n", | |
| " # t = 0, 1, 2, ... , 999 time steps t, per iteration/episode i. (Line 6)\n", | |
| " for t in range(max_num_timesteps):\n", | |
| " \n", | |
| " # From the current state s, choose an action a using an p = (1 - ε-greedy) policy\n", | |
| " # Inits ε=0.99 -> p=(1 - ε-greedy) = 0.01 (Just 1% of times PICK MAXIMIZE Q(s, a) -> 'Exploration')\n", | |
| " # so we take 'RANDOM actions a' primarily. Then DECREASES epsilon ε until:\n", | |
| " # Ends ε=0.01 -> p=(1 - ε-greedy) = 0.99 (Most 99% of times PICK MAXIMIZE Q(s, a) -> 'Exploitation')\n", | |
| " # so we PICK 'NOT to take RANDOM actions a' primarily. (Line 7)\n", | |
| " \n", | |
| " # state s vector needs to be the 'right shape' to be the input for the 'q_network'\n", | |
| " # 1D state = [x, y, x’, y’, theta, theta’, l, r] shape (8, ) ->\n", | |
| " # Add a NEW dimension at 'axis = 0' (rows) position, so 'np.expand_dims(state, axis=0)' ->\n", | |
| " # 2D state s = [ [x, y, x’, y’, theta, theta’, l, r] ] shape (1, 8) (Line 7)\n", | |
| " state_qn = np.expand_dims(state, axis=0)\n", | |
| " \n", | |
| " # Input 2D state s = [ [x, y, x’, y’, theta, theta’, l, r] ] shape (1, 8) to NN 'q-network' model\n", | |
| " # 'q_values' output Tensor with shape (1 row, 4 Q-values):\n", | |
| " # [ [ 1 vector/row, each with 4 Q-values related to 4 possible actions a=[0=nothing, 1=left, 2=main, 3=right] ] ] \n", | |
| " # Q(s,a) = [ [Q2, Q2, Q3, Q4] ] 2D tensor vector (Line 7)\n", | |
| " q_values = q_network(state_qn)\n", | |
| " \n", | |
| " # From the current state s, Get an action a using the p = (1 - ε-greedy) policy with p[0-1] (+)\n", | |
| " ## if random.random() > epsilon: # When float random number[0-1) - epsilon[0-1] > 0.0 (+)\n", | |
| " ## return np.argmax(q_values.numpy()[0]) # return the action (index of q_values) [0-3] where Q-value is MAXIMUM\n", | |
| " ## else:\n", | |
| " ## return random.choice(np.arange(4)) # Otherwise when float random number[0-1) - epsilon[0-1] < 0.0 (-)\n", | |
| " ## # return a random action [0-3]\n", | |
| " # action is an integer between [0-3] -> action=0=nothing, action=1=left, action=2=main, action=3=right\n", | |
| " \n", | |
| " # From Q(s,a) = [ [Q2, Q2, Q3, Q4] ] 2D tensor vector CONVERT TO -> 1D numpy vector [Q2, Q2, Q3, Q4] -> \n", | |
| " # select MAX[Q2, Q2, Q3, Q4] -> so return the index of MAX Q(s,a) scalar = action[0-3] (Line 7)\n", | |
| " action = utils.get_action(q_values, epsilon)\n", | |
| " \n", | |
| " # Take (1) step in environment, to perform previous 'action' A, and receive a 'reward' R(S), in the next state S'\n", | |
| " # done = True -> if episode finished because it reached a terminal state\n", | |
| " # done = False -> if episode didn't finished or NOT reached a terminal state yet. (Line 8)\n", | |
| " next_state, reward, done, _ = env.step(action)\n", | |
| " \n", | |
| " # state = env.reset() -> s = [x, y, x’, y’, theta, theta’, l, r]\n", | |
| " # action = utils.get_action(q_values, epsilon) -> action[0-3]: 0=nothing, 1=left, 2=main, 3=right\n", | |
| " # next_state, reward, done, _ = env.step(action)\n", | |
| " # Store 'experience' tuple = (state,action,reward,state',done) in the 'memory_buffer' list [].\n", | |
| " # We store the 'done' variable (True/False) as well for convenience.\n", | |
| " \n", | |
| " # environment (obj) creates a new tuple -> environment(s,a,R,s',done)\n", | |
| " # memory_buffer = [ experience1, experience2, ... , experience64, ... ]\n", | |
| " # memory_buffer = [ (s,a,R,s',done)1, (s,a,R,s',done)2, ... , (s,a,R,s',done)64, ... ]\n", | |
| " # If mini-batch has 64 tuples/examples/rows, our 'memory_buffer'\n", | |
| " # should have more than 64 'experience' tuples in order to pass the latter condition (Line 9)\n", | |
| " memory_buffer.append(experience(state, action, reward, next_state, done))\n", | |
| " \n", | |
| " # ONLY UPDATE the Neural Network, when every C=NUM_STEPS_FOR_UPDATE=4 time steps t have occurred,\n", | |
| " # and when 'memory_buffer' has MORE THAN 64 tuples to be SAMPLED RANDOMLY, \n", | |
| " # because MINIBATCH_SIZE = 64 tuples/examples/rows,\n", | |
| " ## if (t + 1) % num_steps_upd == 0 and len(memory_buffer) > MINIBATCH_SIZE:\n", | |
| " ## return True\n", | |
| " ## else:\n", | |
| " ## return False\n", | |
| " # memory_buffer = [ experience1, experience2, ... , experience64, ... ]\n", | |
| " # memory_buffer = [ (s,a,R,s',done)1, (s,a,R,s',done)2, ... , (s,a,R,s',done)64, ... ]\n", | |
| " # Then update = 'True', otherwise update = 'False' (Line 10)\n", | |
| " update = utils.check_update_conditions(t, NUM_STEPS_FOR_UPDATE, memory_buffer)\n", | |
| " \n", | |
| " # When update = 'True'\n", | |
| " if update:\n", | |
| " \n", | |
| " # SAMPLE RANDOM mini-batch of 'experience' tuples (S,A,R,S',done) from D\n", | |
| " # memory_buffer = [ experience1, experience2, ... , experience64, ... ]\n", | |
| " # memory_buffer = [ (s,a,R,s',done)1, (s,a,R,s',done)2, ... , (s,a,R,s',done)64, ... ] \n", | |
| " ## experiences = random.sample(memory_buffer, k=MINIBATCH_SIZE) # Select k=64 elements randomly from\n", | |
| " ## # 'memory_buffer' list -> [()1,()2,()3,...,()64]\n", | |
| " ## states = tf.convert_to_tensor(np.array([e.state for e in experiences if e is not None]), dtype=tf.float32)\n", | |
| " ## actions = tf.convert_to_tensor(np.array([e.action for e in experiences if e is not None]), dtype=tf.float32)\n", | |
| " ## rewards = tf.convert_to_tensor(np.array([e.reward for e in experiences if e is not None]), dtype=tf.float32)\n", | |
| " ## next_states = tf.convert_to_tensor(np.array([e.next_state for e in experiences if e is not None]), \n", | |
| " ## dtype=tf.float32)\n", | |
| " ##done_vals = tf.convert_to_tensor(np.array([e.done for e in experiences if e is not None]).astype(np.uint8),\n", | |
| " ## dtype=tf.float32)\n", | |
| " ## return (states, actions, rewards, next_states, done_vals) # Each tuple's variable has tensors \n", | |
| " ## # with 64 elements \n", | |
| " # ( [[s]*64],[[a]*64],[[R]*64],[[s']*64],[[done]*64] ) (Line 11)\n", | |
| " experiences = utils.get_experiences(memory_buffer)\n", | |
| " \n", | |
| " # Set the y targets = Q^(s, a), perform a Gradient Descent step -> 'Adam' optimizer,\n", | |
| " # and UPDATE the Q-network and target Q^-network weights W,B. \n", | |
| " # experiences = ( [[s]*64],[[a]*64],[[R]*64],[[s']*64],[[done]*64] )\n", | |
| " # GAMMA = 0.995 Discount Factor (Lines 12, 13, 14)\n", | |
| " agent_learn(experiences, GAMMA)\n", | |
| " \n", | |
| " # At the end of each inner loop / time step t iteration (t = 0, 1, 2, ..., 999), \n", | |
| " # OVERWRITE/UPDATE/SET the NEW next_state s' as our actual state s -> s = s',\n", | |
| " # using an independent copy of s’, to keep state s' UNMODIFIED. (Line 15)\n", | |
| " state = next_state.copy()\n", | |
| " \n", | |
| " # ADD to the previous accumulated SUM ‘total_points’ counter of ‘rewards’ at time step t, a NEW ‘reward’ -> \n", | |
| " # total_points = total_points (counter init as 0) + reward (line 15)\n", | |
| " total_points += reward\n", | |
| " \n", | |
| " # done = True -> if episode finished because it reached a terminal state\n", | |
| " # done = False -> if episode didn't finished or NOT reached a terminal state yet.\n", | |
| " # next_state, reward, done, _ = env.step(action)\n", | |
| " # If a 'terminal state' HAS BEEN REACHED (done = 'True'), then we 'BREAK' out the inner loop (Line 15)\n", | |
| " if done:\n", | |
| " break\n", | |
| " \n", | |
| " # At the end of each i-th outer loop iteration (i = 0, 1, 2, ..., 1999),\n", | |
| " # append ‘total_points’ counter of rewards (updated value), at 'total_point_history' list [] \n", | |
| " # after finishing 1000 time steps t (each iteration i). (Line 16)\n", | |
| " total_point_history.append(total_points)\n", | |
| " \n", | |
| " # At the end of outer loop, 'total_point_history[-100:]' selects the LAST num_p_av = 100 i-th \n", | |
| " # updated reward points, at 'total_point_history' list []. Then, get the 'mean' of these 100 points\n", | |
| " # each of them, obtained during 1000 times steps t, during inner loop iterations. (line 16)\n", | |
| " av_latest_points = np.mean(total_point_history[-num_p_av:])\n", | |
| " \n", | |
| " # UPDATE the ε epsilon value, gradually DECREASING the value of ε epsilon towards a MIN value E_MIN = 0.01 \n", | |
| " # using a given ε-decay rate E_DECAY = 0.995. \n", | |
| " # 'utils.get_new_eps(epsilon)' function, returns the 'MAX' value between 'E_MIN = 0.01' and a 'DECREASED epsilon' \n", | |
| " # return max(E_MIN, E_DECAY * epsilon) -> return max(0.01, 0.995 * epsilon). (Line 16)\n", | |
| " epsilon = utils.get_new_eps(epsilon)\n", | |
| " \n", | |
| " # Print just the 1st iter/Episode (i=0 -> i+1=1), print AVG reward of LAST [100] iters/episodes: [x.xx], end=\"\" \n", | |
| " # because module (%) of i+1=1 is != 0, so it is NOT printed at if statement below\n", | |
| " # i+1=1 |_100\n", | |
| " # 1 0\n", | |
| " # print(\"something\", end=\"\") end=\"\" secures the next print continuing on the same line. \n", | |
| " # print(f\"\\rEpisode...) \\r character returns the cursor at the begining of the line, \n", | |
| " # so it OVERWRITES the line, updating information on the same line instead of printing a new one each time (Line 16)\n", | |
| " print(f\"\\rEpisode {i+1} | Total point AVG 'reward' of the last {num_p_av} iters/episodes: {av_latest_points:.2f}\", end=\"\")\n", | |
| " \n", | |
| " # Print every 100 iters/episodes i+1=[100,200,300, ... ,2000] (Line 16) \n", | |
| " if (i+1) % num_p_av == 0:\n", | |
| " \n", | |
| " # Every 100 iters/episodes i+1=[100,200,300, ... ,2000] print AVG reward of LAST [100] iters/episodes: [x.xx] \n", | |
| " # print(f\"\\rEpisode...) \\r character OVERWRITES the line, updating information on the same line (Line 16) \n", | |
| " print(f\"\\rEpisode {i+1} | Total point AVG 'reward' of the last {num_p_av} iters/episodes: {av_latest_points:.2f}\")\n", | |
| "\n", | |
| " # UNTIL we consider that the 'environment' has been SOLVED after some iters/episodes {i+1}, if we get AT LEAST an\n", | |
| " # AVG reward of ‘av_latest_points’>= 200 points, in the LAST 100 episodes/i-th iters, \n", | |
| " # then 'BREAK' out the outer loop (Line 16)\n", | |
| " if av_latest_points >= 200.0:\n", | |
| " \n", | |
| " # If ‘av_latest_points’ >= 200 then print \"Environment SOLVED! after i+1 episodes/iters\" (Line 16)\n", | |
| " print(f\"\\n\\nEnvironment solved in {i+1} episodes!\")\n", | |
| " \n", | |
| " # SAVE the Q-Network -> 'q_network' model as ‘.h5’ tensorflow file, saving architecture, \n", | |
| " # learned weights and learning info, becoming a portable and reusable NN model (Line 16)\n", | |
| " q_network.save('lunar_lander_model.h5')\n", | |
| " \n", | |
| " # And 'BREAK' the outer loop, so STOP computing AVG reward, because ‘av_latest_points’>= 200 points (Line 16)\n", | |
| " break\n", | |
| " \n", | |
| "# Check the execution time, measuring time taken for a code segment by recording 'start' and 'end' times, so \n", | |
| "# this cell of code has a total run or execution_time = (end_time - start_time) seconds \n", | |
| "tot_time = time.time() - start\n", | |
| "\n", | |
| "# Print total execution_time: x.xx seconds (x.xx mins)\n", | |
| "# 1 min -> 60 seconds then tot_time (seconds) / 60 = tot_time (mins)\n", | |
| "print(f\"\\nTotal Runtime: {tot_time:.2f} s ({(tot_time/60):.2f} min)\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We can plot the `'total point history’ list [ ]` along with the `'moving average’ or episodes/iterations i` **to see how our agent improved during TRAINING**. If you want to know about the `different plotting options available` in the `utils.plot_history()` function we encourage you to take a look at the `utils` module." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "metadata": { | |
| "deletable": false, | |
| "id": "E_EUXxurfe8m", | |
| "scrolled": false | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "image/png": "\n", | |
| "text/plain": [ | |
| "<Figure size 720x504 with 1 Axes>" | |
| ] | |
| }, | |
| "metadata": {}, | |
| "output_type": "display_data" | |
| } | |
| ], | |
| "source": [ | |
| "# Plot the total point history list of updated rewards [] VS episodes/iterations i []\n", | |
| "# UNTIL enviroment is Solved! when latest reward >= 200, (201.37) \n", | |
| "# after i+1 = 534 episodes/iterations.\n", | |
| "utils.plot_history(total_point_history)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "c_xwgaX5MnYt" | |
| }, | |
| "source": [ | |
| "<a name=\"10\"></a>\n", | |
| "## 10 - See the Trained Agent In Action\n", | |
| "\n", | |
| "Now that we have trained our agent, we can see it in action. We will use the `utils.create_video` function to create a video of our agent interacting with the environment using the trained $Q$-Network. The `utils.create_video` function uses the `imageio` library to create the video. This library produces some warnings that can be distracting, so, to suppress these warnings we run the code below." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": { | |
| "deletable": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# Suppress warnings usually produced by 'imageio' library\n", | |
| "import logging\n", | |
| "logging.getLogger().setLevel(logging.ERROR)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In the cell below we `create a video of our agent interacting with the Lunar Lander environment 'env'` using the trained `q_network`. The video is saved to the `videos` folder with the given `filename`. We use the `utils.embed_mp4()` function to **embed the video in the Jupyter Notebook so that we can see it here directly without having to download it**.\n", | |
| "\n", | |
| "We should note that `since the lunar lander starts with a random initial force applied to its center of mass, every time you run the cell below you will see a different video`. If the **agent was trained properly**, it should be able to `land the lunar lander in the landing pad every time`, **regardless of the initial force** applied to its center of mass." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 18, | |
| "metadata": { | |
| "deletable": false, | |
| "id": "3Ttb_zLeJKiG" | |
| }, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "\n", | |
| " <video width=\"840\" height=\"480\" controls>\n", | |
| " <source src=\"data:video/mp4;base64,\" type=\"video/mp4\">\n", | |
| " Your browser does not support the video tag.\n", | |
| " </video>" | |
| ], | |
| "text/plain": [ | |
| "<IPython.core.display.HTML object>" | |
| ] | |
| }, | |
| "execution_count": 18, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# Save the video at 'videos' folder at 'filename' path.\n", | |
| "filename = \"./videos/lunar_lander.mp4\"\n", | |
| "\n", | |
| "# Create a video of our agent interacting with the Lunar Lander environment 'env' using the trained q_network.\n", | |
| "utils.create_video(filename, env, q_network)\n", | |
| "\n", | |
| "# Use 'utils.embed_mp4()' function to embed the video in the Jupyter Notebook, \n", | |
| "# so that we can see it here directly without having to download it\n", | |
| "utils.embed_mp4(filename)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"11\"></a>\n", | |
| "## 11 - Congratulations!\n", | |
| "\n", | |
| "You have successfully used Deep Q-Learning with Experience Replay to train an agent to land a lunar lander safely on a landing pad on the surface of the moon. Congratulations!" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<a name=\"12\"></a>\n", | |
| "## 12 - References\n", | |
| "\n", | |
| "If you would like to learn more about Deep Q-Learning, we recommend you check out the following papers.\n", | |
| "\n", | |
| "\n", | |
| "* Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).\n", | |
| "\n", | |
| "\n", | |
| "* Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. Continuous Control with Deep Reinforcement Learning. ICLR (2016).\n", | |
| "\n", | |
| "\n", | |
| "* Mnih, V., Kavukcuoglu, K., Silver, D. et al. Playing Atari with Deep Reinforcement Learning. arXiv e-prints. arXiv:1312.5602 (2013)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "<details>\n", | |
| " <summary><font size=\"2\" color=\"darkgreen\"><b>Please click here if you want to experiment with any of the non-graded code.</b></font></summary>\n", | |
| " <p><i><b>Important Note: Please only do this when you've already passed the assignment to avoid problems with the autograder.</b></i>\n", | |
| " <ol>\n", | |
| " <li> On the notebook’s menu, click “View” > “Cell Toolbar” > “Edit Metadata”</li>\n", | |
| " <li> Hit the “Edit Metadata” button next to the code cell which you want to lock/unlock</li>\n", | |
| " <li> Set the attribute value for “editable” to:\n", | |
| " <ul>\n", | |
| " <li> “true” if you want to unlock it </li>\n", | |
| " <li> “false” if you want to lock it </li>\n", | |
| " </ul>\n", | |
| " </li>\n", | |
| " <li> On the notebook’s menu, click “View” > “Cell Toolbar” > “None” </li>\n", | |
| " </ol>\n", | |
| " <p> Here's a short demo of how to do the steps above: \n", | |
| " <br>\n", | |
| " <img src=\"https://lh3.google.com/u/0/d/14Xy_Mb17CZVgzVAgq7NCjMVBvSae3xO1\" align=\"center\" alt=\"unlock_cells.gif\">\n", | |
| "</details>" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "accelerator": "GPU", | |
| "colab": { | |
| "collapsed_sections": [], | |
| "name": "TensorFlow - Lunar Lander.ipynb", | |
| "provenance": [] | |
| }, | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.7.6" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 1 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment