Created
October 16, 2022 06:30
-
-
Save SolClover/98bf3d5bc21f7a4a1b52cccdd9296446 to your computer and use it in GitHub Desktop.
Define functions to use in training and evaluation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This is our acting policy (epsilon-greedy), which selects an action for exploration/exploitation during training | |
def epsilon_greedy(Qtable, state, epsilon): | |
# Generate a random number and compare to epsilon, if lower then explore, otherwise exploit | |
randnum = np.random.uniform(0, 1) | |
if randnum < epsilon: | |
action = env.action_space.sample() # explore | |
else: | |
action = np.argmax(Qtable[state, :]) # exploit | |
return action | |
# This function is to update the Qtable. | |
# It is also based on epsilon-greedy approach because the next_action is decided by epsilon-greedy policy | |
def update_Q(Qtable, state, action, reward, next_state, next_action): | |
# π(ππ‘,π΄π‘)=π(ππ‘,π΄π‘)+πΌ[π π‘+1+πΎπ(ππ‘+1,π΄π‘+1)βπ(ππ‘,π΄π‘)] | |
Qtable[state][action] = Qtable[state][action] + alpha * (reward + gamma * (Qtable[next_state][next_action]) - Qtable[state][action]) | |
return Qtable | |
# This function (greedy) will return the action from Qtable when we do evaluation | |
def eval_greedy(Qtable, state): | |
action = np.argmax(Qtable[state, :]) | |
return action |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment