Chess & Shogi with General Reinforcement Learning Algorithm

Chess & Shogi with General Reinforcement Learning Algorithm
Chess & Shogi with General Reinforcement Learning Algorithm

Chess is a strategy game played by two players on a board which is very popular worldwide. It has 64 squares arranged in an 8×8 square grid with the concept of winning the space by two kings with his whole army.

At the last, only one king is left or the match is either drawn. You can see all the rules here. Chess is derived from the Indian ancient game named Chaturanga.

FIDE organised the first official Chess Olympiad in 1927 in which there were 16 participating countries. By the 29th Chess Olympiad in 1990, there were 127 member countries. The Chess Olympiads were held at irregular intervals by FIDE until 1950; since then, they have been held regularly every two years. So, development of various machine learning model led to the curiosity of development of such an ML model that can think of the next move in order to win the game this led the researcher to develop a program that is not hard-coded but an actual machine learning model.

Shogi is also a strategy Japanese board game that is played with two players, it is also known as Japanese Chess or Game of Generals. Two players face each other across a board composed of rectangles in a grid of 9 ranks (rows) by 9 files (columns) yielding an 81 square board. Chaturanga is known to be transmitted to be from India to Japan via China, and thus the theme remained the same with some difference in-game rules and model.

In older days chess and shogi were hard-coded which runs on various algorithms (tree based), but after development of computational power of processing chips various machine learning algorithms and models got developed at faster rate. And so, with new algorithms various researcher came with various models for several games.

In 2015 google showcased a new AI named Alpha Go that on its first match defeated the Go champion players and won the championship with 5-0. Alpha Go trained itself by playing with the professionals. AlphaGo then competed against legendary Go player Mr Lee Sedol, who is widely considered the greatest player of the past decade. AlphaGo’s 4-1 victory in Seoul, South Korea, in March 2016 was watched by over 200 million people worldwide, he even invented some of the moves.

In January AlphaGo wins 60 games in a time-controlled environment against the top international players. Thereafter, the Go summit, AlphaGo Zero was revealed. AlphaGo Zero trained itself by playing with himself and started from scratch, just knowing the basic rules of the game. Instead, the computer program accumulated thousands of years of human knowledge during a period of just a few days and learned to play Go from the strongest player in the world, AlphaGo.

In late 2017, on the same sequence, another AI named Alpha Zero was revealed. This AI learned to play chess and Shogi. Alpha Zero learned from scratch by playing to itself (using reinforcement learning) it learned and surpassed human-level thinking in chess and was able to defeat professional of both chess and shogi.

  • Dataset: The first step should be to find a large dataset in order to train and test the model, so we referred to FICS datasets that are in PNG format, who has about 353,712,300 games that are counted. With such a big dataset you can test and train your model and can have a hand-on it.
  • Way to create chess and shogi Engine
  • How chess engine work:
    • For Brute Force: We simply move calculates all the move after the player has moved a piece. This method took time to for the moves and uses much of the computational processing.
  • For Hardcoded: It enters the board as the input and output the move it thinks is the best. If we could simplify it in mathematical form then it would be:

f(board) = move

It then evaluates all the position on the board and calculates the net advantage of your minus the net advantage of the opponent, this sum is calculated by assigning the weights to different pieces.

                                                f(board) = sum of your advantages – sum of opponents advantages

After this, we can use Decision Tree to evaluates futuristic moves we follow DFS once the winning move is decided, this is also called min-max strategy. On the basis of the answer to the leaf node. Every non-leaf node will take a path to maximise its win.

  • For Machine Learning by traditional method

Moves are read and converted to move class. The data from FICS that is in PNG format are used since all the moves are from professional’s we consider them optimal. The position for each move is recorded for the good player.

  • For Machine Learning using reinforcement learning

In this mythology we just make an agent which plays the game, we let the agent know the rules and start training the agent, but the training here is not done with any types of data rather it is trained through playing with himself or some human player.

Here the model takes time to train the model because it starts from scratch. He gets fined if the moves if it loses at last, but if we continue training like this then it would just be able to reach the human level, so it may be trapped under the best moves and never try for the new moves that can be a better option. To solve this situation and extra randomness parameter to it, at the starting when the training is going, he should try different moves and thus the weight associated with it must be at a higher value.

After some training when the most probable move is on urge then also there should always be some randomness in moves of the agent. We know to see a vast improvement in the agent gameplay. So, on a very basic level, we followed

  • Self-play
  • Retraining the Neural Network
  • Evaluating the Neural Network Conclusion

The modern Residual_CNN was used to create a neural network architecture, which defines how to build an instance of the neural network. It uses a condensed version of the neural network architecture in the AlphaGoZero paper — i.e. a convolutional layer, followed by many residual layers, then splitting into a value and policy head.

Implementation:
A Neural Network for chess and shogi implemented in python using chess lib, and this can be similarly used for shogi.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import inspect
import chess
from chess.pgn import Game
import RLC
from RLC.capture_chess.environment import Board
from RLC.capture_chess.learn import Q_learning
from RLC.capture_chess.agent import Agent

board = Board()
board.board

board.layer_board[0,::-1,:].astype(int)

board = Board()
agent = Agent(network=’conv’,gamma=0.1,lr=0.07)
R = Q_learning(agent,board)
R.agent.fix_model()
R.agent.model.summary()

def network_update(self, minibatch):
“””
Update the Q-network using samples from the minibatch
Args:
minibatch: list
The minibatch contains the states, moves, rewards and new states.
Returns:
td_errors: np.array
array of temporal difference errors
“””

    # Prepare separate lists
    states, moves, rewards, new_states = [], [], [], []
    td_errors = []
    episode_ends = []
    for sample in minibatch:
        states.append(sample[0])
        moves.append(sample[1])
        rewards.append(sample[2])
        new_states.append(sample[3])

        # Episode end detection
        if np.array_equal(sample[3], sample[3] * 0):
            episode_ends.append(0)
        else:
            episode_ends.append(1)

    # The Q target
    q_target = np.array(rewards) + np.array(episode_ends) * self.gamma * np.max(
        self.fixed_model.predict(np.stack(new_states, axis=0)), axis=1)

    # The Q value for the remaining actions
    q_state = self.model.predict(np.stack(states, axis=0))  # batch x 64 x 64

    # Combine the Q target with the other Q values.
    q_state = np.reshape(q_state, (len(minibatch), 64, 64))
    for idx, move in enumerate(moves):
        td_errors.append(q_state[idx, move[0], move[1]] - q_target[idx])
        q_state[idx, move[0], move[1]] = q_target[idx]
    q_state = np.reshape(q_state, (len(minibatch), 4096))

    # Perform a step of minibatch Gradient Descent.
    self.model.fit(x=np.stack(states, axis=0), y=q_state, epochs=1, verbose=0)

    return td_errors

print(inspect.getsource(R.play_game))

def play_game(self, k, greedy=False, maxiter=25):
“””
Play a game of capture chess
Args:
k: int
game count, determines epsilon (exploration rate)
greedy: Boolean
if greedy, no exploration is done
maxiter: int
Maximum amount of steps per game

    Returns:

    """
    episode_end = False
    turncount = 0

    # Here we determine the exploration rate. k is divided by 250 to slow down the exploration rate decay.
    eps = max(0.05, 1 / (1 + (k / 250))) if not greedy else 0.

    # Play a game of chess
    while not episode_end:
        state = self.env.layer_board
        explore = np.random.uniform(0, 1) < eps  # determine whether to explore
        if explore:
            move = self.env.get_random_action()
            move_from = move.from_square
            move_to = move.to_square
        else:
            action_values = self.agent.get_action_values(np.expand_dims(state, axis=0))
            action_values = np.reshape(np.squeeze(action_values), (64, 64))
            action_space = self.env.project_legal_moves()  # The environment determines which moves are legal
            action_values = np.multiply(action_values, action_space)
            move_from = np.argmax(action_values, axis=None) // 64
            move_to = np.argmax(action_values, axis=None) % 64
            moves = [x for x in self.env.board.generate_legal_moves() if \
                     x.from_square == move_from and x.to_square == move_to]
            if len(moves) == 0:  # If all legal moves have negative action value, explore.
                move = self.env.get_random_action()
                move_from = move.from_square
                move_to = move.to_square
            else:
                move = np.random.choice(moves)  # If there are multiple max-moves, pick a random one.

        episode_end, reward = self.env.step(move)
        new_state = self.env.layer_board
        if len(self.memory) > self.memsize:
            self.memory.pop(0)
            self.sampling_probs.pop(0)
        turncount += 1
        if turncount > maxiter:
            episode_end = True
            reward = 0
        if episode_end:
            new_state = new_state * 0
        self.memory.append([state, (move_from, move_to), reward, new_state])
        self.sampling_probs.append(1)

        self.reward_trace.append(reward)
        self.update_agent(turncount)

    return self.env.board

pgn = R.learn(iters=750)

reward_smooth = pd.DataFrame(R.reward_trace)
reward_smooth.rolling(window=125,min_periods=0).mean().plot(figsize=(16,9),title=’average performance over the last 125 steps’)

with open(“final_game.pgn”,”w”) as log:
log.write(str(pgn))
board.reset()
bl = board.layer_board
bl[6,:,:] = 1/10 # Assume we are in move 10
av = R.agent.get_action_values(np.expand_dims(bl,axis=0))

av = av.reshape((64,64))

p = board.board.piece_at(20)#.symbol()
white_pieces = [‘P’,’N’,’B’,’R’,’Q’,’K’]
black_piece = [‘_’,’p’,’n’,’b’,’r’,’q’,’k’]

df = pd.DataFrame(np.zeros((6,7)))
df.index = white_pieces
df.columns = black_piece

for from_square in range(16):
for to_square in range(30,64):
from_piece = board.board.piece_at(from_square).symbol()
to_piece = board.board.piece_at(to_square)
if to_piece:
to_piece = to_piece.symbol()
else:
to_piece = ‘‘ df.loc[from_piece,to_piece] = av[from_square,to_square] df[[‘‘,’p’,’n’,’b’,’r’,’q’]]

Conclusion

Machine Learning is changing the world, from the above explanation we can see that it has become very easy to code the model. Machine Learning is the futuristic technology which is on a boom. After these series of some serious capabilities, the world was filled with several questions as “Is this AI safe for the future?”, “How far AI can go?”, “Will AI lead to job losses?”, “How Intelligent will be the next AI?”, “What does AI think of life?” But today also some questions remain unanswered, and the true potential of the AI is still unknown.

To explore more about Machine Learning, click here.

By Gaurav Verma