Question

Reinforcement Learning What are some advantages and disadvantages of A3C over DQN? What are some potential issues that can be caused by asynchronous updates in A3C?

Reinforcement Learning

What are some advantages and disadvantages of A3C over DQN? What are some potential issues that can be caused by asynchronous updates in A3C?

0 0
Add a comment Improve this question Transcribed image text
Answer #1

Asynchronous Advantage Actor-Critic

But what comes after? The same company who is responsible for the DQN, DeepMind, introduced the A3C architecture more recently. It’s supposed to be faster, simpler and more robust than the DQN and also able to achieve better results. But how?

You can figure out the biggest difference by looking at the name of this mysterious architecture: Asynchronous Advantage Actor-Critic. In DQN, a single agent (or so-called worker) interacts with a single environment, generating training data. The A3C launches several workers asynchronously (as much as your CPU can handle) and lets them all interact with their own instance of the environment. They also train their own copy of the network and share their results at the end of the simulation.

The Advantage

So why exactly is this better than a traditional DQN? There are multiple reasons for that. First, by asynchronously launching more workers, you are essentially going to collect as much more training data, which makes the collection of the data faster.

Since every single worker instance also has their own environment, you are going to get more diverse data, which is known to make the network more robust and generates better results!

Deep Q-Network

DQN is introduced in 2 papers, Playing Atari with Deep Reinforcement Learning on NIPS in 2013 and Human-level control through deep reinforcement learning on Nature in 2015. Interestingly, there were only few papers about DRN between 2013 and 2015. I guess that the reason was people couldn’t reproduce DQN implementation without information in Nature version.

0*35Fy2ZkW6zTJLGO-.

DQN agent playing Breakout

DQN overcomes unstable learning by mainly 4 techniques.

  • Experience Replay
  • Target Network
  • Clipping Rewards
  • Skipping Frames

I explain each technique one by one.

Experience Replay

Experience Replay is originally proposed in Reinforcement Learning for Robots Using Neural Networks in 1993. DNN is easily overfitting current episodes. Once DNN is overfitted, it’s hard to produce various experiences. To solve this problem, Experience Replay stores experiences including state transitions, rewards and actions, which are necessary data to perform Q learning, and makes mini-batches to update neural networks. This technique expects the following merits.

  • reduces correlation between experiences in updating DNN
  • increases learning speed with mini-batches
  • reuses past transitions to avoid catastrophic forgetting

Target Network

In TD error calculation, target function is changed frequently with DNN. Unstable target function makes training difficult. So Target Network technique fixes parameters of target function and replaces them with the latest network every thousands steps.

1*Gqg5g7PxlpHv35MchecWiA.png

target Q function in the red rectangular is fixed

Clipping Rewards

Each game has different score scales. For example, in Pong, players can get 1 point when wining the play. Otherwise, players get -1 point. However, in SpaceInvaders, players get 10~30 points when defeating invaders. This difference would make training unstable. Thus Clipping Rewards technique clips scores, which all positive rewards are set +1 and all negative rewards are set -1.

Skipping Frames

ALE is capable of rendering 60 images per second. But actually people don’t take actions so much in a second. AI doesn’t need to calculate Q values every frame. So Skipping Frames technique is that DQN calculates Q values every 4 frames and use past 4 frames as inputs. This reduces computational cost and gathers more experiences.

Performance

All of above techniques enables DQN to achieve stable training.

1*mo7vMyOHEj5TVr3mRjGVpg.png

DQN overwhelms naive DQN

In Nature version, it shows how much Experience Replay and Target Network contribute to stability.

1*V81as-UDBknULyI9onPC0A.png

Performance with and without Experience Replay and Target Network

Experience Replay is very important in DQN. Target Network also increases its performance.

Add a comment
Know the answer?
Add Answer to:
Reinforcement Learning What are some advantages and disadvantages of A3C over DQN? What are some potential issues that can be caused by asynchronous updates in A3C?
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT