We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently.

We present an algorithm
based on the Optimism in the Face of
Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov
decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h*, the pro-posed
algorithm achieves a regret bound of O( SAHT )1 for MDP with S
states and A actions, in the case that an
upper bound H on the span of h*, i.e., sp(h*) is known. This result
outperforms the best previous regret bounds O^{~}(HSradic(AT))[Bartlett and Tewari, 2009] by a factor of radic(SH). Furthermore, this regret bound matches the lower bound
of Omega(radic(SAHT))[Jaksch et al., 2010] up to a logarithmic factor. As a
consequence, we show that there is a near optimal regret bound of O^{~}(radic(SADT)) for MDPs with
finite diameter D compared to the lower bound
of Omega(radic(SADT))[Jaksch et al., 2010].