Wireless systems perform rate adaptation to transmit at highest possible instantaneous rates. Rate adaptation has been increasingly granular over generations of wireless systems. The base-station uses SINR and packet decode feedback called acknowledgement/no acknowledgement (ACK/NACK) to perform rate adaptation. SINR is used for rate anchoring called inner look adaptation and ACK/NACK is used for fine offset adjustments called Outer Loop Link Adaptation (OLLA). We cast the OLLA as a reinforcement learning problem of the class of Multi-Armed Bandits (MAB) where the different offset values are the arms of the bandit. In OLLA, as the offset values increase, the probability of packet error also increase, and every user equipment (UE) has a desired Block Error Rate (BLER) to meet certain Quality of Service (QoS) requirements. For this MAB we propose a binary search based algorithm which achieves a Probably Approximately Correct (PAC) solution making use of bounds from large deviation theory and confidence bounds. In addition to this we also discuss how a Thompson sampling or UCB based method will not help us meet the target objectives. Finally, simulation results are provided on an LTE system simulator and thereby prove the efficacy of our proposed algorithm.