no code implementations • 22 Apr 2024 • Subhojyoti Mukherjee, Anusha Lalitha, Kousha Kalantari, Aniket Deshmukh, Ge Liu, Yifei Ma, Branislav Kveton
Learning of preference models from human feedback has been central to recent advances in artificial intelligence.
no code implementations • 12 Apr 2024 • Subhojyoti Mukherjee, Ge Liu, Aniket Deshmukh, Anusha Lalitha, Yifei Ma, Branislav Kveton
We design the LLM prompt by adaptively choosing few-shot examples for a given inference query.
no code implementations • 23 Oct 2023 • Subhojyoti Mukherjee, Ruihao Zhu, Branislav Kveton
We propose CODE, a bandit algorithm based on a Constrained Optimal DEsign, that is interpretable and maximally reduces the uncertainty.
no code implementations • 29 Jan 2023 • Subhojyoti Mukherjee, Qiaomin Xie, Josiah Hanna, Robert Nowak
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits.
no code implementations • 27 May 2022 • Subhojyoti Mukherjee
We provide regret bounds for our algorithms and show that the bounds are comparable to their counterparts from the safe bandit and piecewise i. i. d.
no code implementations • 9 Mar 2022 • Subhojyoti Mukherjee, Josiah P. Hanna, Robert Nowak
This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs).
no code implementations • 2 Nov 2021 • Blake Mason, Romain Camilleri, Subhojyoti Mukherjee, Kevin Jamieson, Robert Nowak, Lalit Jain
The threshold value $\alpha$ can either be \emph{explicit} and provided a priori, or \emph{implicit} and defined relative to the optimal function value, i. e. $\alpha = (1-\epsilon)f(x_\ast)$ for a given $\epsilon > 0$ where $f(x_\ast)$ is the maximal function value and is unknown.
no code implementations • 15 Dec 2020 • Subhojyoti Mukherjee, Ardhendu Tripathy, Robert Nowak
Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model.
no code implementations • 30 May 2019 • Subhojyoti Mukherjee, Odalric-Ambrym Maillard
The second strategy \ImpCPD makes use of the knowledge of $T$ to achieve the order optimal regret bound of $\min\big\lbrace O(\sum\limits_{i=1}^{K} \sum\limits_{g=1}^{G}\frac{\log(T/H_{1, g})}{\Delta^{opt}_{i, g}}), O(\sqrt{GT})\big\rbrace$, (where $H_{1, g}$ is the problem complexity) thereby closing an important gap with respect to the lower bound in a specific challenging setting.
no code implementations • 18 Oct 2018 • Samarth Gupta, Shreyas Chaudhari, Subhojyoti Mukherjee, Gauri Joshi, Osman Yağan
We consider a finite-armed structured bandit problem in which mean rewards of different arms are known functions of a common hidden parameter $\theta^*$.
no code implementations • 9 Nov 2017 • Subhojyoti Mukherjee, K. P. Naveen, Nandan Sudarsanam, Balaraman Ravindran
We propose a novel variant of the UCB algorithm (referred to as Efficient-UCB-Variance (EUCBV)) for minimizing cumulative regret in the stochastic multi-armed bandit (MAB) setting.
no code implementations • 7 Apr 2017 • Subhojyoti Mukherjee, K. P. Naveen, Nandan Sudarsanam, Balaraman Ravindran
In this paper we propose the Augmented-UCB (AugUCB) algorithm for a fixed-budget version of the thresholding bandit problem (TBP), where the objective is to identify a set of arms whose quality is above a threshold.