Where do the rewards for robotic reinforcement learning come from? In this blog post we study how using crowdsourced language annotations and videos of humans, we can learn reward functions in a scalable way and enable them to generalize more broadly.
We present an almost-linear time algorithm for the k-medoids problem that matches prior SOTA in clustering quality. Our solution has almost the same complexity as k-means and several advantages.
We show that selective classification, where models are allowed to abstain when they are uncertain, can fail to improve and even hurt accuracy over certain subpopulations of the data.
By tapping into knowledge stored explicitly in text corpora, retrieval helps tackle the inefficiency, opaqueness, and static nature of large language models.