Neurips 2023, Part 1

2023-12-20

Neurips 2023 Wrap-up - Posters (main conference)

Neurips 2023, held in New Orleans once again, just finished. The last part of the Neurips ritual for me, is to pool my notes and write a wrap-up (or more than one), where I can reflect over what I saw in the conference that stood out or I want to pay attention to.

In a very real way, over the past few years, Neurips has set the agenda for what I pay attention to in the field of machine learning over the following year. Sometimes this is made especially easy - like when GPT is released at the same time as the conference. (Or Mistral announce their platform and pricing for this year).

I’ve selected a few core themes and within them, a few key posters / papers. This is a tiny subset, filtered throguh my eyes. The conference this year hade an immense amount of research, with each poster session hosting about 2k posters and there were 6 of them - every time you stepped into the posters hall, it felt vast. I suspect I’ll 2 or 3 posts to cover more posters, the main conderence and the workshops.

LLMs representing people

In-Context Impersonation Reveals Large Language Models’ Strengths and Biases - investigates how the behavour of large language models changes as you ask them to impersonate a persona. The authors then evaluate downstream performance. They in-context prompt the model to behave like “a 4 year old” or “an ornithologist” and evaluate downstream modeling tasks. They find that impresonation affects performance in mostly expected ways: as 4 year old a model performs worst than as an adult, and an ornithologist performs better in characterising bird species. This work left me with lot’s of open questions about why this happens? Speaking to the authors, it was clear that like other why questions about LLMs, this one is quite difficult to answer. I have a reading list to tackle after this presentation.
Inducing Personality - A slightly different take on persona than the work above, this work proposes that language models can be tested using the 5 personality traits currently used in phychometirc tests. It first establishes a personality for a set of models by testing them for openess, conscientiousness, extraversion, agreeableness and neuroticism. It then suggests you can induce a personality through prompting across each dimension. You can then test responses (after inducing a personality) against the scores for each latent dimension and comparing these against human evaluation of those scores. NIPS page

Planning

There was a whole workshop on Planning (which I didn’t attend)!

Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents - poster - Explores task planning within the world of MineCraft. Has some very clever ideas in it. It uses an LLM to plan, and forces the plan to have a code like structure (which helps with plan coherence and ultimatley performance). It then uses a selector to choose which goal (step in the plan) to go for, given the current state, in order to improve efficiency (given there are many plans that could work). Then they use a Goal Conditioned Policy executor which interacts with the environment to achieve a goal. The environment produces feedback (observations), that are used to produced explanations (for failed goals) in order to try again (re plan if needed), like the many aproaches at Self Reflexion.
On the Planning Abilities of Large Language Models - A Critical Investigation - This paper and the next were in the sipcy category. It posits LLMs (even GPT4) have significant difficulty in performing actual planning for tasks. Even when the evaluation criteria are relaxed, LLMs struggle without an external validator. A key take away from this work is that the many self critique approaches are a mirage. This is spicy because there is also plenty of literature pointing to LLMs being able to critique their own outputs and Constitutional AI is very much predicated on the beleif the models are better at critiquing output than generating it. The conversation with the authors was super fun, and the core idea was that if you don’t have an external guide (like behavour cloning in one of the papers linke above), the LLMs struggle.
PlanBench - From the same group that wrote the planning abilities critique paper, it proposes a benchmark to more accuratley evaluate the planning ability of LLMs

Misc

Orthogonal FT for Diffusion Models - proposes a method for controlling diffusion models via fine tunning (like DreamBooth for instance) that is more efficient, based on the premise that what matters in weight matrices is the relative angles (hyperspherical similarity) rather than magnitudes. Based on this it proposes an fine tunning by finding an orthogonal matrix (an efficient orthogonal parametrisation) that represents the weight updates (a rotation or reflection) of a Fine Tune. It then compares OFT to Lora (a popular parameter efficient FT method). It claims positive results over LORA and DreamBooth while being much more parameter efficient.

Code, Code Synthesis or Code Evaluation:

Cross Code Eval - Neurips page - Poposes a benchmark for cross file completion, that is much more closely related to real world tasks than the human Eval benchmark. The authors present how they control for contamination and curate a dataset that represents completions that require ‘knowledge’ (often function signatures) that live in other files. I found this interesting and very needed work and really well explained by the authors.
ALGO: Synthesis algorithmic programs - [Poster](https://nips.cc/media/PosterPDFs/NeurIPS 2023/72044.png?t=1701377425.7068577). Posits that simple (brute force) solutions to LeetCode style problms are easy and thus solvable by existing LLMs. This roughly maps to what you should do when doing one of these problems (find a naive solution first, then optimise). Based on this premise (easy to solve), the authors treat an LLM as an Oracle (in that it gets the correct answer through a naive solution) to learn optimal solutions (which are harder), by checking agaisnt the Oracle for feedback.
On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models - focused on the generation of comments, using the most helpful examples for comment generation and removing the ‘harmful samples a fine-tunned model performs better on comment generation. In order to find the helpfulness of a sample, they calculater some sample pairwise training dynamics metrics and prune the dataset based on these. The paper proposes using a single or very few canonical examples to prune the dataset. I wasn’t super conviced with this work and I think I might need to revisit it later.

Data / Efficiency / Speed:

LIMA Paper - NIPS page from META suggests very little data is actually needed to align a model to a mode (a.k.a instruction tunning) and that practically all of a model’s knowledge comes from pre-training. More over it suggest RLHF and human preference data (DPO) is scarcely needed. Interesting connotation: loads of data needed for appropriate pre-training, very little data (1k) needed for structuring of formats, tone, response mechanics. A neat definition of Alignment data and results
Big Little Decoder - NIPS page: proposes running a Big and a Small decoder. Rather than emitting candidates for the next token from the small model and validating them, just accepts them given a confidence score and if the confidence score is below X, then continues generation from the big model using the sequence before x_i (low conf token). This means little decoder runs ahead of large model.
More to come here in a different post as there was loads of interesting work on covering efficiency and fast infrence, including parallel speculative decoding, interesting takes on precision requiremens, and various takes on LORAs.

LLM Memory / Context length, updating:

Life Long Editing of LLMs - Proposes using a hashmap (of KV cache), to map incorrect generations (an incorrect token) to correct generations, without any changes to weights or any fine tunning. It does so by literally mapping a resulting token (which is a point in multi-dimensional space), to a new token. To find the new token, a procedure similar to inversion is used, exploring the embedding space of where the correct token ought to be. Once found a dictionary of old_point : new_point is made. For any token matching old_point + epsilon, it’s replaced with new point.
Alternating Updates for Transformers for efficiency - Proposes doubling the representation (the width) of transformer layers without double the computation by applying alternating prediction and correction updeates across the two blocks at training time. This feel like a clever hack, but I had tons of questions and didn’t get time with the authors as loads of people were at the poster when I was there. Need to do more reading here.
MatFormer - Proposes having a matrioshka (the Mat in MatFormer) architecture of layers. Where forward passes run through N variant sizes of the layer. This means a final model (with similar inference time dynamics because of the same training) can be fit to a variety of memory and compute constraints (by selecting the appropriate size of the transformer layer).