On Evals and Agents

A common topic at Reddy lately has been how far can we go with LLM automation. Inference is cheap, so our primary target is getting reliable output without manual review. I want to address what makes this hard and my model for reasoning about it.

Note: Most of these processes don’t need latency or cost optimization (yet), so I’m ignoring both of these real-world factors for now. Assume that if a method is at all possible, it will be more scalable than manual execution of the same task.

Email Filter

Let’s walk through simple binary classification example first that may be very familiar to you. If you aren’t familiar with ML basics like FP/FN, scoring, etc., see the Google developer introduction.

I want to filter out spam in my email inbox using a binary ranking from an LLM. My eval dataset consists of 100 emails that I’ve manually marked as spam, 100 that I have not, and 100 that I’ve opened and read for more than 20 seconds (and not marked as spam). For simplicity, let’s assume that the non-spam set is distinct from the “important” set even though in reality it would likely be a superset.

Starting with a simple prompt, e.g., “Is this email spam?”, I run it across my dataset of 300 emails. When I compare the LLM scores to my actual data, I can expect something like this:

Spam emails: 90% accuracy (10% FN)
Not-spam emails: 70% accuracy (30% FP)
Important emails: 85% accuracy (15% FP)

Now I’ll write a simple scoring function to compare runs. This and the dataset it is trained on dictates what the final model will actually be. I care most about the model having a FP for an important email, less so about the non-spam, but not important FP’s, and even less so about FN’s in the spam dataset. The first may be quite consequential (e.g., an important email from an investor or client is marked as spam) while the latter is just a minor inconvenience. So a scoring function may look like this:

ERROR = (0.1)SPAM_FN + (0.25)NON_SPAM_FP + IMPORTANT_FP

This weights the FPs from the important dataset 10 times higher than FNs from the spam dataset and 4x as high as the regular non-spam dataset.

Now as I change my prompt, language model, or parameters, I re-run the eval and compare the ERROR to the previous configuration.

There are a few issues to point out even in this simple example. The most important is the dataset.

Because I’ve only trained on my categorization, I cannot assume that this will work effectively for someone else’s definition of “spam”.

Also, are the 100 emails a good representation of the space of all emails in my inbox? If I pulled them all from today and today does not happen to be a good proxy for most days (maybe something unusual happened), then I’ve unknowingly overtrained on a single day. Random sampling over a year and scaling up the datasets can help to mitigate this, but the risk will always be there.

Next, is the scoring function what I actually want? If I am aiming for 90% accuracy in the final output and I mark everything as not spam, as it currently sits, I will achieve 92.6% accuracy ((0+0.25+1)/(0.1+0.25+1)). This is a tradeoff that I may be willing to make, but a more complex comparison function can try to mitigate it.

HTML Generation from Screenshots

Let’s look at a more complex example: generating HTML from screenshots. The naive approach would be to have the LLM look at a screenshot and output HTML that matches it. However, even with modern models, this rarely works perfectly on the first try. So we might try an iterative approach:

Generate initial HTML from screenshot
Render the HTML into a new screenshot
Use an LLM to identify differences between the target and rendered screenshots
Have another LLM call edit the HTML based on those differences
Repeat

This introduces several challenges. First, the chained model calls compounds the chance of errors. If our first call is 90% accurate at identifying differences, and our second is 90% accurate at implementing those fixes, we’re down to 81% accuracy for making the correct change.

But an even bigger issue is that the model often inadvertently breaks something else while making changes to fix differences. This creates a frustrating cycle where each fix potentially introduces new problems, even if both model calls technically worked as intended. This means that on each step, you may have to go back and re-run with a slightly different prompt instead of continuing where the last call left off.

When I implemented the simple version of the algorithm (no branching/reverting), current SOTA models get you something that looks somewhat like the original, but it never gets close to “right.” There’s always at least one or two glaring issues. After 5-10 iterations, it stops making any improvements overall – it’s not 5 steps forward, 3 steps back, instead it becomes 2 steps forward, 2 steps back.

I would imagine that with a high enough temperature, a fully branched algorithm, and a theoretically infinite number of calls, you could make this reliable. This is because even a random assortment of characters will at some point be the exact HTML replica of a screenshot when taken to infinity. As long as you have a mechanism for diffing the images in a reasonable way, it should be possible. I’m not sure where it sits between 100 calls and infinite calls though.

Generate a Function from Unit Tests

Another limitation of current LLMs that’s less obvious but equally important: their cognitive abilities often leaves them stuck in local minima. Consider trying to generate a function that passes a set of unit tests. The overall algorithm looks like this:

Human writes the test cases
The LLM infers the underlying logic, often with a prompt
The LLM generates code
An automation runs the test cases against the model-generated code
Results are shared and we retry starting from (2)

The eval is very straightforward. The data is often small since it’s only the manually written test cases, and for that dataset the functionality must work 100% of the time. If it does not, the model will be fed the failures as well as any error messages to help it understand what exactly went wrong.

In simple cases, this works well. But as the complexity of the functionality we need increases, the model will get stuck between two or three possible implementations, unable to find a solution that satisfies all cases. Eventually it will fix one test case only to break another, then flip back to the previous version, creating an endless cycle.

This isn’t solved by larger context windows – it’s down to the model’s ability to hold and manipulate multiple concepts simultaneously while searching for a solution. It might understand each individual requirement but still not be able to write a solution that satisfies all of them simultaneously. Every time we get a new SOTA model, the level of complexity that will actually work goes up.

The rapid advancements we saw initially in 2023 were exciting, but while costs have plummeted, I don’t think models are on any sort of exponential cognitive growth curve. Either way, as an application developer, if you can put a problem into the bucket of “too complex”, it’s a good idea to come back to it with new models to see if it’s been solved.

Jake Duth

On Evals and Agents

Email Filter

HTML Generation from Screenshots

Generate a Function from Unit Tests

Leave a Reply Cancel reply