Uncategorized Archives - Jake Dube

On Evals and Agents

Jake Duth — Sun, 27 Oct 2024 22:17:40 +0000

A common topic at Reddy lately has been how far can we go with LLM automation. Inference is cheap, so our primary target is getting reliable output without manual review. I want to address what makes this hard and my model for reasoning about it.

Note: Most of these processes don’t need latency or cost optimization (yet), so I’m ignoring both of these real-world factors for now. Assume that if a method is at all possible, it will be more scalable than manual execution of the same task.

Email Filter

Let’s walk through simple binary classification example first that may be very familiar to you. If you aren’t familiar with ML basics like FP/FN, scoring, etc., see the Google developer introduction.

I want to filter out spam in my email inbox using a binary ranking from an LLM. My eval dataset consists of 100 emails that I’ve manually marked as spam, 100 that I have not, and 100 that I’ve opened and read for more than 20 seconds (and not marked as spam). For simplicity, let’s assume that the non-spam set is distinct from the “important” set even though in reality it would likely be a superset.

Starting with a simple prompt, e.g., “Is this email spam?”, I run it across my dataset of 300 emails. When I compare the LLM scores to my actual data, I can expect something like this:

Spam emails: 90% accuracy (10% FN)
Not-spam emails: 70% accuracy (30% FP)
Important emails: 85% accuracy (15% FP)

Now I’ll write a simple scoring function to compare runs. This and the dataset it is trained on dictates what the final model will actually be. I care most about the model having a FP for an important email, less so about the non-spam, but not important FP’s, and even less so about FN’s in the spam dataset. The first may be quite consequential (e.g., an important email from an investor or client is marked as spam) while the latter is just a minor inconvenience. So a scoring function may look like this:

ERROR = (0.1)SPAM_FN + (0.25)NON_SPAM_FP + IMPORTANT_FP

This weights the FPs from the important dataset 10 times higher than FNs from the spam dataset and 4x as high as the regular non-spam dataset.

Now as I change my prompt, language model, or parameters, I re-run the eval and compare the ERROR to the previous configuration.

There are a few issues to point out even in this simple example. The most important is the dataset.

Because I’ve only trained on my categorization, I cannot assume that this will work effectively for someone else’s definition of “spam”.

Also, are the 100 emails a good representation of the space of all emails in my inbox? If I pulled them all from today and today does not happen to be a good proxy for most days (maybe something unusual happened), then I’ve unknowingly overtrained on a single day. Random sampling over a year and scaling up the datasets can help to mitigate this, but the risk will always be there.

Next, is the scoring function what I actually want? If I am aiming for 90% accuracy in the final output and I mark everything as not spam, as it currently sits, I will achieve 92.6% accuracy ((0+0.25+1)/(0.1+0.25+1)). This is a tradeoff that I may be willing to make, but a more complex comparison function can try to mitigate it.

HTML Generation from Screenshots

Let’s look at a more complex example: generating HTML from screenshots. The naive approach would be to have the LLM look at a screenshot and output HTML that matches it. However, even with modern models, this rarely works perfectly on the first try. So we might try an iterative approach:

Generate initial HTML from screenshot
Render the HTML into a new screenshot
Use an LLM to identify differences between the target and rendered screenshots
Have another LLM call edit the HTML based on those differences
Repeat

This introduces several challenges. First, the chained model calls compounds the chance of errors. If our first call is 90% accurate at identifying differences, and our second is 90% accurate at implementing those fixes, we’re down to 81% accuracy for making the correct change.

But an even bigger issue is that the model often inadvertently breaks something else while making changes to fix differences. This creates a frustrating cycle where each fix potentially introduces new problems, even if both model calls technically worked as intended. This means that on each step, you may have to go back and re-run with a slightly different prompt instead of continuing where the last call left off.

When I implemented the simple version of the algorithm (no branching/reverting), current SOTA models get you something that looks somewhat like the original, but it never gets close to “right.” There’s always at least one or two glaring issues. After 5-10 iterations, it stops making any improvements overall – it’s not 5 steps forward, 3 steps back, instead it becomes 2 steps forward, 2 steps back.

I would imagine that with a high enough temperature, a fully branched algorithm, and a theoretically infinite number of calls, you could make this reliable. This is because even a random assortment of characters will at some point be the exact HTML replica of a screenshot when taken to infinity. As long as you have a mechanism for diffing the images in a reasonable way, it should be possible. I’m not sure where it sits between 100 calls and infinite calls though.

Generate a Function from Unit Tests

Another limitation of current LLMs that’s less obvious but equally important: their cognitive abilities often leaves them stuck in local minima. Consider trying to generate a function that passes a set of unit tests. The overall algorithm looks like this:

Human writes the test cases
The LLM infers the underlying logic, often with a prompt
The LLM generates code
An automation runs the test cases against the model-generated code
Results are shared and we retry starting from (2)

The eval is very straightforward. The data is often small since it’s only the manually written test cases, and for that dataset the functionality must work 100% of the time. If it does not, the model will be fed the failures as well as any error messages to help it understand what exactly went wrong.

In simple cases, this works well. But as the complexity of the functionality we need increases, the model will get stuck between two or three possible implementations, unable to find a solution that satisfies all cases. Eventually it will fix one test case only to break another, then flip back to the previous version, creating an endless cycle.

This isn’t solved by larger context windows – it’s down to the model’s ability to hold and manipulate multiple concepts simultaneously while searching for a solution. It might understand each individual requirement but still not be able to write a solution that satisfies all of them simultaneously. Every time we get a new SOTA model, the level of complexity that will actually work goes up.

The rapid advancements we saw initially in 2023 were exciting, but while costs have plummeted, I don’t think models are on any sort of exponential cognitive growth curve. Either way, as an application developer, if you can put a problem into the bucket of “too complex”, it’s a good idea to come back to it with new models to see if it’s been solved.

The post On Evals and Agents appeared first on Jake Dube.

A Technical Founder at a Tech Startup

Jake Duth — Fri, 05 Jan 2024 07:35:07 +0000

Six months ago I left the corporate world at Ball Aerospace to found an AI startup with Adam, who I’d met on Twitter only briefly before. We hadn’t even met in person, yet I left one of the most stable 6-figure tech jobs where I had no complaints. It was a hard decision because I loved my work, had great coworkers, excellent benefits, and a reasonable salary. I’d been working next to some of my coworkers for 5 years at that point.

Burnout

Why? I was burnt out. Over the last few years, I’ve learned that burnout for me comes not from overdoing something, but from not doing enough of anything. I need challenges, I can’t live in a mediocre world where everything just works. “The norm” grinds me down, I can’t do it. I live for uncertainty, stress, and tight deadlines. Most people hate when they’re asked to work late on Friday. Even in a corporate job where that wasn’t expected, if someone wanted something ASAP, they were going to get it ASAP. I stayed up coding until 3 AM some days and became known as the guy to GSD and do it fast. I got bonuses and a pat on the back every time, but whatever the word for grossly failing to maximize your results is…that’s what applies here. If you’re like me, save voraciously, then quit while you’re ahead.

Maybe it’s not burnout. Whatever it is, I get lethargic, I don’t want to do anything, and I can focus on anything but work.

My startup cured me of whatever it was. My workload immediately shot up from 40 hours a week (on the dot) to 80 hours being a “break week” and most in the 100-120 range. When you’re at a startup, it’s do or die. Not only did I need to put in a lot of hours, but I also had to context switch faster than ever before.

More Tasks, Less Time

Roughly speaking, I had to be able to do 10 major tasks in a day starting out, then 5 in a day after 3 months, then 2 a day after 6 months. When I was in corporate, we usually measured tasks in multiple days or weeks. Those same tasks are now hours. The primary reason for this falloff over time, is also why large companies tend to move so slow: it becomes increasingly complex to iterate as your software becomes more full-featured and complex.

Yet that’s only part of the picture. If I had to put a number on it, I’d say there’s also a minimum 2x speed improvement you have to apply to yourself anyways. This is because:

You are the product owner. You must be able to make good decisions on scope. You no longer have someone telling you to do this or that. If ‘that’ takes too long, you can just do ‘this’ – especially when ‘that’ is only 20% of the value.
You have to become a faster engineer. Learn things faster. Take longer days. Go to new environments every 3-4 hours. Read documentation when you can’t be code. Find libraries or open source applications that do 95% of what you need, and then ignore the other 5%. Adam calls me a 10x dev, and I’d say this is why.

Solving Problems Alone

Until we started hiring, I was responsible for every bit of tech 24/7. That also means you get to use whatever tech you feel the most comfortable with which makes you feel like you have inhuman coding capabilities. At my corporate job, I didn’t even have the senior title yet (mostly due to corporate HR BS about how years = experience, but still). At Reddy, that didn’t matter. I was the only technical founder, every technical problem was my problem to solve and boy did I solve a lot of problems.

We landed our first client earlier than most startups which is great, unless you’re the engineer trying to get everything stood up in 25% of the time you thought you’d have. Keep in mind the first time I seriously touched web dev was January 2023, and our pilot product had a real-time simulator using websockets. I can’t remember exactly, but it went something like I learned what websockets were one morning, coded a prototype that day and then had end-users breaking it a day or two later.

Breaking it? Of course. One of my first major bugs I got to work on was when audio streaming over the websockets was crashing the connection. After many hours of beating my head against the wall (remember this was new tech to me, so as always the perceived problem space is significantly bigger than when you know something better), I left to go eat dinner at my family’s house. I talk it over with my dad who is also a software engineer and his hunch is right that we were hitting bandwidth limitations at our client company’s router. That was with only a handful of people running it in parallel. I spent the evening in Wireshark which proved him right and then spent the night fixing it (and a host of other bugs we had seen). That was my first of several all-nighters. At one point Adam thought I was saying midsockets, so now whenever software goes wonky, we just say it’s “midsockets all over again.”

Room to Breathe

Thankfully the midsockets have long since been fixed and the product has become both substantially more elaborate and stable. Best of all, building this product has leveled up my web knowledge so much that I feel more comfortable doing web than what I’d been doing for 3 straight years before. This makes sense, because I was doing 6 months at 3x hours, and because it was all back-to-back, my recall and learning skyrocketed. There’s nothing like a startup to teach you new things and keep your skills sharp.

Skip forward ~4 months. We’re getting ready to pilot another product, and decided it was time to hire. I’ll save the rest of the story for another day, but I think it’s important to say that going from one engineer to two engineers on the team was an incredible relief. If you’re a solo dev, as soon as you can, get someone else. No matter how good you are, having someone else you can turn to or just to keep each other motivated is such a win.

Thankfully 120 hour weeks are largely behind me, now I’m focused on delivering an increasingly higher quality of work than shipping a large mass of features as fast as possible. It’s not really any easier, and it’s still a thousand times more challenging and fun than a 9-5, but it’s nice to work (slightly) less than every waking moment.

I’ll be in this startup for a while, but I can’t imagine wanting to go back to corporate when it’s done.

The post A Technical Founder at a Tech Startup appeared first on Jake Dube.

Ruining January 1

Jake Duth — Tue, 02 Jan 2024 04:02:05 +0000

You probably shouldn’t read this if you made any new years resolutions, but if you didn’t, give yourself a pat on the back. I never commit to anything on Jan 1, and for good reason.

Everyone has an idealistic view of their goal – it’s impossible to know in advance what it’ll feel like when you accomplish it – so we just dream about that day when coming up with resolutions. The excitement and anticipation is enough to get you a few weeks into a challenge (sometimes), but we all know those feelings won’t last.

The point when you’ve had enough and won’t take no for an answer is when you know change will truly come. Whether you’re sick of being broke, fat, or skinny, most people will at some point cross a mental threshold that they’re not willing to go past. The likelihood of that day being Jan 1 is roughly 1:365.

There’s nothing special about the new year. If we didn’t have calendars, we wouldn’t be able to pinpoint the exact day Earth completed a revolution anyways.

The only thing different about 1/1 is the habit of doing some introspection to decide on new goals. However speaking only for myself, if I have to actually think to know what problem I’m going to solve, that problem wasn’t big enough in the first place. If you’re broke, you’ll be reminded of it every time you buy a meal or pay the bills. If you’re fat or skinny, you’ll be thinking about it whenever you eat a meal or look in the mirror.

Huge, glaring problems are the only things worth trying to set goals for. If it’s not a massive issue, the chance of you sticking it out is low.

It’s also crazy to delay fixing one of these things that you know is super impactful. Why would I wait until January to fix a big problem that I’m well aware of in November or December? If I’m not willing to fix it when I recognize it to be a massive issue, a new year resolution sure isn’t going to push me over the line.

So that’s why I don’t start challenges or plans on 1/1.

By the way, I do have a 2024 business goal for Reddy, but my cofounder and I discussed it weeks ago and have already been executing on it. I have a fitness goal to lose 20 lbs over the next few months, but I’ve already been tracking my calories on Cronitor (previously MFP) and going to the gym for months.

The message I’m trying to get across isn’t that you should just quit your goal a day after you start, but rather that even if you fail, it’s perfectly fine to start on 1/2, 6/1, or even 12/31.

It doesn’t matter what day it is.

Start your challenge when you’ve said enough is enough and are unwilling to do anything but fix the problem.

The post Ruining January 1 appeared first on Jake Dube.

Clean Web Apps Without an Opinionated JavaScript Framework

Jake Duth — Mon, 30 Oct 2023 00:46:49 +0000

Note that this was originally posted for a tweet while I was still relatively early in our development stage. I’ve single-handedly shipped an app to production and this has worked great, but time will tell if it holds up as our development team and application size grows. I’ll be coming back to this article to update it as we solve pieces.

That said, I’m not pioneering anything. These tools are already used by many applications you use every day.

Jumping to a new framework may be the correct answer in some cases, but I’ve seen a lot of these posts recently and wanted to give another option that I use. Keep in mind that I’ve not done much in React, Vue, Nuxt, Next, etc. I’m a backend developer brought into the world of front-end by necessity. I love designing UIs and improving UX flows and I wish all engineers got the chance to be full-stack.

I can’t (yet) share a major success story about our product’s scalability, speed, etc., but I can share details of how we achieve certain high-level needs in our web apps. For success stories to pique your interest, here’s what OpenUnited said about their choice to switch from React to a more Vanilla codebase.

Our “deliberately simple” frontend means that we use Jinja templates, TailwindCSS, TailwindUI, Hyperscript, plain javascript where needed, and HTMX where it improves the UX. Earlier we had a separate ReactJS frontend and a GraphQL API layer, however such fanciness failed to deliver the expected value, whilst creating complexity/friction… therefore, we now have a deliberately simple frontend. As a result, we have about 50% less code and move way faster.
OpenUnited: https://github.com/OpenUnited/platform

For another well-rounded UI example, see Contexte. Here’s some handpicked pieces from the HTMX post about them switching to HTMX from React:

No reduction in the application’s user experience (UX)

They reduced the code base size by 67% (21,500 LOC to 7200 LOC)

They reduced their total JS dependencies by 96% (255 to 9)

They reduced their web build time by 88% (40 seconds to 5)

First load time-to-interactive was reduced by 50-60% (from 2 to 6 seconds to 1 to 2 seconds)

Web application memory usage was reduced by 46% (75MB to 45MB)

A Real World React -> htmx Port: https://htmx.org/essays/a-real-world-react-to-htmx-port/

Clearly, the “Vanilla Framework” doesn’t have to be a monster. You can choose the simplicity of not having an opinionated JS framework and still have a clean codebase. You can even do it without a build step if you so choose.

Components

The core issue that is easy to run into is that there are no components by default so you feel like you have to duplicate code, styles, etc.

So how do you get components? Many backend frameworks already provide a rich templating language for reducing your HTML into sub-components. Learn at least the basics of passing variables into them and how to do formatting. Also, try to make them more generic. If you’re creating a form, e.g., you should only have to create the form template once per project.

I predominantly use Django, so a nice ancillary tool is the django-components package which allows creation of a single component that can tie CSS, HTML, and JavaScript into one modular piece. It will even smartly insert the rendered output into an HTML document the way you would normally write it if it was a one-off page. The CSS gets put in the tag and JavaScript gets combined into

I can now render however many of the component I’d like (even dynamically with HTMX), and the JS/CSS will be consistent for all of them.

Events

Part of a good UI is keeping everything up-to-date. When you add an item in a popup modal form, you expect the table that lists all your items to also update to how the new item. With HTMX, updating arbitrary numbers of other components in the UI when a value changes is done using events. It’s simple, but powerful. When you return a response, add an HX-Trigger response header. That’s as simple as doing this in your view function:

response = ...
response["HX-Trigger"] = "customerAdded"
return response

Then when HTMX handles the response and on the client side, it will issue an event “customerAdded”. Anything listening to that event can then trigger an update to it’s own respective component. You can trigger however many events you need by comma-separating them.

Summary

We haven’t slain all the monsters yet, but some people have to great effect, and it’s worked marvelously for us up until to this point. To uphold KISS, I’m sticking to this design until I have reason to believe otherwise. It is possible to adhere to good patterns without a framework and is not necessarily more difficult to build reasonably complex applications.

The post Clean Web Apps Without an Opinionated JavaScript Framework appeared first on Jake Dube.

This is why you can’t ship products quickly.

Jake Duth — Sun, 16 Jul 2023 03:38:29 +0000

Why does it take so long to launch products? Building working MVPs in a matter of days doesn’t have to be a dream, and no, you don’t have to be the world’s fastest developer. You will quickly become a well-paid developer though.

I learned web dev a year ago and already make far more $ than I did after spending years as an Android developer. If I shared the dollar amount I make across all my projects, you probably wouldn’t believe me, but it’s easily found if you’re a good detective.

If you don’t get triggered, you can might make an extra $100 over the next week using the principles discussed here.

Skill != speed.

The line of thinking that assumes this is not looking at the entire picture. To build the same thing feature for feature will take the average developer just as long as another average developer, so it’s a nice thought to assume that’s why you’re slower. The first thing that broke this myth for me, though, was when I was watching objectively lower-skilled developers completely crush me on timelines. After mending my ways over the last few months, I can honestly say that there’s a better way for a professional dev. If you know what you’re doing, there’s no reason you shouldn’t be able to deliver software faster than the newcomers to the field.

Ultimately it comes down to two trios. The first is Reuse, Leverage, and Value. Let’s look at them in reverse order.

Value

Not all features are created equal. Recently Dan Kulkov brought up on Twitter how getting a 100% on PageSpeed Insights score is not how you bring value to your users. Business savvy builders get this. Nerdy devs, don’t. There’s nothing wrong with being a nerd, but not when it comes to thinking about what your customers really want. Unless you’re selling to other developers, pay attention to what really matters to the user’s outcome. Stop worrying about what is technically complex.

Leverage

Have you tried no code tools? Using GPT to debug an issue? Hiring off-shore talent? These are the sorts of things less technical founders are doing and they’re winning. They’re not worried about whether they’ll be respected by their peers for using TDD from day one. They traded Kubernetes for revenue. Why don’t you?

Reuse

The electrical engineering field figured this out way better than software engineers did. Finding the right software component to plug into your system without bloating it can be hard. Hardware documentation is far more thorough and pedantic than documentation for software. Even so, there’s more than likely a plethora of libraries that have done what you’re trying to build – and as hard as it is to believe, they probably did it better. Why not reuse other people’s code as heavily as you can? Bloat, you say? Who cares, you’re building an MVP. It’s supposed to be fast to build because realistically chances are slim it’s going to work out longer than a few months. If you find product-market fit, it’s not like your market will disappear if you hunker down and build it the “right” way later on.

The next trio is well known. There are three main aspects of an MVP we need to consider:

Time
Quality
Scope

Pick 2…well maybe just 1.

The golden key – mercilessly restricting the scope – will get you further than you might think. But that’s not all. To build an MVP, the quality also needs to be diminished. It hurts your pride, but optimizing software is premature at this point. Improving the UI needs to have a hard line that you’re not going to waste time crossing. To expand on time is to give yourself less attempts at the game. If it takes 100 tries to win, you better move faster.

In Practice

It took me 3 months to build the main features of GoClose, 48 hours to launch a version of PromptEditor, and about 10 hours to launch an early bird version of Find Market (what I’m calling it right now).

You aren’t going to build Photoshop that fast, but most indie projects could be 80% replicated in a week. Once you’ve built something from scratch in a specific stack, you get a lot faster. I can start a new Django project and get it deployed to Heroku with a custom domain setup in less than an hour.

Don’t waste time writing 20 authentication flows for each social media login. Copy/paste the username and password flow you built on your first app and then change the names in the view.

Cut features religiously. Just focus on one core piece. For Find Market, I decided the most useful thing (that people were already asking me for) was a database of the ideas I had. I didn’t remake a sortable, searchable table – I just used Bootstrap Table. This was my first time using it, so getting that right took up the largest part of the 10 hours tbh. That was the core thing and I wanted to make it as useful as possible.

Then it was just a matter of getting the database models setup. I use Django’s ORM to setup the database, so it’s pretty simple to just define all the fields you need (and I’ve been doing Django consistently for 6 months, so that took me all of 20 minutes to get right).

Next I wanted a way to focus on just the marketing ideas the individual user was working on. So I made another DB table to track “My Ideas” and added a new page to add/remove from that list. Also wanted a way to track work, so I added a few modal popups to handle adding time/money spent per marketing channel.

None of this is hard unless you’re trying to do it for the first time. This is where day 1 ended – besides screenshots + writing a bit of copy on Gumroad. Yes, I used Gumroad to collect payments to save even more time setting up all the Stripe logic.

The last part I needed was graphing. I wanted date range pickers to automatically update a line graph to show hours worked by day + money spent by day. I knew making my own range picker was out of the question given the constraints, so I looked around for packages. Most of them sucked, so then I looked in the browser source for a website that I liked the date picker on and was able to find a lesser-known package that looked awesome. That took probably an hour to set up.

For charts, Chart.js is simple and I’ve used it before. But honestly I think they look pretty bad without a good bit of customization. I decided this time to try Apex charts.

It could’ve been faster.

Knowing what I know now, GoClose could’ve had an MVP in a day. It wouldn’t of had a UI, it would just be a few Zapier calls to send email responses when it saw a new message. The defining feature would be that it understood the entire thread. If I couldn’t figure out how to make that part work in Zapier, I would’ve made a simple API to handle a quick IMAP query that could be accessed from Zapier. This would already be a huge improvement on all the Chrome plugins that just feed the latest email into a ChatGPT prompt.

Action taking

Stop, don’t close the page just yet. It’s time to actually take action. What good does reading this do if you don’t change your behavior?

As homework, identify what the #1 core feature is of what you’re working on. If you’re new to your tech stack, give yourself one week to get that as good as you can. If you’re experienced, launch it within the next two days.

The post This is why you can’t ship products quickly. appeared first on Jake Dube.

LLM-Based Education Will Change the World

Jake Duth — Thu, 20 Apr 2023 01:26:54 +0000

Now…

If you’re here from Twitter, you probably asking:

What’s this graph about?

We’ll get there, hang with me for some context.

Disclaimer

First, let’s be clear.

This isn’t my idea, and this isn’t about me. I built something cool last night, but other’s have done exceedingly more in this space than I can even imagine. I’m thinking of Sal Khan, Jimmy Wales, and Larry Sanger. These guys have done more for the world’s education than perhaps anyone else has accomplished in history.

What I’m going to propose has to be a community effort to have any chance of working (and may get significant support from Wikipedia).

I hope many of us in the tech community can contribute and also that non-technical folks will in a very short time.

Bonus disclaimer: 50% of the thought behind this occurred in the last 24 hours. And a lot of this is probably wrong. It’s not well-researched (yet). Take it with a grain of salt. Actually don’t, call me out on Twitter so I can learn. I have a lot to learn, and your experience and knowledge helps.

My Idealistic Vision

I’m trying to sum up the vision in a sentence, here’s my current attempt:

Generative LLMs can make education costs approach $0 for everyone.

There’s several aspects to this.

Generative
Cost
Everyone

We all know generative AI can answer questions when you’re learning something new. I would argue that if you’re an independent learner, GPT-4 could provide a better technical education than many current systems.

Unfortunately a lot of people don’t learn well that way. Many do better with structure imposed on them by someone else. What if generative AI could provide some of that structure?

Turns out it can. Here’s an example of it making a syllabus for a college course.

Interestingly, we can go in both directions from here.

What does a list of syllabi compose? A list of classes.

What is a syllabus broken down into? Weeks.

What is comprised by a list of classes? A degree.

What are weeks broken down into? Lectures, readings, and coursework.

Quick note: for brevity, I’m not going include all the prompts. I’d be happy to share them with anyone who wants them of course, but I think the baseline concept should be pretty straightforward.

Now which of these can AI not generate a baseline version of?

Can it give me a list of degrees for a college? Yes.
Can it give me a list of courses needed for a degree program? Yes.
Can it take that list and make a list of syllabi for each one? Also yes!

Now we’re to the level of granularity most people use ChatGPT for all the time. A specific, niched down prompt. At this point we can use the LLM itself, agents (e.g., from Langchain), or human data validation. This last part is key, more on that later.

Alright, so everyone knows what the graph is now? You see where this is going?

Here’s a zoomed in version if you’re curious. The left-most node asks GPT-3.5 to list 25 college majors, the middle one represents one of those majors and asks to build a syllabus. The right-hand one represents a syllabus and could be further broken down into weeks, sections, etc. The labels are all automatically generated if you’re wondering why they are the way they are.

That’s what I got to yesterday. I’ll save the details about the tool I used for a Twitter thread. That’s not really what this is about.

Is that exciting? Eh, not really by itself. It’s a for loop and a tree data structure. Something any CS major should be able to build in a day.

But humor me for a bit. Tell me I’m wrong if you know better.

Can this beat traditional education?

Note that I’m only familiar with traditional western education. Would love to hear thoughts from those outside that bubble. Where I live, you usually go to a classroom environment and a teacher imparts knowledge to you (hopefully) in a way that the average student in his class will glean from.

There are three problems with this:

Location
The teacher
The “average student”

Location is a big problem because if you didn’t grow up with location-access to a good school and couldn’t travel to a good college…

I think we can all finish that with a negative statement if we’re being honest.

Now I have a lot of respect for teachers, but let’s look at points 2 and 3 together. Teachers are human, and so are students. They both have learning and teaching patterns that can be changed a good bit, but not infinitely.

Thus the unfortunate circumstance often develops where in the vast majority of classes, there’s a number who can’t keep up. Often it’s because of their learning style not matching the teaching style, not just that the students are lazy.

Online learning solves the location problem easily (to a degree, we’ll get to this). It somewhat supplements the teaching/learning style problem due to the sheer number of styles currently available.

But having to jump around between “teachers” is not conducive to learning.

AI can be run behind a consistent interface, so you can focus on learning the material, not about the teacher nor his style. It can also:

Generate content for a variety of learning styles on the fly.
Decrease the cost of content.
Make the content more interactive.

And no, I don’t mean just add a chat bot to pre-recorded class. That’s already been done:

I’m thinking more like what @synthesischool is doing for elementary education. For example, they build thinking games for kids to play that supplement more traditional assignments. Not only are they fun, but I imagine it has better results.

This is the sort of education the world’s richest men choose to put their children in (e.g., Elon). What if education along that level of quality could transcend to those who don’t even have internet access?

AI can already make game-development much easier, but we’re already seeing full 3D environment generation starting to peek. I would bet we’re about 3-6 months away from basic games being created in a matter of minutes with a nice, descriptive prompt (or series of prompts).

Here’s another example. @ykdojo shared with me how he’s thinking of building a less-gamified version for learning to code. I don’t think the “new devs” who are using ChatGPT to code are learning much. This could help fix that.

I’m thinking of doing something similar too

cc: @spacesdojo https://t.co/6kF6ngSmtd
— YK aka CS Dojo (@ykdojo) April 6, 2023

A note for any young developers reading this:

Learning to code isn’t easy (for the vast majority of people), so if you feel like it is, then you’re probably not learning as much as you should be.

This doesn’t mean you have to bang your head against the wall 24/7.

What about the 35% without internet?

From a (very quick) Google search, I’m seeing there’s about 35% of the world without internet and roughly 10% of the world without a phone. That means there’s an easy 25% win if @internetactvsm can roll something out globally that has a sufficient distributed throughput.

It wouldn’t even have to be much and wouldn’t have to be fast. A user could request a topic or “degree” if you will. Then the underlying data would get transferred to them over time and then cached on their phone. The endpoint’s mobile app would then recreate the modules from all the packets they received. This is a different use case, but almost exactly what IA has been working on lately (they’re also the guys who made the COVID tracker that had over 600 million users).

So while streaming video content is not necessarily unreasonable, at least text formats can be sent – even over air gaps. I have personal experience with this as it’s a major part of what I do at my day job, and I know folks who are far smarter and more capable with these technologies than I am. Our flagship product has to robustly handle remote environments where data transfer is never as simple as a cellular data connection. However, I’ve never seen this tech applied to education. If Internet Activism doesn’t make it happen, someone can and will.

What about the 10% without a phone? I honestly don’t know. I think education should be an option universally. There will likely always be exceptions, but this is a big group of people that deserve more thought.

The Accuracy Problem

That brings me to one of the biggest problems. I don’t believe we’ll have anything meaningful in the near future if we use ChatGPT live to generate educational material. Having conversations to explain things is great, but that needs to be constrained. There’s too much hallucination right now.

The solution? Wikipedia + Quora-style collaborative editing and improvement. GPT can structure a large, usable dataset very quickly. That doesn’t mean that it’s done. Ideally, SMEs at least skim through everything to check for accuracy and completeness.

The Bias Problem

Any global education platform, especially one that is based on an LLM has exposure to bias issues. It’s a significant issue, but I’m going to hold my thoughts on this for the moment. We should never constrain an individual to the point that they lose self-expression, but at the same time a centralized information center for the globe carries significant risks. Distributing the development, storage, and retrieval of internal knowledge frameworks may help to reduce this issue.

Summary

I hope we can collaboratively build something that harnesses the power of this LLMs technology to educate ourselves better. There’s a new era that just arrived and it’s up to us to make the most of it.

I appreciate your desire to make a difference. I truly believe we can build something together that will change the world.

Thanks for reading.

– Jake

The post LLM-Based Education Will Change the World appeared first on Jake Dube.