Comparison of OpenAI fine-tune models: gpt-3.5-turbo vs. all the rest.
Fine-tuning models allow training the AI to handle specific jobs. Some projects may leverage fine-tuning to learn AI how to treat specific data.
Recently, OpenAI has announced the gpt-3.5-turbo model to be available for fine-tuning. Until then, it offered four models for fine-tuning: Ada, Babbage, Curie and Davinci. They also announced sunsetting these models in favour of gpt-3.5 and gpt-4. While 3.5-turbo is now available, we still can use older models that are cheaper and, in some cases, can provide as good results as the more advanced (and expensive) models.
This article will compare prices and results from fine-tuned legacy and gpt-3.5-turbo models. What’s essential to have in mind is that gpt-3.5 introduces a new way of fine-tuning. It’s just a couple of days old at the time of writing this article, hence the results it provides may not be as promising as expected.
NOTE: Ada and Curie are no longer present on the OpenAI pricing page.
Introduction
In one of the projects I am working on, I faced the problem of determining whether the post from a car forum is valuable in terms of whether it brings any value or it’s just an author’s opinion, greeting or encouragement. This non-deterministic problem couldn’t be solved with a traditional algorithm, and it’s an excellent task for AI.
OpenAI API allows fine-tuning the model to handle a specific job. In short, we train what output is expected for a given input, where the input can be fuzzy, but the output will translate into a specific result. In my case, an example would be a post content, like “Hey, that’s a great idea!”. Such content isn’t meaningful, and I need to skip such posts. Another example could be a post: “No, this isn’t right. Check spark plugs first”. Such a post is meaningful.
Training data
OpenAI’s documentation explains in detail how to prepare a data set. Following their instructions, I prepared a collection of 1232 posts (1% of all posts in my database) and manually ranked them as Yes or No, whether meaningful or not. Again, by meaningful, I understand a post that brings value, explains something or gives guidance. All posts with opinions, greetings, and other random thoughts are not significant (at least from my project point of view).
The sample data looks like the following:
It took a couple of hours to rank posts manually. The result is that almost 54% of posts are not meaningful, while just 46% contain valuable information.
OpenAI’s documentation suggests a couple of hundred examples for Babbage and Davinci and 50–100 for gpt-3.5-turbo. I used 1.3k posts for training because the posts vary significantly, and I wanted to ensure the model was well-trained. For gpt-3.5-turbo, I used the first 100 examples and updated them to match the new training schema. The system prompt I used was “You are an expert who can rank whether the given forum post is valuable from a technical point of view. If it is an opinion, digression, greeting, or encouragement, the post is not valuable, and you answer No. If the post contains a question, a problem, or a solution, then the post is valuable, and you answer Yes.”.
Models overview
The training data was the same for all models except gpt-3.5-turbo, which requires a different structure. It introduces a system prompt that gives more context to the model. The main difference is that gpt-3.5-turbo requires 50–100 training examples, whereas older models are recommended to use several hundred models.
It should not be confused with the ease of preparing a data set. 3.5 is more advanced and introduces a system prompt that gives more control to train the model. Where old models are simple, the challenge was the number of examples. With new models, even when sufficient, the number of samples may not give the desired results due to wrong system prompts.
Models cost comparison
All models cost per kilo tokens and are broken down into three categories: training, input and output.
There is a significant cost difference between babbage and other models both in training and usage.
While gpt-3.5-turbo is slightly more expensive than Davinci for training, its usage costs are almost the same. gpt-3.5 requires less training data, so overall, it might be as expensive as Davinci regarding production use.
Overview comparison of fine-tuned models outputs
The chart below represents the overview summary of the ranking set of 73,000 posts using each of the models. As per training data, the expected result would be around 55% of non-valuable posts and approximately 45% of valuable posts.
Models represent similar output. The Ada model seems to be more restrictive when it comes to ranking posts as valuable. While Babbage and Curie lean toward the desired distribution, Davinci and gpt-3.5 are slightly off, leaning towards an even distribution of valuable and non-valuable posts.
Detailed comparison human vs AI
Let’s look closer at the results. In general, all models deliver similarly distributed results. A detailed comparison may reveal weaknesses of particular models. I’ve selected ten of the most challenging prompts and cross-compared them against every model:
Mistakes per model (the lowest the value, the better):
Ada: 6
Babbage: 5
Curie: 3
Davinci: 4
GPT-3.5: 4
As expected, Ada’s cheapest and fastest model provides the worst results.
Second worst, Babbage, while providing similarly distributed results to the more advanced models, fails when dealing with ambiguous prompts.
While the Curie model isn’t supported anymore and is considered a legacy model, it performs better than the Davinci. I am not covering performance tests in this article, but I would like to highlight that Curie also performed faster than Davinci.
Regarding gpt-3.5, the results could be better, but this is a brand-new tool I didn’t spend much time fine-tuning as with the previous models. I see potential in the flexibility of the training. While it didn’t show superiority over older models, it is more flexible and provides a better interface to train the model so that older models wouldn’t compete.
Summary
OpenAI keeps updating its models and adjusts prices. We can benefit from these changes using the most powerful tools and putting our hands on the discounted models.
In 2024, we will say a final goodbye to the fantastic four, yet there’s still plenty of time to leverage them in some projects while considering that they will sunset.
GPT-3.5 fine-tuning is something more than the previous models. Its flexibility expands beyond what previous models offer. Yet it’s more challenging and requires more test-and-trial to leverage its potential truly. My simplified test showed that it cannot be approached the same way as the previous models. With more patience, it will best other models.
The future is inevitable, and fine-tuning will lean towards what gpt-3.5 fine-tuning looks like.