The launch of gpt-4o-mini this past week triggered this dive down the rabbithole of the cost of intelligence. 15 cents to process an input of a million tokens, or ‘intelligence too cheap to meter’ as Sam Altman stated in a tweet. Last year in March, Jim Fan of Nvidia fame, tweeted that it cost him $4.30 to process all seven harry potter books at about a 2 million tokens, so the same corpus would now cost 30 cents to process with gpt4o-mini. Intelligence just got cheaper by a factor of about 15 over a year. Some other comparisons put this number at 20 or even 30 times cheaper over a year.


Keep in mind that this is cost for processing inputs to the model, and outputs/ inference costs about 60 cents per million tokens for gpt4o-mini, but last year that number was proportionately higher. This is not just happening with OpenAI, and is a trend seen across all service providers billing for LLMs on pay per use basis. Add to this trend the availability of open source models, where the models are free so you just pay for the compute to a cloud provider. Or you could pick a small enough model that you can host on your computer. And today, Meta launched its latest models, LLAMA 3.1, with permissive open source licenses and the the largest model competing with frontier models. Small models are also having their day in the sun, with a flurry of activity just this week, and many more launches earlier this year. There are several reports of organizations preferring small to medium models fine tuned to their specific use cases.
As firms are experimenting with more LLM applications, demand for inference grows and once you put LLMs into production, costs grow very fast. Open source models and smaller models help reduce these costs. If you can count on 10-20 X reduction in cost in a year, you might just not worry about inference costs from commercial providers as much. This trend of smaller/cheaper models which are comparable in terms of capability to frontier models released a few months earlier is counter intuitive. This trend is being driven not just by optimizations in the training or inference methods, but also by better training data. Training data is often generated from bigger/frontier models and used to train smaller models in better data. There are even specialized models (such as Nvidia Nemotron) which are optimized for synthetic data generation. Such training is now possible on consumer grade hardware as well. Another innovative method that any one of us can try, without formal training pipelines, is to just create examples using a frontier model and have a small model(such as gpt4o-mini) use those examples as a guide.
Models are getting so small that they can now be embedded into browsers. Think AI in your browser, with no need to have access to the internet and no cost to operate. Watch the video below to learn more to about Google Gemini Nano on the desktop. It is also coming on Google flagship phones.
Between running frontier models in the cloud as a service, or running local models on your laptop or just your browser, there are some other options which allow you to get mid-tier performance without the need to buy your own GPU’s. Open Neural Network Exchange (ONNX) is one such open-source framework that allows for the exchange of machine learning models between different deep learning frameworks. ONNX allows you to run neural networks (the tech that underlies LLMs) on disparate hardware and even browsers and even distribute across a network. Finally, you can even supplement local intelligence with cloud when needed, in a hybrid setup. Let the local/on device models do the easier work, and call the more ‘intelligent’ models in the cloud when needed.
As intelligence gets cheaper, several new approaches can be used with small models to improve their performance. One common approach is to use many LLMs or many calls to the same LLM, in parallel or sequentially to overcome the limitations of a single model. Think of it as making a decision after consulting or debating with other people or a brainstorming exercise to come up with more creative solutions. Different LLM calls could be used to instantiate ‘agents’ with different specialist roles, replicating how work gets done in an organization. Since we are considering digital personas, this approach can be scaled subject to the costs of intelligence, which is getting cheaper by the day. An incomplete list of references for different approaches to do this follows below.
The ‘Society of minds’ approach by Marvin Pinsky becomes possible - https://arxiv.org/abs/2305.17066.
LLMs can act as a judge to approximate human evaluations - https://arxiv.org/abs/2306.05685 .
A recent effort by OpenAI had them finding errors in code generated by GPT4 by using GPT4 itself, primarily to help human evaluators catch more errors.
So what? All of the previous discussion showcases capabilities that are available now. Soon, you can imagine a world where your phone or laptop has the worlds knowledge, with capability for problem solving in many domains with accuracy levels of an ‘average human’, but ability to do this faster and cheaper, without the need for network connectivity. Add to that the availability of unlimited experts or ‘agents’ to do specialized tasks when needed. What does this unlock?
I don’t have the answers, but Im working on it. Like, comment, share and subscribe to join me on this journey.