Optimizing Inference Costs in the Cloud

Say you want to run a language model in the cloud on AWS. What instance type should you run it on?

Your decision can have a considerable impact on infrastructure costs, but it’s hard to make a good choice without knowing how instance costs trade off with model performance.

Luckily, SystemsLab can help you find the sweet spots in the tradeoff space.

We ran thousands of llama-bench benchmarks sweeping across a range of language models, quantizations, and instance types, including both GPU and CPU instances. The table below organizes all of that information in a single integrated view.

Each row is an instance type
Each column is a model
Each cell shows the lowest $-cost-per-million-tokens for that particular model and instance across all of our tests.

We’ve also included some controls that let you specify minimum requirements on model performance based on your use case. As you slide the sliders, the cost table will update automatically to show you the costs for only those models and instances which meet your requirements.

The lowest overall cost is highlighted with a , and the lowest cost for each model is highlighted with a .

Your minimum requirements for speed and size:

Advanced controls