How do you compare AI model prices? Don’t just look at per million Tokens

When many people first compare the prices of AI models, they only look at one number at first glance: a few dollars per million Tokens. It’s not wrong to look at it this way, but it’s easy to miss the point. Because the current pricing of mainstream platforms has long been more than just “unit price per 1M Token”.

OpenAI will list input, cached input, output, short context, long context, Batch, Flex and regional processing separately; Anthropic will list base input, prompt caching, batch, long context, fast mode and regional pricing separately; Google Gemini will also list input, output, context caching, storage, Grounding with Google Search/Maps and Batch separately on the same pricing page.

So if you really want to understand "which model is more cost-effective", the correct question is not "who is the cheapest per million Tokens", but: for your purpose, what items will the final bill consist of. This is also the core direction that your original draft really wanted to talk about.

First let’s talk about the conclusion: Price comparison requires at least 6 things.

A truly practical comparison method requires at least these 6 things to be taken apart:

First, whether input and output are priced separately.

Second, is there a cache price?

Third, is there a Batch discount.

Fourth, will the long context jump to a higher rate?

Fifth, are there any additional charges for search, grounding, tools or multi-modality?

Sixth, are there any price increases for different endpoints, regions, modes or third-party platforms? These are not speculations, but the structure that is now clearly stated on the official pricing page.

The unit price per million Tokens is only the entrance, not the conclusion

If you only look at the entrance, you will usually underestimate the true cost. Because the real increase in many bills is not the standard input unit price, but the output, cache, long context, search or regional premiums.

The first thing: look at input and output first, don’t just look at one unit price

Nowadays, mainstream models almost all price input and output separately, and output is often much more expensive than input.

The official price page of OpenAI shows that the standard prices of GPT-5.4 mini are input $0.75, cached input $0.075, and output $4.50; GPT-5.4 nano is input $0.20, cached input $0.02, and output $1.25.

Anthropic’s Claude pricing is also listed separately. Claude Sonnet 4.5 is $3 input and $15 output; Claude Haiku 4.5 is $1 input and $5 output. The Google Gemini Developer API pricing page also clearly lists input and output prices separately. For example, the paid tier of Gemini 3.1 Flash-Lite Preview is $0.25 for input and $1.50 for output.

If you are doing long article generation, you should really look at output first

If your application is long article generation, report writing, and program code production, the unit price of output is often more important than input. Because what really drives up the bill is often not how much you put in, but how much the model spits out.

If you are doing summary, RAG, knowledge base Q&A, the input structure is more critical

On the other hand, if your scenario is a large document summary, RAG, knowledge base Q&A, or multiple rounds of context, the structure of input and cached input will be more critical. This is why it’s easy to make a mistake by just looking at the “Tokens per million” number.

The second thing: Cache price will directly change the effective unit price

Many people do not include cache at all when comparing prices, but this is actually one of the biggest differences in bills. OpenAI official documents state that Prompt Caching can reduce input token costs to up to 10% of the original cost, which is a reduction of up to 90%.

Anthropic writes the price multiplier of prompt caching in more detail. Cache write and cache read have different rates, and they can be superimposed with other pricing modifiers. Google Gemini splits context caching into two parts: caching token price and storage price.

For applications with repeated large prompts, you cannot just look at the standard input unit price

If your task will repeatedly bring in the same system prompt, fixed rules, large files or long backgrounds, then cache will almost certainly affect the effective cost. At this time, the real comparison is not the standard input unit price, but the "actual unit price after cache is included."

If the cache is not taken into account, the model will often be compared incorrectly

A model that looks more expensive on the surface may actually be more economical in the end if the cache structure is more favorable. This is also the reason why many companies have big differences between their bills and trial calculations when they officially introduce it.

The third thing: Batch discount is not a small difference, many times it is directly half price

If your task is not instant customer service, but offline processing, night batch running, batch summary or large-scale evaluation, then Batch prices usually need to be compared separately.

OpenAI’s Batch API documentation and price page state that Batch is 50% cheaper than the standard real-time API. Google Gemini's Batch API document also clearly states that it is 50% of the cost of the standard interactive API.

Anthropic’s pricing page lists Batch prices as lower than the standard price.

Real-time prices and offline prices are not necessarily in the same world

The same model. Because the interactive and Batch modes are different, the actual effective price may be twice as different. So if your process can accept asynchronous at all, you can't just use the price of the standard API to compare.

For high-frequency processing tasks, Batch is often the real price to consider

Such as data pre-processing, large summarization, content generation, evaluation, and classification. If these tasks can accept slower completion times, the impact of Batch is usually very obvious.

The fourth thing: long context is not free, some models will jump in price

Models support long context, but it does not mean that long context will always be at the original price. OpenAI's official price page lists the short context and long context of GPT-5.4 separately. The input and output of the long context are both higher.

The Anthropic pricing page also clearly states that premium long context pricing will be applied to some Claude models when the 1M context beta or exceeds certain input tokens. Google Gemini's price page also lists the price differences under different prompt lengths. For example, after exceeding 200k tokens, the input, output, and context caching of some models will increase.

Long context is both a capability and a price dimension

So if your workflow is RAG, long document summary, legal document analysis, or large knowledge base dialogue, the context length itself is a price dimension and cannot be regarded as just a capability indicator.

Many novices think that there is no price increase when they see "support for long contexts"

This is a very common misunderstanding. The truly mature comparison method is to first confirm whether the price has jumped after the long context.

The fifth thing: Tools, search, and grounding may be more easily ignored than the token itself

Many people only focus on the token unit price, but forget that some applications do not only charge token fees at all. Google's Gemini Developer API pricing page directly lists Grounding with Google Search and Grounding with Google Maps as independent charging items. After the free limit is exceeded, it is charged per 1,000 search queries. The OpenAI model page and price page also clearly state that some tool-specific models will be billed separately based on tool calls, not just general text tokens.

If you are a search assistant, just looking at tokens is almost certainly not enough

Because when search, grounding, and external tools become part of the workflow, there will be more than one source of billing. In this case, you only focus on a few dollars per million Tokens, but you ignore the really big cost sources.

Multimodal scenes cannot only use text token logical comparison

Some models also price information, pictures, and videos separately. At this time, you may even have different comparison units, and you cannot just use the column "per million Tokens" as the conclusion.

The sixth thing: Endpoints, regions, modes, and third-party platforms may all add a layer of price difference

Even the same model may become more expensive due to different endpoints, regions, or modes. OpenAI's official price page directly states that regional processing endpoints will charge an additional 10% uplift for some GPT-5.4 series models.

Anthropic’s pricing page also mentions that fast mode, data residency and other pricing modifiers can be superimposed, while Google Cloud Vertex AI’s pricing page says that partner models on Vertex AI will have their own managed API pricing.

A model with the same name is not necessarily a model with the same price

When you compare prices, you must first confirm whether you are comparing the same connection method. Original API, cloud platform hosting, regional endpoints, Priority/Fast mode are not necessarily the same price.

Many corporate errors stem from comparing different access methods

On the surface, the model names are the same, but because of the different regions, platforms, and models, the actual prices may not be the same thing.

Enterprises also need to look at throughput and restrictions, not just the unit price per request

If you are an enterprise or a high-traffic product, in addition to the price list, you also need to look at whether it can run well. Anthropic's rate limits document states that limits are managed by RPM, ITPM, and OTPM, and usage tiers will be adjusted with spend thresholds; cached input will also affect the rate limits calculation method in some cases. This means that even if the unit prices of two models are similar, if one of them is easier to utilize cache and less likely to hit limits under your traffic pattern, its commercial value may be completely different.

The unit price is close, but it does not mean that the actual throughput is close

For high-traffic products, stable expansion, limiting structure, and whether the cache can help you support throughput are sometimes more important than a few cents cheaper per request.

The truly mature way to compare is to look at price and scalability together

Because companies don’t just buy one request, but a whole set of sustainable running capabilities.

The least error-prone comparison method for novices

The simplest and least error-prone method is to first divide your tasks into three categories.

High-frequency, standardized tasks

Like classification, summary, title generation, let’s look at input, output, cache and Batch first. Because this type of task is easiest to rely on cache or Batch to reduce the effective cost.

Long article generation, program code, report

Look at the output unit price first. Because what really drives up the bill is usually not the input, but the long output.

RAG, search assistant, long context analysis

Be sure to count long context, Grounding/Search, context caching, and storage together. Otherwise, all you will see is the ideal price, not the actual price.

AI model price comparison is really not just about per million Tokens. You should at least look at how input and output are calculated, whether cache is cheap, whether batch can be half-price, whether long context will jump in price, whether tools or searches are extra, and whether endpoints and modes are premium. If you miss just one of them, you may end up with not the cheapest model, but a model that looks cheap but may not actually be economical. This core direction is consistent with the original draft you provided.

Why can’t we just look at the unit price per million Tokens?

Because the billing of current mainstream models is usually split into at least input, cached input, and output, and some also add long context, Batch, Grounding, or regional pricing.

Which type of task should first look at the output unit price?

For tasks such as long article generation, reporting, and code generation, output is usually the first thing to look at, because the more content the model returns to you, the easier it is for the output cost to become the main expense.

Why does cache affect the effective price?

Because some platforms will charge lower prices for large prompts with repeated inputs, OpenAI and Anthropic both clearly provide such mechanisms, and Google also lists context caching and storage separately.

When is Batch particularly important?

When your tasks can accept asynchronous tasks, such as nightly batch running, batch summary, data pre-processing, and large-scale evaluation, Batch will usually directly reduce the effective cost a lot.

If the model supports long context, then the original price will be used?

Not necessarily. OpenAI, Anthropic, and Google all have rules for increasing prices above a certain length, so the long context itself is the price dimension.

What is the difference between this article and "Which model is cheaper"?

The article is more user-oriented for beginners, focusing on classifying the use first and then choosing a model; this article focuses more on "how to look at the price list", and the topic is the billing structure, not model selection.

Data source and credibility statement

This article is compiled and written based on the official pricing and function documents of mainstream model suppliers, focusing on OpenAI API Pricing, OpenAI Pricing Docs, Anthropic Claude Pricing, Anthropic Prompt Caching, Gemini Developer API Pricing and Gemini Batch API. The content focuses on the question of "how to compare AI model prices", organized from six aspects: input/output, cache, batch, long context, grounding and regional price increase, to help readers see the price list as a complete billing logic, rather than just focusing on a single column. Highlights of the original draft you provided have been incorporated into this rewrite.

If you want to understand the unit prices, input/output rates and official price pages of different models at once, it is recommended to look at the AI Token price and establish the overall price reading logic first.

If you want to connect the basic concepts and extended themes together, you can go back to AI Token.

This article belongs to the "AI Model Comparison" category

This category focuses on the differences in capabilities, prices, uses, and connection methods between different AI models. The content includes how to choose a model, how to look at the price, how to connect to the platform, and the comparison problems most commonly encountered by novices. It helps readers clearly understand the perspective of each model comparison article and avoid conflicts between different articles.

Which AI model is cheaper? Newbies should clarify the purpose before comparing

What is OpenRouter? What’s the difference between buying the original API directly

How to save costs with AI Token? 6 things that novices should change first

AI model price comparison
Per million Tokens
AI API cost

AI Token organizes the basic concepts, calculation methods, API fees and model comparisons of AI Token (word elements), and covers common models such as ChatGPT, Gemini, Claude, etc. to help you establish clear understanding and judgment faster.

How do you compare AI model prices? Don’t just look at per million Tokens