How to check the usage of AI Token? Which backend number is the most important

Many people open the AI API backend for the first time. The most common reaction is not "it's so clear", but: why do a bunch of numbers seem so important? Should we look at input, output, cached, or spend, quota, or TPM?

This problem is very normal. Because now the backend of mainstream platforms no longer only displays a total usage, but will split Token usage, fees, cache, rate limits, and project quotas into different dimensions.

OpenAI’s new API Usage Dashboard can view usage and cost, and also supports 1-minute granular TPM inspection; Anthropic will manage spend limits, RPM, ITPM, and OTPM separately; Google Gemini also separates quota, system limits, input/output token, context caching, and storage.

So if you want to remember one sentence first, the simplest version is:

Look at the bill, look at the output first. When looking at long conversations or knowledge bases, look at input and cache first. To see if the system is stuck, first check the quota, RPM, and TPM.

This article is not to re-explain what input and output are, but to directly help you solve a more practical matter: There are so many numbers in the background, which one is worth looking at first?

Let’s make a clear distinction first: Common numbers in the backend are actually divided into 4 categories

The numbers you often see in the backend can be roughly divided into four categories. These four categories actually do not answer the same question, so they cannot be lumped together.

The first category: Input Tokens

Input Tokens represent the content you send into the model. OpenAI's token description treats the tokens in the request as input; Anthropic's rate limits file also clearly distinguishes input tokens per minute; Google Gemini's pricing page directly uses input token as one of the basic billing fields.

The most useful time for this kind of numbers is not to look at "how much the model has returned", but to look at:

Did you bring too long a context

Is the system prompt too fat

Is the file fragment packed too much

Is the historical dialogue always accumulating

In other words, input tokens are more like answering: How many things did you throw in?

The second category: Output Tokens

Output Tokens represent the content returned to you by the model. OpenAI's GPT-5.4, mini, and nano price pages all show that the output unit price is higher than input; Anthropic's Sonnet 4.5 / 4.6 and Haiku 4.5 also have output prices higher than input.

This type of number usually directly affects the bill, because what really costs money for many generative tasks is not what you ask, but how much the model returns.

The first thing to look at in the background is usually not the total tokens, but the output tokens.

The third category: Cached Tokens / Cache Rate / Cache Storage

This type of number represents how much of your content is reused, or how much storage and billing the cache itself takes up.

OpenAI’s pricing page lists cached input separately; Anthropic’s pricing page separates Cache Writes, Cache Hits & Refreshes; Gemini lists context caching and storage price separately.

The most suitable answer for this type of number is:

Have you really saved duplicate content

Is your workflow re-sending the same background every time

Does the cache play a role

Although the cache is useful, is the storage cost worth it

In other words, the cache number does not look at how much is used, but to see whether you use it smart enough.

Category 4: Quota / Rate Limits / Spend

These numbers are not telling you "how much you will spend this time", but "whether you can continue running".

OpenAI’s new Usage Dashboard can view usage data and supports 1-minute granularity TPM; Anthropic officially clearly distinguishes between spend limits, RPM, ITPM, and OTPM; Gemini’s documentation treats quota and system limits as another layer of restriction logic.

Having a balance does not mean that it will not be restricted

Having a monthly budget does not mean that it will not hit the per-minute limit

Normal billing does not mean that the system throughput is normal

So these types of numbers are more like answering: Can the system hold up now?

If you just want to know "which one is the most important", look at this judgment first

Many people ask "which backend number is the most important". In fact, there are usually three real meanings:

First, which one affects the bill the most.

Second, which one is most likely to cause the system to get stuck.

Third, which one best reflects whether I am wasting tokens now.

The answers to these three questions are not the same number.

Look at the bill: Output is usually the most important

For most text generation tasks, output tokens are often the most important cost number in the background, because it not only means that the model replies a lot of words, but also often has a higher unit price. This structure can be directly seen on the official price pages of OpenAI and Anthropic.

Which tasks should be monitored first?

If your job is:

The first thing you should look at in the background is usually not the total tokens, but the output token amount and the output unit price.

Many people think that they don’t have much input and should not be expensive. As a result, their bills are high. The reason is often that the model is too long.

Doing knowledge base, long files, RAG: Input and Cache may be more important

Not all scenarios are output the most important.

Automated process of fixing large prompts

The most important thing is often input tokens and cache related numbers.

Gemini’s official pricing directly lists input token count, cached token, and cached token storage duration; Anthropic also clearly states that the rate limits of long context are related to the input side; OpenAI lists cached input separately.

Why this type of task is easy to misjudge

The most likely situation for this type of task is: the question you ask is very short, but behind it is a large section of system prompt, historical dialogue, knowledge fragment or PDF content.

As a result, the real explosion in the background is not output, but input. This is why just looking at the chat screen often misjudges the cost. What should really be looked at is the background usage.

Check whether the system is stuck: the most important thing is TPM, RPM, Quota

Many people clearly have balance, but find that the system starts to slow down, rate limit appears, or some requests are blocked. The most important numbers at this time are not input or output, but TPM, RPM and quota.

The official Anthropic document clearly states:

RPM = requests per minute

ITPM = input tokens per minute

OTPM = output tokens per minute

It also says that the API response headers will return the current limit, remaining available volume, and reset time.

OpenAI provides 1-minute granular TPM view in the new Usage Dashboard.

Why are these numbers different from fees

Because balance is an accounting concept, and TPM/RPM/quota are throughput and limit concepts. Your bill may still be normal, but the system is stuck because the tokens per minute are too high.

For formal products, this layer is very important. Because no matter how beautiful the backend usage numbers look, as long as they hit TPM or RPM, the online experience may directly cause problems.

If you are using a thinking/reasoning model, don’t just look at the “visible output”

This is a point that many advanced users will ignore.

The background output of some models is not necessarily equal to the output text you see with your naked eyes. Gemini’s official price page clearly labels output as including thinking tokens.

This means that if you see that the output in the background is higher than expected, don't rush to think that the system is broken. In some cases, it's not that the model says too much, but that thinking tokens are also included in the output cost.

The "most important" number at this time is still output, but you have to interpret it in the correct way.

What three numbers should we look at first in the background?

If you are new, I would recommend you to look at these three first:

The first one: look at output tokens

because it most often directly corresponds to bill inflation. Especially content generation, reporting and long reply scenarios.

Second: look at input tokens or cached tokens

because this will tell you if too much background information, historical information or knowledge fragments are being resent all the time.

Third: Look at TPM / quota / rate limits

Because this means whether your system can run stably, not just whether you can afford it.

How to judge whether you are "normally consuming" or "starting to waste"?

You can first use this simple criterion to judge:

If the output is always higher than you think, it means that you may have made the model too long. If the input is always high, but the user actually only asks short questions, it means that the context you brought in is too rich. If the cache class number is low, it means you may not be caching reusable content. If the TPM or quota is often close to the upper limit, it means that the scale or rhythm of your system has begun to encounter operational bottlenecks.

OpenAI, Anthropic, and Google all provide usage, pricing, rate limits or token counting related files, which means you don’t need to rely entirely on guessing. The platform has actually given you enough judgment tools.

How to look at AI Token usage? The key is not to focus on the total number, but to first distinguish whether you are looking at fees, traffic, or waste.

To see the cost, look at the output first. To look at long conversations, knowledge bases and large prompts, look at input + cache first. To see if the system is stuck, first check TPM / RPM / quota.

As long as this order is correct, the numbers in the background that originally seemed chaotic will actually become much clearer.

Is the total tokens in the background the most important?

Not necessarily. Total tokens can only tell you the total amount, but it cannot tell you whether the input is too high, the output is too high, or the cache is not done well. To really judge cost and waste, it’s best to take it apart and look at it.

Why are the replies I see short but have many output tokens?

If you use reasoning / thinking type functions, the platform may also count thinking tokens into billed output. Gemini’s official price page clearly states output including thinking tokens.

There is still a balance, why is the limit still displayed in the background?

Because balance is an accounting concept, quota / TPM / RPM are traffic and platform restriction concepts. Anthropic's official rate limits document clearly separates spend limits and rate limits.

Why can a short question take a lot of input?

Because what actually enters the model is not necessarily only the user's sentence, but may also include system prompts, historical conversations, search snippets, or long file content.

Which situations are most worth looking at cache?

Workflows with knowledge base Q&A, fixed template processes, RAG, long conversations and a large number of repeated backgrounds are usually the most worth looking at cache related numbers.

Data source and credibility statement

This article is compiled and written based on the official usage, pricing and limits documents of mainstream AI platforms, focusing on the following sources:

OpenAI｜API Usage Dashboard||OpenAI｜What are tokens and how to count them?

OpenAI｜API Pricing

Anthropic｜Rate limits

Anthropic｜Token counting

Anthropic｜Pricing

Google AI for Developers｜Gemini API pricing

This article is organized from three perspectives: "Backend Monitoring × Bill Interpretation × Traffic Limitation". The purpose is not to just help you memorize the field names, but to help you establish a sequence to focus on the key points when looking at the backend. In this way, whether you are an individual user, a content team, or a formal product, it will be less likely to misread the numbers.

If you want to understand this topic from a more complete perspective, it is recommended to read AI Token.

This article belongs to the category "AI Token Usage Tutorial".

This category mainly organizes the actual usage scenarios, background interpretation, cost control, model selection, workflow design and daily operation suggestions of AI Token to help novices, content creators, case recipients and enterprises not only know what token is when they come into contact with AI API, but also know how to see the really important costs and restrictions from the background numbers.

How to check the usage of AI Token? Novices can understand the background numbers and no longer have to worry about it

Why does AI Token deduct so quickly? The 8 most common reasons

AI Token 為什麼扣很快？最常見的 8 種原因

How to calculate AI Token cost? It can be seen most clearly from the separation of input and output

How does AI Token reduce fees? Don’t just switch to cheaper models

AI Token Usage
API Usage

AI Token organizes the basic concepts, calculation methods, API fees and model comparisons of AI Token (word elements), and covers common models such as ChatGPT, Gemini, Claude, etc. to help you establish clear understanding and judgment faster.

How to check the usage of AI Token? Which backend number is the most important