Quick Summary

AWS Bedrock pricing in 2026 ranges from roughly $100/month for lightweight, low-volume workloads to $5,000+/month once Bedrock Agents, Knowledge Bases, and high-throughput inference enter the picture. In this blog, we will walk you through every layer of AWS Bedrock costs, from on-demand token rates and vector storage minimums to agent token amplification and adjacent AWS charges.

Table of Contents

Introduction

Several CTOs end up spending 1.5-2x their initial Amazon Bedrock pricing estimates. The surprising part? The overrun doesn’t come from hidden fees; it comes from AWS Bedrock costs that are difficult to calculate upfront.

That’s the trap with AWS Bedrock pricing. The official Amazon Bedrock pricing page shows clean per-token numbers, but the actual bill tells a very different story.

The bill isn’t just one line item. On-demand inference, provisioned throughput, batch processing, and add-ons like Knowledge Bases and Guardrails all stack up independently, and most teams only connect the dots after the fact.

This guide walks you through how AWS Bedrock pricing actually works in real-world scenarios, what silently inflates your AWS Bedrock cost, and how to take control before it becomes a problem worth explaining to your finance team.

How Amazon Bedrock Pricing Works in 2026?

Amazon Bedrock charges you primarily for one factor: inference. The Amazon Bedrock cost structure is built on tokens processed by foundation models, with input tokens (your prompt, system instructions, and retrieved context) priced separately from output tokens (the model’s response).

Across nearly every model on the platform, output tokens cost three to five times more than input tokens. For instance, Claude Sonnet 4.5 runs $3.00 per million input tokens versus $15.00 per million output tokens, a 5x asymmetry that often matters more than the headline rate.

Understanding AWS Bedrock Token Pricing

A token is roughly 4 characters or three-quarters of a word. AWS Bedrock token pricing is quoted per million tokens for text models, per image for image models, and per minute for audio and video models.

Bedrock offers four pricing models, each with different economics:

  • On-Demand for variable workloads with no commitment
  • Batch Inference at 50% off for asynchronous bulk jobs
  • Provisioned Throughput for reserved hourly capacity at high volume
  • Cross-Region Inference for routing requests across regions without surcharge

On the free tier: Bedrock has no permanent free tier. Inference is pay-as-you-go from the first API call. New AWS accounts created after July 15, 2025, receive $200 in AWS credits ($100 on signup, $100 for completing guided activities) usable across 200+ AWS services, including Bedrock, expiring after six months. Treat this as an evaluation budget and not a long-term cost lever.

AWS Bedrock Pricing Models: 4 Tiers Explained

Picking the right tier inside the AWS Bedrock pricing models is the single highest-leverage decision in your AWS Bedrock cost structure. Here is how each works and when each wins.

1. On-Demand (Standard)

On-demand is the default pricing in the AWS Bedrock cost structure. You pay per token at the published rate with no capacity commitment and no minimum. Bedrock offers two service tier variants under the On-Demand cost: a Priority tier (roughly 75% over the Standard tier) for guaranteed performance, and a Flex tier (roughly 50% discount to the Standard tier) for higher-latency, best-effort delivery.

Best for: Variable workloads, prototypes, development environments, and anything with unpredictable traffic patterns.

Trade-off: AWS enforces throttling limits, so during peak demand, your requests may be rate-limited.

2. Batch Inference

Batch Inference accepts a JSONL file of prompts via Amazon S3, processes them asynchronously, and returns results to your S3 bucket within 24 hours. The discount is substantial: a flat 50% off On-Demand rates for supported models from Anthropic, Meta, Mistral, and Amazon.

Best for: Document summarization pipelines, content generation jobs, evaluation runs, embeddings backfills, and anything else that does not need a synchronous response.

Trade-off: Not real-time. Some models, notably DeepSeek, are excluded from Batch.

3. Provisioned Throughput

Provisioned Throughput reserves dedicated model capacity measured in Model Units (MUs). You pay an hourly rate regardless of actual usage in exchange for guaranteed throughput, no rate limiting, and predictable monthly spend.

Discounts of 15% to 40% are typical with 1-month or 6-month commitments. PT is also the only way to host fine-tuned or imported custom models on Bedrock.

Break-even rule of thumb: PT becomes economical at sustained capacity utilization of roughly 80-85%. Below 60%, on-demand or batch will almost always be cheaper.

Best for: Production workloads with predictable, steady volume on a single model and strict latency requirements.

Trade-off: Capacity wasted during low-traffic hours is capacity you still pay for. Negotiate enterprise PT pricing directly with your AWS account team.

4. Cross-Region Inference

Cross-Region Inference is a routing capability, not a separate billing mode. It lets a request originating in one region automatically use capacity in another region during traffic spikes. Pricing is calculated at the source region’s rate with no cross-region surcharge.

Best for: Production applications that need resilience against regional capacity constraints.

Trade-off: Reduced control over data locality and potential latency variability, depending on the ultimate served request.

AWS Bedrock Token Pricing: Model-by-Model Reference Table

Below are representative On-Demand rates as of early 2026 for popular models in us-east-1. Always verify current pricing on the official AWS Bedrock pricing page before forecasting.

Model Input ($/M tokens) Output ($/M tokens) Context Window
Claude Opus 4.7 $5.00 $25.00 200K
Claude Sonnet 4.6 $3.00 $15.00 200K (1M optional)
Claude Haiku 4.5 $1 $5.00 200K
GPT 5.4 $2.50 $15.00 1050K
Mistral Large 2 $2.00 $6.00 128K
Amazon Nova Pro $0.80 $3.20 128K
Amazon Nova Lite $0.06 $0.24 300K
Amazon Nova Micro $0.035 $0.14 128K
Amazon Titan Text Express $0.20 $0.60 8K
DeepSeek V3.2 $0.62 $1.85 128K

Nova Micro at $0.035 per million output tokens is roughly 143 times cheaper than Claude Opus 4.7 at $5.00 per million input tokens. For simple classification, structured extraction, or routing decisions, defaulting to a frontier model is one of the most expensive habits in production AI.

AWS Bedrock Hidden Costs: 4 Surfaces That Inflate Your Bill

Token rates are the visible part of the bill. These four AWS Bedrock hidden costs are where teams get blindsided.

Surface 1: Customization (Fine-Tuning, Distillation, and Model Storage)

Customizing a foundation model on Bedrock has three cost components:

  • Training: Charged per token processed during training, calculated as dataset size multiplied by epochs. Rates vary by base model and are listed in the AWS console.
  • Storage: Custom model weights are billed monthly at $0.02 to $0.10 per GB.
  • Inference: Custom text-generation models cannot run On-Demand. They require Provisioned Throughput, with at least one MU available without commitment, billed hourly.

Reinforcement Fine-Tuning (RFT) is billed at an hourly training rate. Model Distillation charges synthetic data generation at the teacher model’s On-Demand rate, plus customization pricing for the small model.

Teams that planned these charges upfront as part of an AWS migration services engagement typically budget 20-25% more accurately than those who discovered them post-launch.

Surface 2: AWS Bedrock Knowledge Bases Pricing and Vector Infrastructure

Knowledge Bases for Amazon Bedrock have no separate fee, but the underlying components are billed in full.

The cost everyone gets surprised by inside AWS Bedrock Knowledge Bases pricing: Amazon OpenSearch Serverless, the default vector store, has a minimum baseline of 2 OpenSearch Compute Units (OCUs) at $0.24 per OCU per hour, which works out to roughly $345 per month even with zero query traffic.

The 2026 alternative worth knowing: Amazon S3 Vectors, launched in December 2025, is up to 90% cheaper than OpenSearch Serverless and supports trillions of vectors with sub-second latency. For new Knowledge Bases, S3 Vectors should be your default unless you have a specific OpenSearch dependency.

Other retrieval costs: embedding model inference (input tokens only), Bedrock Data Automation for parsing at $0.010 per page, and Amazon Rerank 1.0 at $1.00 per 1,000 reranking queries.

Surface 3: Agentic Features (Agents, AgentCore, and Guardrails)

Bedrock Agents themselves do not charge a per-invocation fee. The trap is token amplification: a single user query can trigger multiple internal model calls (thinking, searching, tool calling, summarizing), and you pay for every intermediate step. Agent workflows commonly consume 4 to 8 times the tokens you would estimate from looking at the user’s prompt and the final response.

Amazon Bedrock AgentCore is now billed across five separate consumption-based layers:

  • AgentCore Runtime: Per-second vCPU and GB-hours for active resources
  • AgentCore Gateway: Per MCP operation (ListTools, CallTool, Ping) and indexed tools
  • AgentCore Memory: Per raw event for short-term memory, per record processed and retrieved for long-term memory
  • AgentCore Identity: Per OAuth token or API key issued for non-AWS resources
  • AgentCore Policy: Per authorization request, billed at $0.000025 per request

Guardrails charges per text unit (1,000 characters), with content filters and denied topics priced at $0.15 per 1,000 text units each. Bedrock Flows charges $0.035 per 1,000 node transitions, metered daily.

Surface 4: Adjacent AWS Charges

The line items most teams forget to budget:

  • CloudWatch Logging ($0.50 per GB ingested) for prompt and response logging, which compounds quickly with verbose system prompts
  • Amazon S3 for storing batch input and output, custom model artifacts, and Knowledge Base documents
  • AWS KMS for encrypting data at rest, billed per request and per key
  • VPC endpoints for private connectivity, billed hourly per endpoint
  • Data transfer between regions and to the public internet, billed at standard EC2 rates (including for AgentCore Runtime, Gateway, Code Interpreter, and Browser as of November 1, 2025)

On a production bill, these adjacent charges often run 20% to 35% of the inference cost itself. And, handling these costs is not just about working around your Bedrock setup, it involves the AWS services running alongside it. And, if you want a detailed roadmap on AWS cost optimization, you can refer to our blog.

Need help auditing your AWS Bedrock spend?

Bacancy’s certified AWS engineers run cost audits, architecture reviews, and prompt optimizations for production GenAI workloads. Hire AWS developers with Bedrock experience to cut your bill without compromising quality.

Comparing Foundation Models on Bedrock: Cost, Context, and Trade-offs

Claude on AWS Bedrock pricing matches Anthropic’s direct API pricing exactly across Opus 4.5, Sonnet 4.5, and Haiku 4.5. The advantage of running Claude through Bedrock is its governance: VPC isolation, IAM-based access control, KMS encryption, and a contractual guarantee that prompts and outputs are never used to train models.

For regulated workloads, that posture is often worth more than chasing a $0.20 per million token discount elsewhere.

Model Provider Ideal Workload
Anthropic High-stakes reasoning, agentic workflows, code generation
Meta Open-weight workloads, cost-sensitive reasoning
Mistral AI European data residency, multilingual workloads
Amazon High-volume tasks, classification, and embeddings, AWS native pipelines
DeepSeekCost-sensitive reasoning, coding tasks
OpenAIGPT-native apps on AWS, low-cost open-weight inference

AWS Bedrock Token Economics by Workload Pattern

The pricing model that wins depends less on the foundation model and more on where your tokens concentrate. Here are the 4 patterns we see most often, and where to look first when costs run hot.

1. Conversational Chatbot

Where spend concentrates: Input tokens. System prompts, persona instructions, safety rules, and conversation history accumulate across every turn. A chatbot handling 10,000 monthly active users on Claude Haiku 4.5 typically runs $200 to $800 per month in inference, with 60-80% of that bill coming from input.

Optimization lever: Prompt caching. The 1-hour cache TTL launched for Claude Sonnet 4.5, Haiku 4.5, and Opus 4.5 in January 2026 makes this dramatically more effective for stateful conversations. Cache reads cost roughly 90% less than standard input tokens.

2. RAG-Based Document Assistant

Where spend concentrates: Retrieval and long input contexts. Each query stuffs retrieved chunks into the prompt, often 5,000 to 20,000 tokens of context per request, and you pay full input rates on every call.

A typical mid-scale RAG assistant handling 50,000 monthly queries on Claude Sonnet 4.5 with a Knowledge Base will run $1,500 to $3,500 per month for inference, plus the OpenSearch Serverless floor.

Optimization levers: Retrieval reranking to send fewer but better chunks, context compression to shorten what gets passed in, and right-sizing the vector store (S3 Vectors over OpenSearch Serverless for new builds).

3. Agentic Workflow with Tool Calls

Where spending concentrates: A single user request that calls three tools in sequence will incur model invocations for the initial reasoning, each tool selection, each tool result interpretation, and the final synthesis.

Token consumption is typically 5 to 10 times the tokens visible in the user’s prompt and final answer. An agentic workflow handling 5,000 monthly runs on Claude Sonnet 4.5 with Agents and Guardrails commonly lands at $2,500 to $6,000 per month.

Optimization levers: Model cascade (route simple sub-tasks to cheaper models), loop budget caps on agent iteration depth, and structured output schemas to reduce verbose intermediate reasoning.

4. Batch Processing Pipeline

Where spend concentrates: Customization training, model storage, and the volume of input tokens at processing time. A nightly summarization pipeline on Llama 3.3 70B for 100,000 documents runs $200 to $600 per month at Batch rates, but a fine-tuned variant adds training and monthly storage costs on top.

Optimization levers: Distillation (smaller model trained from a frontier teacher), Batch mode (50% discount), and scheduled jobs aligned to business hours.

Building Your Own AWS Bedrock Cost Calculator

The fastest way to size a workload is a back-of-the-envelope AWS Bedrock cost calculator built from four inputs:

  • monthly request volume
  • average input tokens per request (including retrieved context and conversation history)
  • average output tokens per request, and
  • model choice

Multiply request volume by input tokens, divide by 1 million, and multiply by the model’s input rate. Repeat for output tokens. Sum the two.

For agentic workloads, multiply the result by 5 to 8 to account for token amplification. For RAG, add the OpenSearch Serverless or S3 Vectors floor. For Knowledge Bases at scale, layer in embedding model inference and reranking.

AWS Pricing Calculator handles the underlying services well, but does not yet model agent token amplification, so build that multiplier into your spreadsheet.

AWS Bedrock Best Practices for Cost Optimization: 8 Levers That Move the Needle

These cross-cutting AWS Bedrock best practices for cost optimization compound across workload patterns. Most teams already know one or two; the leverage comes from stacking them.

AWS Bedrock Best Practices for Cost Optimization

1. Cascade Routing

Route simple requests to a cheap model first. Escalate to a frontier model only when the cheaper response fails a confidence check. A typical implementation cuts blended cost-per-request by 40-60% without measurable quality loss on tasks like classification, simple Q&A, or routing decisions.

2. Prompt Caching

For repeated context (system prompts, RAG passages, few-shot examples), cached reads cost about 90% less than uncached input tokens. The first cache write costs roughly 25% above standard input rates, so the break-even sits at two cache reads within the TTL window. Track CacheReadInputTokens and CacheWriteInputTokens in CloudWatch to validate savings.

3. Intelligent Prompt Routing

Bedrock’s native Intelligent Prompt Routing automatically routes between models in the same family (for instance, Claude 3) based on prompt complexity. AWS reports cost reductions of up to 30% without quality loss, with no application changes required.

4. Batch Inference for Asynchronous Workloads

Anything not requiring a real-time response should default to Batch: nightly summarization, document enrichment, evaluation runs, embedding backfills, and content moderation queues. The discount is a clean 50% off on On-Demand.

5. Provisioned Throughput Above Break-Even

Move a workload to PT only when sustained utilization on a single model exceeds 80%. Below that, you are paying for idle capacity. Use the 1-month commitment first to validate the pattern before locking into a 6-month term, and negotiate directly with your AWS account team for enterprise discounts.

6. Model Distillation

Train a smaller model on outputs from a frontier teacher. It often delivers 90%+ of the quality at a fraction of the inference cost. Distillation is most effective for narrow, repetitive tasks like extraction, classification, and structured generation, and least effective for open-ended reasoning.

7. Prompt and Context Compression

Prompt compression trims unnecessary words from system instructions and few-shot examples (verbose prompts often have 30-40% input). Context compression summarizes retrieved chunks before they enter the prompt, useful for RAG pipelines where 60%+ of input tokens come from retrieval.

8. Inference-Level Cost Allocation Tags

You cannot optimize what you cannot see. Tag every Bedrock invocation by team, product, environment, and feature. AWS Cost Explorer with the Amazon Bedrock service filter, plus per-model CloudWatch metrics, gives you the granular view needed to identify which workloads are driving spend and which are quietly burning budget on the wrong model.

How AWS Bedrock Compares to OpenAI Direct, Azure OpenAI, and Vertex AI

DimensionAWS Bedrock OpenAI Direct Azure OpenAI Google Vertex AI
Per-token cost (Frontier) Matches Anthropic directly Lowest for the GPT community Slight Microsoft premium Competitive on Gemini
Model breadth 60+ models and 15+ providers OpenAI only OpenAI only Google + selected partners
Governance AWS, VPC, IAM, KMS native Vendor-managed Microsoft trust boundary Google trust boundary
Training-data guarantee No training on customer data No training required (API tier) No training required No training required
Enterprise contracting Single AWS contract Direct OpenAI contract Microsoft EA integration Google Cloud Contract
Ideal buyer AWS-native enterprises OpenAI-first AI teams Microsoft-stack enterprises Google Cloud customer and multimodal-heavy

For a detailed comparison between the two most common starting points, see our blog on Amazon Bedrock vs ChatGPT.

For teams evaluating cloud providers for generative AI workloads, the comparison goes beyond per-token rates. Governance posture, model breadth, data residency, and ecosystem fit usually matter more than $0.50 per million tokens.

When to choose which:

  • Choose AWS Bedrock when your data, compliance posture, and existing infrastructure live in AWS, and when model optionality matters more than picking a single vendor.
  • Choose OpenAI Direct for teams that have standardized on GPT and want the latest models the day they ship.
  • Choose Azure OpenAI for Microsoft-native enterprises where Office 365, Teams, and Active Directory integration are central.
  • Choose Google Vertex AI for Google Cloud customers and for multimodal workloads where Gemini’s native video, audio, and document handling are differentiators.

How Bacancy Helps Teams Manage AWS Bedrock Pricing and Costs?

We begin by understanding how your current AI workloads consume resources. It helps uncover where costs are increasing, whether due to high token usage, over-reliance on premium models, or inefficient API calls. Based on this, the team aligns your architecture with a more cost-efficient strategy.

Here’s how Bacancy actively helps reduce and control costs:

  • Usage Analysis & Cost Mapping: Identify high-cost models, regions, and workflows that impact your budget
  • Smart Model Selection: Match each use case with the most cost-effective model and pricing option
  • Token & Prompt Optimization: Refine prompts and responses to reduce unnecessary token consumption
  • Workflow Efficiency: Eliminate redundant API calls and introduce caching to avoid repeated processing
  • Cost Controls & Alerts: Set budgets, usage limits, and real-time alerts to prevent overspending

What sets this apart is that it doesn’t stop at a one-time audit. As your AI usage grows and pricing evolves, we keep refining the setup, so your costs stay predictable even as you scale.

Conclusion

AWS Bedrock pricing is a layered cost model where the per-token rate is often the smallest line item on a production bill. Knowledge Base infrastructure, agent token amplification, customization storage, and adjacent AWS charges routinely add 30% to 80% on top of the inference cost you projected.

The teams that get this right are the ones who plan for the total cost of ownership before they architect.

Get in touch with our AWS consulting services that help your teams with design, deployment, and optimization of generative AI apps on Amazon Bedrock. Our certified AWS engineers handle model selection, prompt and context optimization, FinOps tagging, and Bedrock architecture reviews so you can scale GenAI without scaling your AWS invoice.

Frequently Asked Questions (FAQs)

Both have different feature sets. For Anthropic models like Claude, AWS Bedrock pricing matches direct Anthropic API pricing exactly. For Meta Llama and other open-weight models, Bedrock can be 10-40% more expensive than alternatives like Groq.

No, Amazon Bedrock has no permanent free tier, and you pay from the first API call. New AWS accounts offer $200 in starter credits ($100 on signup, $100 for guided activities) usable across 200+ AWS services, including Bedrock. Credits expire after six months.

Amazon Nova Micro at $0.035 per million input tokens and $0.14 per million output tokens is among the cheapest text models. Google Gemma 3 4B and Mistral Voxtral Mini are also priced near the floor. For simple classification, routing, or structured extraction, these models deliver acceptable quality at a fraction of frontier-model cost.

Fine-tuning has three cost components: training (per token processed, calculated as dataset size multiplied by epochs), monthly model storage at $0.02 to $0.10 per GB, and inference. Custom text-generation models cannot run On-Demand and require Provisioned Throughput, billed hourly with optional 1-month or 6-month commitments.

No, Knowledge Bases for Amazon Bedrock have no separate fee, but the underlying vector store does. The default OpenSearch Serverless backend has a minimum AWS Bedrock Knowledge Bases pricing floor of roughly $345 per month (2 OCUs at $0.24 per OCU per hour), even with zero query traffic.

AWS provides a generic Pricing Calculator that supports Bedrock inference, but it does not currently model agent token amplification, prompt caching, or Knowledge Base retrieval costs end-to-end. A reliable AWS Bedrock cost calculator is best built as a spreadsheet that combines monthly request volume, average input and output tokens, model rates, agent multipliers, and vector store floors.

Mehul Budasna

Mehul Budasna

Director of Engineering at Bacancy

Cloud engineering leader optimizing scalable, secure, and cost-efficient cloud solutions.

MORE POSTS BY THE AUTHOR
SUBSCRIBE NEWSLETTER

Your Success Is Guaranteed !

We accelerate the release of digital product and guaranteed their success

We Use Slack, Jira & GitHub for Accurate Deployment and Effective Communication.