Quick Summary
Building RAG on Bedrock feels easy until you get hit with production traffic. Here are the five problems we faced building RAG on Bedrock for clients in insurance, wealth management, logistics SaaS, and legal, and how we fixed them.
Introduction
Building a basic RAG system on Amazon Bedrock is simple. Just connect a Knowledge Base to an Amazon S3 source, configure the ingestion pipeline using platform defaults, and route generation through a foundation model over the retrieved context. We do not spend much time on this part of a Bedrock engagement.
But as the traffic on the system grows, things get difficult. The document volume moves past a small test set for the demo environment and multiplies to an even bigger number in production. Monthly query costs no longer track user activity, they track document count, and the cost line on the dashboard keeps rising.
We have worked on multiple projects for RAG on Bedrock, and here are the five problems we faced while building for clients in insurance, financial services, legal, and logistics.
Top 5 Problems We Faced Building RAG on Bedrock
Below, we cover five key situations of building RAG on Bedrock, where the system failed, nearly failed, or required a complete rebuild after the initial implementation passed the testing phase.
We were ingesting 12,000 insurance policy PDFs into a Bedrock Knowledge Base for an insurance client. The PDFs ranged from 3 pages to 180 pages. We used the default chunking (300 tokens with 20% overlap) and Claude 3 Sonnet for generation.
Two weeks in, our accuracy was around 53%, and we could not move it.
The retriever was pulling 300-token chunks from the middle of long policy documents. The clauses underwriters needed were getting split across two chunks. The half with the topic keywords ranked high in the top 5 results, but the half with the actual conditions ranked lower and dropped out, so the model only saw half of what it needed.
FAQ documents had the opposite problem. They were so short that the 300-token default cut them into single bullet points with no heading. Claude was writing answers anyway, filling in what wasn’t there.
How we fixed it:
We split the document set across two data sources inside the same Knowledge Base. Each one got a different chunking strategy.
Long policy documents moved to hierarchical chunking, with section headings as boundaries and section titles carried into segment metadata. FAQ documents moved to semantic chunking. On dense legal text, we kept fixed-size chunking but raised the overlap from 20% to 35%.
After re-ingestion, accuracy on the same evaluation set improved by roughly 25 points.
Problem 2: Retrieval was working perfectly, but Claude was still hallucinating
We were building a client-facing assistant on Amazon Bedrock for a wealth management firm, covering fund performance, fee structures, and investment mandates from around 2,000 prospectuses and fact sheets. The architecture used a Bedrock Knowledge Base and the RetrieveAndGenerate API with Claude 3 Sonnet. Retrieval was tuned carefully: top-5 segments per query, a re-ranker for relevance, and evaluation scores showing consistent recall.
Then, during one instance, a client of the firm asked about a fund that was no longer in the prospectus set. The system actually invented a performance history for it. Another client asked about a fund’s fee structure, which varied by share class. The system averaged the share classes into one number and returned that as the answer.
Retrieval was returning the correct segments for each query. The failure was at the next step, when Claude generated the final answer. The system prompt in the RetrieveAndGenerate call told Claude to “answer the question based on the provided documents,” but said nothing about what to do when the documents could not. In that situation, Claude writes one anyway, using its training data to fill in.
How we fixed it:
We rewrote the system prompt in the RetrieveAndGenerate configuration to instruct Claude on how to handle questions that the retrieved segments cannot answer. We added a secondary Claude call through Bedrock’s InvokeModel API that scored how well the retrieved content actually answered the question. If the score came in below the threshold, the system blocked the primary response and told the user it could not answer from the available documents.
After the rollout, the hallucination rate on the evaluation set fell from the low double digits to under 3%.
For engineering teams planning to build RAG on Bedrock, our AWS consulting services provide architecture review and implementation support.
We were building a customer-facing assistant for a logistics SaaS provider on Amazon Bedrock to answer queries against shipment records, carrier contracts, and route documentation. New documents were being added per client per month.
By the time the document count crossed 50,000, monthly query costs had reached the low five figures and were still climbing. Every query was running a similarity search against the full document set, regardless of which client or document category it was meant to cover. An automotive client asking about their own carrier SLAs was pulling segments from retail clients’ shipment records. Retrieval precision was dropping as the document count went up.
Every document already carried client ID, document type, and date range as metadata. But the query path was not using any of it.
How we fixed it:
We applied Bedrock Knowledge Base filter expressions to limit each query to a single client ID and a relevant document category, and made these filters required. We also mapped each user session to the correct client ID, so the filter is set automatically on every query.
Each query was now searching a selective hundred documents instead of 50,000. Monthly costs dropped by roughly 80%, and retrieval precision improved as a result.
For a deeper look at how Bedrock pricing works and where costs accumulate, refer to our AWS Bedrock pricing guide.
Problem 4: One Knowledge Base per tenant hit the account limit at 80 customers
On the same logistics SaaS engagement, the architecture had a second problem waiting for us. The initial design used one Bedrock Knowledge Base per customer for strict isolation between tenants. New customers were being added every month, and the document set was growing for each customer.
At around 80 customers, the next customer onboarding failed at the provisioning step. We had hit Bedrock’s hard limit of 100 Knowledge Bases per account per region (the remaining 20 slots were taken by staging, development, and internal Knowledge Bases the team was using for other work). There was no path to add more customers without restructuring the architecture.
The pattern was also expensive. Each Knowledge Base held its own vector index in OpenSearch Serverless, and we were paying for storage and OCU capacity per tenant, whether the tenant queried the assistant or not.
How we fixed it:
We refactored to a single Knowledge Base over a shared OpenSearch Serverless collection, with tenant_id attached as a metadata field on every document at ingest time. Every RetrieveAndGenerate call now passes the user’s tenant_id as a mandatory filter expression before similarity search runs.
We added IAM-level access control so the API gateway resolves each user session to its tenant and injects the value into the call. Client code cannot pass an arbitrary tenant ID.
After the refactor, monthly costs fell by roughly 80%, and we could onboard new customers again without hitting the account limit.
Problem 5: Removing reranking saved a few seconds and broke complex legal queries
We were building a contract review assistant on Amazon Bedrock for a legal technology firm, covering precedent agreements, standard clauses, and jurisdiction-specific regulations. The architecture used a Bedrock Knowledge Base with the RetrieveAndGenerate API. The original design included a reranker model on top of vector retrieval. To cut latency before launch, the team pulled the reranker.
With this setup, simple queries continued to work. But, with multi-part questions, the kind lawyers actually issue, they did not. A query like “What are the force majeure exceptions under New York law for software delivery delays, and how have they been interpreted post-2020?” required reasoning across four or five distinct document types. The retriever was returning the most semantically similar segments, which clustered around shallow keyword matches. The segments that actually contained the answer were sitting further down the result list and never reached the model.
Citation accuracy collapsed as a result. Lawyers were getting answers that cited regulations the retrieved segments did not support. For a legal product, that creates direct malpractice exposure for the firms using it. Within six weeks, the lawyers stopped trusting the system and were back to doing the research manually.
How we fixed it:
We reintroduced reranking through Bedrock’s rerank API, using Cohere Rerank 3 to reorder retrieved segments by relevance to the actual question. For complex queries, we routed the call to an asynchronous path, where the response returns via polling. We tuned on citation links in the response so each cited source linked back to the segment that supported the claim.
Latency rose by a few seconds. Answer quality recovered, and the lawyers started using the platform again within a month.
Results at a glance
The table below sums up the five RAG on Bedrock problems we faced in our client engagements. Each row covers the industry, our engagement covered, the problem faced, the solution we implemented, and the outcome that followed.
| Industry
| Problem
| Solution
| Outcome
|
|---|
| Insurance | Default chunking broke on a mixed set of long policies and short FAQs
| Two data sources in one Knowledge Base, with hierarchical chunking for policies and semantic for FAQs
| Accuracy up roughly 25 points
|
| Wealth management
| Retrieval worked, but the system prompt had no refusal path
| Rewrote the prompt with refusal logic, added a sufficiency check via InvokeModel
| Hallucination rate dropped to under 3%
|
| Logistics SaaS
| No metadata filter, so every query searched the full document set
| Bedrock filter expressions limited each query by client ID and document category
| Monthly query costs cut by roughly 80%
|
| Logistics SaaS
| One Knowledge Base per tenant hit Bedrock's 100-KB account limit at 80 customers
| Single Knowledge Base with tenant_id metadata filtering and IAM-level session scoping
| Onboarding ceiling removed, per-tenant costs eliminated
|
| Legal Tech
| Reranking pulled for latency, complex queries returned shallow matches with bad citations
| Restored reranking via Cohere Rerank 3 on an async path
| Answer quality recovered
|
Conclusion
These are five of the many RAG on Bedrock problems we have handled, and each one changed how we approach the next.
From our experience, we have noticed a pattern. The Bedrock defaults are good enough to get a pilot project running. But the moment the system has to handle production document volumes, complex queries, or multiple tenants, things start breaking. And, the architecture decisions that get postponed are usually the ones that needed to be made on day one, because the defaults made everything look fine.
So, we changed our approach. We test the default chunking against a representative sample of the document set before we ingest anything at scale. We write refusal instructions into the system prompt before the first user query. We decide on the multi-tenancy pattern based on where the customer count will be in the coming years, not where it is at launch. We add metadata filtering to the query layer before the document count grows past a few thousand, and we keep reranking in the architecture from launch, with an asynchronous path for the queries that need it.
Whether you are planning a Bedrock RAG implementation or already running into one of these problems in production, our AWS developers can come in and work through it with your team.
Frequently Asked Questions (FAQs)
Running RAG on Bedrock involves three cost areas: token-based inference on the foundation model, the vector store behind the Knowledge Base, and add-ons like Guardrails or reranking. OpenSearch Serverless, the default vector store, runs roughly $345 per month for storage alone before any query traffic. For systems with light query traffic, that fixed cost often exceeds the inference cost.
The Bedrock RAG Evaluation has been generally available since March 2025. It scores context relevance, coverage, correctness, completeness, and faithfulness using an LLM-as-a-judge model. For production RAG on Bedrock builds, pair it with a golden dataset of 100 to 200 question-answer pairs that match your actual query distribution. Run the evaluation on every ingestion cycle.
Claude is the most common generation model on RAG on Bedrock builds because of its strong reasoning over retrieved context. For embeddings, Titan v2 works well on general English content. Cohere Embed performs better on multilingual or domain-specific corpora like legal or medical text. Evaluate two or three options on a 50-pair sample from your actual document set before committing.