We Made Our AI Agent Blind and Cut Costs 46x
ยท7 min read

We Made Our AI Agent Blind and Cut Costs 46x

A walkthrough of how we built a multi-model AI architecture where Claude orchestrates and Gemini handles vision, cutting costs by 46x while improving reliability.

"That's it, our prototype works"

My co-founder Sugir had just finished creating the instructions for Charlie - our new price quote creation AI Agent. What we didn't know was that in its current form, it was doomed to fail.

Charlie's first stepsg

An example of the output of the prototype of Charlie earlier this year

What Sugir had accomplished was already incredible: creating a system prompt that compiled 7 years of his experience in an AI Assistant that was able to read a proposal for a trade show booth and pump out the line items of that price quote.

No prices were provided but 80% of the work was done. And the assistant gave results that were impressively good.

What would later become our agent Charlie was taking its first steps. And its first step was getting from a PDF proposal to producing in a matter of minutes line items across all relevant categories (electrical, floors, walls, etc. - there are more than a dozen of them).

PROPOSAL

๐Ÿ—๏ธFloor Plans
๐Ÿ“ฆ3D Mockups
๐Ÿ“Elevation Plans
Claude AI

STEP 1: GENERATE LINE ITEMS

Claude Sonnet
Instructions for creating detailed line items

LINE ITEMS OUTPUT

Complete list of line items and categories for the price quote
We got here!
Claude AI

STEP 2: PRICING

Claude Sonnet
Instructions for generating accurate pricing from line items

COMPLETE QUOTE

Final pricing with all
line items priced

This got us the greenlight of: nice, this is possible, AI is here.

The stumble

With this system prompt in hand, we went to test it out on more projects. Projects that had a bigger floor surface, more intricate architectural needs - essentially making the job of understanding how the floor plans are structured more difficult.

And after a couple of tests, we hit a problem: it seemed like for the same project, with the same instructions, the AI gave us different results every time.

This is not something you want to hear when you're building an AI agent destined for production which people will actually rely on.

So we had to figure something out, make this more reliable.

Can't give you a confidence score boss

At the time, my intuition was that the limitation we were running into was the vision capacity of the model we were using: Claude Sonnet.

We had chosen Claude for its excellent instruction following. It was also the model that we felt was the most concise, professional, and predictable in terms of conversation.

But this model had one flaw: it didn't have good vision capacity. Having a look at the LMArena vision leaderboard, it was clear Claude Sonnet wasn't shining in that department.

The first solution we went for was to ask it to give us a confidence score in what it was identifying:

"Identify the length of this trade show booth from this floor plan and tell us how confident you are"

We tested this in isolation, meaning we gave it just one floor plan and the simple prompt above.

The results were not consistent, with multiple runs and the same input we sometimes had:

  • correct length, confidence score high
  • wrong length, confidence score low
  • correct length, confidence score low
  • wrong length, confidence score low

๐Ÿ‘‰ So it was clear we couldn't rely on this in our prompting.

Why can't it give a reliable confidence score?

Let's take a detour to understand why it can't give a reliable confidence score.

Models don't read images and PDFs like you and I do. When we give it to them, those images are broken apart - "tokenized" - in something that is understood by the models.

This process is a bit opaque, we don't exactly know how that is done and it varies company by company, model by model.

But for sure, once the PDF is loaded, for the model, it's part of its context for the whole conversation.

For example, if it's not sure about what is inside an image it can't go back and "zoom in" on the image like we would do as humans.

The vision capacity of a model is a combination of how images/PDFs are tokenized (broken apart) and the ability of the model to understand those tokens.

Vision capacity = how images/PDFs are tokenized + the ability of the model to understand those tokens

Once a model is rolled out and "trained" this capacity doesn't change (unless they modify the tokenization process but that's rare) - so a model is either good or bad at vision at birth.

When we're asking the model for a confidence score when viewing stuff, we can't know how good its vision is, so we can't know what its top performance in vision is - aka how clear it can normally see things with high quality images.

This would be like asking a dog how confident it is that the toy in the image below is green or orange. It's never known anything else, so it will probably say it's green.

Difference between the vision of a human (left) and a dog (right)

Let's switch models

We knew we had to find a solution to increase the performance in vision.

By this time, Gemini 2.0 pro offered good vision capacity and Google was about to launch Gemini 2.5 pro which would have been - we assumed - better in vision.

So we tested our full prompt with Gemini 2.0 pro and then with Gemini 2.5 pro. It was much better at identifying elements on plans and 3d renderings.

BUT the flow and the adherence to the instructions were much worse than Claude. It was not following the proper formatting of the line item titles, it was messing up the sequence in the workflow.

So it seemed like we could only choose between either having a near-sighted smart model (Claude Sonnet) or a less smart, 20/20 vision model (Gemini 2.5 pro)

... or did we?

The solution: The best of both worlds

For reliability we wanted to get the best of both worlds: the vision capacity of Gemini 2.5 pro and the quality and reliability of instruction adherence and conversation of Claude.

Which made me think: as humans, we use models all the time, we ask them questions.

Why not give Charlie the ability to use Gemini 2.5 pro to view stuff

Here is how it would go down:

  • User provides the PDF as a message to Charlie (Claude)
  • Charlie can't see well so asks Gemini to analyze on his behalf the document/images
  • Gemini provides its answer
  • Charlie gets the answer
  • Charlie continues the process

We made sure to properly write instructions for Gemini so it could actually understand what it was looking at.

We tested and there it was.

The quality of the vision was reliable and we kept the quality of the conversation of Claude.

The spin: making Charlie completely blind

This system worked better than all of the previous attempts.

But sometimes Charlie was still trying to look at the documents itself without considering Gemini's output.

Plus it became difficult to understand when a line item was wrong, if it was Charlie's fault or Gemini's.

๐Ÿ‘‰ So we completely removed the data of the files to Charlie

When a file is uploaded, we do some manipulation to not give the data of the document to Charlie. What we do provide is something like "The user has uploaded the file named: proposal.pdf, it's 3MB"

Then the only way for Charlie to know what the files provided contain is by asking the vision tool.

The many benefits of this approach

Hiding the data (in this case tokens) of the files sent by the user to Charlie had many benefits.

For reference here are some data points to understand what a regular PDF can represent in tokens

What?Token count on ClaudeToken count for Gemini
35 page PDF60k tokens10k tokens
Text instructions for Charlie50k tokens50k tokens
Note: Gemini is very efficient at tokenizing media (PDF, images, sound, etc...)

1. Context pollution

As mentioned above, Claude can't decide to just "not look at the tokens of a file" that is in the conversation. If the file is in the conversation, Claude will consider its tokens. You can try as much prompting as you want, if you keep the file in the context, you'll never be sure Claude doesn't use some information from those files. And we already established that Claude is not as good at vision as Gemini is. So we wanted to avoid a case where Claude thinks it understood something from the plan and ignores what Gemini said.

By removing the tokens of those files altogether, we made sure they would never pollute the context.

2. Decreasing the risk of going over the context window

All models have context windows: when the conversation hits a certain number of total tokens, the oldest tokens will not be considered. Most providers now throw an error when trying to send a message in a conversation going over this limit.

So assuming on average a proposal pdf is 60k tokens for Claude, that's already 30% of the total context window taken. This meant that if a conversation with a user was taking longer than expected, it had a higher risk of hitting that limit.

Note: since we built Charlie, Anthropic has released a 1 million token context window Claude, but the point remains valid

3. The speed

This one is simple: less tokens at every message = faster response.

The Gemini generation takes a bit of time but it's more than made up for in the back-and-forth the user has with Charlie - which now doesn't have to carry around those tokens.

4. The cost

When a message is sent and Charlie answers we are billed per input token and output tokens. In this case, input tokens are made up of: Charlie's instructions + all messages in the conversation (including files).

Most times, users send the supporting documents early in the chat. This means that for a regular conversation to create a price quote with an average of 20 messages, the tokens of the documents are carried at every message - costing us at every message.

Now that Charlie is blind and only Gemini is looking at the files, we calculated that for a conversation of 20 messages on average:

This setup reduces the cost of document processing by 46 times!

The Bigger Lesson

Building Charlie taught us that the best AI architecture isn't about finding the perfect model - it's about orchestrating multiple models to do what they do best.

By making Charlie "blind" and giving him Gemini as his eyes, we didn't just solve a vision problem. We built a system that's 46x more cost-effective, faster, and more reliable than our original approach.

If you're building an AI Agent or some AI automation: don't force one model to do everything. Build systems where models help each other and can do what they do best.

The results will be more robust and you'll cut costs.

The question isn't "which AI model should I use?" It's "which models should work together?"

If you want to talk more about this, hit me up at will@willgyt.com