How to recreate Navan’s receipt scan feature to retrieve the total amount

In this blog post, I’ll show how to perform image recognition to detect the “Total” from of a receipt.

Navan’s impressive receipt upload feature

I’ve been impressed for a while by Navan’s feature to scan a receipt, it feels magical to just upload an image and have your full receipt parsed and most of the time you just need to press “Submit for Review”.

If you don’t know what Navan is or what this feature is about, Navan is an app used to manage company expenses easily (there’s more but that’s what’s relevant here), the feature I’m talking about is when you are about to upload a receipt, you have 2 options:

You can manually enter all the information – very time consuming.
You can take a photo of your receipt – this is really fast and it saves you typing all the details yourself.

How does it work?

What I’m about to describe is how you can recreate this feature using a Machine Learning model called SmolDocling-256M-preview (this one)

We’ll focus on getting the Total amount of the receipt (we’ll skip merchant and date but you should be able to get those as well if you try). The Machine Learning algorithm that we are using is an Image-Text-Text model which takes as input an image and a text prompt, and it outputs plain text. This model has been pre-trained with millions of images and prompts to generate accurate results based on what the model sees in an image, it won’t be 100% perfect but it still can get very good results. Testing what kind of results you get with this specific model is part of the scope of this post.

Is this model big or accurate enough for the task?

I wasn’t fully sure if this model was going to generate good results at the beginning and this is because with the example image the model provides, I tried asking the model “How many fingers do you see in the picture?” and it responded “2.” when you can clearly see 4 fingers in the picture.

I decided to keep going still, since it was a fairly small model (256M parameters) that you can run on less powerful hardware. I ran all these tests locally in my Macbook M1 Pro.

My testing

I tested with 4 receipts and SmolDocling succeeded in all 4 after a few tweaks to the prompt I was using.

Example # 1

The first test that I tried was using a digital receipt from an email. Often I’d just screenshot an email in my phone and upload that to Navan as it is.

Here’s the image:

The query that I used is:

What is the total amount of the charge for this receipt?

The output that I got is:

The total amount of the charge for this receipt is $26.65.

Which is of course correct! ✅ 1/1.

You can adjust the query to get just the number or you can parse the number from the output string, that’s up to the developer to make a call for which option to go with to get the float number.

Example # 2

The second test I tried, this receipt is from probably my favorite clothing brand.

Here’s the image:

The query that I used is the same one as before:

What is the total amount of the charge for this receipt?

The output that I got is:

The total amount of the charge for this receipt is $70.32.

Which is correct as well, SmolDocling is looking good ✅ 2/2

Example # 3

The third test I tried was a receipt from Yard House, a restaurant I like.

Here’s the image:

If I used the same query as with the previous examples:

What is the total amount of the charge for this receipt?

The output that I got is:

 $27.35.

Which is not correct, it’s a value that’s quite close to the bottom of the receipt which may be what the model is looking for but the model is not able to find a “Total” string without a specific instruction.

After I adjusted the query to this:

What is the total amount of the charge for this receipt?  You will see it next to a text that says \"TOTAL\"

The output that I got then is:

$136.77.

Which is correct, I would count this example as also correct even though the prompt needed a tweak. 3/3 ✅

Example # 4

For the next example, I used a receipt from a local Asian supermarket here in Austin.

Here’s the image:

If I used the same first query I had:

What is the total amount of the charge for this receipt?

The output that I got is:

 114.12 AM.

Which is not correct, this time it tried to give me I guess part of a string that also was near the bottom of the receipt. Interesting result but not exactly what we want.

After I adjusted the query to the one that worked for the previous example:

What is the total amount of the charge for this receipt?  You will see it next to a text that says \"TOTAL\"

The output that I got is:

 $23.90

Which is correct, I’d also count this example as correct. 4/4 ✅

Example # 5 (Extra credit)

I wasn’t going to do this one but I went to HEB (a local supermarket) and got a receipt, and after some attempts to get the total from that receipt, this was the most surprising result.

Here’s the image:

When using the first query:

what is the total amount of the charge for this receipt?

The output that I got was:

The total charge for this receipt is $125

Which is incorrect of course, the correct total is $48.49. Then I tried the improved query.

what is the total amount of the charge for this receipt? You will see it next to a text that says \"TOTAL\" or TOTAL SALE.

The output was:

 $120.00.

Which is also incorrect. Then I tried another different prompt, trying to force the model to at least start giving me the correct $48.49 amount with any prompt that may do that:

what is the total amount of the charge for this receipt? You will see it next to a text that says **Total Sale**

The output then was:

 $12,000.

$12k?! I would count this one as fail, so the final accuracy of the model would be 4/5 which translates to 80% which is not amazing but the user would still get other pre-filled data correctly (the vendor for instance) and usually you verify the amounts are correct or in the right ballpark at least before submitting.

I may try to do a test with a larger model later on to see if I can get higher accuracy and to try to get an accurate amount out of this HEB receipt. The main visual difference I see is that this HEB receipt has the total in much smaller font compared to the previous Hana Market one that worked correctly so that may factor in inference being inaccurate.

Inference performance

How long does it take to do inference with this model? Here are the notes:

Clerk receipt - 2.1 seconds
Everlane receipt - 2.7 seconds
Yardhouse receipt - 2.1 seconds
Hana market receipt - 2.4 seconds
HEB receipt - 2.4 seconds

Overall not bad at all for a feature that saves you way more than 3 seconds when using it. And bear in mind this was using my local computer, if you use specialized hardware you will get significant boost in performance, the documentation mentions 0.35s average time per page.

Conclusion:

Can SmolDocling work for a feature similar to what Navan does? I wouldn’t confidently say you can use it in production without more testing, I’d try out a bigger model (maybe at least 512M params) and perform tests with it with various receipts to see what accuracy you can get with various different receipts. If you are trying to get any improvement at all in your product, an 80% prefill rate over 0% is still good but whether to use this model is still highly dependent on your expectations and your use case.

Will it always be accurate? No it won’t but you can still get decent results, I ran all 4 examples with the improved prompt from examples 3 and 4 and they all gave me the total accurately. Unfortunately the HEB receipt wasn’t accurate at all but still, 1/5 is not the end of the world, in the end it would depend on your specific product and how much better an ~80% accuracy is compared to having users typing this information manually.

Another thing to consider is that I didn’t do a ton of exploration, so there might be a prompt that works better and yields a higher accuracy.

If your team needs help with RAG or anything related to LLMs or Machine learning, checkout this page!