This is the story of a project I delivered in 5 weeks for a Barcelona fashion e-commerce store with 3,000 SKUs and roughly 200 customer inquiries per day. By the end, 70% of those inquiries were handled without a human touching the keyboard, and chatbot-assisted sessions converted 35% better than sessions where the bot stayed quiet. I want to walk through the actual architecture, the parts that broke in production, and the trade-offs I would make differently today.
If you are evaluating whether an AI chatbot makes sense for your store, or you want a reference architecture before you spec your own build, this is the write-up I wish I had when I started.
The business problem that justified the build
The store had two support agents working roughly 9:00 to 19:00 Spanish time. During peak sale windows their queue overflowed — a rough audit showed 40% of chats were abandoned because nobody responded within five minutes. Customers were bouncing to competitors. Hiring a third agent would have cost around EUR 2,200/month loaded, and still would not have covered evenings and weekends.
The inquiries were also repetitive: sizing questions, availability, "does this dress come in tall", "what colour goes with these jeans", "when will my order ship". Most of those answers lived somewhere — in the product catalog, in the returns policy, in the order database — but a human had to stitch them together each time.
That is the pattern that justifies building a real chatbot instead of a generic widget: your data already answers most questions, but a human is copy-pasting the answer. Automating that is not magic, it is just plumbing with a language model in the middle.
Why off-the-shelf tools were not enough
Before writing any code I ran the numbers on ready-made options. None of them fit cleanly:
- Shopify chatbot apps (Gorgias, Tidio with AI add-on): decent for generic FAQ, but they cannot reason over the real-time catalog without custom integration. Pricing tier for AI features started around EUR 250/month and scaled fast with message volume.
- ChatGPT-style embedded widgets: no grounding in the product catalog means confident-sounding hallucinations about stock and sizing, which is worse than no bot at all in retail.
- Dialogflow or Voiceflow: intent-based flows work for narrow, predictable journeys. They fall apart the moment a customer phrases something the designer did not predict, which in fashion is roughly every other message.
The decision point: a bot that is actually useful has to read the same database the shop reads, understand product attributes the way a store employee does, and be able to escalate when it is unsure. That meant a custom build on top of the existing Django stack.
The architecture I ended up shipping
The stack is deliberately boring. Every component has a Django-native equivalent or a well-understood library. Nothing in here requires a PhD to maintain six months from now.
- Django 4.2 as the application server (the store was already on Django).
- PostgreSQL with pgvector for product embeddings — no separate vector database, no extra moving part.
- OpenAI GPT-4o for the chat model, with
gpt-4o-miniused for cheap classification tasks (language detection, intent labelling). - Redis for session state and rate limiting.
- Celery for two background jobs: re-embedding the catalog when products change, and generating product descriptions in batch.
- Cloudflare in front for edge caching and basic abuse protection.
A request from the widget on the storefront flows like this: widget sends message to /chat/, Django looks up session history in Redis, runs an embedding search against the product catalog in Postgres, assembles a prompt with the top matches and conversation history, calls OpenAI, streams the answer back. Total round-trip on GPT-4o is about 2.5 seconds including the vector search, and under 1 second if I route the query through gpt-4o-mini.
Indexing 3,000 SKUs with pgvector
The biggest design decision was how to embed products. I tried three chunking strategies before settling on per-product "cards" of roughly 200 tokens each — title, description, key attributes (colour, size range, material, fit), and a few curated reviews if they existed. Anything longer diluted the embedding; anything shorter lost context.
class Product(models.Model):
title = models.CharField(max_length=200)
description = models.TextField()
attributes = models.JSONField(default=dict)
embedding = VectorField(dimensions=1536, null=True)
updated_at = models.DateTimeField(auto_now=True)
def as_embedding_card(self):
parts = [self.title, self.description]
for key in ("color", "size_range", "material", "fit"):
if val := self.attributes.get(key):
parts.append(f"{key}: {val}")
return "\n".join(parts)
Re-embedding is done on save via a Celery task, not inline, so the admin stays responsive. A full re-index of 3,000 SKUs costs around EUR 0.60 at current OpenAI rates and takes about four minutes.
Retrieval is a straight cosine similarity query in Postgres — no separate service. For a query like "linen shirt for summer wedding", the top 8 matches consistently surface the linen and cotton-linen blends, excluding pure cotton and synthetic blends. That kind of nuance is the difference between a bot that feels like it knows the shop and one that recites the SKU list.
Keeping conversations grounded
The single biggest source of early complaints was hallucinated stock. The bot would happily tell a customer that a dress was available in size M when it had sold out an hour earlier. I fixed this with two changes:
- Stock is never in the embedding, only in a fresh lookup at answer time. Embeddings describe what a product is, not its current state.
- The system prompt forbids claims about availability without a tool call. If the bot wants to say a size is in stock, it has to call a
get_stock(sku)function. OpenAI's function-calling API enforces this cleanly.
tools = [{
"type": "function",
"function": {
"name": "get_stock",
"description": "Return current stock for a SKU. Call this before confirming availability.",
"parameters": {
"type": "object",
"properties": {"sku": {"type": "string"}},
"required": ["sku"],
},
},
}]
The moment I added this, hallucinations about stock dropped to zero. The lesson generalises: anything the customer can verify later (price, availability, shipping cost) has to be a tool call, not a free-form generation.
The handoff — when the bot should shut up
The bot is not allowed to improvise on returns, chargebacks, damaged goods, or anything emotional. Those are routed to a human with the full conversation pre-loaded. I trained a small classifier using gpt-4o-mini that runs after every user message — it returns one of normal, escalate, or urgent, and the widget surfaces a "talk to a person" button accordingly.
The escalation prompt is short and explicit: "If the user mentions a damaged item, wrong order, refund, complaint, or expresses frustration, classify as escalate. If they mention fraud, chargeback, or legal threat, classify as urgent." One-line examples in the prompt cover the tricky cases. Total classification cost: about EUR 0.0002 per message, effectively free.
Three things that broke in production
1. Language detection was wrong 8% of the time. The shop has customers writing in Spanish, English, and Russian, often mixed in the same message ("Hola, у вас есть эта jacket в size M?"). My first-pass language detection ran on the first message only and stuck with that label. The fix was to re-classify on every message and pass the current detected language into the system prompt. Most of the weirdness disappeared.
2. The embeddings went stale silently. When a product's description was edited in the Django admin, the save signal fired, but if Celery was backed up, the re-embed could lag by 20+ minutes. During a catalog bulk-edit this caused the bot to recommend products based on old descriptions. Fix: a daily Celery Beat task that does a sanity-check sweep — comparing updated_at on the product versus the embedding's last-generated timestamp, re-embedding anything out of sync.
3. GPT-4o occasionally ignored the system prompt. Rare — maybe 1 in 300 messages — but it would slip into generic assistant mode ("I'm an AI language model..."). I added a post-generation filter that looks for obvious slip phrases and, if found, regenerates with a slightly stronger system prompt. Crude but effective.
The numbers at 5 weeks and at 6 months
- 70% of customer inquiries handled without human intervention.
- 35% higher conversion rate on chatbot-assisted sessions vs. unassisted.
- Average response time: under 3 seconds, down from a 12-minute peak-hour median.
- Support team reduced from 2 full-time to 1 part-time, saving around EUR 2,800/month after subtracting OpenAI costs.
- Product descriptions for 500+ new arrivals generated in the first week via the same infrastructure, cutting the copywriting pipeline from 2 hours to 5 minutes per batch.
- Monthly OpenAI bill: EUR 140-180, varying with traffic and promotions.
The payback window was about six weeks — cheaper than one month of the extra hire they were considering.
What I would do differently today
Three things, with the benefit of hindsight:
Start with gpt-4o-mini, not GPT-4o. The quality gap for retail support is smaller than the pricing gap. I could have cut the OpenAI bill by 70% and only a handful of edge cases would have suffered. Upgrade the model for the specific handful of complex flows that need it.
Skip Celery for embeddings on a catalog this size. Three thousand products is small. A synchronous re-embed on save would have been simpler to reason about. Celery pays off at 50k+ SKUs or when re-embedding takes more than a few seconds.
Build the evaluation set on day one. I bolted on a test suite of 40 representative customer questions about halfway through the project, which should have been step one. Every prompt change should be measurable against a fixed set of inputs — otherwise you are vibing your way through a system prompt.
FAQ — the questions customers actually asked
These are pulled from the live chat logs after six months of running the bot, which is also a nice stress test of the build itself.
How much does a chatbot like this cost to build?
Fixed price for a store in the 1k-5k SKU range with an existing Django or similar stack is EUR 3,000-5,000, delivered in 4-6 weeks. Running cost is typically EUR 100-250/month for OpenAI plus your existing hosting.
Can you do this on Shopify or WooCommerce instead of custom Django?
Yes. Shopify needs a headless or app-based integration, which adds about two weeks. WooCommerce is simpler because the database is directly accessible. The architecture principles — embeddings, tool calls for stock, escalation classifier — transfer 1:1.
How long until the bot pays for itself?
For a shop doing 100+ support tickets a day, typically 6-10 weeks. The math is straightforward: calculate the loaded hourly cost of your support team, multiply by the hours the bot offloads, subtract the OpenAI bill and the hosting delta.
Will the bot make mistakes?
Yes. The target is not zero mistakes — it is fewer mistakes than a tired human agent on a Friday evening, plus an escalation path for anything it is not sure about. With the tool-call pattern above, the mistakes that remain are mostly taste-level ("this colour does not suit this outfit"), not factual.
How do you handle customers who just want a human?
The "talk to a person" button is always visible, not hidden behind menus. The moment it is clicked, the full conversation history is preloaded into the agent's queue. Roughly 15% of conversations end up escalated, which is where the "70% handled" number comes from.
What about GDPR?
Customer chat logs are stored in the store's own Postgres, not on OpenAI's servers. I use OpenAI's API with data-retention disabled for B2C chats. The privacy policy is updated to disclose AI processing, and chat logs get the same retention policy as customer support emails (typically 12 months).
If you are thinking about building this
The architecture above is not tied to fashion. I have reused the same pattern for a real-estate platform (listings instead of SKUs), a wine investment platform (bottles and vintages), and a clinic website (treatments and availability). The three invariants are: embed your domain data in chunks that match how your customer thinks, force factual claims through tool calls, and classify for escalation on every message.
If you want a fixed-price quote for your own store, the fastest path is a 15-minute call where we walk through your catalog size, current support volume, and existing stack. I typically return a scoped proposal within 24 hours.