Most posts about custom AI tools UK SME stop at the architecture diagram. The interesting question for any operator is the one architecture diagrams never answer: did it actually move the number, and at what cost? This piece is a deliberately concrete answer to that question. It tells the story of a single workflow at a single anonymised UK property management firm, the bespoke AI tool we built around it, the numbers it produced ninety days in, and the things we would do differently if we ran the build again.
The client is a regional property management business with around 180 staff and roughly 12,000 tenancies under management across the Midlands and North West. The names and identifying details are removed, but the workflow, the architecture, the metrics, and the lessons are real. If you are weighing up whether a custom AI tool is justified inside your own business, this should give you a defensible shape to compare against.
The bottleneck was inbound tenant email. Around 220 emails landed in the central tenant services inbox every weekday: maintenance requests, payment queries, lease questions, complaints, end-of-tenancy notifications, deposit disputes, and a long tail of edge cases. Three full-time triagers read every email, classified it, copied the relevant detail into Dynamics 365 as a case record, assigned a priority, and routed it to the right team. The average end-to-end triage time per email, measured across a fortnight of timestamped log data, was just under fourteen minutes. The team was permanently behind.
The client had already tried two routes. Microsoft Copilot in Outlook had been rolled out to the triage team for a quarter and rejected for two reasons: it could not write structured case records into Dynamics 365 in the format the downstream workflows expected, and its handling of emails in Polish, Urdu, and Romanian — roughly a fifth of tenant volume in the relevant regions — was inconsistent enough to require human re-checking. A separate proof-of-concept with a generic third-party email automation tool had failed in the demoware-to-production transition we wrote about in why AI automation stalls in most UK businesses — demo accuracy on a clean test set did not survive contact with real, messy inbox data.
The brief to us was small and explicit. One workflow, one success metric, one custom build, ninety-day payback target.
The Copilot question is the right place to start, because it is the question every UK SME owner now asks before commissioning a bespoke build. Copilot is excellent inside the Microsoft 365 surface — drafting replies, summarising threads, retrieving context from SharePoint. It is not designed to be the spine of a regulated business process that has to drop structured records into a line-of-business system, provide an auditable trail of decisions, and handle lower-resource European languages at SLA-grade reliability.
None of that is a criticism of Copilot — it is a different product solving a different problem. We covered the underlying decision framework in Copilot vs custom AI tools; the property management case is a textbook instance of the criteria that push a workflow into the bespoke column.
We ran a paid two-week discovery before any code was written. Three workshops with the triage team, two with the head of tenant services, one with the IT lead, and a data review against four weeks of historical email logs. The deliverables were a written success metric, a workflow map, a data quality assessment, and a defined evaluation set of 300 historical emails hand-graded by the senior triager.
The success metric was sharp. Median triage time per email below four minutes across the next thirty operating days post-deployment, on no fewer than 95% of emails handled end-to-end by the tool. One sentence, signed by the head of tenant services. That sentence is the single most important deliverable of any discovery; without it, the project would have drifted into permanent-pilot.
Discovery also surfaced two issues the client had not previously named. Around 30% of inbound emails contained attachments — photos of maintenance issues, scanned tenancy documents, payment screenshots — that needed to be retained, summarised, and attached to the Dynamics case record. And multi-language handling required a deliberate prompt strategy, not just a language-detection step. Both changed the scope before any code was written.
The build is straightforward in shape, which is usually the case for a custom AI tool that actually ships. Inbound emails arrive in the central tenant services mailbox via Exchange Online. A Microsoft Graph webhook fires on receipt and triggers an Azure Function, which extracts the body, headers, and attachments, and constructs a structured prompt for the Claude API for business. Claude returns a structured JSON response with classification, language, extracted issue summary, recommended priority, confidence score, and a draft holding reply where appropriate.
A routing layer then makes one of three decisions. High-confidence cases are written directly to Dynamics 365 as a case record and assigned to the correct team. Medium-confidence cases are written to Dynamics with a flag for human review. Low-confidence cases — or cases that trip a defined set of escalation rules — are routed to a human triager with the AI's assessment attached as a starting point. Every decision is logged with its prompt, response, and confidence to a separate evaluation store.
The architecture is deliberately simple. No vector database, no agent framework, no orchestration layer beyond what an Azure Function gives you for free. The complexity sits where it should — in the prompt, the evaluation harness, and the routing rules. This is the pattern we follow across our AI integration services: the smallest architecture that delivers the workflow, instrumented well enough to be operated for years.
The 300-email evaluation set was scored by the senior triager during discovery against the right answer for each: correct classification, correct language detection, correct priority, correct routing decision. That set was wired into a script that runs automatically against any prompt or model change before it reaches production. The harness was in place from week one of the build, which meant every prompt iteration was measured rather than guessed.
The first prompt scored 71% on classification accuracy across the harness. The third iteration scored 87%. The seventh, with multi-language examples and explicit edge-case handling, scored 94%. Without the harness, we would have shipped the 71% prompt and called it good. The harness is the dividing line between a custom AI tool that ships and one that stalls.
Build duration was seven weeks against an original five-week estimate, the slip almost entirely on the multi-language work discovery had underweighted. Week one was infrastructure — Azure Function, Graph webhook, Claude API integration, evaluation harness. Weeks two and three were prompt iteration, scored against the harness on every change. Week four was the routing layer and Dynamics integration. Week five was attachment handling. Weeks six and seven were multi-language reliability work and a phased rollout with a manual review safety net.
Deployment was phased over five operating days: 20% of inbound through the tool with every decision human-reviewed, rising to 50% with high-confidence decisions auto-committed, then 100% with the full routing rules running to spec. The rollout caught two prompt issues the harness had not — both edge cases involving forwarded chains absent from the historical evaluation set.
Median triage time per email: 14 minutes → 3 minutes (78% reduction)
End-to-end emails handled by the tool with no human re-touch: 71%
Classification accuracy on production traffic: 93% (audited monthly)
Staff hours reclaimed across the triage team: ~9 hours per operating day
Annualised staff cost reclaimed: ~£71,000
Build and discovery cost: £32,400
Running cost (Claude API + Azure): ~£180 per month
Payback period: approximately 10 weeks measured against reclaimed staff time
The success metric was hit inside the first thirty operating days post-deployment. Median triage on emails handled end-to-end by the tool settled at three minutes against the four-minute target. Two of the three triagers were redeployed to higher-value tenant relationship work; the third now functions as the named operator for the tool, watching the success metric monthly and chairing the operate-phase review.
Two numbers matter more than the headlines. The first is the 71% of emails handled end-to-end by the tool with no human re-touch — that is the figure the staff-hours reclaimed depends on, and it has held steady across ninety days. The second is the running cost: a hundred and eighty pounds a month against a seventy-one thousand pound annualised saving. The headline cost of a custom AI tool is the engineering, not the inference. Once the build is in place, the operating economics are favourable in a way that off-the-shelf per-seat licensing rarely matches at this volume.
Three things, in honesty. We underestimated multi-language complexity in discovery — the workshops should have included a deliberate sampling of non-English emails with the senior triager translating and grading live. We caught the issue in build, but it cost two weeks. We built the evaluation harness in week one but did not require it to score the human triagers themselves on the same 300 emails until week four; doing that earlier would have given a sharper benchmark from the start. And the routing rules were initially documented inside the Azure Function code rather than as a separately reviewable artefact — we rewrote them as a YAML policy file in operate phase, which is where they should have started.
The decision criteria are not subtle. A custom AI tool earns its keep when the workflow requires routing into a proprietary or line-of-business system, structured outputs another system has to consume, multi-language handling beyond what off-the-shelf assistants reliably manage, an auditable trail of decisions for regulatory or contractual reasons, or volume that makes per-seat licensing economically silly. The property management workflow ticks four of those five. Most workflows in regulated UK SMEs — financial services, legal, healthcare, property, professional services — tick at least three.
Off-the-shelf tools remain the right answer for a long tail of workflows that do not need any of the above. Drafting, summarising, brainstorming, contextual retrieval inside Microsoft 365 — Copilot is genuinely the right product for those jobs. The question is not bespoke versus Copilot in the abstract; it is which one matches the shape of the workflow you are trying to change. If yours looks like the property management one — high inbound volume, classification and routing, structured handoff into a system of record, defensible accuracy — a custom AI tool built on the Claude API is usually the right answer, and a paid two-week discovery is the cheapest way to find out for certain. Our how we work page sets out the discovery, build, and operate phases in detail.
If you have a workflow with high inbound volume, structured routing, and accuracy that has to be defensible — book a 30-minute discovery call. We will tell you straight whether a custom AI tool is the right answer for it, or whether Copilot is doing the job already. No sales theatre.
Book a Discovery Call