How to fix retail AI analytics pilots that fail
MIT's NANDA initiative studied 300 enterprise AI deployments, surveyed 350 employees, and interviewed 150 leaders. The headline finding:
95% of generative AI pilots at companies are failing.
Specialized vendor-led projects succeed about 67% of the time.
Internally built ones reach roughly half that.
The pattern in retail is the same, just with more expensive consultants in the room.
Picture the final POC review
The vendor team presents pattern detection on syndicated foot traffic data layered over competitor pricing scrapes.
The output looks slick. Then the head of analytics says, in some variation:
My team can do that.
The project goes quiet.
Within a quarter, the budget moves elsewhere.
Nobody calls it a failure. They call it a learning.
The analyst is not wrong
Given what the pilot actually tested, the conclusion was correct.
The pilot was built to fail because it was set up to test the wrong thing.
This piece is about what retail AI analytics pilots have to clear to count as incremental, and why most never come close.
If you are running pilots in a multi-location chain, you probably recognize at least one of the failure modes below.
Why most retail AI analytics pilots die in the conference room
The failure rate is not anecdotal. RAND found more than 80% of AI projects fail at twice the rate of non-AI IT projects.
MIT's NANDA work pushed the number higher for generative AI specifically.
The retail sub-sector sits at roughly 74% failure, which is bad, but tells you less than the why.
Most retail analytics pilots fail in a pattern that has nothing to do with model accuracy.
Four conditions tend to be present in every dead POC:
- The pilot ran on public or syndicated data alone
- It never touched the licensed proprietary feeds the retailer pays for
- It never touched internal operational systems (POS, labor, inventory, customer service)
- It never captured the operator judgment that turns a chart into a decision
Plus a fifth condition that quietly kills the pilots that survive the first four: the wrong delivery model.
The output gets pushed into another dashboard that nobody opens.
The point of retail AI analytics is to stop adding tools to the stack and start adding conclusions to the inbox.

Failure mode #1: The public-data-only POC
This is the dominant failure pattern in retail pilots.
The vendor shows up with:
- Foot traffic data from one of the location analytics providers
- Syndicated panel data from Circana or NielsenIQ
- Scraped competitor pricing
- Maybe weather and consumer sentiment
They run pattern detection across that stack and present findings.
Everything they show, your team could have produced. Possibly already did.
That is why the head of analytics ends the meeting the way they do.
We did, like, a small POC with a company who came to us and said, like, we kind of do these ambiguous problem consulting services. And when it came back, I said, my team can do that. Nothing here. While it is faster, it is not incremental. And there is a huge cost difference. I can get a pretty junior analyst to go and do that for me.
That is the VP of strategy at a $12 billion mass retailer describing a pilot they killed.
Read it carefully.
She is not saying the AI was wrong.
She is saying the output existed inside her team's reachable work.
Faster does not equal incremental.
When the inputs are the same as what her team uses, the answer ceiling is the same.
This is the part competitors hate to admit.
The Power BI and Tableau seats your team already pays for are genuinely sufficient for single-source analysis.
She said as much in the same conversation:
My team can use Power BI, they can use Tableau, like they can leverage basic data structuring tools to be able to eek out insights from a single source alone.
The competitive surface here is not your BI stack. It is the layer above it.
A pilot that runs against public data is competing against a junior analyst with a Tableau license and two weeks.
That competition is lost before it starts.
Useful retail strategy work on syndicated data is a solved problem inside most analytics teams.
AI does not change the answer ceiling. It just compresses the time to the same answer.
Failure mode #2: No path to licensed and internal data
The second failure pattern is what happens when the POC team tries to fix failure mode #1 by asking for more data.
They want to add:
- Circana panel data,
- The IRI feed,
- The company's own POS
- Labor systems.
They hit a wall they did not plan for.
The wall has two faces. The first is contractual.
In order to have a third party leverage our data, we need to go to that data provider one by one. Like, we'd have to go renegotiate.
Most retailers license third-party data under agreements that prohibit a third party from touching it.
Letting a new AI vendor query that data means renegotiating with every provider individually.
Nobody has time for that, and nobody wants to spend a quarter of legal cycles to find out whether a POC will be useful.
So the pilot ships without the licensed layer, and the analyst's verdict holds.
The second face of the wall is integration.
The vendor has built their pilot in their environment, not yours.
To touch your internal POS, your labor management system, your inventory feed, your customer service ticketing, you would need to either send data out (which security blocks) or rebuild the pilot inside your environment (which the vendor cannot do without an architecture they probably do not have).
This is where most retail licensed data analytics pilots quietly die.
The findings the analyst would have called incremental, the ones that only appear when licensed third-party data is layered with internal operational data, never get tested.
The pilot delivers what it could test on public inputs, which is exactly what the team already had.
The vendors who get past this are the ones who built their architecture for it from the start.
The Burlington conversation went somewhere interesting when this came up:
We will actually put our agents in your organization. It is in your environment. We never own it, we never touch it, we never see it. You can fit under your own agreements.
That is the deployment model that unblocks licensed data.
Containerized agents that run inside the customer's cloud, data never leaves, license agreements stay intact.
If your POC vendor's architecture cannot do this, the pilot is structurally limited to whatever public data they can pull.
You will see the failure mode #1 result.

Failure mode #3: No capture of tribal knowledge
Most pilots assume the data is the gold. It is not.
The gold is what your best operator does in their head when they look at the data, and almost no pilot captures it.
Brad Peters, who founded Birst before starting Scoop, frames it this way:
85 to 95 percent of the context of a question that you ask is not contained in the question itself.
The context lives in the operator's institutional memory.
A senior retail VP looking at a comp store report does not just see the chart.
They see what the comp number means in this region, at this season, against this promotional overlay, given what they remember happened in 2019.
That layer of interpretation is what makes the chart actionable.
It is also why the descriptive vs diagnostic analytics gap matters more than it looks.
- Descriptive output tells you what happened.
- Diagnostic output tells you why.
The reason your team is good at the second part has very little to do with the dashboards they have access to.
It has to do with the pattern recognition their senior people built over fifteen years.
Rocky described the diagnostic load this way: when a store is doing poorly, there are literally 1,000 reasons why that can be. And it usually takes 10 primary hypotheses you have, and a lot of work to figure out what is happening below the surface. The 1,000 collapse to 10 because of operator judgment. Without that filter, an AI pilot has to surface all 1,000 and let the user sort it out, which is worse than no tool at all.
The capture problem is solvable.
It is also expensive in operator time, which is why most pilots skip it.
If we took a tape recorder and recorded everything you thought as you looked at your BI reports and described your analyses, we stick that into the system so it can do that on your behalf.
In practice this means structured sessions with the operators whose judgment you want to scale.
The Scoop team has spent 13 hours in stores recording how senior managers walk through reports, and six hours in a single consulting session with a COO laying out screening criteria.
The transcripts get turned into rules:
- What to check first
- What thresholds matter
- When to drill deeper
- What "out of balance" actually means for this business
Pilots that skip this step produce findings any decent analyst could have produced.
Pilots that do this step produce conclusions that read like your best operator's weekly review, except they ran across every location while everyone was asleep.
That is what "incremental" looks like.
Failure mode #4: The wrong deployment model
Assume a pilot clears the first three failures. It pulls licensed and internal data.
It captures operator judgment.
The findings are real.
The pilot can still die here, on the last step, because of how the output gets delivered.
The dominant pattern is to bolt a copilot onto the existing BI dashboard.
Another login. Another panel.
Another "go investigate" prompt that assumes the operator has time to investigate.
They do not. They never did.
The whole reason the drift problem is endemic in multi-location retail is that operators are running locations, not running queries.
Even sophisticated companies see this clearly when they look at their own AI rollouts:
We're in pilot mode, and utilization is so spotty because, honestly, it's relying a lot on people's personal ability to learn from videos.
That is Burlington's strategy team describing their copilot rollout.
The same logic applies to any AI tool that requires a user to log in, formulate an intent, and review output.
The friction is small per session and infinite at scale. Adoption dies on it.
The MIT NANDA report identified this as the learning gap: enterprise AI tools that "do not learn from feedback" and "require too much manual context required each time" lose to flexible consumer tools because the enterprise tool sits in a separate place the user has to visit.
The same pattern explains why monitoring tells you what but investigation tells you why stays unresolved in most retail organizations.
The monitoring layer pushes alerts.
The investigation layer requires a human to go look. Nobody goes to look.
Pilots that test against this deployment pattern fail even when the analysis is genuinely good.
The investigation has to come to the operator, not the other way around.
A weekly report in the inbox, organized by store, with the flagged findings and the action options, gets read. A dashboard that requires a login does not.

What incremental actually means: the four-layer test
Here is the reframe to take back to whoever sponsored the pilot. A retail AI pilot is not a feature demo. It is a test of whether a system can produce conclusions your team could not have produced with the same time budget and the same data access. That definition matters because it forces the pilot to compete on the right axis.
Why retail pilots fail differently than other verticals
Retail compounds the pilot problem in ways other verticals do not. SKU counts run into the hundreds of thousands. Store-level variance is enormous. Promotional overlays shift weekly. Seasonality is structural. Comp-store comparisons require pristine matching logic. Markdown cadence varies by category and region. None of this is exotic. All of it is hard.
A pilot that works on aggregate retail data and fails at the store level is not a useful pilot. The decisions that matter happen at the store level. The variance that produces underperformance also happens at the store level. This is the whole problem space Domain Intelligence for retail was built to address, and it is also why so many generic AI pilots fail in retail specifically. The model can be excellent and still produce useless output if it cannot reason at the resolution where decisions live.
The pillar piece on why retail store diagnostics take so long goes deeper on the resolution problem and on what changes when investigation happens at the store level on a weekly cadence. The pilot framework above is the operational version of the same argument.
Frequently Asked Questions about why retail AI analytics pilots fail
What is a retail AI analytics pilot supposed to prove?
That a system can produce conclusions your team could not have produced with the same time budget and data access. Faster is not enough. Cheaper is not enough. The output has to be reachable only with the combination of all four data layers and operator judgment encoded into the workflow. If a junior analyst with the same inputs could have gotten there in two weeks, the pilot did not prove incremental value. The retail analytics baseline matters here because that baseline is what the pilot has to beat.
Why do analytics teams typically reject AI pilots?
Because most pilots run on public data, and analytics teams can already produce findings from public data. The rejection is correct given the inputs the team sees. The way to avoid the rejection is to design the pilot so it operates across licensed and internal data with operator judgment encoded in. That changes what the pilot can find and removes the "my team can do that" objection at the structural level. See monitoring vs investigation for the deeper distinction.
What data does a retail AI analytics pilot actually need to access?
All four layers. Public and syndicated, licensed proprietary, internal operational, and tribal operator knowledge. Missing layer 2 means the pilot is competing with what your team already has. Missing layer 3 means it cannot see what happened at the store. Missing layer 4 means the findings will be analytically correct and operationally useless. The retail licensed data problem is usually the bottleneck and the one to negotiate first.
How long should a retail AI analytics pilot run?
Twelve to thirteen weeks if it is structured well. Three weeks for the layer audit. Three to four for tribal knowledge capture. Three for first runs and tuning. Three for the incremental comparison test. Shorter pilots compress the tuning phase and produce reports that look wrong on first pass, which gives the budget owner an excuse to kill the pilot before it stabilizes. Longer pilots usually waste time on layer 1 work that did not need that much time to begin with.
What is tribal knowledge and how does it get captured?
Tribal knowledge is the interpretation logic your best operator applies when they look at a report. It includes what thresholds matter, which signals to act on versus ignore, what "out of balance" means for this business, and how to rank hypotheses when the data could mean several things. Capture happens through structured sessions where the operator walks through real reports and explains their reasoning out loud. The transcripts get encoded as rules the system uses to interpret data the same way the operator would. This is closer to agentic analytics than to traditional ML because the system is reasoning with encoded judgment, not just pattern matching.
Can a retail AI analytics pilot use only public data?
Technically yes. Practically the pilot will fail the incremental test. Public data alone produces output your analytics team can produce with the tools they have. The whole point of the pilot is to test what the system can do that your team cannot. Public-data-only pilots are useful as cost-benchmark exercises (does the AI get to the same answer cheaper?) but they do not answer the question that determines whether the system is worth deploying.






.webp)