DataSheriff — Product Design Spec

Author: Campana & Schott Consultant, Campana & Schott Date: April 2, 2026 Status: Ready for Build Day Build Day Team: TBD (1 builder + 4 non-builder roles)

1. Problem Statement

Pain Point 1 — The Triage Tax on Data Engineers (Who feels it: Data / Analytics Teams) When a business user raises a ticket saying "the numbers look wrong," a data engineer typically spends 90–120 minutes just determining what is actually broken: Which pipeline? Which dataset? Which downstream report? Which team owns the upstream source? For a team handling 30–50 tickets per month, this triage overhead consumes 45–100 hours of senior engineering time monthly — time that could be spent building data products, not debugging vague complaints.

Pain Point 2 — Invisible Supplier Failures (Who feels it: Data Teams + Business Users) A significant proportion of data quality incidents originate not inside the organisation but in external supplier feeds: a batch file arrives with a renamed schema field, a supplier sends data two days late, a reference dataset contains unexpected nulls. Current monitoring tools (Monte Carlo, Bigeye, Soda) alert the data team — but they cannot identify whether the root cause is internal or external. The result: engineers spend hours investigating their own pipelines before realising Supplier X broke the contract. Meanwhile, business users have already made decisions on bad data. The issue then recurs the following month because nobody formally documented or tracked the supplier failure.

Pain Point 3 — The Trust Erosion Loop (Who feels it: Business Users + VP/Director) When business users encounter data they don't trust, they don't raise a ticket — they either stop using the data entirely, or they begin maintaining parallel spreadsheets as shadow systems. This erodes the entire value of the data platform investment. A survey-level estimate: organisations with persistent data quality issues see 20–30% of business users maintaining unofficial data sources within 18 months. Each shadow spreadsheet represents compounding cost — in analyst time, in decision quality, and in the eventual reconciliation nightmare.

Pain Point 4 — The Executive Escalation (Who feels it: VP / Director of Data & Analytics) Data incidents routinely reach VP level because there is no structured communication channel between the data team and the business. When the business doesn't understand what happened or when it will be fixed, they escalate. A VP managing a team of 8–12 data professionals can expect to be pulled into 5–10 data quality escalations per month, each requiring 30–60 minutes of their time to mediate between technical and business stakeholders — time that should be spent on strategy, not incident management.

The Structural Problem Underneath the Symptoms

The four pain points above share a single root cause: there is no translation layer between the technical world of data pipelines and the business world of reports and decisions. Monitoring tools speak to engineers. Ticketing tools wait passively for humans to act. Neither system explains anomalies in plain language, identifies whether the issue is internal or external, proactively communicates status to the people affected, or builds institutional memory from past incidents. DataSheriff is that missing translation layer — an agent that sits at the boundary between the data team and the business, and manages the full lifecycle of a data incident from detection to resolution.

2. Product Concept

One-sentence pitch

DataSheriff is an AI agent that takes a business user's plain-English question about broken data, identifies whether the root cause lies with an internal team or an external supplier, and closes the loop automatically — raising the right ticket, notifying the right people, and explaining what happened in language anyone can understand.

How it works

Step 1 — Business user asks a question, not raises a ticket. Instead of navigating a ticketing portal, the business user opens DataSheriff's chat interface and types what they actually think: "My Q3 revenue report is $2.1M lower than yesterday — what happened?" This is the natural language of someone who does not know (and should not need to know) how data pipelines work.

Step 2 — Agent clarifies with precision, not open-endedness. DataSheriff asks at most two targeted questions to scope the incident: "Which report — the consolidated P&L or the regional breakdown?" and "Is this affecting all regions or just one?" These are not open-ended prompts — they are structured clarification choices that prevent the vague back-and-forth that wastes engineering time today.

Step 3 — Agent diagnoses using metadata, not raw data. Behind the scenes, the agent queries the anomaly log (pre-seeded or live-connected in production) and the lineage metadata configuration — a YAML file that maps every dataset to its upstream owner, whether that is an internal pipeline team or a named external supplier. The agent reasons over metadata descriptions only; raw client data rows never leave the client environment.

Step 4 — Agent identifies ownership: internal team or named supplier. This is the differentiating moment. The agent traverses the lineage graph and determines: is this a pipeline failure owned by the data engineering team, a model drift issue owned by the analytics team, or a schema change pushed by Supplier ACME in Tuesday's batch? The answer changes everything about who gets the ticket and what the business user is told.

Step 5 — Agent explains in plain English and auto-raises a routed ticket. The business user receives a response such as: "Your revenue figures pull from the ACME Sales Feed. ACME's Tuesday batch contained a schema change — the 'gross_revenue' field was renamed, causing $0 values to be ingested for 847 records. This is a supplier-side issue. Ticket #4821 has been raised with the Supplier Integration team with the affected batch metadata attached. Expected resolution: within 4 business hours." The ticket is pre-scoped, pre-routed, and pre-evidenced. The engineer who receives it knows exactly what to do.

Step 6 — VP dashboard captures the pattern. Every incident is logged. The VP sees a real-time view of open incidents, owner breakdown (internal vs. supplier), resolution time trends, and recurring failure patterns. Over time, the FAQ layer means that common questions — "why does the Monday morning sales figure always look lower?" — are answered directly without ever raising a ticket.

The strategic play

DataSheriff generates recurring revenue through three compounding mechanisms. First, the lineage metadata configuration — the map of datasets to owners and suppliers — is built during the C&S implementation engagement and is deeply client-specific. It cannot be lifted and dropped into a competitor product without re-doing the consulting work. Second, every resolved incident enriches the FAQ layer, making the agent progressively smarter about that client's specific data landscape, supplier relationships, and recurring failure modes. This creates a data moat that grows over time. Third, the incident log becomes an auditable record of supplier SLA performance — a unique asset the client cannot get from any monitoring tool. After 12 months, that log is evidence in supplier contract renegotiations. A client who has used DataSheriff for a year will not cancel it the month before their supplier review.

3. Target Buyer

Primary buyer

Title: VP of Data & Analytics, or Director of Data Engineering / Data Platform (at larger organisations, this may be a Chief Data Officer).

Their pain: They are accountable for data reliability as a business outcome, not a technical metric. They are measured on stakeholder trust, SLA compliance, and the ratio of engineering time spent on product-building versus incident firefighting. They are personally pulled into escalations they should never be involved in. When the CFO calls to ask why the revenue dashboard is wrong, it is their name on the line — not the engineer who wrote the pipeline.

What they do today

Today, this buyer's "solution" is a combination of: (1) a monitoring tool that alerts only to the data team, (2) a Jira or ServiceNow instance where business users raise vague, poorly-scoped tickets, and (3) a Slack channel where incident triaging happens in real time, creating noise for the entire team. Some organisations have appointed a "data steward" or "data operations" role to manage communication between data and business — at a fully-loaded cost of €80,000–€120,000 per year for a senior individual. That is the true cost baseline DataSheriff competes against.

Why they would pay

Three specific reasons: (1) Time recovered. If DataSheriff eliminates 60% of triage overhead for a team handling 40 tickets per month, that is ~48 hours of senior engineering time returned monthly — worth approximately €6,000–€9,000/month at blended consulting rates. The annual license pays for itself in 6–8 weeks. (2) Supplier accountability. No existing tool gives them an auditable record of supplier failures over time. DataSheriff does. That record has material value in contract renegotiations. (3) Stakeholder trust. Business users who receive plain-English explanations within minutes stop building shadow spreadsheets. Rebuilding that trust is worth more than the license cost — it is the return on the original data platform investment.

Secondary beneficiary

The Data Engineering Team. They do not write the check, but they feel the product's impact most immediately. Every ticket they receive is pre-scoped with root cause analysis, correctly routed to the right sub-team, and evidenced with anomaly data and lineage context. They stop spending half their sprint on triage and start spending it on work that matters. They will be the internal champions who push the VP to renew.

4. Architecture

Input Layer

Data enters DataSheriff through three channels. The primary channel is the chat interface — a simple, clean web UI where business users type questions in natural language. There is no form, no ticket category selection, no dropdown — just a text box and a conversation. The secondary channel is the anomaly event stream — in production, webhooks from existing monitoring tools (Monte Carlo, Bigeye) or a lightweight polling job against the client's data warehouse will push anomaly signals into the system before business users notice them. In the hackathon build, this is simulated with pre-seeded anomaly records. The third channel is the lineage metadata config — a YAML file maintained by the C&S implementation team that maps every dataset to its upstream owner (internal team name + contact, or external supplier name + SLA terms). This file is the foundational input that makes supplier identification possible.

Transform Layer (Core IP)

This is where DataSheriff's defensible value lives.

Clarification Agent (Claude, structured output mode): When a business user submits a question, the agent does not immediately attempt a diagnosis. It first identifies what it does not know — specifically, which dataset, which time window, and which report are in scope. It asks at most two targeted questions, generated by a structured prompt that constrains the agent to produce multiple-choice clarifications rather than open-ended questions. This prevents the clarification loop from becoming an interrogation. Temperature is set low (0.2) here to ensure consistent, professional phrasing.

Diagnosis Engine (Claude + anomaly log query): Once scoped, the agent queries the anomaly log for matching signals in the relevant dataset and time window. It is provided with the anomaly record as structured context — metric name, expected vs. actual value, timestamp, severity — and asked to generate a plain-English explanation of what the data shows. A system prompt "skill" anchors the agent's output style: it always states what changed, by how much, when it changed, and what downstream reports are affected. It never speculates beyond the available evidence.

Lineage Traversal (deterministic, not AI): Ownership identification is handled deterministically, not by the LLM. The agent looks up the affected dataset in the lineage metadata config and extracts the upstream owner. This is a simple key-value lookup — not an AI inference. This is deliberate: ownership assignment must be auditable and reproducible. The LLM is only used to translate the lookup result into natural language ("this dataset's upstream source is Supplier ACME, not an internal pipeline").

FAQ Layer (RAG over incident history): After the first 20–30 incidents are resolved, the system begins answering repeated questions directly from its incident history, without raising new tickets. This uses lightweight RAG: the user's question is embedded, compared against the vector store of resolved incidents, and if a match exceeds a confidence threshold (0.85 cosine similarity), the previous resolution is surfaced with a caveat: "This looks similar to an incident we resolved on March 12. Here is what happened then — does this match your situation?" Temperature is set to 0.3 for factual recall.

Ticket Generation (structured output): The agent generates a ticket as a JSON object: title, description, severity, assigned team, linked dataset, anomaly evidence, and lineage source. This is passed to the ticket engine for routing. The ticket is shown to the user for confirmation before being raised — Human-in-the-Loop is enforced at this step.

Display Layer

Three views exist:

Business User Chat View: Conversational interface. Clean, minimal, single-column. The agent's responses are formatted with a clear structure: (1) what happened, (2) who owns it, (3) what has been done, (4) expected timeline. A status badge shows the ticket state (Raised → In Progress → Resolved) and updates automatically.

Engineer Ticket View: When a ticket is received by the data team or supplier integration team, it arrives pre-formatted with all context: the original business user question, the anomaly evidence, the lineage determination (internal vs. supplier), and the affected records. No back-and-forth required.

VP Dashboard: A single-screen executive view showing: open incidents (count + severity), resolution time trends (average, P90), owner breakdown as a donut chart (internal team vs. supplier), top recurring failure patterns, and SLA compliance rate. Designed to be glanceable in 30 seconds. No drilling required for the standard weekly review.

Tech Stack (Hackathon)

Component	Technology	Rationale
Frontend	React + Tailwind CSS	Fast to build, clean UI, familiar to most builders
Backend API	Node.js (Express) or Python (FastAPI)	Simple REST endpoints for chat, ticket creation, dashboard data
AI Agent	Claude API (claude-sonnet-4-20250514)	Structured clarification + diagnosis + ticket generation
Mock Data Layer	SQLite with 3 pre-seeded anomaly scenarios	Fully controlled, no external dependencies, instant setup
Lineage Config	YAML file (3 datasets, 2 internal teams, 1 named supplier)	Simple to author, easily extended
Ticket Engine	In-memory store + mock routing logic	No Jira integration needed for demo — fake it convincingly
VP Dashboard	React with Recharts	Pre-populate with historical mock data so dashboard looks lived-in
Deployment	Local + ngrok tunnel	No cloud setup time required; demo runs on a laptop

5. Hackathon Execution Plan

Timeline table

Time	Phase	What Gets Built	Owner
0:00–0:30	Setup	Repo cloned, environment running, YAML lineage config authored with 3 scenarios (2 internal, 1 supplier)	Builder
0:30–1:15	Core Agent	Claude API integrated; clarification + diagnosis prompts working end-to-end for Scenario 1	Builder
1:15–1:45	Lineage + Ownership	Deterministic supplier lookup working; agent correctly identifies "Supplier ACME" vs "Pipeline Team"	Builder
1:45–2:15	Ticket Engine	Ticket JSON generated, displayed to user, confirmed, and "raised" (stored in memory, shown in UI)	Builder
2:15–2:45	VP Dashboard	React dashboard with mock data: open incidents, owner donut chart, resolution time trend	Builder
2:45–3:15	Demo Polish	All three scenarios scripted and tested; chat UI styled; edge cases handled	Builder + Product Owner
3:15–3:45	Pitch Prep	Slide deck finalised, demo script rehearsed, pricing narrative locked	All non-builder roles
3:45–4:00	Buffer	Fix whatever broke. It will have broken.	Everyone

Critical risk and fallback

Biggest risk: The Claude API integration takes longer than expected due to prompt tuning. Getting the agent to produce the right plain-English structure consistently — not too technical, not too vague — requires iteration, and iteration eats time.

Fallback (if not working by the 2:15 mark): Drop the live agent and replace it with a scripted walkthrough. Pre-record or hard-code the three scenarios as fixed question-answer pairs. The demo still shows the full UX — chat interface, ownership identification, ticket creation, VP dashboard — but the "AI" is smoke and mirrors. This is acceptable for a hackathon. The pitch judges capability and business logic, not whether the NLP is live. The VP dashboard and the supplier identification narrative are the things that win the room — and both can be demonstrated with static data.

Role allocation table

Role	Build Day Responsibility
Product Owner	Owns the demo script from minute one. Writes the three scenario narratives (which datasets break, which supplier is at fault, what the business user says). Runs the first end-to-end test at the 2:15 mark. Makes the call on fallback if needed.
Client Advocate	Plays the business user in the live demo. Rehearses the exact typed question, the agent's response, and the reaction line: "That's the meeting that usually takes 3 hours — it just took 45 seconds." Also stress-tests the agent with unexpected inputs during build time to find gaps.
Narrator	Writes and delivers the pitch. Owns the story arc: the problem (the escalation, the blame game, the blind trust in suppliers), the solution reveal, the demo handoff, the business case close. Must be able to explain supplier identification to a non-technical judge in two sentences.
Market Analyst	Builds the competitive slide. Researches Monte Carlo, Bigeye, and ServiceNow pricing in real time. Prepares the "why DataSheriff wins" framing: what those tools do, what they miss, and why the gap is commercial. Also prepares the ROI calculation for the pitch close.
Pricing Strategist	Owns the business model slide and the pricing conversation with judges. Models the implementation + license structure, prepares the ROI math (engineering hours saved × blended rate), and anticipates objections: "why not just use Jira?" and "why not just hire a data steward?"

6. Defensibility & Competitive Moat

Why a client would pay — and keep paying

1. Workflow lock-in, not feature lock-in. Once DataSheriff becomes the communication channel between business users and the data team, it is embedded in how the organisation handles every data incident. Replacing it means retraining business users, migrating incident history, re-routing tickets, and rebuilding the FAQ layer. This is a change management problem, not a tool swap. The switching cost is measured in months of disruption, not hours of migration.

2. The incident log as an appreciating asset. Every resolved incident adds to an auditable record of what broke, when, who owned it, and how long it took to resolve. After 12 months, this log is the client's primary evidence base for supplier SLA conversations. It contains supplier failure frequency, average impact severity, and resolution time by owner. No other tool produces this. A client approaching a supplier contract renewal who has 12 months of DataSheriff incident history has leverage they did not have before. They will not cancel the product the month before that conversation.

3. The FAQ layer compounds. The first month, DataSheriff raises tickets. By month six, it is answering 30–40% of incoming questions directly from its incident history, without creating tickets. By month twelve, that number is closer to 60%. The agent gets demonstrably smarter about each client's specific data landscape over time. This is not a generic SaaS improvement — it is a client-specific improvement that has no value to anyone else. The client is, in effect, training their own instance of the product. They will not abandon that investment.

Pricing model

Model: Implementation fee (one-time) + Annual license (recurring)

Implementation fee: €30,000 – €60,000. This covers discovery (mapping the client's data landscape), authoring the lineage metadata configuration (the YAML file that makes supplier identification work), integration with existing ticketing systems (Jira, ServiceNow), and user onboarding for both business users and data team. For C&S, this is familiar consulting revenue. It funds the engagement and establishes the asset.

Annual license: €24,000 – €48,000 per year (scaled by number of datasets in scope and number of business users). Renewal is driven by value demonstrated in the incident log and FAQ layer, not by a salesperson's relationship.

The math behind the price:

Average data team: 8 engineers at €120 blended day rate
Monthly triage overhead per engineer: ~6 hours (conservative)
Monthly cost of triage: 8 × 6 × (120/8) = €8,640/month
DataSheriff eliminates ~60% of triage overhead: €5,184/month saved
Annual saving: €62,208
Annual license: €24,000–€48,000
Net saving to client in Year 1: €14,000–€38,000 after license cost
ROI: 30–160%, achieved within the first year

This does not include the value of the supplier accountability log, the reduction in shadow spreadsheets, or the VP's time recovered. Those are upside the client gets to name for themselves in the business case.

Competitive positioning

DataSheriff is NOT…	Why	What it is instead
A data monitoring tool (Monte Carlo, Bigeye)	Those tools detect anomalies and alert data teams. They do not communicate with business users, identify suppliers, or manage the incident workflow end-to-end.	DataSheriff is a data incident communication and resolution layer
A ticketing system (Jira, ServiceNow)	Those tools passively receive tickets that humans create. They have no intelligence, no diagnosis, and no routing logic based on data lineage.	DataSheriff is an active agent that creates, enriches, and routes tickets autonomously
A chatbot or FAQ tool (Confluence AI, Notion AI)	Those tools answer questions from static documentation. They cannot query live anomaly data, traverse lineage, or identify supplier failures.	DataSheriff reasons over live data events, not static knowledge bases

The closest analogue is PagerDuty for data teams — but where PagerDuty routes alerts between engineers, DataSheriff routes explanations between data systems and business humans, with supplier intelligence that PagerDuty has never attempted.

7. Success Criteria

Hackathon demo (April 17)

Business user types a plain-English question into the chat interface without any coaching or guidance
Agent asks at most two clarifying questions before diagnosing
Agent correctly identifies the root cause as a named supplier failure (not an internal pipeline issue) for the primary scenario
Agent's plain-English explanation is readable and credible to a non-technical judge in the room
A pre-scoped, pre-routed ticket is generated and shown to the user for confirmation before being raised
VP dashboard loads with real-looking data: at least three past incidents, an owner breakdown chart, and a resolution time trend
The full flow — question to explanation to ticket to dashboard — runs in under 3 minutes of live demo time
The team can answer "why not just use Monte Carlo?" without hesitation

"Would a client pay for this?" test

The pain is immediately recognisable. When the Narrator describes the escalation scene — the VP pulled into a meeting because nobody knows if it's an internal failure or a supplier issue — at least one judge should visibly react with recognition. If nobody in the room has felt this pain, the product has the wrong audience.
The demo moment lands. The sentence "That's the meeting that usually takes 3 hours — it just took 45 seconds" must be said, and it must land. If a judge laughs or nods, the product is real.
The supplier identification is seen as novel. Judges should not be able to name a tool that already does this. If they say "but doesn't Monte Carlo do that?", the Narrator must be able to explain clearly why it does not.
The price is anchored to value, not cost. The Pricing Strategist must present the ROI math — not the build cost — and the license fee must feel cheap relative to the engineering hours it replaces.
The recurring revenue story is credible. Judges must be able to see how this generates 8–10x multiples for C&S, not just a one-time implementation fee. The FAQ layer compounding and the incident log lock-in must be explained clearly.

8. Open Questions

1. How granular does the lineage metadata config need to be — and who maintains it? The supplier identification feature depends entirely on the YAML lineage config being accurate and up to date. In the hackathon, this is hand-authored for three scenarios. In production, data landscapes change constantly: new pipelines are added, suppliers are onboarded, schemas evolve. Does the client's data team maintain this file? Does C&S maintain it as a managed service (additional recurring revenue)? Does the agent learn to update it automatically over time? This is a design decision with significant implications for the product's operational model and pricing.

2. What happens when the agent is wrong about ownership? The lineage traversal is deterministic — but the lineage config may be incomplete. If a dataset has an undocumented upstream dependency, the agent will either misattribute the issue (blaming the wrong team) or fail to identify an owner at all. A wrongly attributed ticket — especially one that incorrectly blames a supplier — could damage a client relationship. The team needs to design the fallback behaviour: does the agent say "I cannot determine ownership with confidence" and ask the data team to confirm? Does it default to internal routing when uncertain? This needs a clear decision before the first production deployment.

3. How does the FAQ layer prevent false-positive matches from misleading users? RAG over incident history introduces a subtle risk: a question about a new incident may surface a resolution from a superficially similar but causally unrelated past incident. If a business user is told "this looks like the March 12 incident — Supplier ACME's batch was late" when the current issue is actually an internal dbt model failure, they may delay raising the correct ticket. The confidence threshold (0.85) is a starting estimate. The team needs to decide: what is the human review mechanism for FAQ-layer responses? Should all FAQ answers be flagged as "suggested, not confirmed" until a data engineer reviews? This affects both user trust and resolution time.

4. What is the go-to-market motion — land and expand, or full-platform sale? The product as designed is most powerful when it covers a client's full data landscape — all datasets, all suppliers, all business user groups. But the implementation effort scales with scope. The team needs to decide: do we sell DataSheriff as a full platform from day one (higher ACV, harder sell, longer implementation), or do we land on a single use case — say, the finance reporting pipeline and its two key suppliers — and expand from there (lower initial ACV, faster time-to-value, easier renewal conversation)? The answer changes the sales motion, the implementation pricing, and the onboarding experience design.