Skip to content

Instantly share code, notes, and snippets.

@hamelsmu
Last active February 11, 2026 19:18
Show Gist options
  • Select an option

  • Save hamelsmu/c803306c7d08013b1c4589d4dc2df294 to your computer and use it in GitHub Desktop.

Select an option

Save hamelsmu/c803306c7d08013b1c4589d4dc2df294 to your computer and use it in GitHub Desktop.
Alumni Customer Research Plan (Mom Test-Informed) - AI Evals Course

Alumni Customer Research Plan

What To Do Next

Talk to 15-20 alumni. Not a survey. Not 3,000 people. Conversations.

Who to talk to

Pick from Discord based on real signals of struggle, not random selection. Segment across roles (engineer, PM, leader), company stage (startup vs. enterprise), and time since course (3+ months). Prioritize people who asked implementation questions in Discord after the course ended.

High-priority people based on Discord activity:

  1. kindlydefeated (consultant) - Enterprise security blocks all third-party observability tools across multiple clients
  2. 907resident - Fighting organizational resistance to evals adoption with engineering team
  3. teresa_46350 - Building Airtable + Jupyter + AWS Step Functions eval pipeline, wants non-technical people involved
  4. mrbendi25 - Non-ML leader enrolled out of frustration with engineering team; dealing with PII in traces
  5. strickvlzenml - Struggling with tracing instrumentation in complex agentic systems
  6. jackrichardson_23979 (consultant) - Client stuck in whack-a-mole for months, trying to sell them on evals
  7. therobertta - Actively building custom annotation tool, tried to organize a working group
  8. erincode.org_64293 - Org playing "hot potato" with who owns the eval lifecycle
  9. nate1363 - Real experience with team resistance, comparing evals adoption to test suite adoption
  10. wayde_bt - "Hardest thing in my life has been getting people who say they care to actually be involved in error analysis"

How to frame the conversation

Don't mention product ideas. Say: "I'm trying to make the course better for future students. Can I learn about what happened after you took it?"

Questions to ask:

  1. "Walk me through what happened when you went back to work after the course."
  2. "What does your eval workflow look like today?"
  3. "What's the hardest part of your current process?"
  4. "Last time you tried to [do error analysis / build an LLM judge / set up CI for evals], what happened?"
  5. "How are you solving that right now?"
  6. "Who else on your team is involved in evals? How do they learn the process?"
  7. "Have you spent any money or time trying to solve that?"

Do NOT ask: "Would you pay for X?", "What if I offered...", or "What features would you want?" These are Mom Test violations.

What to look for after 15-20 conversations

  • Problems mentioned by 5+ people independently
  • Workarounds people have built (strongest signal)
  • Where money is already being spent
  • Emotional language ("frustrating," "nightmare," "we gave up")

If you find a pattern, do 5 more targeted conversations with people who have that exact problem. Offer to help them manually first (concierge MVP). If they actually follow through on scheduling, the demand is real.


What the Evidence Says So Far

The 176 course reviews are almost useless for product discovery. They're 9-10/10 compliments. Per The Mom Test, compliments are the most dangerous form of data.

The Discord (22K lines of real conversations) is where the signal lives. Analysis combined keyword search with a full semantic pass via Gemini. Here are the consolidated patterns, ranked by evidence strength.

1. Error analysis and annotation tooling is broken (Very High signal)

The loudest signal across both methods. People understand evals are important, but the manual work is so painful it creates resistance and shortcuts.

The pain:

  • chiiz8724: "open coding takes so long. im done with 15 in like 45 mins 🥲"
  • anubha_39005: "creating a dataset and going through the data manually for annotating is the biggest friction in the process. People are reluctant to do this because of the manual efforts."
  • davidh5633: "The resources involved of the development team, product team, and expert annotators throughout the company is quite something."

The workarounds (people spending real time building solutions):

  • Hamel tells students: "I personally use one of the vendors we discuss in this class as a backend database and build a custom annotation interface bespoke to each use case."
  • therobertta built a custom annotation tool and tried to organize a working group around it
  • chiiz8724 built a tool that "color codes axial coding in the trace" with toggleable codes
  • Isaac Flath (TA): "We always end up either using Excel, or creating a custom annotation app because there's unique things for each use case"
  • Teresa runs evals using Airtable + Jupyter in dev, AWS Step Functions/Lambda in prod

People are cobbling together Airtable, spreadsheets, custom FastHTML apps, Jupyter notebooks, and Phoenix. No clean solution exists. Meanwhile, PMs and domain experts (who are critical to error analysis) can't use any of these tools. Teresa: "I'm hesitant to get my non-technical folks using the eval tools that I have seen so far."

2. Organizational resistance and undefined ownership (High signal)

This is a people problem, not a technical one. Multiple threads reveal two distinct sub-problems:

Nobody knows who owns evals:

  • erincode.org_64293: "My org seems to be playing 'hot potato' with whether Engineering or Product or Data or the subject matter expert team are accountable."
  • wayde_bt: "Hardest thing in my life has been getting people who say they care about the AI product I'm building to actually care enough to be involved in error analysis."
  • heyadel_: Struggling to get "PMs and domain experts to actually carve out time to help me annotate."

Engineers resist the process:

  • 907resident getting pushback introducing inter-annotator alignment. Multiple people: "I've encountered the same thing. This could almost just be a workshop in itself."
  • A consultant's client is "stuck in classic whack-a-mole" for months. The lead developer: "If I got 5 people to look at what the AI was doing wrong, I'd get 5 different answers."
  • nate1363: "Like testsuites in traditional software development, it takes some pain/friction for people to see the benefit."

Context: 42% of companies are now abandoning GenAI projects (up from 17% prior year, per Economist/S&P data shared in Discord). Alumni are going back to organizations where AI projects are failing.

3. Agentic and multi-turn evaluation is a distinct harder problem (High signal)

As products shift from single-turn to agentic, the eval problem changes qualitatively. Alumni who mastered single-turn evals are hitting new walls.

  • jakelevine: "How do you generate multiturn conversations using the synthetic data approach you suggest?"
  • strickvlzenml: "When instrumenting one's codebase, how do you avoid code being overtaken by all the extra code gunk that tracing tools add?"
  • mrbendi25 built a multi-agent system and "new issues have emerged, each agent has a specific tool it's expected to call before producing the final output."
  • intellectronica (TA): "Simulating longer, multi-turn interactions is harder. If you can start from real interactions that's especially useful."
  • Hamel: "A big problem is succumbing to an architecture astronaut mentality. Start radically simple and increase complexity when you have justified it."

4. Enterprise security and PII block the entire workflow (Medium-High signal)

Not a niche concern. Appears across healthcare, finance, and enterprise contexts.

  • kindlydefeated (consultant): "Third party products like Langsmith, Braintrust, and similar are definite no-gos in just about every client I've worked with."
  • mrbendi25: "Has anyone implemented tracing in an agentic system where users may expose personal information such as authentication PINs?"
  • zabirauf: "If (due to privacy reason) we can't look at the user data, are there other approaches?"
  • heyadel_: "The only thing we can collect is our employee's internal use of our own AI products."

Workarounds: self-hosted Phoenix, LangSmith masking, LLM-based PII scrubbing, synthetic equivalents. All partial.

5. LLM-as-Judge cost anxiety (Medium signal)

People want the rigor of LLM judges but worry about the bill. This didn't show up in keyword searches because people don't say "struggling." They say "seems expensive."

  • davidh5633: "The LLM Judge evals seem to have the highest cost in terms of human involvement up-front and on-going maintenance."
  • wise_dove_44803: "agree until an enterprise customer sees their cloud / LLM provider bill"
  • labdmitriy: "Isn't LLM-only classification expensive way to find such failure?"

Workarounds: prompt caching, cheaper models, code-based evals where possible, batch APIs at 50% discount.


Hypothesis Ranking

These are starting points for Mom Test conversations, not conclusions.

Rank Hypothesis Evidence Workarounds? Pay Signal
1 Annotation/error analysis tooling Very High Many custom builds Significant engineering time spent
2 Evals implementation consulting High Informal (Discord, "benevolent dictator") Bryce York bought course twice, enrolled whole org
3 Agentic/multi-turn eval tooling High Partial custom tracing Growing as agents go mainstream
4 Enterprise PII/security-compliant eval platform Medium-High Self-hosted Phoenix, masking Compliance budgets exist
5 LLM-as-Judge cost optimization Medium Caching, cheaper models, batch APIs Would pay to spend less
6 Ongoing structured post-course support Medium Free Discord Needs Mom Test validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment