educationcase-studyevaluation

From ELIZA to GPT: Teaching Model Limits with a Classroom Reproducible Project

UUnknown

2026-01-22

9 min read

A reproducible classroom lab that pits ELIZA against modern LLMs to teach hallucination, context failure, and robust LLM evaluation.

Hook: Teach model limits with a hands-on, reproducible lab

Students and professionals need to see how AI fails in the wild — not just read headlines. If your team struggles with slow, informal evaluation workflows, unclear reproducibility, and misleading model outputs, this classroom project fixes all three: a reproducible, step-by-step lab that pits the 1960s rule-based ELIZA against modern LLMs to surface hallucinations, context failures, and emergent behavior. The result is a compact, repeatable curriculum that teaches LLM evaluation, prompting techniques, and the limits of both rule-based and statistical systems.

Why this project matters in 2026

By late 2025 and into 2026 the field shifted from black‑box demos to evaluation-as-code: teams expect automated benchmarks, reproducible notebooks, and CI integration before productionizing a model. Pedagogically, the best way to teach model limits is contrast — compare an intentionally simple system (ELIZA) with contemporary models that incorporate RLHF, retrieval-augmentation, and tool use. Students gain an intuitive and measurable sense of hallucination, prompt sensitivity, and context decay.

Learning outcomes

Empirically demonstrate differences between rule-based and statistical chat systems.
Design reproducible experiments and evaluation rubrics for factuality, coherence, and safety.
Integrate notebooks into a reproducible pipeline (GitHub + CI) for continuous evaluation.
Practice advanced prompting strategies and mitigation (RAG, temperature, calibration).

Project overview: one semester / two-week module

Structure the project in modular phases that scale from a single two-hour lab to a full semester project. Each module includes reproducible notebooks (Jupyter / Colab), automated scoring, and a clear rubric.

Time estimates

Intro lecture & environment setup: 1–2 hours
ELIZA exploration + transcript collection: 2 hours
Modern LLM prompts + logging: 3 hours
Automated evaluations & rubric scoring: 3 hours
Presentation & write-up: 2–4 hours

Materials and prerequisites

Keep the setup light and reproducible. Provide a GitHub repo with notebooks and assets so students can clone and run in Colab or a local environment.

Required tools

Python 3.10+ and Jupyter / Google Colab
Notebook templates (environment, data, evaluation)
ELIZA implementation (simple Python script or pip package)
Access to one or more modern LLMs (open LLMs like Llama-family or hosted APIs such as mainstream providers)
Evaluation libraries or simple scoring code (factuality checks using external knowledge sources, string metrics, human annotation UI)
Versioned dataset (transcripts and prompts) stored in the repo

Step-by-step classroom project

Phase 0: Reproducibility setup

Clone the class repo (contains notebooks, ELIZA implementation, and prompts).
Pin dependencies in requirements.txt and document an exact runtime (Python, package versions).
Create API-key placeholders and a script to swap real keys at runtime. Encourage secrets management (environment variables, GitHub Secrets for CI runs).
Add a lightweight GitHub Actions workflow to run the evaluation notebook on push (makes results reproducible and discoverable).

Phase 1: Play with ELIZA

Start simple. ELIZA is a pattern-matching chatbot: no world model, no retrieval, no learning. Students should interact, then export transcripts for analysis.

Task 1: Chat for 10–15 minutes and save the transcript.
Task 2: Identify reply patterns (reflective questions, rephrasing) and mark places where the bot dodges specificity.
Task 3: Annotate whether each reply is factually grounded, coherent, or merely a pattern response.

Phase 2: Run the same prompts against modern LLMs

Use the same user transcripts and prompts with one or more LLMs. Keep seeds, temperature, and prompt templates recorded for reproducibility.

Task 1: Replay each user utterance as an individual prompt and collect the single-turn response.
Task 2: Run a multi-turn session that mirrors the ELIZA conversation exactly (preserve turn structure).
Task 3: Log model metadata: model name, version, max tokens, temperature, top-p, and any tool calls.

Phase 3: Controlled tests to surface hallucinations and context failures

Design focused probes — short, repeatable prompts — to stress specific failure modes.

Factual checks: ask verifiable factual questions (dates, citations). Use a hidden answer key or external knowledge source for automated checks.
Context retention: give a fact early in conversation (e.g., "My sister's name is Ana"), ask about it several turns later.
Adversarial prompts: feed leading or malformed information to see whether the model invents details.
Instruction sensitivity: change wording slightly to show prompt sensitivity ("List three causes..." vs "Explain three reasons...").
Long-horizon coherence: present a long narrative and measure how often earlier facts are contradicted later.

Phase 4: Automated scoring and rubric

Combine automated checks with human annotation. Below is a practical, reproducible rubric (can be encoded as JSON in the repo):

Suggested evaluation rubric (per response)

Factuality (0–2): 0 = false or invented, 1 = partially true/uncertain, 2 = verifiably correct.
Relevance (0–2): Is the reply on topic and addressing the user intent?
Context retention (0–2): Does the reply correctly use earlier context?
Hallucination severity (0–3): 0 = none, 1 = minor (generic), 2 = moderate (plausible but wrong), 3 = severe (fabricated facts, citations).
Confidence calibration (pass/fail): Does the model hedge when uncertain or confidently state false claims?

Combine these into a composite score and visualize across models and prompt types. The notebooks include code to compute aggregate metrics and present tables/plots for class discussion.

Notebook architecture: reproducible by design

Each notebook follows the same sections so students and graders can reproduce results easily.

Environment and dependency checks (pins versions).
Data loader (transcript and prompt bank).
Model wrappers (ELIZA + LLM API or local model loader).
Run harness (single-turn, multi-turn, controlled probes).
Automated scoring and human annotation hooks.
Results export (CSV/JSON + visualization).

Practical tips for instructors and TAs

Seed randomness where possible and record model hyperparameters to aid reproducibility.
Keep test suites small and deterministic for CI runs — long runs can be expensive.
Use human annotation sparingly and focus it where automated checks are weak (e.g., nuance, safety).
Provide example graded notebooks showing expected analysis and write-up structure.

Expected findings and teaching moments

When students compare ELIZA and modern LLMs you should see clear, teachable contrasts:

ELIZA: Predictable, limited language transformations. Little to no factual hallucination because it rarely asserts world knowledge — but also poor factual competence and brittle dialog flow.
Modern LLMs: High fluency and contextual responses; better memory of recent turns but frequent hallucinations when asked for facts beyond training or the model’s retrieval sources. They can sound authoritative even when wrong.
Students typically note that ELIZA's limitation is obvious (pattern matching) while LLM failures are more insidious (plausible-sounding fabrications).

"When middle schoolers chatted with ELIZA, they uncovered how AI really works (and doesn’t)." — EdSurge, January 2026

Use that example to frame discussion: rule-based systems fail loudly; statistical models fail subtly and dangerously.

Extensions: advanced evaluation strategies (2026 trends)

In 2026 the evaluation ecosystem matured. Here are next-step experiments to bring modern techniques into the classroom:

Retrieval-Augmented Generation (RAG) ablation: Compare LLM responses with and without RAG to measure hallucination reduction.
Tool-using models: Evaluate models that call external tools (calculators, knowledge APIs) and measure tool call correctness and safety.
Calibration tests: Measure confidence alignment by asking the model to rate its own uncertainty and compare against factuality.
Adversarial evaluation: Use automated prompt generators to create edge-case probes; run these in CI to detect regressions over model updates.
Evaluation-as-code: Encode your rubric and tests as executable scripts that run on each PR (GitHub Actions, GitLab CI) to make the grading pipeline reproducible and automated.

Grading guide: quick rubric for instructors

Reproducibility & environment (20%): repo runs, pinned deps, CI checks.
Experiment design (25%): clear hypotheses, control variables, reproducible prompts.
Evaluation implementation (25%): rubric application, automated checks, annotation quality.
Analysis & interpretation (20%): correct conclusions based on data; clear visualizations.
Presentation & reflection (10%): discussion of mitigation, ethical implications, and next steps.

Case study: what students commonly learn

Across classrooms deploying this lab in 2025–2026, instructors report consistent outcomes:

Students who initially trusted the LLM more than ELIZA became skeptical after seeing concrete hallucination examples.
Pattern recognition exercises (ELIZA) helped learners articulate how modern LLMs encode statistical patterns that can produce confident mistakes.
Hands-on reproducible notebooks trained students to treat evaluation as a product requirement — not an afterthought.

Mitigation experiments students can try

Lower temperature and compare hallucination rate — record trade-offs in creativity vs factuality.
Add RAG with a small curated knowledge base to reduce invented citations.
Chain-of-thought prompts to see whether explicit reasoning steps reduce factual errors (note: not always effective and sometimes increases verbosity).
Ask the model to cite sources, then verify those citations programmatically (human annotation can help where automated checks fail).

Final checklist for a reproducible classroom project

Provide the repo with pinned environment and a one-click Colab link.
Include ELIZA code and wrapper for at least one modern LLM.
Ship a prompt bank and answer key for controlled probes.
Encode the rubric as JSON and include automated scoring code.
Set up a CI job to run evaluations on push and capture artifacts (CSV/plots).

Actionable takeaways

Contrast reveals failure modes: ELIZA and LLMs fail differently; both lessons are critical for practitioners.
Reproducibility is non-negotiable: pin environments, record model metadata, and automate evaluations to make results trustworthy.
Use a mixed evaluation approach: automated checks plus targeted human annotation yields practical coverage.
Make evaluation part of the development lifecycle: integrate the notebooks into CI so regressions are caught early.

Call to action

Ready to run this in your classroom or team? Clone the reproducible repo (notebooks, ELIZA, prompt bank, and rubrics) from our project page and drop it into a GitHub Classroom or team repo. Run the provided CI workflow to make every student submission an executable, auditable evaluation. If you want help customizing the lab for graduate courses or professional workshops, reach out or download the instructor's guide included in the repository.

Teaching AI critically starts with reproducible experiments — this project gives you the curriculum, tooling, and rubric to do it well in 2026.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.