CodeCraft

An adaptive coding education platform that decides where AI belongs, and where it doesn't

Capstone thesis · NYU · Two-semester project · Solo design + research

The problem with AI in education right now

Every coding education product is racing to bolt ChatGPT onto their interface and call it personalization. The result is a generation of tools that feel like a chatbot in a wrapper, generative answers to every prompt, no architecture, no taste, no judgment about when AI should stay out of the way.

I built CodeCraft to argue something different: AI in learning products earns its place by deferring to humans wherever a human-authored answer is better. Not the other way around.

The product is a web application that teaches CS to college freshmen through adaptive challenges, hybrid feedback, and a mastery-based progression model. But the case study underneath it is really about one decision, where AI belongs in a learning loop, and where rules, humans, and structure beat it every time.

‍

What 11 interviews told me

I spent the first phase of the project on user research, eleven semi-structured interviews with NYU CS freshmen and one professor who teaches the intro courses. I went in expecting to find what most CS-ed research already says: students are overwhelmed by abstraction, debugging is frustrating, syntax is a wall.

All of that showed up. But the finding that reshaped the project was different.

45% of students said ChatGPT was actively helping them learn. Another 27% said yes-and-no. The instinct in academia is to treat this as a crisis, students cheating, skills atrophying, the end of real learning. The instinct in edtech is to chase it, wrap GPT-4, ship fast, call it adaptive.

I didn't want to do either. The students weren't using ChatGPT to skip thinking. They were using it to break down concepts, generate practice problems, and debug code in ways their TAs didn't have time for. The problem wasn't AI in learning, it was unstructured AI in learning, with no scaffolding, no metacognition, no connection to what they actually needed to master.

That became the design thesis: build the AI-mediated learning experience students were already trying to assemble for themselves, but with the structure their classrooms weren't giving them.

The decision that shaped everything: hybrid AI

The single most important architectural choice in CodeCraft is also the least visible one. When a student submits a wrong answer, the system has to decide how to respond. Most products would route everything through an LLM and let it generate feedback on the fly. I didn't.

CodeCraft uses rule-based detection for anything deterministic, syntax errors, compiler issues, predictable misconceptions like using = instead of ==. These have correct, knowable answers. Generative AI would only introduce variance and risk hallucinated explanations for things that already have a right answer.

CodeCraft uses AI as a feedback matcher, not a feedback generator, for the harder cases. When code is syntactically valid but logically wrong, the system embeds the submission and matches it against a database of past incorrect submissions, each tagged with human-authored feedback. The AI's job is similarity search, not authorship. The student gets a real explanation written by a human teacher, surfaced by AI that found the right one.

This split matters because it answers the question every AI-native product has to answer: what does AI actually do here that humans or rules can't do better? Generating explanations is something humans do better. Finding the right human explanation for this specific student's mistake is something AI does better. The architecture follows the answer.

The same logic shapes the personalization engine. Difficulty doesn't adjust through vibes, it adjusts through a learner proficiency model that tracks skill mastery, error patterns, retry counts, and historical performance. The AI's job is to estimate where a student is on the Vygotsky curve, not to decide what they should learn.

‍

How the product feels to use

Five features, one shared logic: AI handles inference, humans handle authorship, students see structure.

Growth Tracker: career as the organizing principle

Students set a target role, junior frontend developer, data engineer, ethical hacker, and the dashboard reorganizes around skills required for that path. Mastery bars at the skill, sub-skill, and topic level. Suggested next steps that pull from refresher content, debug challenges, and assessments.

The research finding behind this: students kept asking why am I learning this. Coursework felt disconnected from work. The Growth Tracker doesn't fix curriculum, it reframes it. Same content, contextual structure.

Assessments with hybrid feedback

The page where the architecture becomes visible. Students answer questions, get immediate feedback that names the mistake, explains why it's wrong, and points at concepts to revisit. The mastery bar updates. The proficiency model recalibrates in the background.

Every part of this screen is doing something specific: the rule engine caught the error type, the matcher found the closest historical mistake, the template assembled the explanation, and the model logged what this means for future questions.

CSphere: peer learning, structured

A discussion forum, but pinned to the topic the student is in. The research showed something contradictory: students said collaborative coding wasn't helpful (54%), but personalized help was indispensable. Reading between those numbers, they wanted to receive help, not give it. CSphere splits the difference. Posts are visible across the cohort, but threaded inside topics, so help arrives in context instead of as a generic forum dump.

Reflections: metacognition with prompts that aren't generic

Five prompt types, each tied to a different metacognitive function: confidence check (2-1-1), role-based reflection, mastery forecast, effort vs. outcome, and a brain dump zone. The AI reads these reflections and uses them as a signal for content tailoring, if a student writes that they're shaky on recursion, the next session weights toward recursion practice.

This is one of the few places AI is doing something close to "generation," and it's still bounded, reading reflections to update the proficiency model, not generating reflections for the student.

Debug My Life: bridging classroom and industry

Practice problems disguised as bug tickets. "A mobile app is loading slowly, help profile the sorting method." The student steps into the role of a junior dev triaging a real ticket, with real code, real performance constraints. The research finding here was the loudest one in the interviews: students wanted reasons to care about coding problems. Real-life relatability beat every other engagement lever.

What testing changed

Six participants, think-aloud protocol, ten Nielsen heuristics scored, full task completion tracked. The product passed on most heuristics (aesthetic and minimalist design scored highest at 4.8) but failed hard on user control and freedom (2.3) and recognition rather than recall (3.5).

The fixes were specific:

Explainability for AI-driven recommendations. When the dashboard said "You need to work on nested loops," users questioned why. I added the reasoning inline: "This recommendation is based on your recent performance in loop-related questions, where you showed difficulty with nested logic." The fix isn't decorative, it's the exact transparency problem every AI product has to solve, and shipping it inline made the AI feel like a teacher instead of a black box.

Surfacing Debug My Life on the course homepage. Users actively searched for it. Real-world challenges were the highest-praised feature in interviews, but the IA buried them. Fix: promoted to the main course view alongside Practice and Reflection.

Back navigation everywhere. Multiple users got stuck in Reflections, Growth Tracker, Refresher Content with no way out. Basic fix, easy to miss when you're the one who built the IA.

The pattern across all the fixes: I'd over-trusted my own mental model of the system. Testing surfaced the gap between I know how this works and a stranger can find their way through it.

What I'd build differently next time

Three things, in order of how much they'd change the product.

Test the AI assumptions, not just the UI. I tested whether students could navigate the interface. I didn't test whether the rule engine catches the right errors, whether the feedback matcher returns the closest mistake, whether the proficiency model converges. The hybrid architecture is the most interesting part of CodeCraft and the least validated. A v2 starts with offline evals on the AI components before any UI testing.

Build the content engine, not just the content shell. CodeCraft has the structure for adaptive content but not the volume of content to make adaptation feel real. A real version needs a content authoring tool for educators, feedback templates, error pattern tags, refresher modules, that scales the human side of the hybrid system. The AI is only as good as the human-authored library it draws from.

Move metacognition out of a separate page. Reflections are siloed when they should be ambient. A v2 weaves the prompt into the moment, after a hard problem, mid-session, end of week, instead of asking students to navigate to a Reflections page. Metacognition should feel like the product noticing, not a homework assignment.

‍

What this project is really about

CodeCraft is an answer to a question I think most AI-native products are getting wrong: where does AI belong in the loop, and where should it stay out?

The honest answer is that most of the value comes from the parts that aren't AI, the proficiency model, the rule engine, the human-authored feedback library, the IA that makes career goals legible, the metacognitive prompts that make students think about their own thinking. AI is the connective tissue that finds the right human-authored explanation for this specific student at this specific moment.

That's the version of AI-native design I want to keep building. Not generation everywhere, judgment about where generation earns its place.

‍

CodeCraft, built as the capstone for the M.A. in Learning Technology and Experience Design at NYU Steinhardt. Two semesters, eleven research interviews, six usability test participants, eight concept testers, one final prototype. Full Figma file and thesis available on request.

‍

Other projects:

CodeCraft

Stepskill

Unravelling AI