When MusicFirst approached us, their team wanted to explore how artificial intelligence could reduce the time teachers spend preparing lesson content while ensuring quality and reliability. The challenge wasn’t just building an AI integration — it was creating one that worked consistently across a multi-tenant SaaS environment, gave educators full control, and scaled securely for thousands of classrooms.

Technical Approach

Our team integrated OpenAI’s GPT models into the MusicFirst Classroom platform using the openai-php/symfony package. This allowed us to extend MusicFirst’s Symfony-based application with a service layer that could handle AI-powered requests for lesson materials, tasks, assessments, and rubrics.

Key decisions included:

1. Structured Outputs with JSON Schema

One of the first hurdles was response reliability. For example, if a teacher requested a 10-question quiz, the AI might return 8, or occasionally just 1. To solve this, we implemented OpenAI’s json_schema response type. By defining schema rules (e.g., questions array must contain exactly 10 items), we forced the model to output valid, structured data every.

This dramatically reduced errors and made results predictable enough to plug directly into MusicFirst’s grading and task management workflows.

2. Evaluation Pipelines

We built a feedback loop using auto-graders and custom evaluation scripts. Each run was scored on criteria such as accuracy, completeness, and adherence to schema. Over successive iterations, we were able to improve accuracy rates from 86% to 95%.

OpenAI Evaluation - 95% auto-grader score

OpenAI Evaluations - 95% auto-grader score

Evaluations weren’t just for debugging — they became an internal metric system to validate whether prompt or schema changes were genuinely improving outcomes.

3. Metadata for Observability

Another challenge was separating responses from different environments (development, staging, production) and content types (lesson, assessment, rubric, task). Initially, logs were too noisy to analyze effectively.

OpenAI Evaluation - meta data observability

We added metadata tagging to every AI request (e.g., environment: stage, content: assessment), which allowed us to filter logs, run precise evaluations, and spot issues. This also gave product managers clearer visibility into how the Assistant was performing in the field.

4. Text Prompts vs. Structured Prompts

The first version of MusicFirst Assistant used plain text prompts with gpt-3.5-turbo. While functional, the outputs were inconsistent and more costly to run at scale. By moving to structured prompts via JSON Schema and upgrading to gpt-4o-mini, we gained both accuracy and efficiency. In fact, token usage analysis shows this change reduced costs by more than 60%, while ensuring outputs matched the exact formats teachers required.

5. Performance Considerations

While schema enforcement improved accuracy, it also increased response time for larger outputs. For example, generating more than 10 matching-type questions could cause timeouts. Our solution was to cap auto-generated assessments at 10 questions, which balanced performance with usefulness for .

This trade-off shows a key lesson in AI integration: sometimes limiting scope is the best way to improve reliability.

Results in the Real World

Teachers using MusicFirst Assistant now have access to a powerful, optional feature that can cut down lesson prep time dramatically. Instead of manually drafting every task or quiz, teachers can provide a topic or objective, and the Assistant handles the “first draft.” Educators remain in full control—editing, customizing, and tailoring materials to their classroom needs.

5×

faster lesson creation

95%

accuracy

62%

lower cost

Lessons Learned

Structure is everything. Using JSON Schema shifted validation from prompts to code, making outputs far more dependable.
Evaluation drives iteration. Auto and custom graders helped us refine prompts with measurable improvements instead of guesswork.
Metadata matters. Observability at the log level gave us control over a complex, multi-tenant AI deployment.
Performance requires trade-offs. Schema validation added overhead, but careful limits kept the system practical for classroom use.
Structured prompts beat plain text. Moving from free-form text prompts to JSON Schema with gpt-4o-mini improved both accuracy and cost-efficiency.

Why It Matters

For ed-tech providers, school administrators, or SaaS entrepreneurs, the MusicFirst Assistant demonstrates how AI can be embedded into existing platforms responsibly. Instead of bolting on a chatbot, we engineered a scalable, structured, and evaluable system that aligned with educators’ needs. If you’re considering AI features in your own platform, the key takeaway is this: focus as much on structure, evaluation, and observability as you do on the model itself.