Opportunities, Challenges, and a Shared Research Agenda
A full-day workshop @ the Festival of Learning 2026
Small language models (SLMs) are open-weight language models with fewer than 10 billion parameters (as we define them) that can be locally deployed or cheaply hosted. As more model providers release smaller, more efficient models, and as the research community develops better techniques for fine-tuning and deploying these models, SLMs are emerging as a promising alternative to large language models (LLMs) for educational applications.
While LLMs have demonstrated impressive natural language processing capabilities, their adoption in education is constrained by factors such as privacy concerns, high inference costs, and proprietary limitations. SLMs offer a more scalable and accessible path to integrating AI into education, given their capabilities to be efficiently fine-tuned on high-quality domain-specific data.
This full-day workshop aims to explore the potential of SLMs in education. By bringing together researchers from the AIED, EDM, and L@S communities, we hope to foster interdisciplinary collaboration and identify new research directions that leverage the unique strengths of SLMs to improve learning outcomes.
We invite researchers, educators, and industry professionals from the AIED, EDM, and L@S communities (as well as sister communities such as Learning Analytics, Educational Data Science, and Human-Computer Interaction) to submit their original, unpublished work on small language models in educational settings. We are looking for submissions that address, but are not limited to, the following themes:
Accepted submissions will be presented as Lightning Talks (5 minutes + Q&A) or as Posters at the workshop. Attendees will also actively participate in cross-community speed dating and collaborative synthesis sessions to draft a white paper outlining a shared research agenda for SLMs in education.
To allow workshop participants to take advantage of the early-bird registration deadline (May 3, 2026) for the Festival of Learning, we have implemented a two-round submission and review process. The first round will have a shorter turnaround time to allow for early acceptance notifications before the early-bird registration deadline. Interested participants, however, will have until the second round deadline to submit their papers to be considered for presentation at the workshop.
Please note that a paper not accepted in the first round cannot be resubmitted to the second round, even with substantial revisions. However, we still encourage the authors to attend the workshop and participate in the discussions and activities to draft the joint white paper.
All deadlines are 11:59 PM Anywhere on Earth (AoE).
We invite submissions for research papers (5-7 pages, excluding appendices and references) that describe ongoing research, recent advancements, or challenges related to SLMs in education. If accepted, a paper will be presented as either a Lightning Talk or a Poster at the workshop.
All accepted papers will be compiled into the workshop proceedings. Proceedings shall be submitted to CEUR-WS.org for online publication. Furthermore, all workshop participants will be invited to co-author a white paper outlining a shared research agenda for SLMs in education, which will be submitted to a relevant journal (e.g., IJAIED or JEDM).
The workshop consists of a "context & connection" morning phase and a "consensus & output" afternoon phase. The goal is to stimulate the interchange of ideas across communities, identify current opportunities and challenges, and synthesize a shared research agenda for SLMs in education.
| Time (UTC +9) | Activity |
|---|---|
| 09:00 – 09:15 | Welcome & Opening Introduction to SLMs in education and workshop goals. |
| 09:15 – 10:15 | Morning Talks 10-min talk + 5-min Q&A. |
| 10:15 – 10:45 | Coffee/Tea Break & Networking |
| 10:45 – 11:45 | Cross-Community Speed Dating Rotational 10-minute conversations to mix diverse perspectives. |
| 11:45 – 13:00 | Lunch Break & Networking |
| 13:00 – 14:00 | Afternoon Talks 10-min talk + 5-min Q&A. |
| 14:00 – 15:15 | Small-Group Synthesis Groups of 4-5 synthesize the discussions. |
| 15:15 – 15:45 | Coffee/Tea Break & Networking |
| 15:45 – 16:45 | Plenary Synthesis Groups share findings to outline the white paper. |
| 16:45 – 17:00 | Closing Remarks |
* Workshop Chair
The following papers have been accepted for presentation at the workshop.
Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.
Competitive debating is a non-formal educational activity that strengthens critical thinking, argumentation, and persuasive communication. Yet providing structured and consistent feedback remains difficult because speeches are delivered orally and adjudicators must evaluate them under time constraints. We present a rubric-aligned feedback framework for Arabic competitive debates that combines task-specific argument mining with an open-weight 9B Arabic small language model. Building on the Munazarat corpus and its hybrid annotation scheme, we convert segment-level predictions into interpretable speech-level indicators, including claim-support coverage, rebuttal density, repetition rate, and rhetorical balance. These indicators, together with selected transcript spans, are used to generate constrained, evidence-grounded feedback for adjudicators and coaches. To improve trustworthiness, we add a lightweight reliability gate that softens strong recommendations under low confidence, and we compute debate winners deterministically from the same structured signals to avoid inconsistent judgments. An expert study with debate coaches and adjudicators suggests that the framework can provide scalable and pedagogically aligned feedback, while also highlighting limitations in uncertainty communication.
Programming courses are an early and demanding test bed for educational language models: they require code generation, code reading, multi-step reasoning across files, fast feedback under classroom latency, and an honest separation between what the model produced and what the student understood. We argue that small language models (SLMs)—open-weight models below ten billion parameters that can run on a laptop, a classroom GPU, or even a browser tab—are well aligned with this setting, provided they are deployed as tool-augmented agents rather than monolithic oracles. We sketch a tool-augmented SLM architecture for an AI-assisted programming tutor that pairs a small, fine-tuned planner with deterministic tools (a code executor, a unit-test runner, a documentation retriever, and a knowledge-tracing module that drives adaptive feedback), and we connect this design to three contemporary threads: minimal from-scratch reproductions of GPT-style models that have made the SLM stack legible to students and instructors; the recent position that SLMs are sufficiently powerful and economically necessary for agentic systems; and in-browser inference engines that remove cloud-side dependence entirely. We outline a pilot evaluation aligned with prior work on knowledge-tracing for programming education and discuss the open questions around fine-tuning data, calibration of tool use, and equitable access that the SLM4ED community is well placed to answer.
Educational Question Generation (EQG) can reduce teacher workload and support adaptive learning, yet most existing systems lack reliable control over question difficulty grounded in pedagogical frameworks or depend on computationally expensive large language models that pose affordability and ownership complexities, limiting accessibility. This paper presents DiffiQG, a difficulty-controlled question generation framework that explicitly conditions generation on Bloom's Taxonomy using lightweight, fine-tuned transformer models. We benchmark multiple Bloom's classification strategies and identify a high-performing classifier (80.5% accuracy) to serve as an automatic evaluation oracle. Publicly available Bloom-annotated datasets are curated and augmented with generated instructional contexts to enable supervised conditional fine-tuning of Flan-T5 models. Experimental results show that a Flan-T5-Base model achieves strong difficulty control (91.3% Bloom alignment accuracy) while maintaining competitive linguistic quality. Higher-order cognitive levels are generated reliably, while mid-range levels remain challenging due to overlapping conceptual boundaries. Human evaluation confirms strong agreement with automatic difficulty metrics. Overall, this work demonstrates that pedagogically grounded difficulty-controlled EQG can be achieved without large-scale models, paving the way to scalable and privacy-preserving personalised tutoring and assessment generation.
Small language models (SLMs) have recently attracted attention in Artificial Intelligence in Education (AIED) because they can be locally deployed with lower computational cost and greater accessibility than large proprietary models. However, their educational value depends not only on efficiency, but also on whether they can provide pedagogically meaningful guidance during tutoring interactions. This exploratory study evaluates how three locally served SLMs (≤ 1.2B parameters), Gemma-3-1B-IT, LLaMA-3.2-1B-Instruct, and Qwen2.5-0.5B-Instruct, use an explanations-and-analogies pedagogical strategy during mathematics tutoring dialogues. Across 54 multi-turn sessions, the models were evaluated on explanation quality, analogy use, correctness, focus, usefulness, and interactional guidance. Results showed consistently limited pedagogical effectiveness across all models, with no interaction rated as fully satisfactory. At the same time, the models exhibited distinct pedagogical failure patterns. LLaMA frequently produced incoherent instructional trajectories, Qwen generated inconsistent or unstable explanations, and Gemma often produced fluent but conceptually misleading tutoring. The findings also suggest that conceptual mathematical tasks such as place-value reasoning may be less compatible with text-only tutoring than more procedural tasks, because they rely heavily on multimodal and visual-spatial scaffolding. Future research should explore more diverse instructional tasks, test additional pedagogical scaffolding to improve reliability, and refine evaluation metrics to better capture responsiveness and conceptual repair.
Large language models are increasingly explored as AI tutors, yet deploying them in K–12 settings raises concerns around privacy, cost, and reliance on proprietary models. Small language models (SLMs) offer a promising alternative, but selecting the right model for a specific educational context remains difficult, particularly when the target domain, such as block-based programming, is largely absent from model training data. We introduce CSTutorBench, a benchmark for evaluating language models as CS tutors in VEX VR, a block-based robotics environment for middle school students. The benchmark comprises 17 scenario-based questions scored against a pedagogical rubric grounded in established tutoring and feedback research, with an LLM-as-judge pipeline for automated evaluation. Preliminary findings in 11 models (4B–120B parameters) reveal that models perform well on surface-level criteria such as vocabulary and tone but struggle with deeper pedagogical behaviors, particularly avoiding answer leakage and engaging with student debugging histories. In our tests, the model family and the instruction-tuning approach predict tutoring quality far better than parameter count, and a targeted prompt revision grounded in educational prompt engineering research improved scores across all models. These results underscore the value of context-specific, pedagogically grounded benchmarks for SLM selection in educational deployment.
Intelligent tutoring systems depend on accurate models of student mastery over individual skills, or knowledge components. Knowledge tracing (KT) is the standard evaluation task: given a student's prior interactions, predict the correctness of their response to a new question. Most KT models represent each interaction as opaque identifiers, discarding the pedagogical signal in question text; recent text-aware methods restore this signal but at a steep cost, since raw tokens fill the context window so quickly that encoders fit only a handful of interactions, and decoders compensate by scaling to multi-billion-parameter backbones. We introduce ModernBERT-KT, which compresses each interaction to two embedding tokens via a frozen sentence encoder and small MLP adapters, allowing a 150M-parameter ModernBERT-base backbone to process up to 4,000 interactions per student in its native 8,192-token window, around 50x smaller than recent LLM-based KT methods. ModernBERT-KT matches established ID-based and text-aware baselines on XES3G5M and FoundationalASSIST and generalises to questions unseen at training time through content embeddings.
The deployment of Small Language Models (SLMs) in educational settings offers significant advantages in terms of privacy, cost, and scalability. However, SLMs often struggle with complex vision-based tasks, such as grading handwritten student exams, due to the high computational cost of processing large images and the visual distractions present on a full page. In this paper, we investigate whether cropping student responses using bounding boxes can improve the accuracy and computational efficiency of SLMs on a short-answer grading task. Using a dataset of scanned handwritten responses from the 2025 Australian Physics Olympiad, we evaluate the performance of several models ranging from 4B to 72B parameters under varying conditions of Chain of Thought (CoT) prompting and image cropping. Our results demonstrate that using bounding boxes significantly improves grading accuracy and reduces computational cost (FLOPs) across models, with largest gains for the smallest models. We conclude that bounding boxes are a crucial pre-processing step for deploying SLMs in large-scale, vision-based educational assessments.