Development of a Language Model for Classification and Analysis of Educational Content in Dentistry

Internship for Engineers/Master II Students: Development of a Language Model for Classification and Analysis of Educational Content in Dentistry

Context

Language models are increasingly being used in educational applications, particularly in medicine and dentistry. However, these general models like GPT-4 can sometimes lack precision and adaptability when applied to specific domains and raise concerns (intellectual property, data control, privacy). Recently, smaller models like CamemBERT-bio, trained on French medical corpora, have shown better performance in classifying medical terms compared to models like BERT-7B, with up to 30 times lower computation times and carbon emissions (Touchent et al. 2024). These advancements open up possibilities for developing accessible, domain-specific language models that can be used locally with limited resources.

This internship is part of a project aimed at training and evaluating specialized language models to classify, cluster, and organize educational content in dentistry, while supporting curriculum analysis and key concept extraction.

Project Objective

Evaluate the performance of a specialized language model in classifying and clustering educational content (e.g., MCQs, glossaries, course chapters), extracting key concepts, and hierarchically organizing these contents according to competency levels defined by European directives. A comparative analysis of the environmental costs of the employed models will be integrated into the evaluation metrics.

Project Steps

Classification and Clustering of Educational Content
- Perform exploratory data analysis (MCQs, glossaries, chapters).
- Use supervised classification or unsupervised clustering approaches to group content by themes or competency levels.
- Extract key concepts and identify the most relevant chapters for each theme.
Training the Language Model
- Preprocess the data (segmentation, vectorization).
- Adapt existing models (CamemBERT-bio, Mistral) via fine-tuning or Retrieval-Augmented Generation (RAG).
Performance and Impact Evaluation
- Calculate classification and clustering metrics (F1 score, precision, cluster coherence).
- Compare the performance of specialized models with general models like GPT-4.
- Evaluate the environmental impact of the models using indicators such as energy consumption during training and inference, in collaboration with tools like CodeCarbon or equivalents.
Exploration of Explainability and Latent Representations
- Study the model's internal representations to understand the groupings made and justify the classifications.

Required Skills:

Knowledge in Natural Language Processing (NLP), machine learning, and clustering.
Proficiency in Python and deep learning libraries (PyTorch, TensorFlow).
Interest in environmental impact issues and educational data analysis.

Perspectives

This internship will allow the development of skills in AI applied to dentistry education and contribute to pedagogical innovation. By integrating a reflection on environmental impact, it aligns with a responsible and sustainable approach, in line with the current requirements of the technological and educational industries.

Contacts

Thomas Grenier (thomas.grenier@creatis.insa-lyon.fr)
Sébastien Valette (sebastien.valette@creatis.insa-lyon.fr)
Raphaël Richert (raphael.richert@insa-lyon.fr)

References

Touchent R, Romary L, de La Clergerie E. 2024. CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data. 2024 Jt Int Conf Comput Linguist Lang Resour Eval Lr 2024 - Main Conf Proc.:2692–2701.