Theme Explorer: LLM-augmented Inductive Coding
In a previous post I shared work on developing a rigorous process for automating deductive coding with an LLM (GPT-4). Next, here’s a snapshot of where we’ve got to with LLM-augmented inductive coding, resulting in the Theme Explorer interactive web app.
This work comes out of the Australian Student Voices on AI in Higher Ed project, with the code development led by the brilliant Aneesha Bakharia at UQ, shaped through a multidisciplinary, cross-institutional team. We’ll present this next week at the LAK25 workshop From Data to Discovery: LLMs for Qualitative Analysis in Education. As the title and full-day program signal, we’re witnessing an explosion in interest in what it means to harness LLMs for qualitative research, education being our specific interest, but this is just one of many domains spanning arts and social sciences, as well as STEM.
Naturally, this extraordinary acceleration of coding, by what machines see in a text, is a development regarded with great scepticism and concern by some in the QDA community, so I hope that we can convene productive dialogues.
For me, as ever, the exciting opportunity is to “augment human intellect” (thanks Doug Engelbart) by enabling analyses that would otherwise be impractical (in terms of human resources) or impossible (beyond human capability). Hence our design requirements were:
- Requirement 1: To maintain the integrity of coded textual extracts: (i) verify against the source data that quotes are verbatim and not hallucinated, and (ii) verify that they are meaningfully classified under the assigned code.
- Requirement 2: To maintain the transparency of the coding: (i) explain the rationale for each code, and (ii) trace every code, whatever level of abstraction, back to its source data.
This motivated a workflow:
…with Step 5 generating an interactive Sankey Flow Diagram:
…below which is an interface to support Reqt 2 traceability — selecting a top level category to explore displays:
- its rationale and keywords
- the themes from which it was derived
- quotes from each student transcript (fictional student names):
We are working towards an open source release, meantime enjoy the the paper!
Aneesha Bakharia, Antonette Shibani, Lisa-Angelique Lim, Trish McCluskey and Simon Buckingham Shum (2025). From Transcripts to Themes: A Trustworthy Workflow for Qualitative Analysis Using Large Language Models. Proceedings of LAK25 Workshop: From Data to Discovery: LLMs for Qualitative Analysis in Education (Dublin, IRE, 4 March 2025), 10 pages. https://ceur-ws.org [Open Access Eprint]
From Transcripts to Themes: A Trustworthy Workflow for Qualitative Analysis Using Large Language Models
Aneesha Bakharia, Antonette Shibani, Lisa-Angelique Lim, Trish McCluskey, Simon Buckingham Shum
We present a novel workflow that leverages Large Language Models (LLMs) to advance qualitative analysis within Learning Analytics, addressing the limitations of existing approaches that fall short in providing theme labels, hierarchical categorization, and supporting evidence, creating a gap in effective sensemaking of learner-generated data. Our approach uses LLMs for inductive analysis from open text, enabling the extraction and description of themes with supporting quotes and hierarchical categories. This trustworthy workflow allows for researcher review and input at every stage, ensuring traceability and verification, key requirements for qualitative analysis. Applied to a focus group dataset on student perspectives on generative AI in higher education, our method demonstrates that LLMs are able to effectively extract quotes and provide labeled interpretable themes compared to traditional topic modeling algorithms. Our proposed workflow provides comprehensive insights into learner behaviors and experiences and offers educators an additional lens to understand and categorize student-generated data according to deeper learning constructs, which can facilitate richer and more actionable insights for Learning Analytics.