Software development projects generate impressive amounts of data. Mining software repositories research aims to extract information from the various artifacts produced during the evolution of a software system and inferring the relationships between them. This course will introduce the methods and tools of mining software repositories and artifacts used by software developers and researchers. The course will be seminar-based and will involve weekly reading and discussion. The project component will be flexible but will likely involve some programming. For further details on the course content, please refer to the Course Outline (pdf). This course is offered by the School of Computer Science at the Carleton University.
Seminars are held every Monday from 11:35 AM to 2:25 PM via Zoom Meeting (meeting details are posted on Discord).
Announcements
- Submit your paper review (due 11:59 PM every Sunday; latest by 11:00 AM on Monday).
- Please send me your paper selection list (minimum 3-5 papers) by Monday, September 15, 2025.
- We will be using Discord for course communication, news and reminders. Please join here.
- Welcome to COMP5117! Our seminars start on Monday, September 08, 2025.
Content Overview
The course will be adjusted according to students' interests and experience. This is an overview of the kinds of topics the course could cover:
- Mining software repositories
- AI for SE
- SE for AI
- Large language models (LLMs)
- Software development processes
- Software development tools and environments
- Software visualization
- Software maintenance and evolution
- Collaborative development
- Human aspects of software engineering
- Quantitative and qualitative evaluation of software engineering research
Tentative Schedule
It is important to note that this schedule is evolving and will change based on your interests and how the class is progressing.
Monday, September 8 - Introduction
- Introduction to the course.
Presented by Olga Baysal
Monday, September 15 - LLMs for code review.
- Combining Large Language Models with Static Analyzers for Code Review Generation by Imen Jaoua, Oussama Ben Sghaier, Houari Sahraoui. MSR 2025.
Presented by Paul Roode - Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation by Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam. MSR 2025.
Presented by Fengshou Xu - How Effective are LLMs for Data Science Coding? A Controlled Experiment by Nathalia Nascimento, Everton Guimaraes, Sai Sanjna Chintakunta, Santhosh Anitha Boominathan. MSR 2025.
Presented by Victor Li
Monday, September 22 - Bugs and smells.
- It’s About Time: An Empirical Study of Date and Time Bugs in Open-Source Python Software by S. Tiwari, S. Chen, A. Joukov, P. Vandervelde, A. Li and R. Padhye. MSR 2025.
Presented by Jeremy Fang - PyExamine: A Comprehensive, Un-Opinionated Smell Detection Tool for Python by K. Shivashankar and A. Martini. MSR 2025.
Presented by Yeisson Pinilla - An Empirical Study on Leveraging Images in Automated Bug Report Reproduction by D. Wang, Z. Zhang, S. Feng, W. G. J. Halfond and T. Yu. MSR 2025.
Presented by Xingyu Liu
Monday, September 29 - Commit messages, code links and errors.
- Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models by Wang, Zhijie and Zhou, Zijie and Song, Da and Huang, Yuheng and Chen, Shengmai and Ma, Lei and Zhang, Tianyi. ICSE 2025.
Presented by Akshidha Unni - An Empirical Study on Commit Message Generation using LLMs via In-Context Learning by Wu, Yifan and Wang, Yunpeng and Li, Ying and Tao, Wei and Yu, Siyu and Yang, Haowen and Jiang, Wei and Li, Jianguo. ICSE 2025.
Presented by Vineela Pulagam - Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot by D. Bifolco, P. Cassieri, G. Scanniello, M. D. Penta and F. Zampetti. MSR 2025.
Presented by Qijun Han
Monday, October 6 - Security and vulnerabilities.
- Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem by C. Miller, M. Jahanshahi, A. Mockus, B. Vasilescu and C. Kastner. ICSE 2025.
Presented by Eric Leblanc - Leveraging Large Language Models to Detect npm Malicious Packages by N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh and L. Williams. ICSE 2025.
Presented by Arfath Khan - Wolves in the Repository: A Software Engineering Analysis of the XZ Utils Supply Chain Attack by P. Przymus and T. Durieux. MSR 2025.
Presented by Fengshou Xu
Monday, October 13 - NO CLASS (Thanksgiving)
Monday, October 20 - NO CLASS (Fall break)
Monday, October 27 - Developers and teams.
- What Does a Software Engineer Look Like? Exploring Societal Stereotypes in LLMs by M. Bano, H. Gunatilake and R. Hoda. ICSE-SEIS 2025.
Presented by Kanchan Malviya - Who's Pushing the Code? An Exploration of GitHub Impersonation by Zhang, Yueke and Liang, Anda and Wang, Xiaohan and Wisniewski, Pamela and Zhang, Fengwei and Leach, Kevin and Huang, Yu. ICSE 2025.
Presented by Eric Leblanc - What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards Genai by Choudhuri, Rudrajit and Trinkenreich, Bianca and Pandita, Rahul and Kalliamvakou, Eirini and Steinmacher, Igor and Gerosa, Marco and Sanchez, Christopher and Sarma, Anita. ICSE 2025.
Presented by Akshidha Unni
Monday, November 3 - LLMs for manual annotation, testing, and library migration.
- Can LLMs Replace Manual Annotation of Software Engineering Artifacts? by Ahmed, Toufique and Devanbu, Premkumar and Treude, Christoph and Pradel, Michael. MSR 2025.
Presented by Qijun Han - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests by Deljouyi, Amirhossein and Koohestani, Roham and Izadi, Maliheh and Zaidman, Andy. ICSE 2025.
Presented by Vineela Pulagam - Using LLMs for Library Migration by Mohayeminul Islam, Ajay Jha, May Mahmoud, Ildar Akhmetov, Sarah Nadi. Accepted to ASE 2025 (does not appear in proceedings just yet).
Presented by Xingyu Liu
Monday, November 10 - NO CLASS (I am away at a conference).
Monday, November 17 - Sustainability and energy.
- Energy Consumption Estimation of API-usage in Mobile Apps via Static Analysis by A. A. Bangash, K. Eng, Q. Jamal, K. Ali and A. Hindle. MSR 2023.
Presented by Jeremy Fang - Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy by N. Alizadeh, B. Belchev, N. Saurabh, P. Kelbert and F. Castor. MSR 2025.
Presented by Kanchan Malviya - AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code by L. Solovyeva, S. Weidmann and F. Castor. Forge 2025.
Presented by Victor Li
Monday, November 24 - APIs and data retrieval.
- Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models by Zhang, Kunpeng and Wang, Shuai and Han, Jitao and Zhu, Xiaogang and Li, Xian and Wang, Shaohua and Wen, Sheng. ICSE 2025.
Presented by Arfath Khan - APIstic: A Large Collection of OpenAPI Metrics by S. Serbout and C. Pautasso. MSR 2024.
Presented by Yeisson Pinilla - Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an E2E System by Balaka, Muhammad Imam Luthfi and Alexander, David and Wang, Qiming and Gong, Yue and Krisnadhi, Adila and Castro Fernandez, Raul. ACM Management of Data 2025.
Presented by Paul Roode
Monday, December 1 - Project Presentations.
Evaluation
- Weekly paper reviews: 10%
- Class participation and discussion: 20%
- Paper presentation: 10%
- Course Project: 60% (10% project presentation + 50% project report)
Weekly Paper Reviews
Each week you are expected to carefully read two to three papers. In addition, you are to submit a review of one of the papers (you choose which one). However, if you are doing a paper presentation, then you are excused for that week.
Reviews are due by 11:00 AM on the morning of the class. Please send me email with the subject "[COMP 5117] Paper Review Student_Name".
A review should be about 500-1000 words long (1.5-2 pages), and submitted as a PDF file.
Your review should address the following points:
- What were the primary contributions of the paper as the author sees it?
- What were the main contributions of the paper as you (the reader) see it?
- How does this work move the research forward (or how does the work apply to you)?
- How was the work validated?
- How could this research be extended?
- How could this research be applied in practice?
Class Participation
Each week you are expected to read all presented papers, as well as participate in the class discussion.
Paper Presentations
In a typical week, we will examine two or three research papers. Paper resentations will be done by students.
You will get to select three to five papers you want to present from the course (in the order of your first to last preferences). Please make your selections from the proceedings of the MSR 2025, ICSE 2025 or other conferences such as FSE 2025, ASE 2025, etc. (2023-2025): MSR 2024, MSR 2023, ICSE 2024, ICSE 2023. Once you have selected your papers, email me your selection of three or five papers.This must be done by Monday, September 15 via email. I will generate a cohesive class schedule once everyone has selected their papers. Each student will be assigned to present one or two papers in class depending on the class size.
You are then to design a presentation of about 20-25 minutes that is both informative and entertaining. Don't feel limited to just the content of the papers.
You should also come prepared with a set of questions to foster a 15-20 minute discussion session that you will lead to follow the presentation (this is where the other students earn their class participation marks).
When you design your talk, keep in mind that the audience has already read the papers. Remind us of the motivation, the big ideas, the context of the problem being addressed, and how all of this relates to what we've already seen in the course.
Presentations can be done using Open Office, Powerpoint, Keynote, or PDF. You must share a set of slides (only PDF) by uploading them to the Discord server prior/after your talk.
Course Project
The project forms an integral part of this course. The projects can be done individually or completed in groups of two students.
You need to come up with an idea of your own that relates to the course material. The project topic will require my approval (via the proposal).
There are three deliverables for your project:
- Project proposal. Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF in ACM format). The proposal should include a problem statement, the motivation for the project, and set of objectives you aim to accomplish. I will read these and provide comments. The proposal is not for marks but must be completed in order to pass the course. This will be due on September 22 by 11:59 PM via email.
- Written report. The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM format (LaTeX users can use "sigsoft" option: \documentclass[sigconf]{acmart} ) and submitted as a PDF. This report will constitute 100% of the project report grade. This will be due on December 12 by 11:59 PM via email.
- Project presentation. Each group will have the opportunity to present their project in class on December 5. This presentation should take the form of a 15 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk. Please allocate 3-5 minute time for questions after the project has been presented.
- Introduction (describe the problem and motivation)
- Research questions
- Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
- Results (achieved, preliminary, or anticipated)
- Implications (why does this study matter? how can your findings be used?)
- Conclusion (summary, main contributions)
The proposed structure of your presentation:
Contact
The best way to get in touch with me is via Discord or email: olga.baysal[at]carleton.ca.
