Direkt zum Inhalt

Auch, Maximilian ; Balluff, Maximilian ; Mandl, Peter ; Wolff, Christian

Is It Professional or Exploratory? Classifying Repositories Through README Analysis

Auch, Maximilian , Balluff, Maximilian, Mandl, Peter und Wolff, Christian (2025) Is It Professional or Exploratory? Classifying Repositories Through README Analysis. In: Mannion, Mike und Mannisto, Tomi und Maciaszek, Leszek, (eds.) Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal. Scitepress, S. 457-467. ISBN 978-989-758-742-9.

Veröffentlichungsdatum dieses Volltextes: 29 Jul 2025 04:43
Buchkapitel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.77446


Zusammenfassung

This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term ...

This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term frequency similarity and word embedding-based nearest-neighbors, using RoBERTa. The results demonstrate the advantages of LLMs on the given classification task. When applying a zero-shot classification without multi-step reasoning, GPT4o had the overall highest accuracy. The implementation of a few-shot learning showed a mixed result in different models. Llama 3 (70b) achieved 89.5% accuracy when using multi-step reasoning, though such improvements were not consistent across all models. Also, our experiments with word probability threshold filtering showed mixed results. Our findings highlight important considerations regarding the balance between accuracy, processing speed, and operational costs. For time-critical applications, we found that direct prompts without multi-step reasoning provide the most efficient approach, while the model size made a smaller contribution. Overall, README.md content proved sufficient for accurate classification in approximately 70% of cases.



Beteiligte Einrichtungen


Details

DokumentenartBuchkapitel
ISBN978-989-758-742-9
Buchtitel:Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal
Verlag:Scitepress
Seitenbereich:S. 457-467
Datum2025
InstitutionenSprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Identifikationsnummer
WertTyp
10.5220/0013272500003928DOI
Stichwörter / KeywordsClassification, LLM, README, Zero-Shot, Few-Shot
Dewey-Dezimal-Klassifikation000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik
StatusVeröffentlicht
BegutachtetJa, diese Version wurde begutachtet
An der Universität Regensburg entstandenZum Teil
URN der UB Regensburgurn:nbn:de:bvb:355-epub-774466
Dokumenten-ID77446

Bibliographische Daten exportieren

Nur für Besitzer und Autoren: Kontrollseite des Eintrags

nach oben