Auch, Maximilian

; Balluff, Maximilian ; Mandl, Peter

; Wolff, Christian

Is It Professional or Exploratory? Classifying Repositories Through README Analysis

Auch, Maximilian

, Balluff, Maximilian, Mandl, Peter

und Wolff, Christian

(2025) Is It Professional or Exploratory? Classifying Repositories Through README Analysis. In: Mannion, Mike und Mannisto, Tomi und Maciaszek, Leszek, (eds.) Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal. Scitepress, S. 457-467. ISBN 978-989-758-742-9.

Veröffentlichungsdatum dieses Volltextes: 29 Jul 2025 04:43
Buchkapitel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.77446

Veröffentlichte Version
Download ( PDF | 1MB)

Lizenz: Creative Commons Namensnennung-NichtKommerziell-KeineBearbeitung 4.0 International

Zusammenfassung

This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term frequency similarity and word embedding-based nearest-neighbors, using RoBERTa. The results demonstrate the advantages of LLMs on the given classification task. When applying a zero-shot classification without multi-step reasoning, GPT4o had the overall highest accuracy. The implementation of a few-shot learning showed a mixed result in different models. Llama 3 (70b) achieved 89.5% accuracy when using multi-step reasoning, though such improvements were not consistent across all models. Also, our experiments with word probability threshold filtering showed mixed results. Our findings highlight important considerations regarding the balance between accuracy, processing speed, and operational costs. For time-critical applications, we found that direct prompts without multi-step reasoning provide the most efficient approach, while the model size made a smaller contribution. Overall, README.md content proved sufficient for accurate classification in approximately 70% of cases.

Alternative Links zum Volltext

DOIexterner Link, öffnet neues Fenster

Beteiligte Einrichtungen

Sprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff) Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Browse Publikationen

Details

Dokumentenart

Buchkapitel

ISBN

978-989-758-742-9

Buchtitel:

Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal

Verlag:

Scitepress

Seitenbereich:

S. 457-467

Datum

2025

Institutionen

Sprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)

Identifikationsnummer

Wert	Typ
10.5220/0013272500003928	DOI

Stichwörter / Keywords

Classification, LLM, README, Zero-Shot, Few-Shot

Dewey-Dezimal-Klassifikation

000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik

Status

Veröffentlicht

Begutachtet

Ja, diese Version wurde begutachtet

An der Universität Regensburg entstanden

Zum Teil

URN der UB Regensburg

urn:nbn:de:bvb:355-epub-774466

Dokumenten-ID

77446

Bibliographische Daten exportieren

Nur für Besitzer und Autoren: Kontrollseite des Eintrags

Downloadstatistik

Altmetric

Alternative Statistik (altmetrics)

Weitere Literatur (mittels CORE)

nach oben