| Veröffentlichte Version Download ( PDF | 1MB) | Lizenz: Creative Commons Namensnennung-NichtKommerziell-KeineBearbeitung 4.0 International |
Is It Professional or Exploratory? Classifying Repositories Through README Analysis
Auch, Maximilian
, Balluff, Maximilian, Mandl, Peter
und Wolff, Christian
(2025)
Is It Professional or Exploratory? Classifying Repositories Through README Analysis.
In: Mannion, Mike und Mannisto, Tomi und Maciaszek, Leszek, (eds.)
Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal.
Scitepress, S. 457-467.
ISBN 978-989-758-742-9.
Veröffentlichungsdatum dieses Volltextes: 29 Jul 2025 04:43
Buchkapitel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.77446
Zusammenfassung
This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term ...
This study introduces a new approach to determine whether GitHub repositories are professional or exploratory by analyzing README.md files. We crawled and manually labeled a dataset that contains over 200 repositories to evaluate various classification methods. We compared state-of-the-art Large Language Models (LLM) against traditional Natural Language Processing (NLP) techniques, including term frequency similarity and word embedding-based nearest-neighbors, using RoBERTa. The results demonstrate the advantages of LLMs on the given classification task. When applying a zero-shot classification without multi-step reasoning, GPT4o had the overall highest accuracy. The implementation of a few-shot learning showed a mixed result in different models. Llama 3 (70b) achieved 89.5% accuracy when using multi-step reasoning, though such improvements were not consistent across all models. Also, our experiments with word probability threshold filtering showed mixed results. Our findings highlight important considerations regarding the balance between accuracy, processing speed, and operational costs. For time-critical applications, we found that direct prompts without multi-step reasoning provide the most efficient approach, while the model size made a smaller contribution. Overall, README.md content proved sufficient for accurate classification in approximately 70% of cases.
Alternative Links zum Volltext
Beteiligte Einrichtungen
Details
| Dokumentenart | Buchkapitel | ||||
| ISBN | 978-989-758-742-9 | ||||
| Buchtitel: | Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal | ||||
|---|---|---|---|---|---|
| Verlag: | Scitepress | ||||
| Seitenbereich: | S. 457-467 | ||||
| Datum | 2025 | ||||
| Institutionen | Sprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff) Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff) | ||||
| Identifikationsnummer |
| ||||
| Stichwörter / Keywords | Classification, LLM, README, Zero-Shot, Few-Shot | ||||
| Dewey-Dezimal-Klassifikation | 000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik | ||||
| Status | Veröffentlicht | ||||
| Begutachtet | Ja, diese Version wurde begutachtet | ||||
| An der Universität Regensburg entstanden | Zum Teil | ||||
| URN der UB Regensburg | urn:nbn:de:bvb:355-epub-774466 | ||||
| Dokumenten-ID | 77446 |
Downloadstatistik
Downloadstatistik