Towards an Automated Classification of Software Libraries

Auch, Maximilian

, Balluff, Maximilian, Mandl, Peter

und Wolff, Christian

(2024) Towards an Automated Classification of Software Libraries. SN Computer Science 5 (4).

Veröffentlichungsdatum dieses Volltextes: 15 Mai 2024 05:42
Artikel
DOI zum Zitieren dieses Dokuments: 10.5283/epub.58268

Veröffentlichte Version
Download ( PDF | 1MB)

Lizenz: Creative Commons Namensnennung 4.0 International

Zusammenfassung

Nowadays, the use of third-party libraries in software is common. At the same time, the number of published libraries continues to increase. An automated classification should help to maintain an overview and identify similar software libraries. This paper investigates if new approaches can be used to classify all software libraries crawled from Apache Maven repositories into defined classes using machine learning. In addition to tags that are not always available or of poor quality, we examine one feature that is always available—the id. Consisting of group-id and artifact-id, the id of an Apache Maven software library contains valuable information that can help in classification. Through a developed preprocessing and an optimized recurrent neural network (RNN), the tokenised ids should allow a classification of most libraries. Furthermore, we present an optimized approach through a hybrid use of id tokens and tags in combination. Based on the dataset including 28,600 labeled entries, a comparison of various approaches was carried out. The RNN achieved a balanced accuracy of 71.36% by training on tokenised ids. A model trained on tags achieved a balanced accuracy of 92%. However, the new hybrid approach, which combines tags and ids, optimizes the result to 94.12%. While a classification on tags achieves a better result than the more general id-based approach, the applicability is limited to software libraries that are tagged. The hybrid approach, on the other hand, takes advantage of the classification results based on tags when these are available, but includes valuable information from the always available ids.

Alternative Links zum Volltext

Verlagexterner Link, öffnet neues Fenster

Beteiligte Einrichtungen

Sprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff) Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Browse Publikationen

Details

Dokumentenart	Artikel
Titel eines Journals oder einer Zeitschrift	SN Computer Science
Verlag:	Springer
Open Access Art:	CC-Lizenz
Band:	5
Nummer des Zeitschriftenheftes oder des Kapitels:	4
Datum	27 März 2024
Institutionen	Sprach- und Literatur- und Kulturwissenschaften > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff) Informatik und Data Science > Fachbereich Menschzentrierte Informatik > Lehrstuhl für Medieninformatik (Prof. Dr. Christian Wolff)
Stichwörter / Keywords	Software libraries · Java · Classifcation · Machine learning · Recurrent neural network
Dewey-Dezimal-Klassifikation	000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik
Status	Veröffentlicht
Begutachtet	Ja, diese Version wurde begutachtet
An der Universität Regensburg entstanden	Zum Teil
URN der UB Regensburg	urn:nbn:de:bvb:355-epub-582687
Dokumenten-ID	58268

Bibliographische Daten exportieren

Nur für Besitzer und Autoren: Kontrollseite des Eintrags

Downloadstatistik

Weitere Literatur (mittels CORE)

nach oben