Page 30 - SMILESENG
P. 30
Intl. Summer School on Search- and Machine Learning-based Software Engineering
Fig. 1. Proposed approach to build intelligent assistants for Python developers.
to detect design pattern implementations [6]. . Finally, we seek to explore neural code transformers to assist the developer in the effective integration of the library functionalities, as they have shown great potential in code generation tasks like code completion [7].
IV. FIRST RESULTS
We are currently working on the first research objective, using MSR and pattern mining techniques to extract knowl- edge about the usage of Python libraries in GitHub. Following Vidoni’s guidelines for conducting MSR studies [8], we have sampled public repositories2 using filters to ensure that the selected repositories are relevant and active. The following cri- teria were established: minimum number of commits (1,000), minimum number of stars (100), and update (last commit after 1st January 2021). As a result, we have obtained a list of 3,347 repositories. PyGitHub3 and Requirements Parser4 were used to access the repository content and extract the libraries specified in the requirements.txt file, respectively. Our procedure returned 802 repositories with a valid file, resulting in a set of 3,330 different libraries in use.
An initial statistical analysis reveals some interesting in- sights. The maximum number of libraries used by a single repository is 232, with a median equal to 8. The average (15.5) and standard deviation (24.6) clearly indicate that the usage of libraries across repositories presents a skewed distribution: many repositories use a small number of libraries, and only a small sample of repositories import many of them. Even though the initial number of libraries could be considered high, we have found that more than half of them only appear in one repository. This seems to suggest that some libraries were conceived by the repository contributors without a clear intention of being reused in other projects. Only 32 of the 3,300 libraries (0.97%) appear in 50 or more repositories, although it is worth noting that 10 of them (requests, numpy, sphinx, sphinx, six, pyyaml,
2For this step we used GitHub Search: https://seart-ghs.si.usi.ch/ 3https://pygithub.readthedocs.io/ (Accessed: 31/05/2022) 4https://requirements-parser.readthedocs.io/ (Accessed: 31/05/2022)
scipy, python-dateutil, jinja2, matplotlib and pytz) are included in 100 repositories or more.
To discover library pattern usages, we have applied a pattern mining algorithm called DCI Closed [9] that returns subsets of libraries frequently appearing together. Notice that for this part of the study we keep only those libraries appearing in two or more repositories, i.e., 1,237 libraries in 787 repositories. Despite the fact that few repositories contain large sets of li- braries, we have found pairs of libraries that appear together in between 10% and 14% of the repositories. Some frequent com- binations are: {numpy, scipy}, {numpy, matplotlib}, {requests, six} and {requests, pyyaml}.
V. CONCLUSION AND FUTURE WORK
Our initial results have provided us with useful knowledge regarding the current adoption of Python libraries in GitHub repositories. Our next steps will be directed towards enlarging the dataset and applying other unsupervised techniques to find more patterns and dependencies. In particular, we think the DBSCAN clustering algorithm will be useful to isolate libraries with dense usage from those used more sparsely. Then, we will continue our roadmap, focusing on exploding library metadata (e.g., from PyPi) and temporal patterns. All the collected knowledge will feed our intelligent assistant, which will be implemented as an extension or plug-in for IDEs (VSCode, Eclipse or Pycharm). Such an assistant is expected to give real-time recommendations to the developer with the aim of improving his/her code, automatically generating code lines to efficiently operate with the recommended libraries.
ACKNOWLEDGMENT
Work supported by the Spanish Ministry of Science and In- novation (RED2018-102472-T, PID2020-115832GB-I00) and the Andalusian Regional Government (DOC 00944).
REFERENCES
[1] M. Valiev, B. Vasilescu, and J. Herbsleb, “Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem,” in ACM Joint Meeting on European Softw. Eng. Conf. and Symposium on the Foundations of Softw. Eng., 2018, p. 644–655.
[2] F. Thung, D. Lo, and J. Lawall, “Automated library recommendation,” in 20th Working Conf. Reverse Engineering, 2013, pp. 182–191.
[3] M.A.Saied,A.Ouni,H.Sahraoui,R.G.Kula,K.Inoue,andD.Lo,“Im- proving reusability of software libraries through usage pattern mining,” J. Syst. Softw., vol. 145, pp. 164–179, 2018.
[4] P. T. Nguyen, J. Di Rocco, D. Di Ruscio, and M. Di Penta, “CrossRec: Supporting software developers by recommending third-party libraries,” J. Syst. Softw., vol. 161, p. 110460, 2020.
[5] P. T. Nguyen, J. Di Rocco, R. Rubei, C. Di Sipio, and D. Di Ruscio, “DeepLib: Machine translation techniques to recommend upgrades for third-party libraries,” Expert Syst. Appl., vol. 202, p. 117267, 2022.
[6] R. Barbudo, A. Ram´ırez, F. Servant, and J. R. Romero, “GEML: A grammar-based evolutionary machine learning approach for design- pattern detection,” J. Syst. Softw., vol. 175, p. 110919, 2021.
[7] N. Chirkova and S. Troshin, “Empirical Study of Transformers for Source Code,” in ACM Joint Meeting on European Softw. Eng. Conf. and Symposium on the Foundations of Softw. Eng., 2021, p. 703–715.
[8] M. Vidoni, “A systematic process for mining software repositories: Results from a systematic literature review,” Inf. Softw. Technol., vol. 144, p. 106791, 2022.
[9] C. Lucchese, S. Orlando, and R. Perego, “DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets,” in ICDM Workshop on Frequent Itemset Mining Implementations, 2004, pp. 1–9.
18