SMILESENG

Page 42 - SMILESENG

P. 42

Intl. Summer School on Search- and Machine Learning-based Software Engineering
Apart from these issues, another question remains: Is a model with good BLEU score useful? The only way to answer this is to ask real humans, real users. Gehrmann et al. [6] come to a similar conclusion and argue for model-cards based on expert-based qualitative analysis. Theoretically there are few fields easier to change evaluations than Software Engineering; Software Engineers produce the data, ML-libraries, models, metrics and are the final users.
The concrete suggestion (shown in Figure 1) is to start models with metrics, and produce proto models that cover a basic understanding of vocabulary and distributions. The downstream-tasks should be tuned with humans in the loop, by rating various aspects of the specific task (content, quality of language, feedback time, inter-prediction quality, etc.). Rating- Criteria should be derived from and with the final users, in a fashion like requirements engineering. This pipeline is similar to e.g. CodeBERT [15], which learns general perplexity on Code and then is fine-tuned for the specific task and language. The BERT-Core and the Code-Addition would form the proto model and the downstream-task of documentation generation would be done in active learning with experts rating samples, instead of blind metrics. Pieces for this novel pipeline are available and tested [16], [17], and could themselves make great use-cases for reinforcement learning and federated learning.
IV. CONCLUSION
Following metrics down the rabbit hole lead us into a ML wonderland of free publications — but for outsiders we are just kids in an asylum. If our goal is to make models that are useful to developers and help them in their business, the only metric we really have to maximize is their feedback. No developer tries to write documentation with a certain BLEU score, hence we should turn our back on these proxy-metrics. We should trust our users that they know what they want, and change our own research to accommodate for their needs.
REFERENCES
[1] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for
object-oriented software,” in Proceedings of the 19th ACM SIGSOFT
symposium and the 13th European conference on Foundations of soft-
ware engineering, 2011, pp. 416–419.
[2] “Codexglue: A benchmark dataset and open challenge for code intelli-
gence,” 2020.
[3] Github. [Online]. Available: https://copilot.github.com/
[4] B. Li, M. Yan, X. Xia, X. Hu, G. Li, and D. Lo, DeepCommenter:
A Deep Code Comment Generation Tool with Hybrid Lexical and Syntactical Information. New York, NY, USA: Association for Computing Machinery, 2020, p. 1571–1575. [Online]. Available: https://doi.org/10.1145/3368089.3417926
[5] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
[6] S. Gehrmann, E. Clark, and T. Sellam, “Repairing the cracked founda- tion: A survey of obstacles in evaluation practices for generated text,” arXiv preprint arXiv:2202.06935, 2022.
[7] D.Coughlin,“Correlatingautomatedandhumanassessmentsofmachine translation quality,” in Proceedings of Machine Translation Summit IX: Papers, 2003.
[8] C. Callison-Burch, M. Osborne, and P. Koehn, “Re-evaluating the role of bleu in machine translation research,” in 11th conference of the european chapter of the association for computational linguistics, 2006, pp. 249– 256.
[9] G. Doddington, “Automatic evaluation of machine translation quality using n-gram co-occurrence statistics,” in Proceedings of the second international conference on Human Language Technology Research, 2002, pp. 138–145.
[10] A. Eghbali and M. Pradel, “Crystalbleu: Precisely and efficiently measuring the similarity of code,” 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). [Online]. Available: https://conf.researchr.org/details/icse-2022/icse- 2022-posters/14/CrystalBLEU-Precisely-and-Efficiently-Measuring-the- Similarity-of-Code
[11] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big code!= big vocabulary: Open-vocabulary models for source code,” in 2020 IEEE/ACM 42nd International Conference on Software Engineer- ing (ICSE). IEEE, 2020, pp. 1073–1085.
[12] H. Babii, A. Janes, and R. Robbes, “Modeling vocabulary for big code machine learning,” arXiv preprint arXiv:1904.01873, 2019.
[13] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of code synthesis,” arXiv preprint arXiv:2009.10297, 2020.
[14] T. Sellam, D. Das, and A. P. Parikh, “Bleurt: Learning robust metrics for text generation,” 2020. [Online]. Available: https://arxiv.org/abs/2004.04696
[15] Z.Feng,D.Guo,D.Tang,N.Duan,X.Feng,M.Gong,L.Shou,B.Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
[16] B. Settles, “Active learning literature survey,” 2009.
[17] M.Aledhari,R.Razzak,R.M.Parizi,andF.Saeed,“Federatedlearning:
A survey on enabling technologies, protocols, and applications,” IEEE Access, vol. 8, pp. 140 699–140 725, 2020.
Fig. 1. Proposed Pipeline for SE Model Training
30

40 41 42 43 44