2018年雁栖湖会议成果集.pdf

Huairou Beijing China 2018.10.11-13 总策划：中国科学院学部学术与出版工作委员会 2018 Meeting Proceeding 编辑出版：中国科学院院刊编：梅宏副主编：苏荣辉责任编辑：李鹏飞编辑：马新勇杨柳春向岚文彦杰刘爽沙小晶 Meeting Proceedings Huairou Beijing China 2018.10.11-13 YANQI LAKE MEETING 主 Software Automation in the Big Data Era: Challenges and Opportunities 2018 Meeting Proceedings Software Automation in the Big Data Era: Challenges and Opportunities Huairou Beijing China 2018.10.11-13 Yanqi Lake Meeting The mission of Yanqi Lake Meeting is to focus on emerging disciplines and inter-disciplines, to analyze the growing points and novel issues in frontiers of science, to accelerate the development of science and the construction of disciplines, to establish a scientific and democratic academic environment, to promote interdisciplinary and international academic exchanges, and to enable Academic Divisions of the Chinese Academy of Sciences (CASAD) to fully play the guiding role for the development of science and technology as well as for the academic advancement in China. Contents 1 Introduction····························································································· 4 2 Meeting Review···················································································· 7 3 Relevant Scientific Publications············································ 34 3.1 Relevant Special Topics·································34 3.2 Meeting News···············································83 3.3 Other Relevant Scientific Publications············84 3.4 Invited Talks·················································158 2018 Meeting Proceedings 1 Introduction Yanqi Lake Meeting is a high-end international academic forum sponsored by the Chinese Academy of Sciences, which aims to promote some specific research domain through various activities among Chinese and oversea researchers. In order to focus on specific topics of frontier sciences with in-depth discussion, the 2018 Yanqi Lake Meeting is organized in a way like Dagstuhl Meeting or Shonan Meeting. In particular, the meeting includes 3 keynote speeches and a number of talk presentations as well as discussions. Its highly expected outputs will include scientific articles and advisory reports. The major theme of Yanqi Lake Meeting in Fall 2018 is software automation, which refers to the process of generating software automatically based on formal or informal specifications. Software automation (e.g., program synthesis, code completion, program transformation, code recommendation, program repair, software selfevolution) used to be a dream in computer science, which can free developers from tedious programming. Furthermore, as software usually evolves due to the changes from requirements or environments that the software is running on, software automation may also free developers from this task as well. Nowadays, a huge volume of —4 — Software Automation in the Big Data Era: Challenges and Opportunities Group Photo software engineering data make software automation feasible. In this meeting, Professor Hong Mei, a member of the Chinese Academy of Sciences, organizes Yanqi Lake Meeting whose theme is software automation-challenges and opportunities in the big data era. The opening ceremony is broadcasted lively on the Internet, which is —5 — 2018 Meeting Proceedings General Chair Keynote Speaker Keynote Speaker Keynote Speaker equipped with simultaneous translation between Chinese and English. Xinhua Net, Guangming Daily, China Science Journal, China Youth Daily, and other news media are invited to publicize the meeting. According to incomplete statistics, more than 230,000 live clicks and 970,000 clicks were recorded in Xinhua Net. —6 — Software Automation in the Big Data Era: Challenges and Opportunities 2 Meeting Review Software Automation in the Big Data Era: Challenges and Opportunities October 11-13, 2018 General Chair: Hong Mei Member of Chinese Academy of Sciences Beijing Institute of Technology Organization Committee: Dan Hao He Jiang Ge Li Xiaoxing Ma Peking University Dalian Institute of Technology Peking University Nanjing University Software Testing Intelligent Software Engineering Program Language Processing Adaptive Software System Xin Peng Tao Xie Yingfei Xiong Lu Zhang Fudan University University of Illinois Urbana-Champaign Peking University Peking University Software Maintenance Intelligent Software Engineering Program Repair and Synthesis Software Testing —7 — 2018 Meeting Proceedings Keynote Speaker: Huimin Lin Huimin Lin received Ph.D. in Computer Science from the Institute of Software, Chinese Academy of Sciences, in 1986. He is currently a research professor and the director of the Academic Committee, Institute of Software, Chinese Academy of Sciences. He was elected a Member of Chinese Academy of Sciences in 1999. Title: On Program Automation Abstract Vast amounts of our work have been automated by running various kinds of software on computers. However, software products are produced by human developers. Writing programs are labour consuming and error-prone. It has been a dream to have computers producing software for us. But it is wellknown that the problem of automatically generating programs from arbitrary specifications is undecidable. Thus, instead of full automation, researchers have been aimed at frameworks or platforms which generate programs with some sort of interactions with human. In this talk, I will review some existing approaches to program automation, grouped into two categories: Code Creation and Code Reuse. I will comment on the advantages and limitations of each approach, with the hope to stimulate discussions. —8 — Software Automation in the Big Data Era: Challenges and Opportunities Barry W. Boehm Barry W. Boehm is the University of Southern California Distinguished Professor of Computer Science, Industrial and Systems Engineering, and Astronautics; the TRW Professor of Software Engineering; and Founding Director of the USC Center for Systems and Software Engineering. He is also the Chief Scientist of the DoD-Stevens-USC Systems Engineering Research Center (SERC). He is a Fellow of the primary professional societies in computing (ACM), aerospace (AIAA), electronics (IEEE), systems engineering (INCOSE), and Lean Systems Society (LSS); and a member of the U.S. National Academy of Engineering. Title: Opportunities and Challenges in the Big Data and Software Automation Areas Abstract We are in the middle of what has been called the Third Industrial Revolution. The First went from animal and wind power to steam and inter nal combustion engines. The Second was driven by electrification. And the Third is generally agreed to be driven by computing, communications, and software (CCS), with vast changes in work acceleration and lifestyles. As we have seen, CCS technology has radically changed business (the world’s top companies in market value in 2018 have been Apple, Alphabet/ Google, Microsoft, Amazon, Facebook, Tencent, and Alibaba), and people’s —9 — 2018 Meeting Proceedings information technology skills (my grandsons could beat me at most video games when they were 8 and 6 years old). Further opportunities include smart laborsaving devices such as 3D printing, autonomic logistics (microsensor-enhanced devices identifying their maintenance needs and expediting their update or replacement), collaboration technology (identifying stakeholder needs and priorities, identifying likely conflicts, and suggesting mutually-satisfactory or win-win solutions), and crowdsourcing (bad driver identification, confidence-based estimation). Also as we have seen, there are further challenges that go along with the opportunities. Examples include scalability of complex, continuouslyevolving systems of systems, such as internets of things; ensuring security while undergoing continuous deployment; performing tradeoff analysis among numerous system qualities; early warning of conflicts among autonomous systems; and decision making under uncertainty as new ideas and approaches are proposed. I will also briefly summarize some of the related research going on at USC and the SERC on win-win collaboration support; model-view-controllerbased systems architecting, software generation, and continuous delivery; the associated Parallel Agile extension of our Incremental Commitment Spiral Model (the ICSM book description was published in 2014, and its Chinese translation was published in 2015); and continuous monitoring of software development for vulnerabilities and technical debt. — 10 — Software Automation in the Big Data Era: Challenges and Opportunities Dame Wendy Hall Dame Wendy Hall is Regius Professor of Computer Science, Pro ViceChancellor (International Engagement) at the University of Southampton, and is the Executive Director of the Web Science Institute. With Sir Tim Berners-Lee and Sir Nigel Shadbolt, she co-founded the Web Science Research Initiative in 2006 and is the Managing Director of the Web Science Trust, which has a global mission to support the development of research, education and thought leadership in Web Science. She became a Dame Commander of the British Empire in the 2009 UK New Year's Honours list, and is a Fellow of the Royal Society. She has previously been President of the ACM, Senior Vice President of the Royal Academy of Engineering, a member of the UK Prime Minister’s Council for Science and Technology, was a founding member of the European Research Council and Chair of the European Commission’s ISTAG 2010-2012, was a member of the Global Commission on Internet Governance, and until June 2018, was a member of the World Economic Forum’s Global Futures Council on the Digital Economy. Dame Wendy Hall was co-Chair of the UK government’s AI Review, which was published in October 2017, and has recently been announced by the UK government as the first Skills Champion for AI in the UK. Title: AI Through the Looking Glass Abstract Artificial Intelligence is set to transform society in the coming decades in — 11 — 2018 Meeting Proceedings Keynote Speak ways that have long been predicted by science fiction writers but are only now becoming feasible because of recent developments in computing technology, machine learning, and the availability of massive amounts of data on which to train the algorithms. We are still a long way from AI being as powerful as the human brain but many applications can now outperform human beings, particularly when it comes to analyzing large amounts of data to predict results. This will lead to many jobs being replaced by automated processes and machines, but as with all major technological revolutions — 12 — Software Automation in the Big Data Era: Challenges and Opportunities there are also amazing opportunities for the development of new companies and the growth of jobs to help us take advantage of everything that the development of AI might bring to society. In this talk, we will talk about how the UK is positioning itself in this brave new world in the light of the recent AI Review that has been undertaken as part of the UK government’s industrial strategy. But we must also be very aware of the potential threats to society that such developments might bring and the ethical, accountability, and diversity issues we need to address, including particularly the world of software automation. As Alice found when she went through the looking glass, everything is not always what is first appears to be. If we don’t lay the groundwork well now, there is huge potential for chaos and confusion in the future as AI starts to become more dominant in all our lives, which is why we need to take a socio-technical approach to every aspect of the evolution of AI in society. — 13 — The 2nd Yanqi-Lake Meeting Report 2018 Meeting Proceedings Software Automation in the Big Data Era: Challenges and Opportunities Hong Mei (Beijing Institute of Technology, China) Tao Xie (University of Illinois Urbana-Champaign, USA) Lu Zhang (Peking University, China) Dan Hao (Peking University, China) Oct. 11-13, 2018 We thank all the organization committee members and participants of this meeting. General Chair: Hong Mei (Beijing Institute of Technology, China) Organization Committee: Dan Hao (Peking University, China) He Jiang (Dalian Institute of Technology, China) Ge Li (Peking University, China) Xiaoxing Ma (Nanjing University, China) Xin Peng (Fudan University, China) Tao Xie (University of Illinois at Urbana-Champaign, USA) Yingfei Xiong (Peking University, China) Lu Zhang (Peking University, China) — 14 — Software Automation in the Big Data Era: Software Automation in the Big Data Era: Challenges and Opportunities Challenges and Opportunities Software automation, the process of producing software automatically based on formal or informal specifications, is always a dream of computer scientists. Its purpose is to free developers not only from tedious programming for software’s new features, but also from endless manual maintenance of evolving software under its ever-changing environment. Software automation includes, but not limited to, program synthesis, code completion, program transformation, code recommendation, program repair, and software self-evolution. As an exciting and promising direction, software automation also faces a series of essential challenges, such as vague and diverse requirements in open domains, complex software ecosystems and software technology stacks, and diverse technical and business domains. It is even more challenging when software automation needs to handle nonfunctional requirements such as extensibility and safety. Nowadays, the “big” software engineering data, which are characterized with their volume, variety, velocity, and veracity attributes, provide new opportunities for this dream of software automation. The participants at the Yanqi-Lake Meeting believe that some specific software engineering tasks, e.g., bug fixing, can be fully automated in the near future. Some participants even believe that we are about to witness computers to gradually outperform humans in programming in the coming decades. With software automation, a new type of pair programming may arise: an intelligent assistant hidden within the Integrated Development Environment (IDE) is paired with a human developer to perform daily development tasks. Prof. Premkumar Devanbu from the University of California at Davis, suggests that the intelligent interaction between the IDE and human developers may be a breakthrough in the coming years. All the participants in the meeting agree that “big” software engineering data play a key role in software automation. Hence, in addition to the “big” data emerging in publicly available sources such as GitHub and Stack Overflow, researchers are also seeking for more ways to manually label more software engineering data. For example, with the support from the China Ministry of Science and Technology, Prof. Minghui Zhou from Peking University and Prof. Gang Yin from National University of Defense Technology have started a project to organize competitions among students on commenting open source code. In this Yanqi-Lake Meeting, the participants also discuss possible ways to classify the capability of software automation into a series of levels. A possible classification may be automated generation of machine code (L1), automated generation of skeleton code and suggestion of next line of code (L2), automated generation of code fragment (L3), automated generation of design structure (L4), automated generation of the whole application based on requirements understanding (L5). In summary, software automation is promising, yet with great challenges, in the big data era. It is time to bring together researchers from various disciplines, including artificial intelligence, software engineering, and programming languages, to advance the research and practice of software automation. — 15 — Participants 2018 l Barry Boehm, University of Southern California Meeting l WeiProceedings Ngan Chin, National University of Singapore l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l — 16 — Premkumar Devanbu, University of California at Davis Wei Dong, National University of Defense Technology Sumit Gulwani, Microsoft Wendy Hall, University of Southampton Dan Hao, Peking University Zhenjiang Hu, National Institute of Informatics, Tokyo University He Jiang, Dalian Institute of Technology Yu Jiang, Tsinghua University Sarfraz Khurshid, University of Texas at Austin Bixin Li, Southeast University Ge Li, Peking University Huimin Lin, Chinese Academy of Sciences Ting Liu, Xi’an Jiaotong University Yang Liu, Nanyang Technological University Jian-Guang Lou, Microsoft Research Asia Xiaoxing Ma, Nanjing University Hong Mei, Beijing Institute of Technology Bruno Oliverira, University of Hong Kong Xin Peng, Fudan University Abhik Roychoudhury, National University of Singapore Danny Tarlow, Google Brain Cong Tian, Xidian University Meng Wang, University of Bristol Ji Wang, National University of Defense Technology Zan Wang, Tianjin University Tao Xie, University of Illinois Urbana-Champaign Yingfei Xiong, Peking University Chang Xu, Nanjing University Jifeng Xuan, Wuhan University Naijun Zhan, Chinese Academy of Sciences Hongyu Zhang, Newcastle University Lu Zhang, Peking University Jianjun Zhao, Kyushu University Hao Zhong, Shanghai Jiao Tong University Program Thursday (11 October) 09:00-09:30: Opening Speech 09:30-09:45: Group Photo 09:45-11:50: Keynote 11:50-13:30: Lunch 13:30-15:30: Self-introduction 15:30-16:30: Panel 16:30-16:50: Coffee break 16:50-18:50: Invited Talk Software Automation in the Big Data Era: Challenges and Opportunities Friday (12 October) 08:30-09:15: Invited Talk 09:15-09:45: Panel 09:45-10:00: Coffee break 10:00-11:45: Group Discussion 11:45-12:10: Panel 12:10-13:30: Lunch 13:45-18:00: Brainstorming Saturday (13 October) 09:00-09:30: Group Discussion Summary Report 09:30-11:00: Group Discussion 11:00-11:20: Coffee break 11:20-11:50: Group Discussion Summary Report 11:50-13:30: Lunch 13:30-14:50: Group Discussion 14:50-16:20: Meeting Summary and Closing This proceeding summarizes the Yanqi-Lake Meeting through its four components: (1) keynote speech, (2) invited talks, (3) panels, and (4) working groups. — 17 — Contents 2018 Meeting Proceedings 1. Abstract of Keynotes ......................................................................... 1 (1) On Program Automation, Huimin Lin ............................................................... 1 (2) Opportunities and Challenges in the Big Data and Software Automation Areas, Barry Boehm .............................................................................................................. 1 (3) AI through the Looking Glass, Wendy Hall ...................................................... 1 2. Abstract of Invited Talks .................................................................. 3 (1) Can Parallelization Be Done by Machine Learning? Zhenjiang Hu.................. 3 (2) Intelligent Software Engineering: Synergy between AI and Software Engineering, Tao Xie ................................................................................................. 3 (3) Automated Program Repair, Abhik Roychoudhury ........................................... 3 (4) Towards Intelligent Software Development, Hongyu Zhang ............................ 4 (5) Programming by Examples, Sumit Gulwani ..................................................... 4 (6) Analysis and Synthesis of Declarative Models, Sarfraz Khurshid .................... 4 (7) Automating Proofs for Dependable Software, Chin Wei Ngan ......................... 4 (8) Statistical Program Repair, Yingfei Xiong......................................................... 5 (9) An Intuitive Explanation about the Gap between Program and Natural Language, Ge Li......................................................................................................... 5 3. Panel Summary ................................................................................. 6 (1) Panel 1: Software Automation: Expectation ...................................................... 6 (2) Panel 2: Roadmap of Software Automation....................................................... 6 4. Summary of Working-Group Discussions ...................................... 8 (1) (Neural) Program Generation/Synthesis (Rounds 1 and 2) ............................... 8 (2) Community Building: Data Collection, Benchmarking, and Competition for Software Automation, Bridging Research and Practice ............................................. 8 (3) Challenges of Program Repair ........................................................................... 9 (4) Human-AI Cooperation (Round 1) .................................................................. 10 (5) Human-AI Cooperation (Round 2) .................................................................. 11 — 18 — (6) Engineering of Intelligence Software .............................................................. 11 Software Automation (7) Scoping Software Automation: What Does It Mean and What Can be in the Big Data Era: Expected in the Near Future, Survey of Software Automation (Round 1) .............. 12 Challenges and Opportunities (8) Scoping Software Automation: What Does It Mean and What Can be Expected in the Near Future, Survey of Software Automation (Round 2) .............. 13 (9) Software Automation for IoT Systems ............................................................ 14 — 19 — 1. Abstract of Keynotes 2018 (1) Meeting Proceedings On Program Automation, Huimin Lin Vast amounts of our work have been automated by running various kinds of software on computers. However, software products are produced by human developers. Writing programs are labour consuming and error-prone. It has been a dream to have computers producing software for us. But it is well-known that the problem of automatically generating programs from arbitrary specifications is undecidable. Thus, instead of full automation, researchers have been aimed at frameworks or platforms which generate programs with some sort of interactions with human. In this talk Huimin Lin reviewed some existing approaches to program automation, grouped into two categories: Code Creation and Code Reuse. Besides, Huimin Lin commented on the advantages and limitations of each approach, with the hope to stimulate discussions. (2) Opportunities and Challenges in the Big Data and Software Automation Areas, Barry Boehm We are in the middle of what has been called the Third Industrial Revolution. The First went from animal and wind power to steam and internal combustion engines. The Second was driven by electrification. And the Third is generally agreed to be driven by computing, communications, and software (CCS), with vast changes in work acceleration and lifestyles. As we have seen, CCS technology has radically changed business (the world’s top companies in market value in 2018 have been Apple, Alphabet/Google, Microsoft, Amazon, Facebook, Tencent, and Alibaba), and people’s information technology skills. Further opportunities include smart laborsaving devices such as 3D printing, autonomic logistics (microsensor-enhanced devices identifying their maintenance needs and expediting their update or replacement), collaboration technology (identifying stakeholder needs and priorities, identifying likely conflicts, and suggesting mutually-satisfactory or win-win solutions), and crowdsourcing (bad driver identification, confidence-based estimation) . (3) AI through the Looking Glass, Wendy Hall Artificial Intelligence is set to transform society in the coming decades in ways that have long been predicted by science fiction writers but are only now becoming feasible because of recent developments in computing technology, machine learning and the availability of massive amounts of data on which to train the algorithms. We are still a 1 — 20 — long way from AI being as powerful as the human brain but many applications can now outperform human beings, particularly when it comes to analysing large amounts of Software Automation in by the Big Data Era: data to predict results. This will lead to many jobs being replaced automated Challenges processes and machines, but as with all major technological revolutionsand thereOpportunities are also amazing opportunities for the development of new companies and the growth of jobs to help us take advantage of everything that the development of AI might bring to society. This talk talks about how the UK is positioning itself in this brave new world in the light of the recent AI Review that has been undertaken as part of the UK government’s industrial strategy. But we must also be very aware of the potential threats to society that such developments might bring and the ethical, accountability and diversity issues we need to address, including particularly the world of software automation. As Alice found when she went through the looking glass, everything is not always what is first appears to be. If we don’t lay the groundwork well now, there is huge potential for chaos and confusion in the future as AI starts to become more dominant in all our lives, which is why we need to take a socio-technical approach to every aspect of the evolution of AI in society. 2 — 21 — 2. Abstract of Invited Talks 2018 (1) Meeting Proceedings Can Parallelization Be Done by Machine Learning? Zhenjiang Hu Program calculation (or algebra of programming) has been studied for a long time, aiming to provide a theoretic framework for human to systematically construct programs from specification. On the other hand, it has been recently shown that the machine learning technology can be used for program synthesis. This talk considers automatic parallelization as an example, showing that it might be possible to achieve more if we could combine them together well. (2) Intelligent Software Engineering: Synergy between AI and Software Engineering, Tao Xie As an example of exploiting the synergy between AI and software engineering, the field of intelligent software engineering has emerged with various advances in recent years. Such field broadly addresses issues on intelligent [software engineering] and [intelligence software] engineering. The former, intelligent [software engineering], focuses on instilling intelligence in approaches developed to address various software engineering tasks to accomplish high effectiveness and efficiency. The latter, [intelligence software] engineering, focuses on addressing various software engineering tasks for intelligence software, e.g., AI software. This talk discusses recent research and future directions in the field of intelligent software engineering. (3) Automated Program Repair, Abhik Roychoudhury Automated program repair is a promising new technology which seeks to reduce manual effort from the programmer. In the past, program repair has been cast as a search problem, and search based program repair tools try to search for a plausible repair passing all tests, from among a search space. We envision program repair as a specification inference process, rather than a search problem. We show that selective use of symbolic execution can infer specifications about how a given program should be rectified. Conceptually, semantic program repair captures a novel usage of symbolic execution, since instead of navigating search it is used for inferring specifications of intended program behavior. Overall, (semantic) program repair takes one key step towards building (trustworthy) self-healing software for autonomous systems. 3 — 22 — (4) Towards Intelligent Software Development, Hongyu Zhang Software Automation Currently, software development is largely a manual, time-consuming, and error-prone in the Big Data Era: process. In the era of big data and artificial intelligence, we aim towards intelligent Challenges and Opportunities software development. During the development and maintenance of software, a vast amount of data are generated. These data include source code, operation logs, historical failures, performance counters, etc. Various artificial intelligence, machine learning, and data analytics techniques can be utilized to mine these data to automate programming, testing, debugging, and maintenance tasks. As a result, software quality and development productivity could be improved. This talk briefly introduces some recent work of Hongyu Zhang on intelligent software development, including deep learning based code search and log-based fault diagnosis. (5) Programming by Examples, Sumit Gulwani Programming by Examples (PBE) involves synthesizing intended programs in an underlying domain-specific programming language from example-based specifications. This new frontier in AI enables computer users, 99% of whom are non-programmers, to create scripts to automate repetitive tasks. PBE can provide 10-100x productivity increase for data scientists, business users, and developers in various task domains like string/number/date transformations, structured table extraction from logfiles/web pages/semi-structured spreadsheets/PDF/images, transforming JSON from one format to another, repetitive text editing, robotic/business process automation, repetitive code refactoring and formatting. PBE capabilities can be surfaced using GUI-based tools, code editors, or notebooks, and the code can be synthesized in various target languages like Java or even PySpark to facilitate efficient execution on big data. This talk demos some PBE technologies, showcases some latest innovations and forms factors inside different products. (6) Analysis and Synthesis of Declarative Models, Sarfraz Khurshid While the general benefits of modeling key elements of software systems are wellknown, the specific value a model provides hinges critically on its correctness. This talk presents a novel approach for analysis and synthesis of models to facilitate the development of correct models. The approach is embodied in a tool-set for the wellknown Alloy modeling language. Experimental results demonstrate that the usefulness of the approach. (7) Automating Proofs for Dependable Software, Chin Wei Ngan This talk briefly outlines the progress Chin et al. made for automating the modular analysis and verification of software. He outlines how they made modular 4 — 23 — extension to his SLEEK prover which allowed it to handle: (i) complex heap-based data structures via separation logic (ii) termination/non-termination reasoning (iii) inference via second-order bi-abduction (iv) reasoning with arrays and multi-party Meeting Proceedings communicating protocol. 2018 (8) Statistical Program Repair, Yingfei Xiong Automated program repair is an important sub problem of software automation and has received a lot of attention in recent years. However, a major problem hindering the current program repair techniques is overfitting: patches passing all the tests are often still incorrect. This talk introduces our recent work in using statistical methods to address the overfitting problem. We develop techniques to learn how developers write programs and patch programs from existing programs and existing patches and apply the learned knowledge in patch generation. Our methods achieve high precision and recall in program repair. (9) An Intuitive Explanation about the Gap between Programs and Natural Languages, Ge Li The problem of generating program code from natural language intent is more difficult than image recognition and natural language translation. An intuitive explanation is that we can view the image recognition as a mapping from a continuous dataset with a clear boundary to a quasi-continuous dataset with quasi-clear boundary, the natural language translation as a mapping between two quasi-continuous datasets with quasi-clear boundaries. The problem of generating code from intent can be viewed as a mapping from a quasi-continuous dataset with an unclear boundary to a discrete data set with unclear boundary. This mapping is more difficult to be expressed by continuous function mapping, which brings difficulties to the application of learning based methods. So collecting more code data in a specific application domain to improve the continuity of the data set and the clarity of the data boundary is a possible way to reduce this difficulty, and then to enhance the possibility of generating code from natural language intent by learning based methods. — 24 — 5 3. Panel Summary (1) Software Automation in the Big Data Era: Challenges and Opportunities Panel 1: Software Automation: Expectation This panel discussion provides an opening discussion for the event to set the tone for the subsequent discussions. In particular, this panel discussion aims to explore the expectations of software automation. Five panelists (Prof. Abhik Roychoudhury, Prof. Ge Li, Dr. Sumit Gulwani, Prof. Hongyu Zhang, and Prof. Xin Peng) participate in the panel discussion moderated by Prof. Tao Xie. Prof. Peng argues that our expectations of software automation may depend on whether it is applied in a lab or real environment. In a clean lab environment, we may expect some specific development tasks to be fully automated, but in a real industrial environment of software development, what we can expect may be an intelligent assistant hidden behind the IDE. Dr. Gulwani believes that a key to the success of software automation is to find a suitable way of communication between humans and computers. With effective communications, many specific domains may be highly automated. Dr. Gulwani also believes that it is important to bring together researchers from AI, software engineering, and programming languages for software automation. Prof. Roychoudhury believes that bug fixing is promising to be automated to a large extent, but is skeptical about fully automating general software development. Prof. Zhang compares software automation with software reuse and argues that 100 percent of software automation may not be expectable due to the intrinsic complexity of software development, but he also argues that software automation may be realized in specific domains. Prof. Li also believes that full software automation may be achieved in only specific domains. His argument mainly comes from the data that can be collected for software automation. He argues that we may have only sparse data for general-purpose software automation. The panelists and the audience are quite interested in data collection and discuss many issues related to data collection, such as dealing with legal issues and collecting data beyond code or documentation. (2) Panel 2: Roadmap of Software Automation The theme of the second panel discussion is to explore goals in software automation possibly achievable in the near future. Seven panelists (Prof. Premkumar Devanbu, Prof. Zhenjiang Hu, Prof. Sarfraz Khurshid, Prof. Tao Xie, Prof. Yang Liu, Prof. Xiaoxing Ma, and Dr. Jian-Guang Lou) participate in the panel discussion moderated by Prof. Lu Zhang. Prof. Devanbu believes that intelligent interaction between development environments and human developers may be a breakthrough in the near future. Prof. Hu argues that 6 — 25 — automatic parallelization can be an interesting example for software automation. Prof. Khurshid emphasizes the opportunity of software automation in the stage of requirement engineering. Prof. Ma presents his vision of software automation in the Meeting Proceedings context of self-adaptive software. Prof. Liu is more conservative and emphasizes the intrinsic difficulty in fully automatic software development. Prof. Xie emphasizes research directions on intelligent operation of software systems. Dr. Lou talks about a software tool from his software analytics group at Microsoft Research Asia for aiming to automatically translate queries in a natural language into SQL statements. The audience have intensive discussion over the research areas mentioned by the panelists. Operation of software systems is commonly believed to be a promising area of software automation. The panelists and the audience also discuss possible negative impacts that software automation may have on human developers and users. 2018 — 26 — 7 4. Summary of Working-Group Discussions (1) Software Automation in the Big Data Era: Challenges (Neural) Program Generation/Synthesis (Rounds 1and andOpportunities 2) The discussions of this working group focus on two main parts. The first part is about the recent research progress on program generation using deep neural networks (DNNs). The second part is open discussion on future research directions. During the first part, the participants first highlight three main ways in applying DNNs for program generation: (1) using end-to-end (or encoder/decoder) DNN models, (2) using DNN models for code search or result ranking, and (3) using the divide-andconquer methodology and using DNN models in one or more steps. Based on the understanding of the existing approaches, the participants further discuss the strengths and weaknesses of these existing approaches. For example, DNN-based program generation achieves much better results on domain-specific scenarios (e.g., generating SQL statements from natural language discerptions) than on general scenarios. The participants also discuss some typical difficult scenarios for program generation, such as generating loop statements. During the second part, the participants mainly discuss four major research directions. First, the participants believe that involving humans in the process of DNN-based program generation would both improve the correctness ratios and the readability of the generated code. Second, the participants agree that the research on program generation may be combined with the research of program comprehension, since program generation and program comprehension are generally two reversed directions. Third, the participants discuss the possibility of combining DNN models with domain knowledge in program generation. Finally, the participants also discuss what kinds of programming languages are more suitable for program generation. For example, it may be easier to generate programs written in a declarative language than in a procedural language. (2) Community Building: Data Collection, Benchmarking, and Competition for Software Automation, Bridging Research and Practice In pursuing the dream of data-driven software automation, the availability of large-scale and high-quality labelled data sets is very important. The data will provide not only the fuel for learning engines but also the meter to know how far we go. The participants of this session thus focus their discussion on initiating collective efforts for labeling software artifacts. 8 — 27 — Currently, with the support from the China Ministry of Science and Technology, Prof. Gang Yin et al. from National University of Defense Technology and Prof. Minghui Zhou et al. from Peking University have started a project to organize competitions Meeting Proceedings among students on commenting open source code. Besides the primary goal of engaging and incentivizing students in China to contribute to open source project development, the project is also expected to offer a side product of labelled software artifacts that can be used in software automation. 2018 The participants suggest that, given a code snippet (or coarser-grained code artifacts), one should not only label its functionality, but also its design rationale, such as the design pattern used and the decision context. Research opportunities exist for tools supporting efficient code labeling. However, some technical issues still remain, with the following aspects emphasized during the discussions: ¥ Quality control for the labeling. On one hand, the accuracy of software artifact labeling heavily depends on the domain knowledge owned by the labelers, and such domain knowledge is often scarce among students. On the other hand, different labelers may provide different but all valid labels for the same artifact. ¥ Scale of the artifacts under labeling. Competitions and classroom efforts may be still limited to produce a large number of labeled software artifacts. We need to investigate what artifacts/labels are of higher priority for labeling to maximize return on investment given limited initial resources. ¥ Staleness of the artifacts under labeling. Considering the evolving nature of software, we need to investigate how to make the labeled artifacts up to date in a long term. ¥ Automated expansion of dataset. Can we borrow from the machine learning communities the idea of Generative Adversarial Nets to expand the data sets? Finally, the participants discuss how to encourage industrial participation by removing sensitive information from industrial code bases before releasing them to public, and how to make the labeling task interesting with some question/answer games. The discussions also mention that ICSME 2018 hosted the first challenge of software documentation generation “to build an automated system that can create, on-demand, reference documentation for a Java class”. (3) Challenges of Program Repair The working group on challenges of program repair discusses challenges and possible solutions for program repairs. The working group starts with an introduction to the current practice of program repairs, and then the participates are free to discuss either challenges or their solutions. The discussions raise a few challenges, such as 9 — 28 — ¥ ¥ ¥ ¥ the overfitting problem of, which is the largest challenge faced by current program repair approaches; Software Automation the challenges in program repair in educational settings, such howBig to grade inasthe Data Era: the assignments and how to provide feedbackChallenges to best help theand students; Opportunities the repair of performance bugs; the limitation of one single popularly used benchmark: many researchers evaluate their approaches on the Defects4J benchmark, and it is not clear that their approaches overfit the benchmark or not. The participants also discuss a set of solutions to the challenges, such as those listed below: ¥ Regarding the problem of overfitting, one possible solution is to look for better target domains where the problem of overfitting is not a critical issue. One promising domain is the software development based on a reference implementation. This domain covers a lot of development efforts that implement a protocol, such as compiler implementation and audio/video player implementation. ¥ Another solution toward overfitting is to further classify bugs and design different strategies for different types of bugs. ¥ A possible direction to repairing performance bugs is to borrow compiler optimization techniques or program calculation techniques from the programming language domain. (4) Human-AI Cooperation (Round 1) This working group’s discussions raise the following points: l A friendly human-AI interactive environment is essential for software automation. On one hand, within a good interactive environment, AI could seek for the help of developers for better focusing on certain parts of software in limited resources, e.g., automatically generating test cases for certain methods/classes. On the other hand, developers can well employ AI in programming, e.g., interactive code search with AI-assisted query expansion. The preceding scenarios are similar to some hot topics in AI, such as active learning, associate computing search, and interactive data mining. l Human-AI interaction can contribute to software automation due to the behavior change of developers in software automation. AI could produce software in distinct granularities, and developers could determine whether the generated software works or not, since it is easier for developers to “evaluate” software than to manually “produce” software. l Many concerns are still to be considered as follows. • What is the boundary between AI and human in software automation? • How to predict/explain the behavior of AI in software automation? • Program methodology (e.g., object-oriented programming, data-riented programming) should be considered in Human-AI interaction. 10 — 29 — (5) Human-AI Cooperation (Round 2) The participants of the working group commonly agree on the importance of humanAI cooperation in software automation. The discussions focus on two main aspects. Meeting Proceedings 2018 The first aspect is about the ways that AI communicates with human developers for more effective software development. The first discussed way of human-AI interaction for software development is voice programming: human developers use their voice to command the IDE for programming instead of typing in the code manually. Essentially, this way of human-AI interaction can be viewed as pairing a human developer with a robotic developer during software development. Another way of human-AI interaction for software development is code completion, where an AI-centric recommender working within the IDE provides possible code for human developers to select and adopt/adapt. The second aspect is about what and how researchers can learn knowledge about the need for human-AI interaction. One possible way for learning such knowledge is studies of human behaviors in pair programming, where two human developers work cooperatively to deal with development tasks. Such a study may help reveal what a human developer may need during software development from a peer developer. Thus, the results of these studies may lead to invention of more powerful robotic developers. Another issue to study is to track eye movement of human developers during software and correlate the eye movement data with various development activities. Such studies may help us acquire knowledge about how human developers utilize existing code and human developers connect different code snippets in the code base. The knowledge can be valuable because it may be hard to obtain from code and/or documentation. (6) Engineering of Intelligence Software Intelligence software refers to software systems that have artificial intelligence components such as machine-learned classification models. This working group discusses how software engineering could be used to help such software systems. A lot of discussion focuses on how to help machine learning developers especially neural network developers. The participants with machine learning background discuss how to test and debug a neural work in practice. Also, a published empirical study on TensorFlow bugs (Zhang et al. An Empirical Study on TensorFlow Program Bugs. ISSTA’18) was extensively discussed. This paper identifies four symptoms of TensorFlow bugs and their seven main causes. It also identifies a few challenges for testing and debugging these bugs. The definition of bugs in machine learning is also discussed. One way to define a programming bug in machine learning is that the coded machine learning model does — 30 — 11 not match the mental model of the developers. Besides programming bugs, there may be other types of bugs such as bugs in data, bugs in model design, bugs in the underline Software Automation framework, etc. in the Big Data Era: Challenges and Opportunities Challenges for engineering intelligence software beyond machine learning components is also discussed. One is the engineering need for other types of intelligence software such as autotuning or adaptive software. Another one is potential challenges in integrating machine-learned models with traditional software. Finally, the participants also discuss how AutoML may affect the engineering of machine learning software. (7) Scoping Software Automation: What Does It Mean and What Can Be Expected in the Near Future, Survey of Software Automation (Round 1) First, the participants of the working group discuss the scope of software automation. Many of the participants agree that there are some differences on the scope of software automation between the AI and software engineering communities. In the AI community, software automation refers to producing programs automatically; however, in the software engineering community, software automation may include much more activities such as automating the process of requirements elicitation, architectural design, low-level designing, system implementation, system testing, etc. The scope of software automation is much broader in the software engineering community than in the AI community. Second, the participants discuss the problem of how to measure the automation level of software development. A possible way is to measure the automation level from four aspects: (1) the workload, i.e., measuring how much work can be done by machine and how much work can be done by human; (2) the final output, i.e., measuring how many system components can be realized automatically; (3) the process, i.e., measuring the complexity of implementation of the system; (4) the developer, i.e., whether or how much software automation can support to program by the end users such as end-user porgrammers. Third, the participants discuss a possible roadmap of software automation. The participants generally agree that the ultimate goal is to realize end-user-guided automatic programming, and by that time, most software developers would disappear, and only some of them remain for maintaining the system. During this process, the following levels may be achieved one by one: L1: automated generation of machine code from some high-level programming language. L2: programming with a higher-level language, and automatic generation of skeleton 12 — 31 — code based on models, code completion, and code recommendation. L3: automated generation of code fragment from a user intent. L4: automated generation of design structure. Meeting Proceedings L5: automated understanding of requirements. The participants generally agree that the descriptions of L1, L2, and L3 might be suitable, and more discussion is necessary for improving the definition and description of L4 and L5. 2018 Fourth, the participants discuss the possibility of realizing software automation. The participants generally agree that software automation may be achieved in some specific application domains rather than in the general application domain. The research communities may choose some specific application domains to set some milestones, label more data, and carry out some experiments. It is also a feasible way to automate through software reuse. By that way, one can build some software repositories, classify the software components in these repositories, and conduct programming according to the functionality and reuse some software components during software development by using some system synthesis/composition techniques. In addition, one can try to extract knowledge from the repositories. The participants generally believe that software automation is a predetermined growing direction. However, currently the open questions of how to achieve this goal and to what extent automation can be accomplished still need further discussions. (8) Scoping Software Automation: What Does It Mean and What Can Be Expected in the Near Future, Survey of Software Automation (Round 2) Software automation faces a series of essential challenges, such as vague and diverse requirements in open domains, complex software ecosystem and software technology stack, and diverse technical and business domains. It is even more challenging when software automation needs to handle nonfunctional requirements such as extensibility and safety. The participants identify some key points for data-driven software automation: (1) human in the loop, software automation may need to be implemented in an iterative and interactive way; (2) learn from the history, software automation approaches may benefit from successful practices from the history, such as prototype and feedback; (3) knowledge is important, knowledge such as domain knowledge, design knowledge, and API knowledge, is essential for successful software automation; (4) combination of different techniques, software automation requires the combination of a series of different techniques, not just deep learning or even AI; for example, coarse-grained software reuse by adaptation and composition and big data analysis for software — 32 — 13 development are also important techniques. Software Automation in theautomation Big Data Era: The participants agree that there is a strong need of a roadmap for software Challenges Opportunities to (1) have a clear understanding of our current position and the next and step, (2) set our short-term and long-term targets, and (3) clarify the concrete topics for data collection and benchmarking. One can learn from autonomous-driving levels and set multiple levels for software automation from different dimensions, such as automated generation of machine code (L1), automated generation of skeleton code and suggestion of next line of code (L2), automated generation of code fragment (L3), automated generation of design structure (L4), and automated generation of the whole application based on requirements understanding (L5). (9) Software Automation for IoT Systems First, the participants of the working group agree on what is Internet of Things (IoT): IoT is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and connectivity, which enables these things to connect, collect, and exchange data, so that these things can work together to complete given tasks by real-time monitoring and controlling. Second, the participants identify potential applications of IoT in our daily life, e.g., smart transportation, state grid, remote medicine, intelligent logistics, smart factory, smart house, etc. Third, the participants discuss the features of IoT, its relation to software automation, and challenges in the design of IoT. IoT consists of various subsystems, which are distributed geographically, and heterogonous certainly, with many complicated behaviors, such as behaviors related to communication, mobility, real-time, control, being discrete, and being continuous. In IoT, there is a lot of control software. How to synthesize the control software and how to guarantee their correctness play an important role in the design of IoT. The major issues in the design of IoT include communication, protocol, specification, simulation and testing, specific platform, resource constraints, programming languages, heterogeneity, mobility, interoperability, error diagnosis and monitor, control strategy, controller synthesis, security and privacy, robustness, safety, architecture, modeling, and so on. 14 — 33 — 2018 Meeting Proceedings 3 Relevant Scientific Publications 3.1 Relevant Special Topics................................................................................................... 36 Special Focus on Software Automation, Science China Information Sciences, October 2019, Vol. 62 Editorial................................................................................................................................... 36 Special Focus on Software Automation Review..................................................................................................................................... 37 Evaluation of model checkers by verifying message passing programs Research paper...................................................................................................................... 61 A manual inspection of Defects4J bugs and its implications for automatic program repair Perspective............................................................................................................................. 77 Automated program repair: a step towards software automation Letter....................................................................................................................................... 80 AI-boosted software automation: learning from human pair programmers 3.2 Meeting News................................................................................................................... 83 NSR reports: Challenges and opportunities of software automation discussed at the Yanqi-Lake Meeting, He Jiang, National Science Review, 2019, 6:19, doi:10.1093/nsr/nwy145 — 34 — Software Automation in the Big Data Era: Challenges and Opportunities 3.3 Other Relevant Scientific Publications............................................................................ 84 Hong Mei, Lu Zhang, Can Big Data Bring a Breakthrough for Software Automation? Science China Information Sciences, Vol 61, May 2018, 056101:1-056101:3........................................ 84 Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, Sumit Gulwani. Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples. ICLR (Poster) 2018............................................................................................................................ 87 Naji Dmeiri, David A. Tomassi, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, Cindy Rubio-González. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. ICSE 2019: 339-349........................ 102 Yiling Lou, Lingming Zhang, Dan Hao. History-Driven Build Failure Fixing: How Far Are We? ISSTA 2019: 43-54................................................................................................................. 113 Jiajun Jiang, Luyao Ren, Yingfei Xiong, Lingming Zhang. Inferring Program Transformations From Singular Examples via Big Code. ASE 2019: 255-266................................................... 125 Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, Lu Zhang. TreeGen: A Tree-Based Transformer Architecture for Code Generation. AAAI 2020:.................................................... 137 Angello Astorga, P. Madhusudan, Shambwaditya Saha, Shiyu Wang, Tao Xie. Learning Stateful Preconditions Modulo a Test Generator. PLDI 2019: 775-787................................................ 145 3.4 Invited Talks.................................................................................................................... 158 — 35 — 2018 SCIENCE CHINA Information Sciences . EDITORIAL . Meeting Proceedings October 2019, Vol. 62 200100 https://doi.org/10.1007/s11432-019-2627-y Special Focus on Software Automation∗ From October 11th to October 13th 2018, forty researchers from the US, the UK, Canada, Australia, Japan, Singapore and China gathered at Yanqi-Lake to attend Yanqi-Lake Meeting 2018 to discuss software automation in the big data era. Software automation refers to the process of generating software automatically based on formal or informal specifications. Software automation (e.g., program synthesis, code completion, program transformation, code recommendation, program repair, and software selfevolution) used to be a dream in computer science, which can free developers from tedious programming. Following the theme of the Yanqi-Lake Meeting, we provide a special focus on software automation, which includes one review paper, one research paper, one perspective paper, and one letter. The review paper, which is entitled “Evaluation of model checkers by verifying message passing programs”, is on automated software verification. This paper reports an empirical study on how different model checkers perform on verifying message passing software. The research paper is entitled “A manual inspection of Defects4J bugs and its implications for automatic program repair”. This paper investigates the potential of automated software automation via a controlled study on how humans repair software. The perspective paper is entitled “Automated program repair: a step towards software automation”. This paper focuses on a special form of software automation — automated software repair, and discusses the achievements and challenges in this form of software automation. The letter, which is entitled “AI-boosted software automation: learning from human pair programmers”, is on interactive software automation that can be viewed as cooperation between machine programmers and human programmers. This article argues that interactive software automation may borrow ideas from cooperation in paired programming. Overall, the four articles provide a showcase of current research on software automation, including progress, opinions, and outlooks. We hope that the four articles may stimulate further research in this important direction. Guest Editors: Hong MEI Peking University, China Lu ZHANG Peking University, China ∗ Citation Mei H, Zhang L. Special focus on software automation. Sci China Inf Sci, 2019, 62(10): 200100, https://doi. org/10.1007/s11432-019-2627-y c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019 — 36 — info.scichina.com link.springer.com SCIENCE CHINA Information Sciences . REVIEW . Software Automation in the Big Data Era: Challenges and Opportunities October 2019, Vol. 62 200101:1–200101:24 https://doi.org/10.1007/s11432-018-9825-3 Special Focus on Software Automation Evaluation of model checkers by verifying message passing programs Weijiang HONG1,2 , Zhenbang CHEN1* , Hengbiao YU1,2 & Ji WANG1,2* 1 College of Computer, National University of Defense Technology, Changsha 410073, China; State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, China 2 Received 30 July 2018/Revised 24 October 2018/Accepted 6 March 2019/Published online 3 September 2019 Abstract Benchmarks and evaluation are important for the development of techniques and tools. Studies regarding evaluation of model checkers by large-scale benchmarks are few. The lack of such studies is mainly because of the language difference of existing model checkers and the requirement of intensive labor in building models. In this study, we present a large-scale benchmark for evaluating model checkers whose inputs are concurrent models. The benchmark consists of 2318 models that are generated automatically from real-world message passing interface (MPI) programs. The complexities of the models have been inspected to be well distributed and suitable for evaluating model checkers. Based on the benchmark, we have evaluated five state-of-the-art model checkers, i.e., PAT, FDR, Spin, PRISM, and NuSMV, by verifying the deadlock freedom property. The evaluation results demonstrate the ability and performance difference of these model checkers in verifying message passing programs. Keywords model checker, evaluation, benchmark, MPI, symbolic execution Citation Hong W J, Chen Z B, Yu H B, et al. Evaluation of model checkers by verifying message passing programs. Sci China Inf Sci, 2019, 62(10): 200101, https://doi.org/10.1007/s11432-018-9825-3 1 Introduction Model checking [1] is an effective push-button technique used for verifying concurrent systems. Until now, many model checkers have been developed and successfully applied to verify hardware or software systems. A model checker usually accepts a model M of the system under verification and a critical property ϕ of the system. Then, it tries to explore all the status of M. If no violation of ϕ is detected during exploration, the model checker reports that M satisfies ϕ; otherwise, a counter-example is reported by the model checker which can be used to detect and fix bugs. In this way, model checking provides an automatic verification method. However, model checking suffers from state explosion problem, especially when applied to verify concurrent systems. Although there are several successful studies on using model checking to verify concurrent systems, only a few of them exists for evaluating and comparing model checkers [2]. This is because different model checkers support different modeling languages. In the evaluation, much engineering effort is needed to generate the models with respect to the input languages of different model checkers. However, the existing benchmarks of model checking mainly use manually created models [3], hardware benchmark model1) , or models of classical problems2) . In addition, the benchmarks for software model checkers usually consist * Corresponding author (email: zbchen@nudt.edu.cn, wj@nudt.edu.cn) 1) Hardware Model Checking Contest Website. Http://fmv.jku.at/hwmcc17/. 2) Model Checking Contest Website. Https://mcc.lip6.fr/. c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019 info.scichina.com link.springer.com — 37 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:2 of programs instead of models3) . To the best of our knowledge, large benchmarks consisting of concurrent models automatically extracted from real-world concurrent programs do not exist. Therefore, study on Meeting evaluating model Proceedings checkers over a large-scale benchmark of such models is lacking. In this study, we provide a large-scale model benchmark for evaluating existing model checkers. The models in the benchmark are automatically extracted from real-world message passing programs. Message passing is the current de-factor programming paradigm in high-performance computing (HPC). Message passing interface (MPI)4) plays a crucial role in developing HPC applications. However, due to the complexities such as non-deterministic and non-blocking communications, the development and maintenance of MPI programs are not trivial. In particular, the verification of MPI programs is extremely challenging [4] and the concurrency in MPI programs is used for evaluating the model checker. However, in some of the existing studies (e.g., [5, 6]), they either manually create the models or are difficult to support real-world MPI programs. Hence, it is not suitable for creating a large-scale model benchmark by leveraging them. In our previous study [7], we developed a symbolic verification method, called MPI-SV, for MPI programs. MPI-SV combines symbolic execution [8] and model checking to verify MPI programs. Symbolic execution extracts path-level models from MPI programs, whereas model checking verifies path-level models with respect to critical properties, such as deadlock freedom. The two combined techniques complement each other. Symbolic execution helps to improve code coverage and handle the complex language constructs in the MPI code and extracts the communication models for model checking. By contrast, model checking helps to boost the efficiency of symbolic execution and enlarge the scope of verifiable properties. In principle, we can use the tool of MPI-SV as a benchmark generator to generate path-level models of real-world MPI programs. Based on MPI-SV, we created a benchmark from 10 real-world MPI programs and the benchmark consists of 2318 models. Using the benchmark, we evaluated five state-of-the-art model checkers, i.e., PAT5) , FDR [9], Spin6) , PRISM7) , and NuSMV8) . The evaluation is enabled by generating the input models for different model checkers from the models in the benchmark. The main contributions of this study are as follows: 2018 • A large-scale benchmark for evaluating model checkers is generated. It consists of 2318 models extracted from real-world MPI programs and is justified to be well distributed in complexity. • A comprehensive evaluation of five state-of-the-art model checkers based on the benchmark. • A discussion of lessons learned from creating the benchmark and the evaluation. Structure. The rest of this study is organized as follows. A brief summary of the background of the study, which includes MPI, MPI-SV, and the model checkers, is presented in Section 2. Section 3 presents the design and generation as well as the the evaluation of the benchmark. Then, the translation algorithms from the models in the benchmark to the input models of model checkers are presented in Section 4. The evaluation of five model checkers and its results are presented in Section 5. Furthermore, Section 6 discusses related work. Finally, the conclusion of the study with some further research directions is presented in Section 7. 2 Preliminaries and framework This section first provides an introduction to our framework for generating benchmark and evaluating model checkers. Then, the key MPI operations, the evaluated model checkers, and MPI-SV will be briefly introduced. 3) SV-Comp Website. Https://sv-comp.sosy-lab.org/. 4) Message Passing Interface Forum. Http://www.mpi-forum.org/docs/. 5) PAT Website. Http://pat.comp.nus.edu.sg. 6) Spin Website. Http://www.spinroot.com. 7) PRISM Website. Http://www.prismmodelchecker.org. 8) NuSMV Website. Http://nusmv.fbk.eu. — 38 — Software Automation in the Big Data Era: Challenges and Opportunities — 39 — Hong W J, et al. 2018 Sci China Inf Sci Table 1 P0 Meeting IRecv(*,Proceedings 1) Recv(1, 1) October 2019 Vol. 62 200101:4 An example of an MPI program. P1 P2 ISend(0, 1) ISend(0, 1) Barrier Barrier Barrier • Barrier: blocks the process until all the processes have called Barrier, which makes a global synchronization. • Wait(req): blocks until the operation indicated by req is completed. Non-blocking operations are frequently used in MPI programs for improving the performance. The key non-blocking operations are the following: • ISend(i, tag, req): sends a message with a tag to the ith process, and the sending process returns immediately after the operation is completed. The parameter req is used to indicate the status of the operation. • IRecv(i, tag, req): receives a message with the tag from the ith process, and the receiving process returns immediately after being completed. Similarly, IRecv(*, tag, req) is the non-blocking wildcard receive. Some complex MPI operations, such as MPI Bcast and MPI Gather, can be implemented by composing these key operations, and MPI-SV supports a wide range of complex MPI operations. Table 1 shows an example of an MPI program. In the example, the program runs three processes, i.e., P0 , P1 , and P2 . Both P1 and P2 first send a message with tag 1 to P0 . Then, they wait for the remaining processes to synchronize. P0 first receives a message with tag 1 from any process in a non-blocking way. Afterward, P0 blocks until a message with tag 1 is received from P1 . This program clearly has a deadlock when P0 ’s wildcard receive operation receives the message from P1 first, which causes the second blocking receive operation to be blocked forever since there would be no message from P1 anymore. If the wildcard receive operation receives the message from P2 , no deadlock will happen. 2.3 Model checkers Model checking is one of the most effective automated techniques for verifying the correctness of software and hardware designs. It explores all possible states in a brute-force manner to prove whether a given system model truly satisfies a property. Several effective model checkers have been developed. Here, we give a short description of the five model checkers we evaluated in this study. The reason that we chose these five model checkers is that they are available, applies state-of-the-art model techniques, and are still under development or maintenance. In addition, we require the model checker to provide a command-line interface; otherwise, we do not include it, such as UPPAAL9) . • PAT: Process analysis toolkit (PAT) is a self-contained framework designed to apply state-of-theart model checking techniques for system analysis. The system model is specified by the classic process algebra language communicating sequential process (CSP) [11]. An example of using PAT is as follows. 1 2 3 4 5 6 7 8 9 10 (* channel_definition *) channel C1 1; channel C2 1; (* process_definitions *) P0 = C1!1->Skip; C2!->Skip; P1 = (C1?1->Skip [] C2?1->Skip); (* parallel operation *) P = P0 || P1; 9) UPPAAL Website. Http://www.uppaal.org. — 40 — Hong W J, et al. 11 12 13 Sci China Inf Sci (* assertion *) #assert P deadlockfree; October 2019 Vol. 62 200101:5 Software Automation in the Big Data Era: Challenges and Opportunities As shown in this example, we declare two different channels and processes. Then, we use the parallel composition operation defined in CSP to obtain process P by composing processes P0 and P1 in parallel. Moreover, we can write some assertions (or properties) to be verified by the model checker. In the example, we verify the deadlock freedom using the following model checkers. • Spin: Spin is a model checker for verifying concurrent systems, e.g., data communication protocols. The input language of Spin is process meta language (Promela), which provides a convenient way for modeling concurrent systems. The following is an example of Spin. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 (* variable_definitions *) #define a1 #define b2 chan ch = [1] of {byte}; (* process_definitions *) Proctype A() { ch!a } Proctype B() { ch!b } Proctype C() { if :: ch?a :: ch?b fi } (* parallel operation *) init { atomic { run A(); run B(); run C() } } This example defines three processes run in parallel and one channel. Processes A and B write message a and b, respectively, into channel ch. The first branch in C is executable if and only if the channel contains message a. The second branch is defined in a similar manner. Considering that the channel’s capacity is 1, the message that will be available is non-deterministic, which depends on the writing speeds of the processes. • FDR: FDR [9] is also a model checker used for analyzing programs written in CSP, in particular, machine-readable CSP, named CSPM [12], which combines the operators of CSP with a functional programming language. The syntax of the input language of FDR is similar to that of PAT. An example is given as follows: 1 2 3 4 5 6 7 8 (* variable_definitions *) channel C1 channel C2 channel D (* process_definitions *) P0 = C1 ->D -> SKIP P1 = D ->C2 -> SKIP — 41 — Hong W J, et al. 9 10 11 12 13 14 Sci China Inf Sci October 2019 Vol. 62 200101:6 2018 (* parallel operation *) Q = P0 [|{D}|]Proceedings P1 Meeting (* assertion *) assert Q : [deadlock free[F]] This example illustrates the synchronization between processes. Here, the channels, i.e., C1, C2, and D, can be understood as events. P0 and P1 execute these channels in order. Q forces P0 and P1 to synchronize on event D while any other event can be performed by either process. The property Q, i.e., deadlock freedom, is written in the assert statement. • PRISM: PRISM is a probabilistic model checker that supports the modeling and verification of probabilistic models, including discrete-time Markov chains, continuous-time Markov chains, Markov decision processes, and probabilistic timed automata. The following shows an example in Prism. 1 2 3 4 5 6 7 8 9 10 11 12 (* the type of model *) mdp module M (* variable_definitions *) x : [0..2] init 0; (* transition *) [] x=0 -> 0.8: (x’=0) + 0.2: (x’=1); [] x=1 -> (x’=2); [] x=2 -> 0.5: (x’=2) + 0.5: (x’=0); end module This example describes a Markov decision process. Three states are represented by x. Process M can be in one of the three states, i.e., 0, 1, 2. From state 0, the process can move to state 1 with a probability of 0.2 and remain in the same state with a probability of 0.8. From state 1, the process can only move to state 2. Finally, from state 2, the process will either remain there or move back to state 0 with an equal probability. • NuSMV: NuSMV is a symbolic model checker that re-implements and extends the original binary decision diagram-based model checker SMV [13]. NuSMV’s input model is specified by labeled transition system [14]. The following is an example. 1 2 3 4 5 6 7 8 9 10 11 MODULE main VAR request : boolean; machine : {ready, busy}; ASSIGN init(machine) := ready; next(machine) := case machine = ready & request = TRUE TRUE esac; : busy; : {ready, busy}; This example specifies the transition of the machine state between ready and busy. While the current state of the machine is ready and the request is TRUE, its next state is busy; otherwise, the next state is randomly chosen from {ready, busy}. — 42 — Hong W J, et al. MPI programs Sci China Inf Sci October 2019 Vol. 62 200101:7 Violation path Symbolic executor in the Big Data Era: Challenges and Opportunities No State pruner Property No Violation CSP model checker Automation YesSoftware Test case Yes CSP model MPI-SV Figure 2 2.4 Framework of MPI-SV [7]. MPI-SV Figure 2 shows the framework of MPI-SV. The inputs of MPI-SV are an MPI C program, its number of running processes, and a property (e.g., deadlock freedom) to verify. The program will be analyzed using symbolic execution [8]. During symbolic execution, different cases of message matching in MPI communications will be explored systematically. The property will be checked simultaneously to detect property violations. If a path explored by the symbolic executor is terminated and no violation is detected, a CSP model is generated from the communication behavior of the path. The CSP model represents the equivalent communication behavior of the current path soundly and completely by changing only the message matchings of the wildcard receives in the path. Then, the CSP model will be fed into a CSP model checker for verification. In [7], PAT was used as the model checker. If PAT reports a counter-example, a violation of the property is detected and reported; otherwise, the equivalent paths of the current path are pruned because no violation exists in the equivalent behaviors. In this way, MPI-SV combines symbolic execution and model checking to verify MPI programs. In principle, MPI-SV divides the behaviors of an MPI program into different equivalent classes. Each class is modeled by a CSP model and only one path inside the class is explored by a symbolic executor. In the high level, symbolic execution can be considered an extractor for extracting path-level models and model checking is used to verify that the models satisfy the property. Hence, we can use MPI-SV as a benchmark generator to generate the models of behavior equivalence classes of an MPI program under various running processes. We only need to modify MPI-SV slightly to generate benchmark models during verification, i.e., generating a benchmark model in the unified syntax (c.f. Subsection 3.2) when generating a CSP model. We use MPI-SV∗ (c.f. Figure 1) to denote the modified version of MPI-SV for benchmark generation. 3 Benchmark 3.1 Benchmark design Our criteria of selecting MPI programs for benchmark generation are as follows: (1) the programs must be real-world MPI programs; (2) the types and scales of the programs are diverse; and (3) the programs are analyzable to MPI-SV. Finally, we select the MPI programs in Table 2 for producing benchmark models. All the programs, which are real-world open source MPI programs from real-world applications, are classified into several categories. In numeric calculation, we have Integrate_mw and Diffusion2d from the FEVS benchmark [15]. In addition, an MPI implementation for Gaussian elimination Gauss_elim [16] and a parallel solver for heat equation Heat [17] are typical applications in numeric calculation. For transition and communication behavior, we have DTG from a Ph.D. dissertation [18] and Pingpong that is a testing program for communication performance. In addition, we collect the MPI programs related to image processing, i.e., Mandelbrot and Image_manip, which draw the Mandelbrot set for a bitmap in — 43 — Hong W J, et al. 2018 Sci China Inf Sci Table 2 October 2019 Vol. 62 200101:8 Programs for benchmark generation Program Line of code Brief descripttion 90 Dependence transition group Integrate_mw 181 Integral computing Diffusion2d 197 Simulation of diffusion equation Gauss_elim 341 Gaussian elimination Meeting DTG Proceedings Heat 613 Heat equation solver Pingpong 220 Comm performance testing Mandelbrot 268 Mandelbrot set drawing Image_manip 360 Image manipulation Kfray 12728 KF-Ray parallel raytracer ClustalW 23265 Multiple sequence alignment Total 38263 10 open source MPI programs parallel and deals with image manipulations, respectively. All Pingpong, Mandelbrot, and Image_manip are downloaded from GitHub. There are also two large MPI programs: Kfray10) is a ray tracing program that can create realistic images and ClustalW [19] is a popular tool for aligning multiple gene sequences. Besides the programs, the number of processes to run in a program is important for model generation. In principle, the number of processes determines the scale of parallelism. Our aim is to have models under different scales of parallelism. Hence, we plan to use MPI-SV to analyze each program under different numbers of processes, i.e., 2, 4, 6, 8, and 10, except the programs developed with a fixed number of processes, such as DTG and Pingpong. In addition, due to the huge path space under more processes, we only analyze Diffusion under 4 and 6 processes. 3.2 Benchmark generation As shown in Figure 1, the selected programs will be compiled into LLVM intermediate representation (IR) [20] for symbolic execution. Then, the IR will be fed to MPI-SV for model generation. The property we want to check is deadlock freedom, which is critical for MPI programs. Note that we mutate the IR of each program to diversify the benchmarks [21]. Mutants are generated by rewriting a randomly selected receive operation using the following two rules: • Replace Recv(i) with if (x > a) Recv(i) else Recv(*). • Replace Recv(*) with if (x > a) Recv(*) else Recv(j). Here x is an input variable, a is a randomly generated constant value, and j is randomly selected from the scope of the process identifiers. The mutations for IRecv(i,r) and IRecv(*,r) are similar. The goal of the first rule is to improve the program performance and simplify programming, while the second rule is to make the communication more deterministic. For each program, we generate five mutants if possible or generate as many as the number of receives. The time limit for generating models for a program under a specific number of processes is 1 h. All the models in our benchmark are generated on a server with 32 Xeon 2.5 G cores and 256 G memory. When MPI-SV analyzes an MPI program, as depicted in Subsection 2.4, it generates a model for a terminated path along which no violation of the critical property happens. The communications that occurred along the path are recorded in the model. Formally, if an MPI program is run under n processes, a model M of the program is a set of communication sequences {Proci | 0 i n − 1}, and each Proci is defined as the syntax in Figure 3, where 0 d, s n − 1 are process identities, and t, r ∈ N stand for the status of the operation. For example, the MPI program in Table 1 produces the following model, where the statuses t and r are omitted for the sake of simplicity. 10) kf-ray. Https://code.google.com/archive/p/kf-ray/. — 44 — Hong W J, et al. Proc ::= Comm ::= Sci China Inf Sci October 2019 Vol. 62 200101:9 Software Automation in the Big Data Era: IRecv(d,t,r) | IRecv(*,t,r) | Wait(r) Challenges and Opportunities Comm | Proc ; Proc Ssend(d, t) | Recv(s, t) | Recv(*, t) | Barrier | ISend(d,r) | Figure 3 Syntax of a model in benchmark. Table 3 Number of models extracted from different mutants Program o m1 m2 m3 m4 m5 Total DTG 1 2 1 1 1 1 7 Integrate_mw 5 5 67 – – – 77 Diffusion2d 2 2 18 80 3 2 107 Gauss_elim 5 6 – – – – 11 Heat 5 8 6 6 6 6 37 Pingpong 0 28 28 547 28 28 659 Mandelbrot 39 330 327 – – – 696 Image_manip 20 5 5 – – – 30 Kfray 1 5 641 5 – – 652 ClustalW 5 5 5 17 5 5 42 Total 83 396 1098 656 43 42 2318 Proc0 = IRecv(*,1);Recv(1,1);Barrier Proc1 = ISend(0,1);Barrier Proc2 = ISend(0,1);Barrier In addition to the communication sequences, for each model, we record a verification result produced by MPI-SV. A verification result of a model may be deadlock, deadlock-free, or timeout. The result can be used as a reference answer during evaluation. For example, the result of the model extracted from the program in Table 1 is deadlock. Table 3 lists the results of the benchmark generation. The first column shows the names of programs and the remaining columns indicate the number of benchmarks under different mutants, where o represents the original program. In total, 2318 models have been generated from the MPI programs in Table 2. As indicated by Table 3, the number of the models extracted from different programs varies, which is mainly due to the different input symbolizations of the programs. In principle, a model represents an equivalent class of the program’s behaviors, whose control and data dependencies are same, but the matches of wildcard receives are different. From Table 3, we can see that the number of the models extracted from original programs (c.f., column o) is not large, which is also the reason why mutations are carried out. In particular, since no wildcard receives are contained in the program, the number of models extracted from original Pingpong is 0. In contrast, we can extract models from the mutated Pingpong program because the wildcard receives has been generated in the mutation. Table 4 displays the generation results from the perspective of parallelism scales. The first column shows the names of programs and the remaining columns list the number of benchmark models generated under different processes. As shown in Table 4, for some programs, the number of models under more number of processes is smaller than that under fewer processes. The reason for this is that a large number of processes enlarge the scale of parallelism and complicate the communication along paths. In addition, MPI-SV needs more time to verify path models. Hence, the paths explored under the same time limit become fewer. For most programs, the number of extracted models is largest under 4 or 6 processes. 3.3 Benchmark evaluation In principle, the communication complexity of each benchmark model determines the complexity of model checking-based verification. Hence, to indicate the validity of our benchmark for evaluating model checkers, we first evaluate our benchmark based on the communication complexity of the models. — 45 — Hong W J, et al. October 2019 Vol. 62 200101:10 Number of models in different scales of parallelism Table 4 2018 Program Sci China Inf Sci 2 Meeting Proceedings DTG – 4 5 6 8 10 Total – 7 – – – 7 3 3 – 3 59 9 77 Diffusion2d – 37 – 70 – – 107 Gauss_elim 2 2 – 2 3 2 11 Heat 13 6 – 6 6 6 37 Pingpong 659 – – – – – 659 Mandelbrot 48 243 – 169 150 86 696 Integrate_mw Image_manip 6 6 – 6 6 6 30 Kfray 292 208 – 138 11 3 652 ClustalW 6 9 – 15 6 6 42 Total 1029 514 7 409 241 118 2318 2 3 Frequency Probability 1 8/9 7/9 0 1 Figure 4 Cumulative distribution function. The complexity of one process P in a model is determined by two aspects: (a) P ’s frequency of receiving messages, and (b) P ’s variety of receiving messages, i.e., the number of processes from which P receives messages. Hence, to measure the complexity of a model, we extract these static features of the model as a matrix M . Each element Mij in the matrix indicates the frequency of receiving messages between the two processes, Pi and Pj , i.e., the number of receive operations of Pi that can match a send operation in Pj . For example, the matrix M of the model in Subsection 3.2 is  0 2 1    0 0 0.   0 0 0 M01 is 2 because both the receive operations of P0 , i.e., IRecv(*,1) and Recv(1,1), can match the send operation ISend(0,1) of P1 . However, only IRecv(*,1) can match the send operation ISend(0,1) of P2 . As a result, M02 is 1. The number of non-zero values in ith line of M represents the variety of Pi . For example, the variety of P0 is 2, whereas the variety of P1 or P2 is 0. Using the matrix of a model, we calculate the complexity of the model based on the notion of cumulative distribution function (CDF), which is widely used in network complexity evaluation [22]. Given the matrix of a model M , the CDF of M is a function FM (x) : N → R, defined as follows: FM (x) := P (Mij x), where 0 i n − 1, 0 j n − 1, and the right-hand side of this formula represents the probability that the element value in M is less than or equal to x. Taking the matrix M shown above as an example, Figure 4 shows the corresponding CDF. — 46 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:11 Program 2 4 Integrate_mw 3.5 15.25 Software Automation 8 in the Big 10 Data Era: 35.17 Challenges 113.24 and Opportunities 131.82 Diffusion2d – 48.28 110.67 Complexity of programs Table 5 6 – – Gauss_elim 4.38 28.65 75.39 146.30 240.37 Heat 12.93 42.51 85.80 141.46 211.41 Pingpong 23.75 – – – – Mandelbrot 10.26 38.10 67.59 97.49 149.37 Image_manip 5.87 18.58 29.65 40.24 50.61 Kfray 25.34 83.61 137.25 182.61 207.53 ClustalW 5.79 60.44 144.47 256.61 421.95 The complexity C of a model can be defined on the basis of the matrix and the CDF as follows: C := C1 + C2 , sum C1 := 1 + FM (x)dx, 0 |Mij − avg| , C2 := 1− max(Mij , avg) (1) 0i,jn−1 where C1 calculates the complexities of frequency and variety, and C2 calculates the degree of distribution. In particular, the 1 in C1 is the term of the Laplacian smoothing [23] in case C1 is 0 when all the element values are zero; sum and avg stand for the summation and the average of the elements values in M , respectively. The complexity calculation ensures the following results: • If the total frequency value sum of M is larger, i.e., there are plenty of communications in this model, the model tends to be more complex, which is ensured by C1 . The model is more complex when there are more processes in the model, which is also ensured by C1 . • Under a fixed sum, if the distribution of the frequency values in M is more uniform, i.e., there are more process pairs participating in the communications, the model also tends to be more complex, which is ensured by C2 . For example, the complexities of the following two matrices M1 and M2 are 4.75 and 8, respectively. 0 4 11 , M2 = . M1 = 0 0 11 Meanwhile, the C1 values of M1 and M2 are both 4. Table 5 shows the average model complexities of each program under different numbers of processes. For each column in Table 5, the complexities between various programs show a large range of variation, especially when the number of processes is larger, e.g., 10 processes. Therefore, the models of the programs in our benchmark are different from each other with respect to the complexity; this also indicates that verification complexities of our benchmark are well distributed. We also inspect the difference of the models from one program, which is shown in Figure 5. We can observe that the following: • No matter what the number of processes is, the box related to Mandelbrot is wide enough to declare the diversity of the models from Mandelbrot. • For the models of the remaining programs, the diversity only appears in a few cases, i.e., ClustalW under 4 and 6 processes, Diffusion2d under 4 and 6 processes, Kfray under 4, 6, and 8 processes, and Integrate_mw under 8 and 10 processes. These two observations indicate that the benchmark models from a program under a fixed number of processes tend to have similar complexity. In addition to the diversity in complexity, we also inspect the benchmark models with respect to the complexity of verification employing partial order reduction (POR) [24], which is one of the most popular — 47 — Hong W J, et al. October 2019 Vol. 62 200101:12 Sci China Inf Sci 2018 Complexity Complexity Meeting Proceedings Clu G H sta auss_ eat lW eli m P K M I ge_ ntegr fray andel ingpo ng bro man a t ip te_mw Ima Clu Dif H G sta fus auss_ eat lW ion eli 2d m (b) Complexity Complexity (a) In Kf Man del e_m tegra ray bro te_ ani t mw p Ima g G H D sta iffus auss_ eat e lW ion 2d lim Clu Clu K I ge_ ntegr fray ate man _mw ip Ima H Gau ss_ eat lW eli m sta (d) Man d elb r ot Complexity (c) I M K ge_ ntegr fray andel a bro man t ip te_mw Ima Clu H Gau ss_ eat eli lW m sta In Kf e_m tegra ray te_ ani mw p Ima g (e) Man d elb r ot Figure 5 (Color online) Complexity in programs. Complexity under (a) 2 process; (b) 4 process; (c) 6 process; (d) 8 process; (e) 10 process. reduction techniques applied in model checking and its application for concurrent systems improves the scalability. For MPI programs, if the processes are running independently or communicate seldom, POR has a good chance to be applied to reduce the state space during verification. If a model has many wildcard receive operations, the model tends to be difficult to apply POR in verification. Hence, we collect the average ratio of wildcard receive operations in the models of each program under different processes. Figure 6 shows the results and illustrates that the average rate of wildcard receives is less than 25% for the majority of the programs, which empirically indicates a large chance of applying POR during model checking. Therefore, our benchmark can be used to evaluate the design and implementation of POR in the model checker. — 48 — Hong W J, et al. October 2019 Vol. 62 200101:13 Sci China Inf Sci 2 Software Automation 4 6 in108the Big Data Era: Challenges and Opportunities 40 Wildcard receive percentage (%) 35 30 25 20 15 10 5 0 p ni ma lW ta us Cl y ra Kf d ot br e_ ag Im el nd ng im n2 el po ng Ma Pi at He s_ us io mw 4 Ga e_ at us ff Di gr te In Figure 6 (Color online) Percentage of wildcard receive operations. Translation algorithm To evaluate different model checkers, we need to translate the models in our benchmark into the models in the input languages of the model checkers. Algorithm 1 shows the general framework of translation. Algorithm 1 Benchmark translation procedure Input: A benchmark model M , the number of processes n. Output: A tool-specific model M ′ . 1: for i ← (0, . . . , n − 1) do 2: Pi′ := skip; // the tool specific model for Pi . 3: for op ← Pi do 4: if op = Barrier then 5: Pi′ := Pi′ ; B; 6: else if op = Wait(req) then 7: Pi′ := Pi′ ; Wreq ?0; 8: else if op = Ssend(obj, tag) then 9: Pi′ := Pi′ ; Cobj !0; 10: else if op = ISend(obj, tag, req) then 11: Pi′ := Pi′ ; Dobj !0 ; Wreq !0; 12: else if op = IRecv(obj, tag, req) then 13: Pr := ROM(M, op, i); 14: Pi′ := Pi′ � Pr ; 15: else if op = Recv(obj, tag) then 16: Pr := ROM(M, op, i); 17: Pi′ := Pi′ ; Pr ; 18: end if 19: end for 20: end for 21: M ′ := � {Pi′ | 0 i (n − 1)}; // all the Pi′ models synchronize on B. 22: return M ′ . As described in Algorithm 1, given a benchmark model M , for each process Pi of M, we construct a toolspecific model of process Pi′ to simulate Pi . Finally, all the models of the processes are parallel composed (line 21) to form the tool-specific model for M . For simplicity, we use channel based notations (e.g., channel read C?x and channel write C!a) and CSP composition operators (e.g., sequential composition ; and parallel composition ) to depict the algorithm. Each process Pi is a sequence of MPI operations, and we handle each operation in a reverse order. For Barrier, we use an event B (i.e., special event for synchronization) composed with Pi′ to indicate a — 49 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:14 2018 synchronization (line 5). Given a Wait operation, we use a channel reading operation Wreq ?0, which blocks until a completion message is written to the corresponding channel Wreq by the waited operation with reqMeeting (line 7). Proceedings When a send operation, i.e., Ssend or ISend is encountered, we compose Pi′ with Cobj !0 (line 9) or Dobj !0 ; Wreq !0 (line 11), where Cobj is a zero-sized channel and Dobj is a one-sized channel, and Wreq !0 indicates the completion of the operation. The challenging part is to model a receive operation, i.e., Recv or IRecv, especially when it is a wildcard receive. Algorithm 2 can be used to build the model of a receive operation, and then the model is composed with Pi′ (lines 14 and 17 in Algorithm 1). Algorithm 2 ROM(M , op, pid) //Receive operation modeling Input: benchmark model M , operation op = recv(obj), and process number pid. Output: the model for the receive operation. 1: matchs := ∅; 2: if obj = ∗ then 3: for j ← (0, . . . , n − 1) do 4: matchs := matchs ∪ {send(pid) | send(pid) ∈ Pj }; // send(pid) can be matched with op. 5: end for 6: else if obj = k then 7: matchs := matchs ∪ {send(pid) | send(pid) ∈ Pk }; // send(pid) can be matched with op. 8: end if 9: Pr := {Cs ?0 | s ∈ matchs }; 10: if op = IRecv(req) then 11: Pr := refine(Pr ); 12: Pr := Pr ; Wreq !0; 13: end if 14: return Pr . The inputs of Algorithm 2 are a model M , the receive operation op, and the process identity pid. The key idea is to collect the possibly matched send operations and use channel read operations for modeling. If the receive operation is wildcard, i.e., obj is “∗”, we collect all the possibly matched send operations (Ssend or ISend) in all the processes (lines 3–5). If obj is k, we collect only those send operations in Pk (line 7). Thus, we create a channel read for each matched send operation and compose this channel read in a choice operator ( at line 9), representing the possible communication in the processes. To satisfy the requirements in MPI standard, we need to refine the model (refine in line 11), e.g., ensuring the message receiving rules inside one process, and the detailed requirements can be referred to [7]. Similar to non-blocking send operation, if the receive operation is non-blocking, we add Wreq !0 in the end to indicate the completion status (line 12). For example, according to Algorithm 1, we translate the model in Subsection 3.2 into the following CSP model: Proc0 := ((D1_0?0->Skip [] D2_0?0->Skip); H0_0!0->Skip) || (H0_0?0->Skip; D1_0?0->Skip); Proc1 := D1_0!0->Skip; Proc2 := D2_0!0->Skip; Model := Proc0 || Proc1 || Proc2; The first three lines represent the models of the three processes in the example. Then, we combine these process models with parallel composition � to obtain the final model. The two algorithms provide a general framework for modeling an MPI program path. The modeling method is sound and complete [7]. This means the model created by our method includes all the equivalent communication behaviors of the path by changing only the matches of the wildcard receive operations in the path. Moreover, the model is precise, i.e., the MPI program has all the communication behaviors in the model. As indicated by the two algorithms, the channel-based and CSP-like languages are more suitable for modeling our benchmark models, which is determined by the nature of MPI programs, such as message passing and synchronization and parallel execution. For the composition operators, such as choice and — 50 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:15 synchronized parallel composition, we use the specific mechanisms in the modeling languageAutomation of a model Software checker to implement them if the language does not directly support them in syntax. in the Big Data Era: Challenges and Opportunities 5 Evaluation and results This section presents the evaluation of the five state-of-the-art model checkers using the benchmark. We start with describing the experiment setup in Subsection 5.1, then the results are given in Subsections 5.2– 5.4. Finally, we discuss the threats to validity in Subsection 5.5. Based on the benchmark, we evaluate the five model checkers and try to answer the following questions. RQ1: Which one of the model checkers is the most effective and efficient for deadlock freedom verification? RQ2: What is the correctness of the model checkers for verifying deadlock freedom, including the consistency of the results produced by different model checkers and the runtime problems of the model checkers during verification? RQ3: How convenient is the model checker for modeling message passing programs? 5.1 Experimental setup We implemented the translation algorithms presented in Section 4 for the five selected model checkers. The verified property is deadlock freedom, which is supported by all the model checkers. The versions of the model checkers are PAT (v3.4.0), Spin (v6.4.8), FDR (v4.2.3), PRISM (v4.4.0), and NuSMV (v2.6.0). The time threshold for a verification task is one hour. All the verification tasks are carried out on a server with 32 Xeon 2.5 G cores and 256 G memory, and the OS is Ubuntu Linux 14.04. 5.2 Effectiveness and efficiency Table 6 shows the main evaluation results. The first column lists the name of programs and the number of the corresponding models in our benchmark. The evaluated model checkers are displayed in the second column. The column verified models shows the number of successfully verified models by each model checker (both “Seg fault” and “Memory Error” mean there exists a runtime error). The last column average time lists the average time that the model checker used for verifying the models of the program (“–” means the average time is not available because all the models met the runtime error and “Timeout” means all the models failed to be verified within the time threshold). Note that the results where the corresponding model checker performs better are highlighted in bold. The criteria for better performance are verifying more models and using less time. For the 2318 models in the benchmark, PAT, Spin, FDR, PRISM, and NuSMV successfully verify 2308 (99%), 2318 (100%), 777 (34%), 656 (28%), and 489 (21%) models, respectively, in one hour. Hence, PAT and Spin have comparative effectiveness on deadlock freedom verification on our benchmark models. FDR can verify less than half of the models. Both of PRISM and NuSMV have relatively poor results, mainly because of the modeling language support for message passing programs, which will be discussed in Subsection 5.4. Figure 7 shows the percentages of verified models on each program for every model checker. Within the timeout threshold, NuSMV and PRISM fail to verify any models of three programs, i.e., Diffusion2d, Pingpong, and Kfray. In addition, FDR fails to verify any models of Pingpong and Kfray. We also inspect the efficiency of the evaluated model checkers. Figure 8(a) shows the time costs (less than 15 s) of the model checkers on all verified models. Considering that NuSMV and PRISM need hundreds of seconds on average for most programs, we do not include NuSMV and PRISM in the figure. As shown in the figure, for the models that can be verified in less than 15 s, PAT and FDR have a comparative efficiency. Although Spin is less efficient than PAT and FDR, the number of exceptional points of Spin is smaller and the time costs are centralized, whereas FDR and PAT have more exception cases. These indicate that Spin has a stable performance. To verify our conclusion further, we inspect the three model checkers on the same verified models (682) whose verification time costs are less than 15 s. Figure 8(b) shows the results. Compared with the results — 51 — Hong W J, et al. 2018 Sci China Inf Sci Table 6 Program Proceedings Model checker Meeting DTG (7) Integrate mw (77) Diffusion2d (107) Gauss elim (11) Heat (37) Pingpong (659) Mandelbrot (696) Image manip (30) Kfray (652) ClustalW (42) — 52 — October 2019 Vol. 62 200101:16 Experimental results Verified models Average time (s) PAT 7 0.29 FDR 7 0.20 SPIN 7 1.17 PRISM 7 1.91 NuSMV 7 0.02 PAT 73 2.13 FDR 75 8.27 SPIN 77 1.82 PRISM 68 126.62 NuSMV 9 15.39 PAT 107 1.05 FDR 2 31.61 SPIN 107 3.18 PRISM 0 (Memory Error) – NuSMV 0 (Seg fault) – PAT 10 5.11 FDR 9 13.12 SPIN 11 5.26 PRISM 2 1.61 NuSMV 4 1468.03 PAT 37 0.30 FDR 24 9.60 SPIN 37 2.24 PRISM 13 2.76 NuSMV 13 5.52 PAT 659 0.31 FDR 0 Timeout SPIN 659 3.36 PRISM 0 (Memory Error) – NuSMV 0 (Seg fault) – PAT 694 0.75 FDR 610 7.41 SPIN 696 1.59 PRISM 530 97.41 NuSMV 428 265.73 PAT 30 0.41 FDR 25 0.67 1.34 SPIN 30 PRISM 30 7.00 NuSMV 22 253.91 PAT 650 3.80 FDR 0 Timeout SPIN 652 3.36 PRISM 0 Timeout NuSMV 0 Timeout PAT 41 10.88 FDR 25 89.68 SPIN 42 5.66 PRISM 6 1.59 NuSMV 6 0.05 Hong W J, et al. October 2019 Vol. 62 200101:17 Sci China Inf Sci PAT Software Automation SPIN in FDRthe Big Data Era: PRISM Challenges and Opportunities NuSMV Verification percentage (%) 100 80 60 40 20 0 lW ta p ni ma ot (Color online) Percentage of verified models. Time cost (s) Time cost (s) us Cl y ra Kf e_ br el ag Im d im ng po nd Ma ng Pi el n2 mw e_ at io s_ us gr at He us ff te Figure 8 Ga Di In G DT Figure 7 Model checker Model checker (a) (b) (Color online) Time costs of model checkers. (a) On the all models; (b) on the same models. in Figure 8, Spin has a more comparative performance, i.e., it has less than 2 s on average on the same models. Similarly, in Figure 8, the time costs of Spin are also centralized on the same models. FDR has the most exceptional points. Figure 9 shows the detailed information of the performance of each model checker on each MPI program except Diffusion2d, Pingpong, and Kfray, which are not verified successfully by all the model checkers. As shown in Figure 9, PAT, FDR, and Spin have a similar result on each program like that on all models shown in Figure 8. For example, of the models that can be verified in less than 15 s, PAT and FDR have a more efficient performance than Spin, and the data distribution of FDR is more widespread, e.g., ClustalW and Heat. In addition, in terms of the average time cost, we observe that PRISM takes more time to verify most of the models, e.g., DTG, Image_manip, and Gauss_elim. Furthermore, NuSMV shows a good performance when the models are simple and verifiable, e.g., DTG and Integrate_mw under 2 processes. However, NuSMV requires more time to verify the complex models, such as Gauss_elim and Mandelbrot under more than 4 processes. For these models, NuSMV requires at least 100 orders of magnitude more time costs than the first three model checkers. Figure 10 shows the quantitative trend of each model checker for completing verification tasks under the time threshold. The X-axis is the time limit, which shows the results within 100 s only. The Y -axis shows the numbers of verified models. We do not include PRISM and NuSMV in the figure because of their poor performance within 100 s. As the figure indicates, all the three model checkers complete most of the tasks within 20 s. Spin verifies more models within the same time threshold than FDR and PAT. — 53 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:18 Time cost (s) Time cost (s) Meeting Proceedings Model checker (b) (c) Time cost (s) Model checker (a) Time cost (s) Model checker Time cost (s) Time cost (s) 2018 Model checker Model checker (d) (e) (f) Time cost (s) Model checker Model checker (g) Figure 9 (Color online) Time costs of model checkers on each program. (a) Integrate mw; (b) ClustalW; (c) DTG; (d) Heat; (e) Image manip; (f) Mandelbrot; (g) Gauss elim. 2500 Verified number 2000 PAT SPIN FDR 1500 1000 500 0 0 5 10 15 Time step (*4s) 20 Figure 10 Trends of completion of verification tasks. 25 In addition, PAT and Spin verify two times more models than FDR under the time limit, indicating that PAT and Spin have better effectiveness and efficiency than the remaining two. The number of processes determines the scale of parallelism. Models with a larger parallelism scale are more challenging for the model checkers than those with a smaller scale. We inspect the performance of each model checker on programs running under different numbers of processes. Figure 11 shows the — 54 — 160 140 120 100 80 60 40 20 0 PAT SPIN FDR NuSMV PRISM 1 2 3 4 5 6 7 Process number (a) Time cost (s) 1800 1600 1400 1200 1000 800 600 400 200 0 Sci China Inf Sci 8 9 10 October 2019 Vol. 62 200101:19 800 700 600 500 400 300 200 100 0 Software Automation PAT in theSPIN Big Data Era: FDR NuSMV PRISM Challenges and Opportunities 1 2 3 4 5 6 7 Process number (b) 250 PAT SPIN FDR NuSMV PRISM 8 9 10 PAT SPIN FDR NuSMV PRISM 200 Time cost (s) Time cost (s) Time cost (s) Hong W J, et al. 150 100 50 1 2 3 4 5 6 7 Process number (c) 8 9 0 10 1 2 3 4 5 6 7 Process number (d) 8 9 10 9 10 120 PAT SPIN FDR NuSMV PRISM 80 Time cost (s) Time cost (s) 100 60 40 20 0 1 2 3 4 5 6 7 Process number (e) 8 9 10 35 30 25 20 15 10 5 0 PAT SPIN FDR NuSMV PRISM 1 2 3 4 5 6 7 Process number (f) 8 3000 PAT SPIN FDR NuSMV PRISM Time cost (s) 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 Process number (g) Figure 11 (Color online) Influence of parallelism scales. (d) ClustalW; (e) Kfray; (f) Heat; (g) Gauss elim. 8 9 10 (a) Image manip; (b) Mandelbrot; (c) Integrate mw; results, which includes only seven programs, with their number of processes ranging 2, 4, 6, 8, and 10. As indicated by the figures, for most cases, a model checker needs more time when verifying models with a larger parallelism scale. Interestingly, the performance of Spin does not decrease immediately with increasing number of processes. However, the remaining model checkers, especially NuSMV, show a sharp decrease in the performance for large number of programs. This indicates that Spin has the more stable performance, consistent with what Figure 8 indicates. As stated in Subsection 3.3 and Figure 6, our benchmark is suitable for evaluating the design and implementation of POR. The evaluation results of the five model checkers also indicate their supports for POR. Spin and PAT have good support for POR, while the support of FDR for POR is only partial — 55 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:20 2018 and still experimental. Both PRISM and NuSMV are symbolic model checkers and they do not support POR. These facts are also consistent with the evaluation results, i.e., Spin and PAT perform better than Meeting Proceedings the remaining three, and FDR performs better than PRISM and NuSMV. With respect to the experimental results, we summarize the following answer for RQ1. Answer to RQ1: Spin can verify all models in the benchmark within 1 h. Spin and PAT are more effective than the remaining three model checkers. In addition, they show better efficiency than the remaining three model checkers under the same time threshold. The performance of Spin is the most stable among the model checkers. 5.3 Correctness In addition to effectiveness and efficiency, we can also check the correctness of the model checkers using the idea of differential testing [25]. If the results of the model checkers are different on any same model, there must be a problem in the implementation of at least one model checker. According to the verification results, we observe that there is no inconsistency when there are multiple model checkers that can produce a result on a model. This depicts that the implementations of the model checkers are of high quality. During experiments, we also observed that NuSMV and PRISM had crashed many times due to memory problems. We collected the models that report “Segmentation Fault” during the NuSMV verification process, which includes 6 models in ClustalW, 70 models in Diffusion2d, and 60 models in Pingpong. Based on the source code of NuSMV, we located the place and found that the reason is a stack overflow. For the models that cause “Out of Memory Error” in the PRISM verification process, most of them are in the benchmark of Diffusion2d and Pingpong. In addition, FDR sometimes reports runtime errors such as “Can’t Allocate Memory”, “Double Free or Corruption (!Prev)”, and “Corrupted Size vs. Prev size”. We collected the input models of these error cases and reported them to the developers. We expect feedback from them. Answer to RQ2: In case of a successful verification, the model checkers are consistent on the verification results; this indicates the high quality of the implementations. NuSMV, PRISM, and FDR have runtime memory problems when verifying some models. 5.4 Convenience of modeling The procedure of implementing the translation algorithms of Section 4 for each model checker is to use the modeling language of the model checker to model the MPI operations with respect to MPI standard. The convenience of the modeling language also influences the results of effectiveness and efficiency. Channel-based modeling constructs, e.g., channel read, write and emptiness checking, are natural and effective for modeling message passing programs. PAT and Spin provide channel constructs in their modeling languages. Therefore, it is not hard for us to use the languages of PAT and Spin to implement the algorithms in Section 4. However, for FDR, PRISM, and NuSMV, we encountered some problems when implementing the translation algorithms. Although both PAT and FDR verify the models in CSP, the channel constructs of PAT are more convenient than those of FDR. The input language of PAT, i.e., CSP#, has channel read, write, and emptiness checking constructs. The emptiness checking is a key to ensure that the completes-before relations [26] required by MPI standard are satisfied. However, the channel constructs of FDR regard channel operations as events, which complicates the modeling. Using more processes and events, we simulate channel operation. Hence, this channel operation adds numerous extra but necessary events, which makes models bigger and complicated. This is also the key reason why FDR performs worse than PAT and Spin. The input language of NuSMV provides the constructs to build a labelled transition system for the system that to be verified. No channel operation construct exists in the language. Using the state-oriented language of NuSMV to model MPI operations is complicated. In addition, if the behaviour of a model is very non-deterministic, the modeling results in a large number of states and transitions. Furthermore, a choice operation is the key to modeling wildcard MPI operations, and every model in the benchmark has — 56 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:21 wildcard operations. However, we encounter a problem for modeling choice operation in NuSMV. In the Software Automation beginning, we used case structure to model choice operation. Consider the following NuSMV model. in the Big Data Era: Challenges and Opportunities 1 2 3 4 5 6 7 8 9 10 MODULE main VAR pc : 1..3; ASSIGN init(pc) := 1; next(pc) := case pc = 1 & cons1 : 2; pc = 1 & cons2 : 3; esac; In this model, pc denotes the current state. If pc is equal to 1 and cons1 holds, the current state will be changed to 2; if pc is equal to 1 and cons2 holds, the current state will be changed to 3. However, if the condition pc is equal to 1 holds and both cons1 and cons2 hold, then the current state will only be changed to 2. This is because the case structure is actually an if-else structure. Therefore, using a random selection method, we solve the problem. If several constraints are satisfied at the same time, we make a random selection. For the case structure problem demonstrated before, the following solution is adopted, where line 10 uses a random selection when both cons1 and cons2 hold. 1 2 3 4 5 6 7 8 9 10 11 MODULE main VAR pc : 1..3; ASSIGN init(pc) := 1; next(pc) := case pc = 1 & cons 1 & not cons2 : 2; pc = 1 & cons 2 & not cons1 : 3; pc = 1 & cons 1 & cons2 :{2, 3}; esac; However, this solution results in two choices. If there are n choices, then we have to define 2n − 1 transitions and states. Hence, the number of choices increases the number of states and transitions rapidly, which is also a reason that NuSMV has a poor performance. The same problem also occurs to PRISM. Answer to RQ3: PAT and Spin provide more convenient constructs for modeling message passing programs. The convenience of modeling directly influences the effectiveness and efficiency of verification. 5.5 Threats to validity There are external and internal threats to the validity of our results. The external threats come from the limited MPI programs used for generating benchmarks. However, we relax these external threats in the following three aspects: • All the MPI programs are real-world MPI programs and some of them are used as the benchmark programs in the previous [27, 28]; • These programs are mixed blocking and non-blocking programs, and the scales are beyond the state-of-the-art static verification tools for MPI programs; • Although the number of the models in our benchmark is limited, as far as we know, the scale of our benchmark is already large (c.f. related work discussion in Section 6). — 57 — Hong W J, et al. 450 Integrate_mw 400 Diffusion2d Gauss_elim 350 Heat Meeting Proceedings Pingpong 300 Mandelbrot Image_manip 250 Kfray ClustalW 200 150 100 50 0 1 2 3 4 5 6 7 Process number (a) Sci China Inf Sci 25 Integrate_mw Diffusion2d Gauss_elim Heat Pingpong Mandelbrot Image_manip Kfray ClustalW 20 Time cost (s) Complexity 2018 October 2019 Vol. 62 200101:22 15 10 5 8 9 10 0 1 2 3 4 5 6 7 Process number (b) 8 9 10 Figure 12 (Color online) Complexity comparison between model and verification. (a) The trend of model complexity; (b) the trend of verification complexity (SPIN). In addition, we plan to address this threat further in the future by analyzing more real-world MPI programs and adding more models to the benchmark. Furthermore, the evaluated model checkers with state-based input language, i.e., NuSMV and PRISM, are naturally not good at modeling message passing programs, which may make our evaluation results biased. However, our evaluation empirically validates this insight and provides a quantitative result of how inefficient these model checkers are in verifying message passing programs. In addition, the model checker PRISM is designed for probabilistic systems; however, our benchmark does not contain probabilistic models, which may also be biased. The internal threats come from our evaluation methods for benchmark and our implementations of the translation algorithms for model checkers. We controlled the threats by testing each implementation and drawing the following conclusion: • The consistency of the results of different model checkers indicates the high quality of our implementation. • To check if the model complexity reflects some trends of verification complexity, we compared the difference between model complexity and verification complexity. Figure 12(a) represents the trend of model complexity under different process numbers, while Figure 12(b) shows for the corresponding verification time by Spin. As demonstrated in Figure 12, the similarity between the trends of the lines in the two figures indicates the criterion for evaluating model complexity and the modeling are reasonable. • We empirically validate the complexity evaluation of POR in Subsection 3.3. For Spin, under the same complexity and process number, the model with fewer wildcard receive operations is verified faster. For example, the average complexities of Kfray and Heat under 10 processes are 207.53 and 211.41, respectively. The corresponding wildcard receive percentages are 26% and 5%, respectively, whereas their average verification time costs are 17.95 and 3.58 s, respectively. These results depict the validity of our complexity evaluation of POR. 6 Related work Our work is closely related to the existing benchmark, evaluation, and contest work of model checking. Next, we discuss the related work and make a comparison. BEEM [3] is a benchmark for explicit model checkers. Insides BEEM, there are 50 parametrized models and their properties (safety or liveness properties) to verify. BEEM has been used by many later studies [29–31] as the benchmark for evaluating LTL model checking. Most models in BEEM are wellknown examples and case studies. Unlike the models in BEEM, which are manually created, the models in our benchmark are extracted automatically from MPI programs. Moreover, our benchmark has more models. However, our benchmark is only concerned with verifying the deadlock freedom property. Model checking contest is an event for evaluating the model checkers for Petri nets. The benchmarks of the contest are Petri nets models created from representative case studies. The properties are CTL, LTL, and reachability. Similar to BEEM, the models used by the contest are created manually by different — 58 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:23 model contributors. Kwiatkowska et al. [32] presents a benchmark for probabilistic modelAutomation checking. Software In the benchmark, thirty models covering four types of probabilistic models exist, and each in the Big model Datais Era: accompanied by several probabilistic properties. There is also a benchmark for asynchronous concurrent Challenges and Opportunities systems [33]. In the same manner as before, the models in the benchmark are created manually. A hardware model checking competition (HWMCC) is held annually from 2006. The benchmarks of HWMCC are unified in AIGER format11) . The benchmarks are from hardware design, e.g., manually created or randomly generated with respect to hardware designs. Multiple tracks are designed for different types of properties, such as safety and liveness. After the development for more than 10 years, the number of models in the benchmark of HWMCC is near two thousand. Compared with the benchmark of HWMCC, we extract our benchmark from real-world MPI software, which cares more about deadlock freedom verification. Similar to HWMCC, a software verification competition event, i.e., SV-Comp, is held each year. The competition aims to evaluate the verification tools for a software system. Most benchmarks of SV-Comp are preprocessed C programs, many of which are manually designed. Some benchmarks are also extracted from real-world programs, such as Linux driver programs. Different tracks of SV-Comp exist for different kinds of properties and programs. The concurrency safety track verifies concurrent programs. However, a message passing program is not included in this track and it has only multi-threaded programs. Compared with the benchmark of SV-Comp, our benchmark aims to evaluate the model checkers whose inputs are models instead of code. Moreover, our benchmarks are all generated from real-world MPI programs. Furthermore, work of evaluating model checkers by applying them to verify a system. Frappier et al. [2] compared six model checkers for verifying an information system, aiming to identify the required features of model checkers for verifying information systems. Fifteen properties specified by different formalisms, e.g., LTL and CTL are verified. Similar to that of [2], Pamela compared Alloy12) and Spin using both of them to verify a distributed hash table system. These approaches mainly care about the expressiveness of the modeling and specification languages. Compared with them, we evaluate model checkers by a large number of models in our benchmark, instead of some single system; we also compare the performances of the model checkers for deadlock freedom verification. Our stduy is based on MPI-SV, which is related to the automatic program analysis work for MPI programs. Existing studies of automatically analyzing MPI programs can be divided into dynamic and static ones. Dynamic approaches, e.g., ISP [27] and MOPPER [28], create a model that runs on an MPI program, which depends on specific inputs. Therefore, for generating models, test inputs are needed. However, creating test inputs is also a labor-intensive work. Static approaches, e.g., TASS [15] and CIVL [6], abstract the whole program in a model for verification, and fewer ones support non-blocking MPI programs, which are frequently used in real-world MPI programs. Hence, the existing approaches are not appropriate for benchmark generation, which is also the reason that we use MPI-SV, which uses symbolic execution to tackle the problem of creating test inputs and supports the analysis of mixed blocking and non-blocking MPI programs. 7 Conclusion and future work Benchmarks and evaluation are very important for developing model checking techniques. This study creates a benchmark based on MPI-SV for evaluating existing model checkers by verifying message passing programs. In the benchmark, 2318 models are automatically extracted from real-world MPI programs by MPI-SV. Based on the benchmark, we evaluate the five state-of-the-art model checkers in three aspects: effectiveness and efficiency, correctness and convenience of modeling. The evaluation results indicate that Spin is most effective for verifying deadlock freedom, and the modeling languages of FDR, PRISM, and NuSMV are not intuitive for modeling message passing programs. Our benchmark can be downloaded at the address13) . 11) AIGER Website. Http://fmv.jku.at/aiger/. 12) Alloy Website. Http://alloytools.org/. 13) https://github.com/mc-benchmark/mpi-benchmark. — 59 — Hong W J, et al. Sci China Inf Sci October 2019 Vol. 62 200101:24 2018 The future work lies in two aspects: (1) enlarge the scale of the benchmark by extracting the models from more MPI programs and add more properties for verification; (2) develop the translators for more Meeting Proceedings model checkers for evaluation, e.g., Petri nets model checkers. Acknowledgements This work was supported by National Key R&D Program of China (Grant No. 2017YFB1001802) and National Natural Science Foundation of China (Garnt Nos. 61472440, 61632015, 61690203, 61532007). References 1 Clarke E M, Grumberg O, Peled D A. Model Checking. Cambridge: MIT Press, 2001 2 Frappier M, Fraikin B, Chossart R, et al. Comparison of model checking tools for information systems. In: Proceedings of the 12th International Conference on Formal Engineering Methods, 2010. 581–596 3 Pelánek R. BEEM: benchmarks for explicit model checkers. In: Proceedings of the 14th International SPIN Workshop on Model Checking Software, 2007. 263–267 4 Gopalakrishnan G, Kirby R M, Siegel S F, et al. Formal analysis of MPI-based parallel programs. Commun ACM, 2011, 54: 82–91 5 Siegel S F. Verifying parallel programs with mpi-spin. In: Proceedings of Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2007. 13–14 6 Luo Z Q, Zheng M C, Siegel S F. Verification of MPI programs using CIVL. In: Proceedings of the 24th European MPI Users’ Group Meeting, 2017. 6: 1–11 7 Yu H B, Chen Z B, Fu X J, et al. Combining symbolic execution and model checking to verify MPI programs. 2018. ArXiv: 1803.06300 8 King J C. Symbolic execution and program testing. Commun ACM, 1976, 19: 385–394 9 Gibson-Robinson T, Armstrong P, Boulgakov A, et al. A modern refinement checker for CSP. In: Proceedings of Tools and Algorithms for the Construction and Analysis of Systems, 2014. 187–201 10 Lattner C. Llvm and clang: next generation compiler technology. In: Proceedings of the BSD Conference, 2008. 1–2 11 Hoare C A R. Communicating Sequential Processes. Upper Saddle River: Prentice-Hall, 1985 12 Scattergood J B. The semantics and implementation of machine-readable CSP. Dissertation for Ph.D. Degree. Oxford: University of Oxford, 1998 13 McMillan K L. Symbolic model checking. Norwell: Kluwer Academic Publishers, 1993 14 Baier C, Katoen J. Principles of Model Checking. Cambridge: MIT Press, 2008 15 Siegel S F, Zirkel T K. TASS: the toolkit for accurate scientific software. Math Comput Sci, 2011, 5: 395–426 16 Xue R N, Liu X Z, Wu M, et al. Mpiwiz: subgroup reproducible replay of mpi applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009. 251–260 17 Müller M, de Supinski B, Gopalakrishnan G, et al. Dealing with mpi bugs at scale: Best practices, automatic detection, debugging, and formal verification. 2011 18 Vakkalanka S. Efficient dynamic verification algorithms for MPI applications. 2010 19 Thompson J D, Higgins D G, Gibson T J. Clustalw: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994, 22: 4673–4680 20 Lattner C, Adve V S. LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004. 75–88 21 Just R, Jalali D, Inozemtseva L, et al. Are mutants a valid substitute for real faults in software testing? In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014. 654–665 22 Newman M E J. The structure and function of complex networks. SIAM Rev, 2003, 45: 167–256 23 Hermann L R. Laplacian-isoparametric grid generation scheme. J Eng Mech Div, 1976, 102: 749–907 24 Godefroid P. Partial-order methods for the verification of concurrent systems — an approach to the state-explosion problem. In: Lecture Notes in Computer Science. Berlin: Springer, 1996 25 McKeeman W M. Differential testing for software. Digit Tech J, 1998, 10: 100–107 26 Vakkalanka S S, Gopalakrishnan G, Kirby R M. Dynamic verification of MPI programs with reductions in presence of split operations and relaxed orderings. In: Proceedings of the 20th International Conference on Computer Aided Verification, 2008. 66–79 27 Vakkalanka S S, Sharma S, Gopalakrishnan G, et al. ISP: a tool for model checking MPI programs. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. 285–286 28 Forejt V, Joshi S, Kroening D, et al. Precise predictive analysis for discovering communication deadlocks in MPI programs. ACM Trans Program Lang Syst, 2017, 39: 1–27 29 Blom S, van de Pol J, Weber M. Ltsmin: distributed and symbolic reachability. In: Proceedings of the 22nd International Conference on Computer Aided Verification, 2010. 354–359 30 Lal A, Reps T W. Reducing concurrent analysis under a context bound to sequential analysis. Form Methods Syst Des, 2009, 35: 73–97 31 Laarman A, van de Pol J, Weber M. Boosting multi-core reachability performance with shared hash tables. In: Proceedings of the 10th International Conference on Formal Methods in Computer-Aided Design, 2010. 247–255 32 Kwiatkowska M Z, Norman G, Parker D. The PRISM benchmark suite. In: Proceedings of the 9th International Conference on Quantitative Evaluation of Systems, 2012. 203–204 33 Atiya D A, Catano N, Lüttgen G. Towards a benchmark for model checkers of asynchronous concurrent systems. In: Proceedings of the 5th International Workshop on Automated Verification of Critical Systems (AVOCs), 2005. 98: 142–170 — 60 — SCIENCE CHINA Information Sciences . RESEARCH PAPER . Special Focus on Software Automation Software Automation in the Big Data Era: Challenges and Opportunities October 2019, Vol. 62 200102:1–200102:16 https://doi.org/10.1007/s11432-018-1465-6 A manual inspection of Defects4J bugs and its implications for automatic program repair Jiajun JIANG1,2 , Yingfei XIONG1,2* & Xin XIA3 1 Key Laboratory of High Confidence Software Technologies, Ministry of Education, Institute of Software, Beijing 100871, China; 2 Department of Computer Science and Technology, Peking University, Beijing 100871, China; 3 The Faculty of Information Technology, Monash University, Melbourne 3800, Australia Received 25 October 2018/Revised 28 January 2019/Accepted 21 June 2019/Published online 6 September 2019 Abstract Automatic program repair techniques, which target to generate correct patches for real-world defects automatically, have gained a lot of attention in the last decade. Many different techniques and tools have been proposed and developed. However, even the most sophisticated automatic program repair techniques can only repair a small portion of defects while producing a large number of incorrect patches. A possible reason for the low performance is the test suites of real-world programs are usually too weak to guarantee the behavior of a program. To understand to what extent defects can be fixed with exiting test suites, we manually analyzed 50 real-world defects from Defects4J, where a large portion (i.e., 82%) of them were correctly fixed. This result suggests that there is much room for the current automatic program repair techniques to improve. Furthermore, we summarized seven fault localization and seven patch generation strategies that are useful in localizing and fixing these defects, and compared those strategies with current techniques. The results indicate potential directions to improve automatic program repair in the future. Keywords automatic defect repair, fault localization, manual repair, software maintenance, case study Citation Jiang J J, Xiong Y F, Xia X. A manual inspection of Defects4J bugs and its implications for automatic program repair. Sci China Inf Sci, 2019, 62(10): 200102, https://doi.org/10.1007/s11432-018-1465-6 1 Introduction Automatic program repair (APR) techniques, which automatically generate patches for defects in programs and target to software automation [1], have gained a lot of attention in the last decade. A typical automatic program repair technique [2–5] takes a program and a set of tests as input, where at least one test is failed by the program, and generates a patch that will fix the defect. Different techniques and tools have been proposed. These tools generate a patch through techniques such as directed random search [2,3], templates [6], component-based program synthesis [5,7,8], program transformation from examples [9–11] and machine learning [12], incorporated fault localization approaches such as spectrum-based fault localization [13, 14], predicate switching [15], and angelic debugging [16], and utilized information such as testing results [2, 17], existing patches [6, 9, 12], invariants [18], existing source code [19, 20], bug report text [21], and comments [4]. Despite these efforts, in practice even the most sophisticated automatic program repair techniques can only repair a small portion of defects while producing a large number of incorrect patches. For example, Prophet [12] and Angelix [5], two approaches on the C language, can only fix 14.3% and 12.2% of the defects on GenProg benchmark [22], while producing incorrect patches for other 22.8% and 22.0% * Corresponding author (email: xiongyf@pku.edu.cn) c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019 info.scichina.com link.springer.com — 61 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:2 2018 defects, respectively. The newest approach on Java, SimFix [20], can only fix 9.5% defects on Defects4J benchmark [23] while producing incorrect patches to other 6.2% defects. Meeting Proceedings An often attributed reason for this low performance, especially the large number of incorrect patches, is that the test suites of real world programs are usually weak. As studied by Qi et al. [24], and Long and Rinard [12], test suites in real world programs are often weak, and in a space of patches that pass all tests, there will be much more incorrect patches than correct ones. Moreover, Martinez et al. [25] in a large experiment on Defects4J benchmark found that the test suites even cannot guarantee the functionality completeness of the program under testing, i.e., when some functional code was deleted, the programs can still correctly pass the test suites. That is also why, by enhancing Defects4J’s test suites, many incorrect patches can be successfully ruled out [26]. As a matter of fact, existing automatic program repair techniques usually rely on existing test suites, which serves as incomplete specifications of programs under repair. Therefore, it is very difficult for them to distinguish incorrect and correct patches. Because the performance of current repair techniques are still limited, it naturally raises a question: is it possible to repair a large portion of defects with existing test suites? This question is important, because if most of the defects cannot be fixed, we may change the problem settings of automatic program repair, e.g., asking the user to provide formal specifications of the programs. On the contrary, if most defects can be fixed, we can focus on improving the current techniques. To answer this question, we manually analyze 50 defects randomly selected from Defects4J [23], a widely-used benchmark of real-world defects in Java programs, to see how much possible these defects can be fixed. In our analysis, a defect is considered as repairable within a given time frame if and only if (1) we could identify a possible root cause1) of the defect, (2) we could generate a patch that tackles the root cause and passes all the tests, and (3) the patch is semantically equivalent to the developer patch. This study could help us to understand the potential of automatic program repair and to improve current techniques. If a defect is considered as repairable in our analysis, there exists at least a manual process to obtain the patch for the defect. By decomposing and automating the manual process, we can potentially obtain an automatic method to repair the defect. Furthermore, if we find many more defects can be fixed than current state-of-the-art approaches, it indicates that current automatic program repair techniques have big potential to be improved. During the analysis of those defects, we focus on four research questions, and obtain the corresponding results as listed below: • RQ1: How many of those defects can be fixed under existing test suite? In our analysis, 41 (82.0%) of the 50 defects are correctly fixed, while 6 (12.0%) defects are incorrectly fixed because the test suite fail to provide sufficient specifications. Moreover, we fail to generate valid patches for 3 (6%) defects, which require domain-knowledge that is difficult to be obtained from the program and the tests. Though these numbers of fixed defects in the paper come from one manual analysis and may be not generalizable, they provide insights that current APR techniques potentially have room for improvement. • RQ2: How those defects are located and what are the implications to future studies? After decomposing the manual analysis process, we summarize seven fault localization strategies that are applied in our manual analysis, along which we compare the most related fault localization techniques with each strategy and identify the concrete points that current techniques could be improved. • RQ3: How those patches are generated and what are the implications to future studies? Similarly, we summarize seven patch generation strategies from the manual analysis and compare them with related program repair techniques, and propose implications for future study. • RQ4: What are the inspirations from the manual analysis? According to the analysis, we find that though many strategies have already been explored by current techniques, they still have a lot of room to improve. Moreover, some strategies may inform new techniques. To conclude, the main contributions of this paper are a set of fault localization and patch generation strategies learned from the manual analysis, which provide concrete directions for future research. 1) Please note that the root cause of a defect may not be the location of code to be changed, but explains the reason of a program failure. — 62 — Jiang J J, et al. Sci China Inf Sci 2 Background and related work 2.1 Automatic program defect repair October 2019 Vol. 62 200102:3 Software Automation in the Big Data Era: Challenges and Opportunities As mentioned in the introduction, in a typical defect repair setting the repair technique takes as input a program and a set of tests, where the program fails at least one test, and produces as output a patch that is expected to repair the defect when applied to the program. Because tests are used as primary tool to guarantee the correctness of the patches, we call this setting as test-based program repair. A key issue to evaluate the performance of repair tools is how to determine the correctness of the generated patches. In the early studies [2, 6] of automatic repair, a patch is usually considered correct if the patched program passes all the tests. In recent studies [3–5, 12, 19, 20, 27, 28], a patch is usually considered as correct if it is semantically identical to the patch produced by human. Note that neither approach can produce an ideal measurement of correctness: the former may overstate the number of correct patches (because the test suites may be too weak to guarantee correctness) while the latter may understate the number of correct patches (because a defect may be repaired in different ways). However, as studied by Qi et al. [24], the former approach is very imprecise for real world programs because the test suites are usually weak. Similarly, Smith et al. [29] studied that inadequate test suite would lead to over fitting patches and suggested that repair techniques must go beyond testing to characterize functional correctness of patches. As a result, in this paper we take the latter approach, determining the correctness by the equivalence with human patches previously generated by the developers of the programs. Many defect repair approaches follow a “generate-and-validate” approach, i.e., these approaches first try to locate a likely patch in a large patch space, and then validate the patch using all the tests. There are two main challenges in the repair process. The first is to ensure the correctness of the generated patches. As mentioned above, the tests in the real world programs are often not enough to guarantee the correctness of the generated patches. The second is on the generation of correct patches to a large number of defects. Because the patches need to be validated against all tests, the number of generated patches cannot be large. In order to locate a small number of likely patches from the patch space, current approaches cannot support a large patch space. As studied by Long and Rinard [30] and Zhong and Su [31], most defects cannot be fixed by the patch space considered in current approaches. For example, to reduce the search space, some techniques are proposed to follow predefined templates for patch generation, which are similar to the strategies proposed in this paper. Kim et al. [6] and Tan et al. [32] defined a set of repair patterns or anti-patterns respectively to guide patch generation. Similarly, Long et al. [3] and Saha et al. [28] proposed a set of program transformation schemas to constrain the search space of patch generation. However, compared with the strategies derived from our analysis, the templates used in these approaches are mainly syntactic templates derived from the changes, while our strategies try to reason why the program failed from a developer point of view and connect more on the process of how the patches can be deduced. There are also defect repair approaches that use a different problem setting. For example, some approaches assume that there exists a full specification of the program [33, 34], and some approaches consider a concrete class of defects such as memory leaks [35], deadlocks [36] and build failures [37]. These different problem settings are not the focus of our paper. 2.2 Empirical studies on defect repair There exist several empirical studies on defect repair. Zhong and Su [31] studied real bug fixes through analyzing commits of five open source projects. They analyzed distributions of fault locations and modified files. To investigate the complexity of fixing bugs, they analyzed data dependence among faulty lines. More concretely, they analyzed operations of bug fixes and frequency of them related to APIs. As another study, Martinez and Monperrus [38] studied the distribution of real bug fixes by analyzing a large number of bug fix transactions in software repositories. To better understand the nature of bug fixes, they classified those bug fixes with different classification models. Besides, Soto et al. [39] analyzed a great deal of bug-fixing commits in Java projects aiming to provide a guidance for future APR approaches. In — 63 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:4 2018 contrast to our study, their studies focus on the distribution of characteristics about defects and patches but not how these defects are fixed, it is difficult to derive conclusion on the repairability of the defects. Meeting Proceedings In addition, Yang et al. [40] proposed to filter out overfitting patches by enhancing existing test cases, which cannot tell how much is the possibility to repair existing bugs under given test cases. On the contrary, in our empirical study, we not only analyze the possibility of defects to be repaired, but also identify several concrete directions to improve existing techniques, which are orthogonal to their work. Similarly, several previous studies [41, 42] also revealed that better test suite is helpful to more accurate fault localization results. Yang et al. [43] studied the difference of repair results under two statement selecting strategies for statement modification, i.e., suspiciousness-first algorithm (SFA) based on the suspiciousness of statements and rank-first algorithm (RFA) relying on the rank of statements. Their study is similar to ours that proposes implications to guide future studies but from different perspectives. There exist other human involved studies. Tao et al. [44] conducted a study that repair real defects manually under the help of APR techniques. It is different from ours because they focus on how the generated patches help the developers rather than how patches can be derived. Several researchers studied the debugging process of human developers. Lawrance et al. [45] studied how human developers navigate through the debugging process and created a model for predicting the navigation process. LaToza and Myers [46] studied the questions developers asked during the debugging. Murphy-Hill et al. [47] studied the factors developers considered during the debugging. Different from these studies, our study focus on analyzing the repairability of defects rather than understanding how human developers behave. 3 Dataset and environment We conduct our case study on Defects4J [23] (v1.0), which consists of 357 defects from five open source projects, JFreeChart, Closure compiler, Apache commons-Lang, Apache commons-Math and Joda-Time. Because the whole Defects4J is too large for manual analysis, we randomly select ten defects from each project, and thus have a dataset of 50 defects. To understand how many defects can be repaired, we analyze each defect in the dataset to determine whether we can locate a correct patch for the defect. Our manual analysis is performed under the following three environment settings like many existing automatic program repair techniques [2, 4, 5, 8, 20, 48]. • We do not have prior knowledge of the programs under analysis. In other words, we do not know the complete specifications of the programs except the test suites. • We only rely on the source code of the program to generate the patch, including both implementation code and testing code. In particular, we have no access to the patch of the defect provided by the developer in the benchmark. • During the analysis, we can access the Javadoc and comments in the source code but no extra documents are provided. Besides, we can access the Internet, but cannot search the bug directly. In this way, we put ourselves into the same environment setting as most test-based program repair techniques, which mainly depend on the source code and test suite. If we obtain the correct patch for a defect under this setting, it indicates potential to fix the defect automatically by decomposing and automating the manual repair process. More concretely, our analysis would classify the defects into repairable and difficult to repair, and the classification is based on the following steps for each defect. The first author of the paper, who is a Ph.D. student with four-year’s experience in Java programming, performs the manual analysis. • Under the above manual analysis settings, we try to locate a possible root cause of the defect. • We generate a candidate patch for the defect, and run all tests to validate the patch. If the patch does not pass all the tests, we restart from the first step. • If the patch passes all tests, we further compare it with the developer’s patch. If the two patches are semantically equivalent, we regard the patch as correct and the defect as repairable, otherwise we regard the patch as incorrect and the defect as difficult to repair. More concretely, we determine semantic equivalence by considering all possible system states when entering the patched method and checking if — 64 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:5 the system state is equivalent when the method returns. For all possible systemSoftware states we mean the states Automation that can be reached through any public method with any input allowed by the method signature. in the Big Data Era: • If we cannot obtain a patch that passes all tests within 5 Challenges hours, we stop and consider the defect and Opportunities difficult to repair. During the analysis, the details are recorded and summarized as strategies. Then, the other two authors further check the validity of the strategies and refine them based on the analysis records until reaching an agreement. The detailed analysis is available at the web2) . 4 Methodology for manual analysis Following the usual design of automatic repair, we view a repair process as two phases: fault localization and patch generation. The former is to identify the root cause of the failure, based on which the latter is to generate a patch that can fix the failure. We will decompose the repair process from the two phases. In an abstract view, both the phases can be seen as locating a solution in a (possibly finite or infinite) space of solutions. In fault localization, the space is the power set of all statements and we try to locate one statement or a few statements that is the root cause of the defect. In patch generation, the space is all possible patches and we try to generate a patch that can fix the current defect. To understand how the defects can be repaired, we need to decompose the fault localization and the patch generation process used in our analysis. To provide useful guidance for automatic repair techniques, we assume a model with strategies and try to derive strategies from our analysis. Concretely, both the fault localization and patch generation processes can be viewed as search and rank procedures. In a more fine-grained level, the analyzing process is a series of attempts to apply different strategies to the current problem. A strategy, when applied, either adds (filters out) or increases (decreases) the rank of some solutions in the search space. A strategy is usually associated with a precondition, which must be satisfied before applying the solution. During the analysis, we always need to simultaneously consider a large set of strategies and determine which of them can be applied. For example, a simple strategy of fault localization is to exclude all statements that are not executed during the failed test execution. This is equivalent to filter out solutions containing these statements. This strategy can be applied only when there is an executable test that is failed by the program (this precondition is always satisfied under the setting of automatic defect repair). As another example strategy, if we observe a rare statement that breaks usual programming practice, such as if(a=1) rather than if(a==1), we can increase the ranking of this statement among all candidate statements during fault localization. Under this view, to understand how defects could be fixed under the given test suite, we try to decompose the repair processes into a set of strategies. Totally, we identified seven strategies for fault localization and seven strategies for patch generation. A further observation on the strategies is that the distinction between fault localization and patch generation is not always clear. A strategy can contribute to both two phases. For example, the aforementioned strategy on programming practice not only provides guidance on fault localization, but also gives us a solution in patch generation, i.e., change a=1 to a==1. Therefore, if a strategy contributes to both sub-processes, we classify it based on its main contribution. 5 Results In this section, we first present the overall result of the analysis and compare it with existing automatic program repair techniques. Then, we will introduce those strategies used in the analysis for fault localization and patch generation, respectively, along which we will compare the most related techniques with each strategy in our study and identify concrete points to improve current techniques. 2) https://sites.google.com/site/d4jinpection. — 65 — Jiang J J, et al. Table 1 October 2019 Vol. 62 200102:6 Compare our analysis result with existing automatic repair techniques on our dataseta) 2018 jGenProg Sci China Inf Sci jKali Nopol ACS HDR ssFix ELIXIR JAID CapGen SimFix Munal Chart Meeting 0/4 Proceedings 0/2 1/1 0/0 2/– 1/3 3/1 0/2 2/0 3/0 7/3 Closure –/– –/– –/– –/– 1/– 0/1 –/– 0/1 –/– 0/0 8/1 Lang 0/0 0/0 0/0 1/0 2/– 1/1 1/0 0/0 1/0 0/1 10/0 Math 2/1 1/1 0/0 3/0 1/– 0/4 1/1 1/0 1/0 1/3 7/2 Time 0/1 0/1 0/0 0/0 0/– 0/1 1/0 0/0 0/0 1/0 9/0 Total 2/6 1/4 1/1 4/0 6/– 2/10 6/2 1/3 4/0 5/4 41/6 a) The results of the first three approaches come from [25] and the results of others come from the corresponding research papers: ACS [4], HDR [49], ssFix [50], ELIXIR [28], JAID [27], CapGen [19], SimFix [20]. In the table, X/Y denotes that X defects are correctly repaired while Y defects are wrongly repaired. “–” denotes the missing data. 5.1 RQ1: manual analysis result of defects Among the 50 defects we analyzed, we correctly repaire 41 (82%) defects, which are regarded as repairable, while faile to repair the other 9 (18%) defects that are regarded as difficult to repair. Table 1 [4,19,20,25, 27,28,49,50] shows the detailed data per project as well as the comparison with a set of existing program repair approaches. As we can see from the table, the performance of existing program repair approaches can only repair a very small portion of repairable defects, indicating a large room for improvement. Finding 1. A large portion of (i.e., 82%) defects are correctly fixed in our manual analysis, indicating that most of the defects have a great potential to be fixed under existing test suite. Among the 9 defects regarded as difficult to repair, we generate incorrect patches for 6 defects while faile to generate valid patches that can pass all the tests for 3 defects. We further investigate the 6 incorrect patches, and find out that in all those cases, the tests in the program do not provide enough information to reveal the full scope of the defect. Without knowing the precise specifications of the programs, we would generate incomplete patches based on only the test suite. For example, a defect from Chart-10 is related to String transformation. According to the failing test, character “\”” in the input should be replaced with “"”. We generate a patch to handle this and it successfully pass all the tests. However, in fact there are many other characters should be replaced, which are not covered by existing test suite. As a result, we generate an overfitting patch to this defect. We further investigate why we could not generate a patch for the three defects in our manual analysis. The reason is similar: these defects require domain knowledge either specific to the project or specific to a particular domain, where a developer may not be familiar with. Among the three defects, Math-2 is a defect about floating-point precision, where the standard patch changes an inaccurate expression into a mathematically equivalent but more accurate expression. Fixing the defect requires the knowledge of accurate arithmetic. Closure-4 and Time-6 are related to the uses of the methods and classes in the project, where the buggy code does not correctly interpret the semantics of called methods or the preconditions of called methods are not properly satisfied. Fixing the defect requires the knowledge of the project, especially the preconditions and semantics of each method. Lacking the domain knowledge, it is difficult for an average developer to locate the root cause of the three defects. 5.2 RQ2: fault localization strategies and implications In this subsection, we present the strategies used for fault localization in the manual analysis process, along which we compare the related existing techniques with each strategy and propose implications to inform future studies. The details of the strategies are listed in Table 2. The first column lists the strategy names, the second column briefly describes how these strategies work, and the last column lists the defects to which each strategy is applied in our manual analysis. Strategy 1. Excluding unexecuted statements. This strategy is very simple: when a statement is not executed in the failed execution, it cannot be the root cause. This strategy is implicitly applied when we try to find the root cause of a defect. Actually, this strategy can be applied to almost all defects. — 66 — Jiang J J, et al. Table 2 Sci China Inf Sci October 2019 Vol. 62 200102:7 Software Automation Strategies applied to locate faulty method in our analysis a) Defects in the Big Data Era: Strategy Description Excluding unexecuted statements Challenges and Opportunities Exclude those statements not executed by failing test All defects Excluding unlikely candidates Filter all non-related candidates based on their functionalities and complexities L-1, 2, 4, 7, 9; M-5, 10; Ch-2; Cl-9; T-1, 4, 10 Stack trace analysis Locate faulty locations based on the stack trace information thrown by failing test cases L-1, 5, 6; M-3, 4, 8; Ch-4, 9; Cl-2; T-2, 5, 7, 8, 10 Locating undesirable value changes Locate those statements that change the input values to the final faulty values of failing test cases L-8; Cl-1, 3, 5, 7, 8, 10; T-3, 9 Checking programming practice Identify those code that obviously violate some programming principles based on previous programming experience L-6, 8; Ch-1, 7, 8 Predicate ing switch- Inverse condition statements to get expected output, the inversed condition statement is the error location L-3; Ch-1, 9; Cl-10 Program standing under- Understand the logic of faulty program and the functionalities of relevant objects and methods L-10; M-6, 9; Ch-3; Cl-9; T-3, 9 a) L, M, Ch, Cl and T denote Lang, Math, Chart, Closure and Time project, respectively. Figure 1 (Color online) The call graph of Chart-2. Related. This strategy is almost adopted in any fault localization approaches. Some approaches [51, 52] can further exclude statements not related to the failure even executed in failed executions. Strategy 2. Excluding unlikely candidates. Given a list of possible candidates of root causes, we could examine them one by one, and exclude those that are unlikely to contain defects. Though technically this strategy can be applied to different granularities, applying it on the method level is effective in our analysis. That is, given a list of methods invoked during the failed test execution, we will examine them one by one and exclude unlikely ones. We find that the following two criteria are effective. • When a method is a library function or the test itself, it is unlikely to contain defect. • In Java, because the lack of default parameter or the use of polymorphism, it is often the case that a method is just a wrapper of another, and the purpose is only to pass a default argument or to adapt to an interface. When a method is a simple wrapper method, this method is unlikely to contain defect. Note that technically the methods excluded by this strategy still have the possibility to contain defects, but their probability is significantly smaller than others. This is a very effective strategy in our analysis, as we could locate the faulty method using only this strategy and strategy 1. For example, Figure 1 shows the call graph of defect Chart-2. After excluding those library methods (e.g., Double.isNaN) and simple wrapper methods (e.g., iterateDomainBounds (XYDataset)), the only remaining method is iterateDomainBounds(XYDataset,boolean), which turns out to be the faulty method. Apparently, this process need not know the specifications about the program and even need not understand the full functionality of the relevant methods. — 67 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:8 2018 Meeting Proceedings Figure 2 (Color online) Stack trace of defect Lang-1. Lines 469 and 472 are real faulty conditions. Related. The most related approach is fault prediction [53, 54], which predicts the probabilities of different software components to contain defects based on features of the software components. Improve. Incorporate richer dynamic information of test failure. Current fault prediction techniques judge whether a given method is correct or not mostly based on the characteristics of the program itself but do not consider the features of failing test cases. Take Chart-2 as an example, there are only several methods in the execution of failing test case, which greatly helps to eliminate some candidate locations. Strategy 3. Stack trace analysis. When an uncaught exception is triggered, the program crashes and the stack trace information is printed. A stack trace lists a sequence of locations in the program where a method is called but is not returned before the point of crash. Usually the root cause of the fault is close to the locations listed in the stack trace. That is, the ranks of statements near the locations in the stack trace will be increased. In our manual analysis, 15 defects are located with the contribution of this strategy. By further combining with other strategies, we can often locate the root cause of the defect. For example, Figure 2 is a stack trace screenshot of Lang-1, which throws a NumberFormatException. The stack trace lists seven candidate faulty locations. Then we can filter the locations based on strategy 2. Among them, all the first five locations (APIs and wrapper) and the last location (test method) are filtered out. The only possible location is the sixth. Related. It has been adopted by many fault localization approaches. For example, Wu et al. [55] proposed a fault localization approach mainly based on stack trace information. Wong et al. [56] proposed to combine stack trace analysis with bug reports to enhance the accuracy of fault localization, while Zhong and Mei [57] proposed MiMo that mines exception-related fix patterns from open-source projects to repair new defects, which improved the performance of program repair. Strategy 4. Locating undesirable value changes. A failed test execution produces an output that is different from the desired output. However, sometimes the desired output has already been constructed during the test execution, but the execution of some statements, S, turns it into an undesirable one. In such cases, S or the statements that S control dependents on are likely to be faulty. Note that the latter should be included because they are the reason why S is executed. In our analysis, those cases frequently occur when testing the optimization component in the Closure project. In a typical such defect, the optimizer changes the input program string into another one that is not semantically-equivalent to the original program. In such cases, the statements that make such an undesirable change will be considered with a high ranking. For an instance, the method call removeChild in Closure-1 wrongly deleted the argument a in window.f=function(a){}, so either this method or the — 68 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:9 statements leading to the call of this method might be faulty. Software Automation Related. Within our knowledge, this strategy is not directly adopted by existing fault localization in the Big Data Era: approaches. A loosely related approach, delta debugging proposed by Cleve and Zeller [58], locates the Challenges and Opportunities transitions that cause the fault. However, delta debugging requires (1) a mechanism to determine the test result and (2) a comparable passed test, which do not apply to the bugs solved by this strategy in our analysis. Another related approach mines program invariants to assist fault localization [59]. This is different with ours as it depends on multiple successful executions while we do not. Improve. Correctly identify undesirable value changes. To overcome the problem, we need to introduce a new technique that could identify undesirable changes in a test execution. A possible way is to define a partial order between states to measure how close to the desirable state a state is, where a standard test execution should only make the state more close to the desirable state rather than make it further. Strategy 5. Checking programming practice. Though in principle the language construct can be combined in any way to form a program, in practice people would only use a small subset of combinations. Basically, a programming practice defines a constraint on the combination, and a piece of code violating the constraint is likely to be faulty. A typical practice, as mentioned before, is that assignment is unlikely to be used in an “if” condition, and thus statement like if(a=0) is likely to be faulty3) . As another example, the following piece of code comes from Lang-6. In this piece of code, the for loop iteratively accesses the elements in the sequence with an unbounded variable pos. This piece of code violates common programming practice and is likely to be faulty. We find that the violation of programming practice is usually an indication of fault and is useful in fault localization. 1 2 3 4 for ( int pt =0; pt < consumed ; pt ++) { pos += C h a r a c t e r. c h a r C o u n t( C h a r a c t e r. c o d e P o i n t A t( input , pos ) ) ; } Related. Static bug detection, such as FindBugs [60], checks bad programming practice in the code to decide potential bugs. However, the templates defined in FindBugs mostly do not depend on runtime information. For example, it is hard to determine the faulty code of Lang-6 based on the static features of the program, the exception caused by variable pos also helps greatly. Improve. Incorporate dynamic information of test failure. A typical static bug detection approach simply considers the static information, which is not sufficient. For instance, as explained above, without the IndexOutOfBoundException caused by variable pos, it is hard to determine the faulty code. On the contrary, because pt is restricted by the length of the input string, if we replace variable pos with pt, the exception can be avoided. Therefore, correctly checking such programming practice not only needs to know the common practice patterns but also needs to combine the failure information. Strategy 6. Predicate switching. This strategy is very similar to the automatic fault localization technique with the same name [15]. In some cases, if we inverse the result of an “if” condition and force the execution to switch to the other branch, the failed test could pass, where we may consider the “if” condition may be faulty and increase its rank in candidate locations. For example, the failed test in Lang-3 expects a Double number when the input is 3.40282354e+38, whereas a Float number is returned. Assuming the value of the condition at line 3 to be false, the desired Double number will be possibly returned at line 7. Therefore, we can rank the first if condition higher. 1 2 3 4 5 6 7 8 try { Float f = c r e a t e F l o a t( str ) ; if (...) return f ; } catch ( N u m b e r F o r m a t E x c e p t i o n e ) {} try { Double d = c r e a t e D o u b l e( str ) ; if (...) return d ; } catch ( N u m b e r F o r m a t E x c e p t i o n e ) {} 3) Please note that this code convention is useful for C but not on Java, as if(a=0) will cause type error in Java. We cite this example just for illustration, and this is not a convention we discovered. — 69 — Jiang J J, et al. Table 3 2018 Strategy Sci China Inf Sci October 2019 Vol. 62 200102:10 Strategies used to generate patches in our analysis Description Meeting Proceedings Add null pointer checker before using the object to avoid Add NullPointer checker NullPointerException Defects M-4; Ch-4; Cl-2 Return expected output Return the expected value according to the assertions L-2, 7, 9; M-3, 5, 10; T-1, 3 Replace an identifier with a similar one Replace an identifier with another one that has the similar name and same type in the scope L-6, 8; Ch-7, 8 Compare test executions Generate patches by comparing the failed tests with those passed tests with similar test inputs L-2, 5 Interpret comments Generate patches by directly interpreting comments written in natural language M-9; Cl-1, 5, 7, 9; T-8, 9 Imitate similar code element Imitate the code that is near the error location and has similar structures L-4, 5; M-6, 8; Ch-1, 2, 7, 9; Cl-3, 8, 10; T-5, 7, 10 Fix by program understanding Generate patches by understanding the functionality of program L-1, 3, 9, 10; M-6, 9; Ch-2, 3; Cl-3, 8; T-1, 2, 4, 10 Related. As discussed before, this strategy is very similar to the predicate switching approach proposed by Zhang et al. [15]. In fact, automatic predicate switching is even more powerful than that used in our analysis because of computer’s superb computation ability, and has been employed by many automatic program repair techniques [3–5, 17]. Strategy 7. Program understanding. The strategies we have seen so far can be applied without a full understanding and specifications of the program, and many faults can be located by only using these strategies. However, not all faults can be found only relying on them and a certain amount of program understanding is required. Program understanding is a complex process and here we try to describe it in terms of the general logic reasoning process. Given a faulty program, we try to infer likely constraints on program behavior from different sources, and check consistency between them. If constraints inferred from different sources are inconsistent, the related source code is likely to be faulty. Otherwise, the related source code is unlikely to be faulty. Typical sources include the following. • Implementation code. By interpreting the semantics of the source code, we can infer constraints on how the source transforms one state into another state. • Test executions. Basically, each test gives a constraint on the desired output for each test input. • Identifier names. We often try to infer likely constraint from the names of the identifiers. For example, a method named “remove” should reduce the number of items in some container. A variable named “max” should contain the maximum element in some container. • Comments. Sometimes the comments describe the intended behavior of a piece of code, and constraints could be inferred from the comments. To understand how this strategy works, let us consider the defect Closure-1 which we have been introduced in strategy 4. Using strategy 4, we can isolate the defect to method removeChild and its callers, and we know the removal is undesired. However, from the name of removeChild, we can infer a constraint that this method should remove an item. Because this semantics is consistent with its implementation code, we know the removal within this method is desired. Therefore, the fault should be in the methods calling removeChild. In other words, removeChild should not be called. 5.3 RQ3: patch generation strategies and implications In this subsection, we present the strategies used for patch generation in the manual analysis. Table 3 shows the seven strategies we summarized on patch generation. Similarly to Table 2, the first column is the identification for each strategy, the second column briefly describes the strategy, and the last column lists the defects to which the strategy is applied. Strategy 8. Add NullPointer checker. This strategy is usually used in our manual analysis when a test fails because of NullPointerException. A typical way to fix such a defect is to surround the statement — 70 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:11 and all following dependent statements with a guarded condition x!=null, where x is the variable causing Software Automation this exception. in the Big Data Era: r = g e t R e n d e r e r F o r D a t a s e t( d ) ; if ( i s D o m a i n A x i s) { 3 if ( r != null ) {...} ... } 4 + if ( r != null ) { 5 C o l l e c t i o n c = r . g e t A n n o t a t i o n s() ; 6 Iterator i = c . iterator () ; ... 7+ } 1 Challenges and Opportunities 2 The following code is the patch for Chart-4. In this patch, an exception is thrown at line 5. The patch adds an “if” statement to surround line 5 and all following statements that depend on it. Though null pointer checker is often added to avoid NullPointerException, the strategy alone usually cannot decide a patch. In this case, we may also change the method getRendererForDataset so as not to return null. We come to this patch by further considering two facts: (1) applying this patch to make all tests pass, (2) there is also a checker for variable r at line 3, indicating that returning null is a valid behavior of getRendererForDataset. We use strategy 14 to summarize these consideration, which will be explained later. 1 void 2 3 4 5 t e s t L i n e a r C o m b i n a t i o n() { double [] a = { 1 . 2 3 4 5 6 7 8 9 } ; double [] b = { 9 8 7 6 5 4 3 2 . 1 } ; Assert . a s s e r t E q u a l s( a [0]* b [0] , M a t h A r r a y s. l i n e a r C o m b i n a t i o n(a , b ) ,0 d ) ; 6} Related. This strategy is similar to a template used in the repair approach PAR [44] and ELIXIR [28], which apply a set of templates to the located statement to generate patches. Improve. Correctly identify the location of the null pointer checker. As discussed before, there are often more than one place to add the null pointer checker, and identifying the correct location is the key for avoiding incorrect patches. In our manual analysis process, different strategies are combined together to decide the correct location. Similarly, ELIXIR depends on a machine-learned model to determine which template to use while PAR simply try different templates one-by-one. However, neither of them consider the runtime information of failing test cases, such as in the example of Chart-4, the exception thrown by the failed test case almost decides the desired template. Strategy 9. Return expected output. When programming, we often encounter boundary cases that should be considered separately from the main programming logic, and such boundary cases are easily neglected by developers. A boundary case is typically handled by a statement if(c) return v;, where v is the expected result and c is a condition to capture the boundary case. As a result, if the failed test execution is a boundary case, we may consider patches using the above form. For example, the following code snippet is a failing test from Math-3. If we can identify that an array of length one is a boundary case, we can come to the fix as inserting statement if(len==1){return a[0]*b[0];} into the method linearCombination. Here variable len represents the length of input arrays, while a[0]*b[0] is just the expected result. However, this strategy heavily depends on the developer’s experience to decide boundary cases. Otherwise the generated patch may overfit to the current test suite. Related. This is similar to a template in ACS [4]. Improve. Correctly identify boundary cases. ACS can only tackle simple boundary cases, such as comparing with constants. Because this strategy is usually used along with boundary identification, therefore, when complex boundaries cannot be correctly identified by the approach, the repair will fail as well, such as the boundary case that will be introduced in strategy 11 (Lang-2), which is hard for ACS. As a result, to better utilize this strategy, a powerful boundary identification mechanism is needed. Strategy 10. Replace an identifier with a similar one. When the names of two identifiers are similar, developers may confuse the two identifiers. As a result, a possible patch is to replace an identifier with another one whose name is similar. Of course, this strategy alone can hardly decide a patch, but this strategy can be used together with other strategies for patch ranking. For example, in defect Lang-6 introduced in strategy 5, we can observe that two variables, pos and pt, have very similar names. In fact, if we replace the last occurrence of pos with pt, the piece of code no — 71 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:12 2018 longer violates the programming practice. Furthermore, rerunning all tests could reveal that this patch passes all the tests. Putting all together, the correct patch will be preferably selected. Meeting Related. SomeProceedings existing approaches consider replacing variables [3] or methods [3, 6]. Most recent automatic repair techniques further identify the similarity between variables [19, 20, 50] with respect to variable names, types, and the like, which improved the state-of-art. Strategy 11. Compare test executions. It is common that more than one test case exists to test a specific method, and only one of them fails. By comparing the passed tests and the failed tests, we can often obtain useful information on patch generation. For defect Lang-2, the test inputs of all passed tests do not contain the character “#”, whereas both the two failed test cases contain it, which suggests that containing “#” is probably a boundary case. Therefore, together with its expected output, IllegalArgumentException, the desired patch can be generated. Related. The related approaches are mining invariants from executions as patch ingredients [18,27,34]. Improve. Generalize boundary cases from executions. In our analysis, we usually compare one or several successful executions with the failing one to identify the boundary cases that are related to the current failure. Therefore, they usually can be used directly to generate patches. On the contrary, existing approaches depend on a large set of successful executions to mine a set of invariants as fix ingredients, most of which are non-related. Furthermore, to mine complex boundary cases, such as the example introduced in this strategy, is also hard for existing techniques. Therefore, learning boundary cases from a small set of examples can potentially improve the precision of existing techniques. Strategy 12. Interpret comments. Program source code may contain comments explaining properties of the program, such as functionality, precondition, and the like. In particular, Java programs often come with Javadoc annotations explaining the method, the parameters, the return value, and exceptions that might be thrown. These comments often provide important information to guide patch generation. For example, the following method is used to create a DateTimeZone object based on the given hours and minutes (Time-9). The failed test expects an IllegalArgumentException to be thrown at the input of 24 and 0. Again, this is a boundary case where strategy 9 can be applied. However, we still do not know what condition should be used to capture this boundary case. By reading the Javadoc, we can know that hours should be in the range of −23 – +23, and the following patch is straightforward. // the offset in hours from UTC , from -23 to +23 public D a t e T i m e Z o n e f o r O f f s e t H M( int hours , int minutes ) throws I l l e g a l A r g u m e n t E x c e p t i o n{ 3+ if ( hours < -23|| hours >23) throw new I l l e g a l A r g u m e n t E x c ep t i o n() ; 4 ... 5} 1 2 Related. Some approaches have adopted natural language processing techniques to analyze comments and other documents in a natural language. For example, ACS [4] analyzes the Javadoc to exclude unlikely variables in an “if” condition, and R2Fix [21] generates patches by analyzing the bug reports in a natural language. Improve. Incorporate dynamic information of test failure. The depth of automatic analysis still cannot match that in our analysis. For example, the following patch is generated to fix Closure-9 based on the comments. For current automatic techniques, it is impossible to interpret this comment to the corresponding source code. Even though they can parse the natural language, it may be confused about which character should be replaced. Therefore, we need to associate the comments with the runtime information. In this example, only the failed test cases contain character “\”, which is the very character to be replaced. As a result, more robust natural language understanding is imperative. Besides, incorporating the dynamic information with the natural language understanding is needed as well. // The DOS command shell will n o r m a l i z e "/" to "\" , so we have to wrestle it back . + filename = filename . replace ( " \\ " , " / " ) ; Strategy 13. Imitate similar code element. In general, programs with similar functions often have similar structures. When similar code pieces exist near the buggy code, we can generate patch by imitating the similar code. This strategy is often useful when we found the program fails to handle some cases, but we do not know how to handle these cases without the full specification. However, if we can find code pieces handling similar cases, we can imitate these code pieces. — 72 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:13 For example, the following patch comes from Chart-9. According to the failing test,Automation when the Software startIndex is greater than endIndex, no exception should be thrown, which can to generate in lead theusBig Data Era: do not know the if body. By readthe condition statement if(startIndex>endIndex). However, we still Challenges and Opportunities ing the code nearby, we find that the if at line 1 is used to handle a similar case, so we can generate the desired patch by imitating the first if condition. if ( endIndex <0) e m p t y R a n g e= true ; if ( startIndex > endIndex ) 3+ e m p t y R a n g e= true ; 4 if ( e m p t y R a n g e) { ... } 1 2+ Related. A related strategy adopted by several automatic program repair approaches is to mine fix ingredients from existing source code for patch generation [2, 4, 19, 20, 50, 61]. Improve. More flexible code adaptation. Related approaches either do not perform any code adaptation [2, 4, 61] or only perform elementary variable replacing [19, 20, 50], which is not sufficient to tackle complicated cases. For example, to fix Chart-2, a method call of intervalXYData.getXValue(series,item) should be inserted, which does not exist in the similar code. However, another, icd.getValue(row,column), is referred to generate the patch in our manual analysis. We can see that besides the variables, we need to further transform the method call and sometimes the cases can be even more complicated. Therefore, to improve current techniques, more powerful code adaptation should be developed. Strategy 14. Fix by program understanding. Similar to fault localization, this strategy is placed to capture the case where we generate the patch by understanding the functionality of the program. The process is similar to the fault localization case, but the potential patches become another source for generating constraints. If we found the constraints generated from a patch are consistent with all other constraints, we would rank the corresponding patch higher. Similar to fault localization, we still lack a full understanding of the program understanding process, and future work is needed to further understand this process. 5.4 RQ4: inspiration from analysis Based on the previous analysis and comparison, we can find that although many of the strategies have already been considered in existing techniques, still some of them (e.g., “locating undesirable value changes”) have not been considered by any approaches, and some of them (e.g., “imitate similar code element”) are not applied in the same way or in the same depth as we do, especially the combination of static and dynamic information. The result indicates that existing techniques still have room for improvement. Finding 2. While existing techniques have already explored strategies similar to some of the strategies we identified, they have potential to be further improved based on the identified strategies. By further observation, we can find that many strategies are simple heuristic rules that do not require deep semantic analysis or full understanding of the program, indicating a high possibility to automate them. Many strategies perform only mechanical operations and can be easily automated. For example, “stack trace analysis” and “locating undesirable value” changes performs only mechanical operations as introduced previously. Some strategies require human experience, but such experience has a high potential to be summarized as heuristic rules. For example, “excluding unlikely candidates” relies on a few heuristics rules to determine whether a method may be faulty. Some simple strategies, such as excluding library functions, used in our analysis have been listed in strategy 2, which can be easily expanded by developers. In fact, only the last strategy in each category, strategies 7 and 14, requires full program understanding. Additionally, as our results show, no single strategy can be effective on a large portion of the defects. Furthermore, most of the defects require multiple strategies to locate and to repair. For instance, to correctly locate the faulty code of Lang-1, we not only use the “stack trace analysis” but also “excluding unlikely candidates” strategy. Furthermore, we can notice that both of the defects explained in strategies 11 and 12 apply strategy “return expected output” besides the strategy explained for each. This observation calls for the studies on combining different fault localization and patch generation approaches. — 73 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:14 2018 Finding 3. Many strategies are simple heuristic rules, such as “excluding unlikely candidates”, “locating undesirable value changes”, “add NullPointer checker”, “compare test execution”, and “imitate Meeting Proceedings similar code element”, that do not require deep analysis nor full understanding of the defects, indicating possibility of automating these strategies to improve current automatic repair techniques Finding 4. No strategy can handle all defects. Combinations of strategies are needed to repair a large portion of defects. 6 Threats to validity First, we discuss the generalizability of our results. Because the case study only involves 50 defects and 5 projects, they may not be representative for a wide range of defects in different types of projects. As a result, our results on the effectiveness of the strategies may not be generalizable to a wider range of projects. However, Defects4J [23] is a widely-used defect benchmark, and so far no generalizability issue on it is reported. Furthermore, we evenly sampled the defects among the 5 projects and the effectiveness of those strategies has been evaluated on them. These facts give us a reasonable degree of confidence on the generalizability of our results. Second, even though we have no prior knowledge about those defects to be analyzed, some basic insights about those projects can be implicitly obtained along with the analysis going on, which may cause training effects to the subsequent analysis. As a result, when summarizing the defects requiring the two program understanding strategies, we may accidentally miss some defects as the program understanding happened unintentionally. To avoid this problem, we have carefully reviewed the analysis record to ensure that the rest of the defects can be fixed without program understanding. Please also note that the validity of the main findings, including the strategies and improvements suggested to existing techniques, is not affected by the threat. Third, as also mentioned in the introduction, our results should not be interpreted as an upper bound of the performance of automatic program repair techniques because they may be superior to human developers on some aspects as well, e.g., by utilizing its computation power. In other words, our results show what automatic techniques can potentially do, but not what they cannot do. Fourth, our study should not be interpreted as an understanding of how human debugs. Our manual analysis settings is different from general human debugging and a single analysis session is not enough to answer such a question. In Section 2, we have summarized some related work on that problem. Besides, though a bug can get repaired with different ways and we only depend on the standard patches in Defects4J to determine patches correctness, the final result are not affected because those correct patches are proved correct while those overfitting patches are obvious and definitely incorrect ones after examination. 7 Conclusion and future work In this paper, we analyzed 50 real world defects to identify to what extent they can be fixed under existing test suites, based on which we summarized the fault localization and patch generation strategies used in our analysis, and discussed the potential of them to be automated to improve existing techniques. Our findings suggest that most of these defects can be fixed in our analysis even though without complete specifications and there is potentially a lot of room for current techniques to improve, and the strategies we identified could potentially be automated and combined to improve the performance of automatic program repair, which calls future work on the automation of those strategies and their combinations. Acknowledgements This work was supported by National Key Research and Development Program of China (Grant No. 2017YFB1001803) and National Natural Science Foundation of China (Grant No. 61672045). References 1 Mei H, Zhang L. Can big data bring a breakthrough for software automation? Sci China Inf Sci, 2018, 61: 056101 — 74 — Jiang J J, et al. Sci China Inf Sci October 2019 Vol. 62 200102:15 2 Le Goues C, Nguyen T V, Forrest S, et al. Genprog: a generic method for automatic software repair. Automation IEEE Trans Software Softw Eng, 2012, 38: 54–72 in the Big Data Era: 3 Long F, Rinard M. Staged program repair with condition synthesis. In: Proceedings of the 2015 10th Joint Meeting Challenges and Opportunities on Foundations of Software Engineering. New York: ACM, 2015. 166–178 4 Xiong Y F, Wang J, Yan R F, et al. Precise condition synthesis for program repair. In: Proceedings of the 39th International Conference on Software Engineering. New York: IEEE, 2017. 416–426 5 Mechtaev S, Yi J, Roychoudhury A. Angelix: scalable multiline program patch synthesis via symbolic analysis. In: Proceedings of the 38th International Conference on Software Engineering. New York: ACM, 2016. 691–701 6 Kim D, Nam J, Song J, et al. Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering. New York: IEEE, 2013. 802–811 7 Thien Nguyen H D, Qi D, Roychoudhury A, et al. Semfix: program repair via semantic analysis. In: Proceedings of the 2013 International Conference on Software Engineering. New York: IEEE, 2013. 772–781 8 Mechtaev S, Yi J, Roychoudhury A. Directfix: looking for simple program repairs. In: Proceedings of the 37th International Conference on Software Engineering. New York: IEEE, 2015. 448–458 9 Gao Q, Zhang H S, Wang J, et al. Fixing recurring crash bugs via analyzing q&a sites (t). In: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering, 2015. 307–318 10 Long F, Amidon P, Rinard M. Automatic inference of code transforms for patch generation. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York: ACM, 2017. 727–739 11 Rolim R, Soares G, Dantoni L, et al. Learning syntactic program transformations from examples. In: Proceedings of the 39th International Conference on Software Engineering. New York: IEEE, 2017. 404–415 12 Long F, Rinard M. Automatic patch generation by learning correct code. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York: ACM, 2016. 298–312 13 Abreu R, Zoeteweij P, van Gemund A J. On the accuracy of spectrum-based fault localization. In: Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques, 2017. 89–98 14 Abreu R, Zoeteweij P, van Gemund A J. Spectrum-based multiple fault localization. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, 2009. 88–99 15 Zhang X Y, Gupta N, Gupta R. Locating faults through automated predicate switching. In: Proceedings of the 28th International Conference on Software Engineering. New York: ACM, 2006. 272–281 16 Chandra S, Torlak E, Barman S, et al. Angelic debugging. In: Proceedings of the 33rd International Conference on Software Engineering. New York: ACM, 2011. 121–130 17 Marcote S L, Durieux T, Le Berre D. Nopol: automatic repair of conditional statement bugs in java programs. IEEE Trans Softw Eng, 2016, 43: 34–55 18 Perkins J H, Kim S, Larsen S, et al. Automatically patching errors in deployed software. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009. 87–102 19 Wen M, Chen J J, Wu R X, et al. Context-aware patch generation for better automated program repair. In: Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018. 1–11 20 Jiang J J, Xiong Y F, Zhang H Y, et al. Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2018. 298–309 21 Liu C, Yang J Q, Tan L, et al. R2fix: automatically generating bug fixes from bug reports. In: Proceedings of the 2013 IEEE 6th International Conference on Software Testing, Verification and Validation, 2013. 282–291 22 Le Goues C, Dewey-Vogt M, Forrest S, et al. A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. In: Proceedings of the 34th International Conference on Software Engineering. New York: IEEE, 2012. 3–13 23 Just R, Jalali D, Ernst M D. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: Proceedings of International Symposium on Software Testing and Analysis. New York: ACM, 2014. 437–440 24 Qi Z, Long F, Achour S, et al. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis. New York: ACM, 2015. 24–36 25 Martinez M, Durieux T, Sommerard R, et al. Automatic repair of real bugs in java: a large-scale experiment on the Defects4J dataset. Empir Softw Eng, 2017, 22: 1936–1964 26 Xiong Y F, Liu X Y, Zeng M H, et al. Identifying patch correctness in test-based program repair. In: Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018. 789–799 27 Chen L S, Pei Y, Furia C A. Contract-based program repair without the contracts. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. New York: IEEE, 2017. 637–647 28 Saha R K, Lyu Y J, Yoshida H, et al. Elixir: effective object oriented program repair. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. New York: IEEE, 2017. 648–659 29 Smith E K, Barr E T, Le Goues C, et al. Is the cure worse than the disease? overfitting in automated program repair. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. New York: ACM, 2015. 532–543 30 Long F, Rinard M. An analysis of the search spaces for generate and validate patch generation systems. In: Proceedings of the 38th International Conference on Software Engineering. New York: ACM, 2016. 702–713 31 Zhong H, Su Z D. An empirical study on real bug fixes. In: Proceedings of the 37th International Conference on — 75 — Jiang J J, et al. 2018 Sci China Inf Sci October 2019 Vol. 62 200102:16 Software Engineering. New York: IEEE, 2015. 913–923 32 Tan S H, Yoshida H, Prasad M R, et al. Anti-patterns in search-based program repair. In: Proceedings of the 2016 24th Meeting ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2016. Proceedings 727–738 33 Dantoni L, Samanta R, Singh R. Qlose: program repair with quantiative objectives. In: Proceedings of International Conference on Computer Aided Verification. Berlin: Springer, 2016. 383–401 34 Wei Y, Pei Y, Furia C A, et al. Automated fixing of programs with contracts. In: Proceedings of the 19th International Symposium on Software Testing and Analysis. New York: ACM, 2010. 61–72 35 Gao Q, Xiong Y F, Mi Y Q, et al. Safe memory-leak fixing for c programs. In: Proceedings of IEEE/ACM 37th IEEE International Conference on Software Engineering, 2015. 459–470 36 Cai Y, Cao L W. Fixing deadlocks via lock pre-acquisitions. In: Proceedings of the 38th International Conference on Software Engineering. New York: ACM, 2016. 1109–1120 37 Hassan F, Wang X Y. Hirebuild: an automatic approach to history-driven repair of build scripts. In: Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018. 1078–1089 38 Martinez M, Monperrus M. Mining software repair models for reasoning on the search space of automated program fixing. Empir Softw Eng, 2015, 20: 176–205 39 Soto M, Thung F, Wong C P, et al. A deeper look into bug fixes: patterns, replacements, deletions, and additions. In: Proceedings of the 13th International Workshop on Mining Software Repositories, 2016. 512–515 40 Yang J Q, Zhikhartsev A, Liu Y F, et al. Better test cases for better automated program repair. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York: ACM, 2017. 831–841 41 Baudry B, Fleurey F, Le Traon Y. Improving test suites for efficient fault localization. In: Proceedings of the 28th International Conference on Software Engineering. New York: ACM, 2006. 82–91 42 Artzi S, Dolby J, Tip F, et al. Directed test generation for effective fault localization. In: Proceedings of the 19th International Symposium on Software Testing and Analysis. New York: ACM, 2010. 49–60 43 Yang D H, Qi Y H, Mao X G. Evaluating the strategies of statement selection in automated program repair. In: Proceedings of International Conference on Software Analysis, Testing, and Evolution. Berlin: Springer, 2018. 33–48 44 Tao Y D, Kim J, Kim S, et al. Automatically generated patches as debugging aids: a human study. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2014. 64–74 45 Lawrance J, Bogart C, Burnett M, et al. How programmers debug, revisited: an information foraging theory perspective. IEEE Trans Softw Eng, 2013, 39: 197–215 46 LaToza T D, Myers B A. Hard-to-answer questions about code. In: Proceedings of Evaluation and Usability of Programming Languages and Tools. New York: ACM, 2010. 1–6 47 Murphy-Hill E, Zimmermann T, Bird C, et al. The design space of bug fixes and how developers navigate it. IEEE Trans Softw Eng, 2015, 41: 65–81 48 Qi Y H, Mao X G, Lei Y, et al. The strength of random search on automated program repair. In: Proceedings of the 36th International Conference on Software Engineering. New York: ACM, 2014. 254–265 49 Le X B D, Lo D, Le Goues C. History driven program repair. In: Proceedings of IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering. New York: IEEE, 2016. 213–224 50 Xin Q, Reiss S P. Leveraging syntax-related code for automated program repair. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. New York: IEEE, 2017. 660–670 51 Agrawal H, Horgan J R. Dynamic program slicing. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 1990. 246–256 52 Zhang X Y, Gupta N, Gupta R. Pruning dynamic slices with confidence. In: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2006. 169–180 53 Rathore S S, Kumar S. Predicting number of faults in software system using genetic programming. In: Proceedings of International Conference on Soft Computing and Software Engineering, 2015. 62: 303–311 54 Tahir A, MacDonell S G. A systematic mapping study on dynamic metrics and software quality. In: Proceedings of the 2012 IEEE International Conference on Software Maintenance, 2012. 326–335 55 Wu R X, Zhang H Y, Cheung S C, et al. Crashlocator: locating crashing faults based on crash stacks. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis. New York: ACM, 2014. 204–214 56 Wong C P, Xiong Y F, Zhang H Y, et al. Boosting bug-report-oriented fault localization with segmentation and stacktrace analysis. In: Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, 2014. 181–190 57 Zhong H, Mei H. Mining repair model for exception-related bug. J Syst Softw, 2018, 141: 16–31 58 Cleve H, Zeller A. Locating causes of program failures. In: Proceedings of the 27th International Conference on Software Engineering. New York: ACM, 2005. 342–351 59 Le T B, Lo D, Goues C L, et al. A learning-to-rank based fault localization approach using likely invariants. In: Proceedings of the 25th International Symposium on Software Testing and Analysis. New York: ACM, 2016. 177–188 60 Ayewah N, Hovemeyer D, Morgenthaler J D, et al. Using static analysis to find bugs. IEEE Softw, 2008, 25: 22–29 61 Weimer W, Fry Z P, Forrest S. Leveraging program equivalence for adaptive program repair: models and first results. In: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, 2013. 356–366 — 76 — SCIENCE CHINA Information Sciences . PERSPECTIVE . Special Focus on Software Automation Software Automation in the Big Data Era: Challenges and Opportunities October 2019, Vol. 62 200103:1–200103:3 https://doi.org/10.1007/S11432-019-9947-6 Automated program repair: a step towards software automation Abhik ROYCHOUDHURY1* & Yingfei XIONG2,3* 1 School of Computing, National University of Singapore, Singapore 117417, Singapore; 2 Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China; 3 Institute of Software, Department of Computer Science, Peking University, Beijing 100871, China Received 26 February 2019/Revised 7 May 2019/Accepted 23 May 2019/Published online 9 September 2019 Citation Roychoudhury A, Xiong Y F. Automated program repair: a step towards software automation. Sci China Inf Sci, 2019, 62(10): 200103, https://doi.org/10.1007/S11432-019-9947-6 Programming is seen as a problem solving activity, which combines precision with creativity. The program needs to be precise at least to the extent of passing given tests. At the same time, the programmer employs copious creativity in terms of problem solving strategies, algorithm design, data structure choice, or even choice of which libraries to invoke. The recent growth of machine learning techniques and the possibility of applying such techniques to large software repositories raise the question to what extent the various software engineering processes can be automated, which is known as the software automation problem [1]. It is known that for many software engineering projects up to 80% of the time is spent in debugging and fixing errors. This is an unfortunate narrative on the state-of-practice in software development, prompting practitioners to label the situation as a legacy crisis a decade back [2]. Since then, the scale of software has increased, and the use of third party code, or geographically distributed software development has also dramatically increased. It is indeed not an exaggeration to say that today’s software systems are often not monolithic. Instead, they are assembled out of software components written by various geographically distributed teams, legacy software components, and third party software components purchased or acquired for free. In the absence of strong oversight, the challenges of debugging and fixing are exacer- bated. This makes the prospect of automated program repair particularly attractive in future software development. Classic automated repair techniques aim to modify a buggy program to meet a given correctness criterion; the correctness criterion is often given as a test-suite. Classic program repair typically proceeds with three steps: (i) fix localization, (ii) fix representation, and (iii) fix selection, as detailed below. Step 1. Fix localization attempts to find program locations where the code may be changed to achieve the fix. Step 2. A space of candidate patches is represented. The representation is often based on meta-level techniques, such as program transformation operations, and/or grammars to constrain the newly generated code pieces. Step 3. The repair system selects a candidate patch from the space to satisfy the correctness requirement. Typical methods to perform the selection include heuristic search [3] and program synthesis with symbolic execution [4, 5]. In fixing program errors, a key issue is the enunciation of the correctness requirement. Because formal specifications of intended program behavior are typically unavailable, the correctness criteria driving program repair are given by test-suites. This presents a challenge for repair approaches because the generated patches can break tests not * Corresponding author (email: abhik@comp.nus.edu.sg, xiongyf@pku.edu.cn) c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019 info.scichina.com link.springer.com — 77 — Roychoudhury A, et al. 2018 Sci China Inf Sci appearing in the given test-suite, or even introduce new errors in the un-tested or under-tested programMeeting functionality. This problem is known as Proceedings “weak specification”, “weak test-suites”, or “overfitting”, and is often considered as one of the most important challenges faced by the program repair research [6, 7]. In this study, we discuss two possible directions to address the weak specification problem. Repair with a reference implementation. Though a formal specification of program behavior is often unavailable, in many situations there exists a reference implementation of the program to demonstrate the desired behavior. In the case of an industrial standard for a program or a protocol, the organization responsible for the standard usually publishes a reference implementation to show how the standard should be implemented, and other companies try to be compatible with the reference implementation. For example, OpenJDK is the reference implementation for the Java programming language, and Oracle JDK and Jikes RVM are compatible with OpenJDK. Similarly, reference implementations usually exist for decoders/encoders for multimedia files, network protocols, encryption algorithms, etc. Different from a formal specification, a reference implementation specifies full execution behavior. Essentially the reference implementation acts as an informal specification of intended behavior. The program under repair does not have to be repaired to be fully identical to the reference implementation, but only the observable behavior should be equivalent after repair. To capture such behavior, one possibility is to generate test cases from the reference implementation. One could employ automated test generation methods such as symbolic execution to automatically generate test inputs. Dynamic symbolic execution engines such as KLEE [8] can compute the set of inputs driving execution along a program path as a logical formula called a path condition. Subsequently the path condition for a random input can be mutated to drive execution along other paths and thereby generate inputs traversing these paths. Symbolic execution can be used to systematically generate a comprehensive test-suite from the reference implementation, driving repair. However, symbolic execution engines leverage constraint solving and SMT solvers as the backend, which leads to scalability challenges for symbolic execution. Several possibilities exist to address this problem. First, one can guide symbolic execution to reach specific targets as embodied by efforts such as [9]. This can help generate test cases stressing un-tested functionality and subse- — 78 — October 2019 Vol. 62 200103:2 quently these tests can be used to guide program repair. Second, systematic grey-box fuzzing methods which attempt to achieve enhanced path coverage such as AFLFast [10] can be used. Grey-box fuzzing methods employ compile-time instrumentation followed by run-time detection of enhanced coverage of control flow artifacts. With the recent push on making grey-box fuzzing methods systematic, there exists an opportunity to generate abundant test-data for driving program repair. Last but not the least, symbolic execution techniques specifically designed for exposing the behavioral difference between two programs could be developed, such as the one by Mechtaev et al. [11]. Repair with big data. While obtaining the full specification of the program is difficult, estimating the likelihood of certain patches being correct in the represented space may be much easier for many types of bugs. For example, given a bug with an unexpected NullException, adding an “if” check to guard the statement is much more likely to be correct than deleting the respective statement. Therefore, another possible direction is to estimate the likelihood of the patches in the represented space, and select the most likely patch for the current context. Towards this direction, recent program repair techniques often add an additional step compared with the classic techniques: fix prioritization. This step estimates the likelihood of the patches and prioritizes them. After adding this step, the goal of the fix selection step is to select the highest ranked patch that satisfies the (possibly weak) specification, such as passing a given test-suite. Some existing techniques try to employ heuristic rules to rank the patches. These rules include minimizing the changes made by the patch with some measurement of change distance [12], anti-patterns [13], or checking if the execution of the tests change under some expected directions. However, heuristic rules are manually constructed by researchers, and by nature cannot cover all situations, especially when the probabilities of the patches depend on the local context of the project and the specific types of the bugs. The availability of software big-data presents a unique opportunity for this problem. It remains an open question whether we can get useful guidance to estimate the likelihood of patches and correctly prioritize them via mining software big-data. By collecting a corpus of patches and training over the corpus, we may build a model to estimate the likelihood of patches. Identifying the most-likely patches with this model, we may repair bugs with a high probability of correctness. In this research direction, many challenges exist, and yet there are Roychoudhury A, et al. Sci China Inf Sci many opportunities to address such challenges. First, it is challenging to collect high-quality training data (patches for training). While patches can be found in the commit history of software projects, a commit may also add new functionalities, refactor code, or mix several purposes. The current approaches [14] use heuristics to identify bug-fixing commits, such as the number of modified code lines or keywords in commit message, but these heuristics all have limited precision and recall. A possible direction, as studied in a recent study [15], is to watch the development process and automatically identifies commits between a failing build and a passing build. Yet future work is still needed to identify which part of a big commit repairs the bug. Another opportunity is to learn from more sources beyond just patches. For example, existing approaches have utilized program source code [16] and QA web sites [17]. A remaining question is how to combine different sources to achieve the best performance. Second, it is challenging to build a learning model for estimating the likelihood of patches. Existing approaches have applied classic machine learning models as well as deep learning to model the code [18, 19]. However, we still lack understanding on how different models perform at different situations. Furthermore, these models usually treat the likelihood estimation procedure as a black box, and cannot utilize the domain knowledge of the program, such as the semantics. These issues remain to be explored in future. Third, it is challenging to identify the most probable patch that meets the weak specification. Though efficient methods exist to quickly identify patches passing the tests in a prioritized space [20, 21], it is often impossible to enumerate all possible patches and sort them by priority. Recent study [18] proposed to decompose a patch into a series of search steps, and instead of estimating the likelihood of patches, the likelihood of choices at each step is estimated. Future work needs to be done to understand how this method can be combined with semantic repair [4]. Acknowledgements This work was partially supported by Singapore’s National Cybersecurity R&D Program (Grant No. NRF2014NCR-NCR001-21), National Key Research and Development Program of China (Grant No. 2017YFB1001803), and National Natural Science Foundation of China (Grant Nos. 61672045, 61529201). October 2019 Vol. 62 200103:3 References Software Automation 1 Mei H, Zhang L. Can big in datathe bring Big a break-through Data Era: for software automation? Sci China Inf Sci, 2018, 61: 056101Challenges and Opportunities 2 Seacord R, Plakosh D, Lewis G. Modernizing Legacy Systems: Software Technologies, Engineering Processes and Business Practices. Boston: Addison Wesley, 2003 3 Weimer W, Nguyen T V, Goues C L, et al. Automatically finding patches using genetic programming. In: Proceedings of ICSE, 2009. 364–374 4 Nguyen H D T, Qi D W, Roychoudhury A, et al. SemFix: program repair via semantic analysis. In: Proceedings of ICSE, 2013. 772–781 5 Mechtaev S, Griggio A, Cimatti A, et al. Symbolic execution with existential second-order constraints. In: Proceedings of ESEC/FSE, 2018. 389–399 6 Qi Z C, Long F, Achour S, et al. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In: Proceedings of ISSTA, 2015. 24–36 7 Smith E K, Barr E, Goues C L, et al. Is the cure worse than the disease? overfitting in automated program repair. In: Proceedings of FSE, 2015. 532–543 8 Cadar C, Dunbar D, Engler D. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In: Proceedings of OSDI, 2008. 209–224 9 Marinescu P, Cadar C. Katch: high-coverage testing of software patches. In: Proceedings of ESEC-FSE, 2019. 235–245 10 Böhme M, Pham V T, Roychoudhury A. Coverage based greybox fuzzing as a markov chain. In: Proceedings of CCS, 2016. 489–506 11 Mechtaev S, Nguyen M D, Noller Y, et al. Semantic program repair using a reference implementation. In: Proceedings of ICSE, 2018. 129–139 12 Mechtaev S, Yi J, Roychoudhury A. Directfix: looking for simple program repairs. In: Proceedings of ICSE, 2015. 448–458 13 Tan S H, Yoshida H, Prasad M, et al. Anti-patterns in search-based program repair. In: Proceedings of FSE, 2016. 727–738 14 Just R, Jalali D, Ernst M D. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: Proceedings of ISSTA, 2014. 437–440 15 Dmeiri N, Tomassi D, Wang Y, et al. Bugswarm: mining and continuously growing a dataset of reproducible failures and fixes. In: Proceedings of ICSE, 2019. 339– 349 16 Xiong Y F, Wang J, Yan R F, et al. Precise condition synthesis for program repair. In: Proceedings of ICSE, 2017. 416–426 17 Gao Q, Zhang H S, Wang J, et al. Fixing recurring crash bugs via analyzing Q&A sites (T). In: Proceedings of ASE, 2015. 307–318 18 Xiong Y, Wang B, Fu G R, et al. Learning to synthesize. In: Proceedings of GI, 2018. 37–44 19 Gupta R, Pal S, Kanade A, et al. DeepFix: fixing common C language errors by deep learning. In: Proceedings of AAAI, 2017 20 Mechtaev S, Gao X, Tan S H, et al. Test-equivalence analysis for automatic patch generation. ACM Trans Softw Eng Methodol, 2018, 27: 15 21 Wang B, Xiong Y F, Shi Y Q W, et al. Faster mutation analysis via equivalence modulo states. In: Proceedings of ISSTA, 2017. 295–306 — 79 — 2018 SCIENCE CHINA Information Sciences . LETTER . Meeting Proceedings October 2019, Vol. 62 200104:1–200104:3 https://doi.org/10.1007/s11432-018-9854-3 Special Focus on Software Automation AI-boosted software automation: learning from human pair programmers Xin PENG1,2* , Zhenchang XING3 & Jun SUN4 1 School of Computer Science, Fudan University, Shanghai 201203, China; Shanghai Key Laboratory of Data Science, Fudan University, Shanghai 201203, China; 3 Research School of Computer Science, Australian National University, Acton ACT 2601, Australia; 4 School of Information Systems, Singapore Management University, Singapore 178902, Singapore 2 Received 18 December 2018/Revised 31 January 2019/Accepted 19 March 2019/Published online 3 September 2019 Citation Peng X, Xing Z C, Sun J. AI-boosted software automation: learning from human pair programmers. Sci China Inf Sci, 2019, 62(10): 200104, https://doi.org/10.1007/s11432-018-9854-3 Dear editor, Software automation [1] aims to automatically generate computer programs from formal or informal requirement descriptions. It covers a variety of transformations of different spans, including generating programs from natural-language requirements, requirements specifications, or design specifications. Traditionally software automation is achieved through logical reasoning and rule-based transformation [1]. Although the transformation from high-level programming languages to their executable forms has been fully automated, automatic generation of programs from their requirements is still hard due to the informality, nonoperationality, and incompleteness of the requirements [2]. The progress of software automation can be boosted by the development of big data and AI (artificial intelligence) techniques. For example, open source communities such as GitHub1) host hundreds of millions of projects with various kinds of software development data such as source code, revisions, issues, emails; online forums such as Stack Overflow2) record dozens of millions of questions and answers on a wide range of topics in programming. Moreover, companies such as Google have accumulated billions of lines of code in their repositories to support tens of thousands of devel- opers around the world [3]. On the other hand, some recent studies have revealed that most software is natural, and thus, like natural language, is also likely to be repetitive and predictable [4]. Based on the huge amount of software development data and knowledge, one can naturally expect that most of the common functionalities have been implemented and shared by others, and most of the common problems in software development have been reported and solved by others. Based on this assumption, many researchers have explored ways of data-driven intelligent software development, which leverages the big software development data and AI techniques such as deep learning for software automation. To date data-driven intelligent software development has made it possible to automate some specific tasks in software development such as recommending the next API based on code context [5], generating API usage sequences for a given natural language query [6], and generating GUI skeleton from UI design image [7]. These tasks only account for a small part of software development. End to end automated program generation from requirements is possible for specific types of small programs such as data manipulation programs (which can be induced from input-output pairs), but is unrealistic for industry-scale software systems in * Corresponding author (email: pengxin@fudan.edu.cn) 1) https://github.com. 2) https://stackoverflow.com. c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019 — 80 — info.scichina.com link.springer.com Peng X, et al. Sci China Inf Sci general due to the following challenges. • Creative and uncertain nature of software. Understanding the requirements and generating architecture design of these systems require creativity and to deal with a great deal of uncertainty. Developers often need to think carefully to understand and refine user requirements into concrete functionalities and business logics. They also have to consider a sound software architecture that satisfies the desired non-functional requirements such as performance, reliability, and extendability. Moreover, the requirements and architecture involve a great deal of uncertainty incurred by the ever-changing user requirements and runtime environments. • Domain diversity. The software projects in open source and industrial repositories have high diversity in their business and technical domains. These projects belong to a large variety of business domains (e.g., finance, e-business, education, entertainment) and are built on a large variety of languages, libraries, frameworks, and platforms (e.g., Java, Spring, Android). Although the total amount of software development data is huge, the data that a specific project can learn from may be limited considering the business and implementation diversity. Moreover, different aspects (e.g., different libraries and frameworks, common and project-specific logics) are often interweaved together, even in a small piece of code. • Data quality. The quality of much software development data is questionable due to both essential and accidental factors. Although the principle of separation of concerns is widely accepted, code scattering and tangling phenomena is common in open-source and industrial software systems. On the other hand, the text content (e.g., class/method/variable names, comments) of programs, which is used in tasks like code recommendation and program comprehension, is often not expressed in a consistent and normative way. For example, a recent study by Liu et al. [8] revealed that a large part of the commit messages that are used as references are noisy, for example they may be bot messages generated by tools or trivial messages containing little or redundant information. Therefore, it may be more realistic to expect human-AI cooperative automation for industryscale software systems. It means that developers still follow existing development processes such as agile development and an AI assistant behind the scene acts as a pair programmer that provides the required helps when the developers encounter difficulties. The purpose of this kind of cooperative automation is not to replace developers by end to end automation, but achieve better efficiency and October 2019 Vol. 62 200104:2 quality by reducing repetitive work and helping Software Automation novice developers think and work like experienced in the Big Data Era: ones. Challenges and Opportunities To understand the requirements for the AI assistant, let us first consider how human developers work and cooperate. Given a development task, developers need to achieve the required goal (e.g., implementing a new feature or improving an existing one) based on their development knowledge and understanding of the current code context. Usually they can also resort to various development resources such as code bases, API documentation, online forums. The reason why they encounter difficulties in the task usually lies in the gap between the developers’ knowledge and the goal of the task. For example, the developers may not know the calculation principle of offset in the canvas of a text editor or the APIs that can change the color of the text on the canvas. Due to the knowledge gap it is often hard for them to find the required solution even though it can be implied from existing code, documentation, or online discussions. Bridging the knowledge gap therefore becomes the main task of the AI assistant. The way how pair programmers communicate with and help each other can help us further understand how the AI assistant should work. Here are some thoughts that are inspired by the cooperative work of pair programmers. Interactive clarification and explanation. When a pair programmer understands the problem of the other one and provides suggestions, he/she usually needs to interactively clarify the intention and explain the suggested solution. For example, when the pair programmer suggests to use Java StringBuffer to construct the text content read from a file, he/she may explain that StringBuffer is thread safe and this explanation can increase the trust on the suggestion. To recommend the solution for the next step, he/she may clarify the intention, for example, by asking whether to print the text content or show it on the screen. Even when recommending a code fragment the pair programmer may explain the parameters and other implementation details that need to be adapted to the task requirements and local code context. Without this kind of clarification and explanation, it is hard for the AI assistant to understand the goal of the developers, make them trust the suggestions, and help them successfully apply the solutions. Stepwise refinement. Developers often follow a non-sequential order of thinking and editing [5]. For example, a developer may first write the body of a file manipulation functionality and then consider its condition (e.g., whether the file exists). When developers provide suggestions for others — 81 — Peng X, et al. Sci China Inf Sci 2018 they usually understand the problem and develop the solution from a refinement process. They could first suggest someProceedings core APIs with sample code Meeting for feedback to clarify the intention, and then determine and complete the implementation details gradually, for example configuring variable values and adding initialization, resource cleaning, and exception handling code. The AI assistant needs to follow a similar process to understand the intention of the developer and suggests the required solution, and avoids to get into the details from every beginning. With the required background knowledge. An implied premise of pair programming is that the pair programmers share the required background knowledge, including both technical and business knowledge. For example, when talking about the thread safety of string APIs the pair programmers need to both understand the underlying technical concepts such as process/thread, thread safety, and buffering and know that thread safety is only meaningful when multiple threads access and modify the string. Similarly, when choosing APIs for reading/writing excel files, the pair programmers need to know concepts like sheet, row, column, cell and their relationships. To efficiently communicate with the developers and provide accurate suggestions, the AI assistant needs to be equipped with knowledge such as API knowledge graph [9] and other technical or business knowledge graphs. On-demand solution granularities or forms. Developers need solutions and suggestions of different granularities or forms in different situations. When an implementation (e.g., a code fragment or a set of files) for a similar functionality can be found there is no reason to suggest the code line by line; instead, the pair programmer could recommend the whole reference implementation and suggest the required modifications. In some cases, the knowledge gap of the developers lies in an API, a technical principle, or the format of a string variable, thus the pair programmer needs to provide different forms of suggestions and solutions such as API recommendation, explanation of technical principles and string format. Therefore, the AI assistant needs to understand the knowledge gap of the developer and provides suggestions and solu- — 82 — October 2019 Vol. 62 200104:3 tions of different granularities or forms in an ondemand way. To conclude, end to end automated program generation is unrealistic for industry-scale software systems due to the creative and uncertain nature of software, diversity of technical and business domains, and data quality issues. A more realistic expectation is human-AI cooperative automation, where an AI assistant acts as a pair programmer and provides the required helps. How the AI assistant should work can be understood by observing the way how pair programmers communicate with and help each other. Some inspirations include interactive clarification and explanation, stepwise refinement, background knowledge, and on-demand solution granularities or forms. Acknowledgements This work was supported by Na- tional Key Research and Development Program of China (Grant No. 2016YFB1000801). References 1 Xu J, Chen D, Lv J, et al. Software Automation (in Chinese). Beijing: Tsinghua University Press, 1994 2 Mei H, Zhang L. Can big data bring a breakthrough for software automation? Sci China Inf Sci, 2018, 61: 056101 3 Potvin R, Levenberg J. Why Google stores billions of lines of code in a single repository. Commun ACM, 2016, 59: 78–87 4 Hindle A, Barr E T, Gabel M, et al. On the naturalness of software. Commun ACM, 2016, 59: 122–131 5 Nguyen A T, Nguyen T N. Graph-based statistical language model for code. In: Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE-15), Florence, 2015. 858–868 6 Gu X D, Zhang H Y, Zhang D M, et al. Deep API learning. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE-16), Seattle, 2016. 631–642 7 Chen C Y, Su T, Meng G Z, et al. From UI design image to GUI skeleton: a neural machine translator to bootstrap mobile GUI implementation. In: Proceedings of the 40th International Conference on Software Engineering (ICSE-18), Gothenburg, 2018. 665–676 8 Liu Z X, Xia X, Hassan A E, et al. Neural-machinetranslation-based commit message generation: how far are we? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE-18), Montpellier, 2018. 373–384 9 Li H W, Li S R, Sun J M, et al. Improving API caveats accessibility by mining API caveats knowledge graph. In: Proceedings of the 34th IEEE International Conference on Software Maintenance and Evolution (ICSME-18), Madrid, 2018. 183–193 MEETING NEWS National Science Review 6: 19, 2019 Software Automation doi: 10.1093/nsr/nwy145 Advance access 23 November 2018Era: inpublication the Big Data Challenges and Opportunities Challenges and opportunities of software automation discussed at the Yanqi-Lake Meeting By He Jiang Scientists attending the Yanqi-Lake Meeting during this fall— a summit sponsored by the Chinese Academy of Sciences— discussed the challenges and opportunities of software automation in the big data era. From 11 to 13 October 2018, nearly 40 distinguished scientists from Australia, Britain, Canada, China, Japan, Singapore and the USA gathered in Yanqi-Lake, Beijing, to exchange ideas on software automation, today and in the future. Software automation—the process of generating software automatically based on formal or informal specifications—used to be a dream of computer scientists. Its purpose is not only to free developers from tedious programming for new features of software, but also free developers from the endless manual maintenance of evolving software under ever-changing environments. Software automation includes, but is not limited to, program synthesis, code completion, program transformation, code recommendation, program repair and software selfevolution. As an emerging and promising direction, software automation also implies a series of essential challenges, including vague and diverse requirements in open domains, complex software ecosystems and software technology stacks, and diversity in technical and business domains. It is even more challenging when software automation is required to handle nonfunctional requirements such as extendibility and safety. Nowadays, the ‘big’ software-engineering data, which are characterized with their volume, variety, velocity and veracity attributes, are driving this dream of software automation to come true. The participants at the Yanqi-Lake Meeting believe that some specific software-engineering tasks, such as bug fixing, will be able to be fully automated in the near future. Some participants even believe that we are about to witness computers gradually outperforming humans in programming in the coming decades. With software automation, a new type of pair programming may arise. That is, an intelligent assistant hidden within the Integrated Development Environment (IDE) is paired with a human developer at one workstation to perform daily development tasks. Devanbu, a professor from the University of California at Davis, said that the intelligent interaction between the IDE and human developers may be a breakthrough in the coming years. All the scientists in the meeting agreed that ‘big’ softwareengineering data play a key role in software automation. Hence, in addition to the ‘big’ data emerging in publicly available sources such as GitHub and Stack Overflow, scientists are also seeking more ways to manually label more software-engineering data. For example, with the support from the China Ministry of Science and Technology, Prof. Minghui Zhou from Peking University and Prof. Gang Yin from the National University of Defense Technology have set up a project to organize competitions among students on labeling open source code. In this Yanqi-Lake Meeting, the scientists also discuss possible ways to classify the capability of software automation into a series of levels. A possible classification may be the automated generation of machine code (L1), automated generation of skeleton code and suggestion of the next line of code (L2), automated generation of code fragment (L3), automated generation of design structure (L4) and automated generation of the whole application based on requirements understanding (L5). ‘Software automation is promising with great challenges under the big data era. It is time to bring up researchers from various disciplines, including artificial intelligence, software engineering, and programming languages, to work together for software automation,’ said Prof. Hong Mei, the chair of this Yanqi-Lake Meeting. He Jiang is a professor of Dalian University of Technology, China. C The Author(s) 2018. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com — 83 — 2018 SCIENCE CHINA Information Sciences . PERSPECTIVE . May 2018, Vol. 61 056101:1–056101:3 https://doi.org/10.1007/s11432-017-9355-3 Meeting Proceedings Can big data bring a breakthrough for software automation? Hong MEI1,2* & Lu ZHANG1 1Key Laboratory on High-Confidence Software Technologies (Ministry of Education), Peking University, Beijing 100871, China; 2Beijing Institute of Technology, Beijing 100081, China Received 15 December 2017/Accepted 15 January 2018/Published online 9 April 2018 Citation Mei H, Zhang L. Can big data bring a breakthrough for software automation? Sci China Inf Sci, 2018, 61(5): 056101, https://doi.org/10.1007/s11432-017-9355-3 Software automation [1] aims to automatically generate computer programs from formal or informal requirements. Since it may release programmers from tedious programming tasks, software automation has long been a dream of computer scientists. Practically, since software systems constantly evolve during their life cycles, software automation should cover all development activities related to both generating new code and changing existing code. Compilers for high-level programming languages (e.g., C and FORTRAN) can be viewed as pioneer work in the field of software automation. With compilers, programs written in high-level programming languages can be automatically transformed into their executable forms using some transformation rules. However, to realize software automation in the modern sense, where software systems written in high-level programming languages need to be automatically generated based on their requirements, three major challenges need to be further addressed: informality, non-operationality, and incompleteness. • Informality. Instead of representing requirements for computers to process (e.g., a formal language), humans tend to represent requirements in a manner for humans to process (e.g., a natural language). To address this challenge, researchers have investigated various specification languages [2] (which can be either formal, semiformal, or graphical) to provide a compromise. However, informality remains challenging because it implies that computers should understand natural languages in an accurate manner. • Non-operationality. Instead of describing “how to do” in requirements, humans tend to describe only “what to do”. To address this challenge, researchers have investigated various declarative languages (e.g., functional languages [3]), where developers can describe only “what to do” requirements and a compiler or an interpreter automatically maps them to their corresponding “how to do” requirements. However, declarative languages can manage only a limited set of such “what to do” requirements. To accommodate a broader scope of “what to do” requirements, a search procedure to synthesize programs satisfying the “what to do” requirements is needed. This kind of program synthesis is very difficult because it needs to search an infinite program space. • Incompleteness. Instead of describing the full set of requirements, humans tend to explicitly provide a small subset of the requirements, keeping the remaining requirements latent. To address this challenge, researchers have investigated various domain-specific languages [4] for mature domains, where software systems differ from each other on a well-defined set of variation points. Thus, developers can use a domain-specific language to concisely describe a target system, thus alleviating the situation. The difficulty of fully ad- * Corresponding author (email: meih@pku.edu.cn) c Science China Press and Springer-Verlag GmbH Germany 2018 — 84 — info.scichina.com link.springer.com Mei H, et al. Sci China Inf Sci dressing this challenge is that the large number of unknown variation points in general domains poses an intrinsic barrier for humans to design and implement suitable domain-specific languages. This article proposes that the vast accumulation of source code and related documentation (i.e., big data in software development) may shed some new insights into software automation. Here, big data in software development at least include software code bases, software revisions, software documents, software issues in issue tracking systems, and development-related emails among developers. Similar to other types of big data, these data are also accumulating at a rapid pace, although the absolute volume is far smaller than that of some typical kinds of big data (e.g., video data). However, due to their complex structures, efficient processing of these data is already a challenge. For informality, the parallel relation between source code and its functionality descriptions in natural languages provides an opportunity to learn how to map descriptions in natural languages to source code. For non-operationality, since these data may generally characterize a space of existing software, confining the search space within this software space may both accelerate the search procedure and help produce human readable code. For incompleteness, since different software systems may share some common functionalities (which may not be specified formally and/or in an explicit form), these common functionalities may provide an opportunity for identifying latent requirements and their implementations in existing code. Analog. By considering software automation as transforming requirement description into source code, software automation can be viewed as an analog to machine translation [5]. Although various rule-based approaches have been intensively investigated, the vast accumulation of parallel natural language corpora makes data-driven machine translation competitive or even more effective. For software automation, what has been done in data-driven machine translation is mainly suitable to address only informality. To address the other two challenges, invention of new methodologies and/or techniques becomes unavoidable, because the other two challenges do not essentially involve pure translation but more or less involve creation of non-existing source code. Noticeable research. In general, there are two noticeable lines of research on utilizing existing data to assist software development. First, many software researchers tried to mine existing data accumulated in previous software development to acquire useful knowledge for software May 2018 Vol. 61 056101:2 development. Techniques with this focus generally Software Automation summarize some patterns in from existing data and Era: the Big Data use these patterns as guidelines for future software Challenges and Opportunities development. A typical example is defect prediction [6], where various attributes are extracted and a prediction model is built. Typically, existing techniques in this line can acquire patterns with only low accuracy at the current stage. Therefore, it is only feasible to use these patterns to aid human developers, but it is infeasible to perform software automation solely based on these patterns. Second, artificial intelligence researchers have recently started studying models (e.g., neural network models) for learning from big data in software development. Their primary concern is proper treatment of the highly structural information (e.g., source code). For example, tree-based convolutional neural networks [7] are proposed to accommodate complex structures of code. These neural networks are demonstrated to be effective for distinguishing functionalities of code snippets. Conceptually, this research line complements the previous line from the perspective of software automation because automatic code generation relies on a wide range of knowledge including both knowledge specific to the development tasks and common coding knowledge. However, the accuracy of the existing techniques in this line of research can be competitive in very few tasks. Expecting a breakthrough. A breakthrough of software automation with the help of big data may occur in the foreseeable future. There may be two criteria that determine such a breakthrough (which may be also called data-driven software automation) has occurred. First, source code for a wide range of daily software development tasks can be automatically generated. In other words, the breakthrough techniques should be able to oversee a large portion of activities for developing software with a typical size and written in a mainstream programming language. Second, the automatically generated code should have comparable or even higher quality than human-written code. In particular, the maintainability of the automatically generated code should be high enough so that human developers can work with it comfortablly. In fact, daily software development tasks nowadays are mainly based on mature algorithmic and architectural designs. That is, instead of inventing totally new algorithms and architectures, developers usually adopt ideas and patterns from existing successful algorithms and architectures. Furthermore, with the accumulation of various useful software libraries, there seems to be a trend that daily software development becomes less creative. Thus, daily software development tasks involve more of — 85 — Mei H, et al. 2018 Sci China Inf Sci building with existing libraries than inventing new code. Meeting Proceedings Approaching the breakthrough. In the following, we focus on discussing where data-driven software automation might replace human developers in the near future. First, it is more practical and viable to apply data-driven software automation during software evolution than during initial software development. During software evolution, the history of a software system is typically a burden for human developers to evolve the system, but for datadriven software automation, the history of the system can be a valuable data source. Thus, the historical data may serve as a benefit instead of a hindrance for data-driven software automation to produce quality code to evolve the system. Second, one particularly promising scenario in data-driven software evolution is functionality transplantation, which migrates the code implementing certain functionalities from one software system to another. Compared with datadriven software evolution in the general sense, the scenario of functionality transplantation explicitly provides developers with the code that will be transplanted. In other words, a technique for functionality transplantation starts with some existing code instead of some (partial) informal specification. Thus, the search procedure can focus on a small search space around the code at hand. Another promising scenario in data-driven software evolution might be bug fixing, where some partial specification represented as test cases is typically available. Since it is convenient to execute the test cases to check the satisfiability of the partial specification, the search procedure in this scenario can be in a much simpler form. Third, data-driven software automation may also provide useful support for developing the initial version of a given software system. Let us consider that developers are required to build a software system or sub-system using existing software libraries. On account of the abundance of software libraries, the typical daily development requirement for developers is to compose application programming interfaces (APIs) from various libraries with some simple logic. Thus, with a knowledge base that stores the up-to-date knowledge about all the known libraries, a technique for data-driven software automation may search in only a limited search space that covers common composition logics to fulfill most daily development tasks in this scenario. Our early work called the architecturebased component composition (ABC) approach [8] — 86 — May 2018 Vol. 61 056101:3 is an existing attempt on reuse-based software automation. Given some specifications, the ABC approach matches the specifications against existing components and finds the most suitable components. Then, the ABC approach tries to generate glue code to compose the components. Whenever the generation of glue code fails, the ABC approach allows manual generation of glue code. Typically, the glue code is tightly mapped to an underlying mechanism (e.g., a middleware system). Compared with data-driven software automation, ABC does not rely on big data to find suitable components or to generate glue code. Limit. To conclude, we briefly discuss the limit of data-driven software automation. In our opinion, software development may not become fully automatic solely due to data-driven software automation. What has been discussed so far is to intrinsically reuse existing software rather than invent new software. The main difference between data-driven software automation and traditional software reuse is that what is reused in datadriven software automation is primarily the implicit knowledge buried in code. Reuse of existing knowledge may become a limiting factor for data-driven software automation to be applicable in the scenario of developing software without suitable precedent knowledge. Of course, this scenario may occur with low frequency in a typical software development process. Acknowledgements This work was supported by National Key Research and Development Program of China (Grant No. 2017YFB1001803). References 1 Xu J, Chen D, Lv J, et al. Software Automation (in Chinese). Beijing: Tsinghua University Press, 1994 2 Pressman R. Software Engineering: a Practitioner’s Approach. Boston: McGraw Hill Press, 2010 3 Hudak P. Conception, evolution, and application of functional programming languages. ACM Comput Surv, 1989, 21: 359–411 4 Mernik M, Heering J, Sloane A M. When and how to develop domain-specific languages. ACM Comput Surv, 2005, 37: 316–344 5 Hutchins W, Somers H. An Introduction to Machine Translation. London: Academic Press, 1992 6 D’Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng, 2012, 17: 531–577 7 Mou L L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, 2016. 1287–1293 8 Mei H, Chang J C, Yang F Q. Software component composition based on ADL and middleware. Sci China Ser F-Inf Sci, 2001, 44: 136–151 Published as a conference paper at ICLR 2018 Software Automation in the Big Data Era: S EARCH FOR EAL Challenges andR Opportunities N EURAL -G UIDED D EDUCTIVE T IME P ROGRAM S YNTHESIS FROM E XAMPLES Ashwin K. Vijayakumar∗† & Dhruv Batra School of Interactive Computing Georgia Tech Atlanta, GA 30308, USA {ashwinkv,dbatra}@gatech.edu Abhishek Mohta† & Prateek Jain Microsoft Research India Bengaluru, Karnataka 560001, India {t-abmoht,prajain}@microsoft.com Oleksandr Polozov & Sumit Gulwani Microsoft Research Redmond Redmond, WA 98052, USA {polozov,sumitg}@microsoft.com A BSTRACT Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively handengineered or on purely statistical models that need massive amounts of data, and in general fail to provide real-time synthesis on challenging benchmarks. In this work, we propose Neural Guided Deductive Search (NGDS), a hybrid synthesis technique that combines the best of both symbolic logic techniques and statistical models. Thus, it produces programs that satisfy the provided specifications by construction and generalize well on unseen examples, similar to data-driven systems. Our technique effectively utilizes the deductive search framework to reduce the learning problem of the neural component to a simple supervised learning setup. Further, this allows us to both train on sparingly available real-world data and still leverage powerful recurrent neural network encoders. We demonstrate the effectiveness of our method by evaluating on real-world customer scenarios by synthesizing accurate programs with up to 12× speed-up compared to state-of-the-art systems. 1 I NTRODUCTION Automatic synthesis of programs that satisfy a given specification is a classical problem in AI (Waldinger & Lee, 1969), with extensive literature in both machine learning and programming languages communities. Recently, this area has gathered widespread interest, mainly spurred by the emergence of a sub-area – Programming by Examples (PBE) (Gulwani, 2011). A PBE system synthesizes programs that map a given set of example inputs to their specified example outputs. Such systems make many tasks accessible to a wider audience as example-based specifications can be easily provided even by end users without programming skills. See Figure 1 for an example. PBE systems are usually evaluated on three key criteria: (a) correctness: whether the synthesized program Input Output Yann LeCunn Hugo Larochelle Tara Sainath Y LeCunn H Larochelle T Sainath Yoshua Bengio ? ∗ † Figure 1: An example input-output spec; the goal is to learn a program that maps the given inputs to the corresponding outputs and generalizes well to new inputs. Both programs below satisfy the spec: (i) Concat(1st letter of 1st word, 2nd word), (ii) Concat(4th -last letter of 1st word, 2nd word). However, program (i) clearly generalizes better: for instance, its output on “Yoshua Bengio” is “Y Bengio” while program (ii) produces “s Bengio”. Work done during an internship at Microsoft Research. Equal contribution. 1 — 87 — Published as a conference paper at ICLR 2018 2018 satisfies the spec i.e. the provided example input-output mapping, (b) generalization: whether the Meeting Proceedings program produces the desired outputs on unseen inputs, and finally, (c) performance: synthesis time. State-of-the-art PBE systems are either symbolic, based on enumerative or deductive search (Gulwani, 2011; Polozov & Gulwani, 2015) or statistical, based on data-driven learning to induce the most likely program for the spec (Gaunt et al., 2016; Balog et al., 2017; Devlin et al., 2017). Symbolic systems are designed to produce a correct program by construction using logical reasoning and domain-specific knowledge. They also produce the intended program with few input-output examples (often just 1). However, they require significant engineering effort and their underlying search processes struggle with real-time performance, which is critical for user-facing PBE scenarios. In contrast, statistical systems do not rely on specialized deductive algorithms, which makes their implementation and training easier. However, they lack in two critical aspects. First, they require a lot of training data and so are often trained using randomly generated tasks. As a result, induced programs can be fairly unnatural and fail to generalize to real-world tasks with a small number of examples. Second, purely statistical systems like RobustFill (Devlin et al., 2017) do not guarantee that the generated program satisfies the spec. Thus, solving the synthesis task requires generating multiple programs with a beam search and post-hoc filtering, which defeats real-time performance. Neural-Guided Deductive Search Motivated by shortcomings of both the above approaches, we propose Neural-Guided Deductive Search (NGDS), a hybrid synthesis technique that brings together the desirable aspects of both methods. The symbolic foundation of NGDS is deductive search (Polozov & Gulwani, 2015) and is parameterized by an underlying domain-specific language (DSL) of target programs. Synthesis proceeds by recursively applying production rules of the DSL to decompose the initial synthesis problem into smaller sub-problems and further applying the same search technique on them. Our key observation I is that most of the deduced sub-problems do not contribute to the final best program and therefore a priori predicting the usefulness of pursuing a particular sub-problem streamlines the search process resulting in considerable time savings. In NGDS, we use a statistical model trained on real-world data to predict a score that corresponds to the likelihood of finding a generalizable program as a result of exploring a sub-problem branch. Our key observation II is that speeding up deductive search while retaining its correctness or generalization requires a close integration of symbolic and statistical approaches via an intelligent controller. It is based on the “branch & bound” technique from combinatorial optimization (Clausen, 1999). The overall algorithm integrates (i) deductive search, (ii) a statistical model that predicts, a priori, the generalization score of the best program from a branch, and (iii) a controller that selects sub-problems for further exploration based on the model’s predictions. Since program synthesis is a sequential process wherein a sequence of decisions (here, selections of DSL rules) collectively construct the final program, a reinforcement learning setup seems more natural. However, our key observation III is that deductive search is Markovian – it generates independent sub-problems at every level. In other words, we can reason about a satisfying program for the sub-problem without factoring in the bigger problem from which it was deduced. This brings three benefits enabling a supervised learning formulation: (a) a dataset of search decisions at every level over a relatively small set of PBE tasks that contains an exponential amount of information about the DSL promoting generalization, (b) such search traces can be generated and used for offline training, (c) we can learn separate models for different classes of sub-problems (e.g. DSL levels or rules), with relatively simpler supervised learning tasks. Evaluation We evaluate NGDS on the string transformation domain, building on top of PROSE, a commercially successful deductive synthesis framework for PBE (Polozov & Gulwani, 2015). It represents one of the most widespread and challenging applications of PBE and has shipped in multiple mass-market tools including Microsoft Excel and Azure ML Workbench.1 We train and validate our method on 375 scenarios obtained from real-world customer tasks (Gulwani, 2011; Devlin et al., 2017). Thanks to the Markovian search properties described above, these scenarios generate a dataset of 400, 000+ intermediate search decisions. NGDS produces intended programs on 68% of the scenarios despite using only one input-output example. In contrast, state-of-the-art neural synthesis techniques (Balog et al., 2017; Devlin et al., 2017) learn intended programs from a 1 — 88 — https://microsoft.github.io/prose/impact/ 2 Published as a conference paper at ICLR 2018 Software Automation in the Big Data Era: single example in only 24-36% of scenarios taking ≈ 4× more time. Moreover, NGDS matches the Challenges and Opportunities accuracy of baseline PROSE while providing a speed-up of up to 12× over challenging tasks. Contributions First, we present a branch-and-bound optimization based controller that exploits deep neural network based score predictions to select grammar rules efficiently (Section 3.2). Second, we propose a program synthesis algorithm that combines key traits of a symbolic and a statistical approach to retain desirable properties like correctness, robust generalization, and real-time performance (Section 3.3). Third, we evaluate NGDS against state-of-the-art baselines on real customer tasks and show significant gains (speed-up of up to 12×) on several critical cases (Section 4). 2 BACKGROUND In this section, we provide a brief background on PBE and the PROSE framework, using established formalism from the programming languages community. Domain-Specific Language A program synthesis problem is defined over a domain-specific language (DSL). A DSL is a restricted programming language that is suitable for expressing tasks in a given domain, but small enough to restrict a search space for program synthesis. For instance, typical real-life DSLs with applications in textual data transformations (Gulwani, 2011) often include conditionals, limited forms of loops, and domain-specific operators such as string concatenation, regular expressions, and date/time formatting. DSLs for tree transformations such as code refactoring (Rolim et al., 2017) and data extraction (Le & Gulwani, 2014) include list/data-type processing operators such as Map and Filter, as well as domain-specific matching operators. Formally, a DSL L is specified as a context-free grammar, with each non-terminal symbol N defined by a set of productions. The right-hand side of each production is an application of some operator F (N1 , . . . , Nk ) to some symbols of L. All symbols and operators are strongly typed. Figure 2 shows a subset of the Flash Fill DSL that we use as a running example in this paper. Inductive Program Synthesis The task of inductive program synthesis is characterized by a spec. A spec ϕ is a set of m input-output constraints {σi ψi }m i=1 , where: • σ, an input state is a mapping of free variables of the desired program P to some correspondingly typed values. At the top level of L, a program (and its expected input state) has only one free variable – the input variable of the DSL (e.g., inputs in Figure 2). Additional local variables are introduced inside L with a let construct. • ψ is an output constraint on the execution result of the desired program P (σi ). At the top level of L, when provided by the user, ψ is usually the output example – precisely the expected result of P (σi ). However, other intermediate constraints arise during the synthesis process. For instance, ψ may be a disjunction of multiple allowed outputs. The overall goal of program synthesis is thus: given a spec ϕ, find a program P in the underlying DSL L that satisfies ϕ, i.e., its outputs P (σi ) satisfy all the corresponding constraints ψi . Example 1. Consider the task of formatting a phone number, characterized by the spec ϕ = {inputs : [“(612) 8729128”]} “612-872-9128”. It has a single input-output example, with an input state σ containing a single variable inputs and its value which is a list with a single input string. The output constraint is simply the desired program result. The program the user is most likely looking for is the one that extracts (a) the part of the input enclosed in the first pair of parentheses, (b) the 7th to 4th characters from the end, and (c) the last 4 characters, and then concatenates all three parts using hyphens. In our DSL, this corresponds to: Concat SubStr0 (RegexPosition(x, “(”, ε , 0), RegexPosition(x, ε, “)” , 0)), ConstStr(“-”), SubStr0 (AbsolutePosition(x, −8), AbsolutePosition(x, −5)), ConstStr(“-”), SubStr0 (AbsolutePosition(x, −5), AbsolutePosition(x, −1)) where ε is an empty regex, SubStr0 (pos1 , pos2 ) is an abbreviation for “let x = std.Kth(inputs, 0) in Substring(x, pos1 , pos2 )”, and · is an abbreviation for std.Pair. However, many other programs in the DSL also satisfy ϕ. For instance, all occurrences of “8” in the output can be produced via a subprogram that simply extracts the last character. Such a program overfits to ϕ and is bound to fail for other inputs where the last character and the 4th one differ. 3 — 89 — Published as a conference paper at ICLR 2018 2018 Meeting Proceedings // Nonterminals @start string transf orm := atom | Concat(atom, transf orm); string atom := ConstStr(s) | let string x = std.Kth(inputs, k) in Substring(x, pp); Tuple pp := std.Pair(pos, pos) | RegexOccurrence(x, r, k); int pos := AbsolutePosition(x, k) | RegexPosition(x, std.Pair(r, r), k); // Terminals @input string[] inputs; string s; int k; Regex r; Figure 2: A subset of the FlashFill DSL (Gulwani, 2011), used as a running example in this paper. Every program takes as input a list of strings inputs, and returns an output string, a concatenation of atoms. Each atom is either a constant or a substring of one of the inputs (x), extracted using some position logic. The RegexOccurrence position logic finds k th occurrence of a regex r in x and returns its boundaries. Alternatively, start and end positions can be selected independently either as absolute indices in x from left or right (AbsolutePosition) or as the k th occurrence of a pair of regexes surrounding the position (RegexPosition). See Gulwani (2011) for an in-depth DSL description. As Example 1 shows, typical real-life problems are severely underspecified. A DSL like FlashFill may contain up to 1020 programs that satisfy a given spec of 1-3 input-output examples (Polozov & Gulwani, 2015). Therefore, the main challenge lies in finding a program that not only satisfies the provided input-output examples but also generalizes to unseen inputs. Thus, the synthesis process usually interleaves search and ranking: the search phase finds a set of spec-satisfying programs in the DSL, from which the ranking phase selects top programs ordered using a domain-specific ranking → R where Σ is the set of all input states. The ranking function takes as input a function h : L × Σ (usually σ = inputs in the given spec + any candidate program P ∈ L and a set of input states σ ∈ Σ available unlabeled inputs), and produces a score for P ’s generalization. The implementation of h expresses a subtle balance between program generality, complexity, and behavior on available inputs. For instance, in FlashFill h penalizes overly specific regexes, prefers programs that produce fewer empty outputs, and prioritizes lower Kolmogorov complexity, among other features. In modern PBE systems like PROSE, h is usually learned in a data-driven manner from customer tasks (Singh & Gulwani, 2015; Ellis & Gulwani, 2017). While designing and learning such a ranking is an interesting problem in itself, in this work we assume a black-box access to h. Finally, the problem of inductive program synthesis can be summarized as follows: Problem 1. Given a DSL L, a ranking function h, a spec ϕ = {σi ψi }m i=1 , optionally a set of unlabeled inputs σu , and a target number of programs K, let σ = σu ∪ {σi }m i=1 . The goal of inductive program synthesis is to find a program set S = {P1 , . . . , PK } ⊂ L such that (a) every program in S satisfies ϕ, and (b) the programs in S generalize best: h(Pi , σ ) ≥ h(P, σ ) for any other P ∈ L that satisfies ϕ. Search Strategy Deductive search strategy for program synthesis, employed by PROSE explores the grammar of L top-down – iteratively unrolling the productions into partial programs starting from the root symbol. Following the divide-and-conquer paradigm, at each step it reduces its synthesis problem to smaller subproblems defined over the parameters of the current production. Formally, given a spec ϕ and a symbol N , PROSE computes the set Learn(N, ϕ) of top programs w.r.t. h using two guiding principles: 1. If N is defined through n productions N := F1 (. . .) | . . . | Fn (. . .), PROSE finds a ϕ-satisfying program set for every Fi , and unites the results, i.e., Learn(N, ϕ) = ∪i Learn(Fi (. . .), ϕ). 2. For a given production N := F (N1 , . . . , Nk ), PROSE spawns off k smaller synthesis problems Learn(Nj , ϕj ), 1 ≤ j ≤ k wherein PROSE deduces necessary and sufficient specs ϕj for each Nj such that every program of type F (P1 , . . . , Pk ), where Pj ∈ Learn(Nj , ϕj ), satisfies ϕ. The deduction logic (called a witness function) is domain-specific for each operator F . PROSE then again recursively solves each subproblem and unites a cross-product of the results. Example 2. Consider a spec ϕ = {“Yann” “Y.L”} on a transf orm program. Via the first production transf orm := atom, the only ϕ-satisfying program is ConstStr(“Y.L”). The second production on the same level is Concat(atom, transf orm). A necessary & sufficient spec on the atom sub-program is that it should produce some prefix of the output string. Thus, the witness function for the Concat operator produces a disjunctive spec ϕa = {“Yann” “Y” ∨ “Y.”}. Each — 90 — 4 Published as a conference paper at ICLR 2018 Software Automation in the Big Data Era: of these disjuncts, in turn, induces a corresponding necessary and sufficient suffix spec on the second Challenges and Opportunities parameter: ϕt1 = {“Yann” “.L”}, and ϕt2 = {“Yann” “L”}, respectively. The disjuncts in ϕa will be recursively satisfied by different program sets: “Y.” can only be produced via an atom path with a ConstStr program, whereas “Y” can also be extracted from the input using many Substring logics (their generalization capabilities vary). Figure 3 shows the resulting search DAG. transf orm Concat(. . .) atom transf orm “Y.L” “Y.L” ConstStr(s) “Y.L” atom “.L” ... atom “.L” ConstStr(s) “Y” ∨ “Y.” “Y.L” Concat(. . .) “.L” ... atom “.” “Y” ∨ “Y.” let x = . . . transf orm “L” atom ... “L” “Y” ∨ “Y.” ... .. . Substring(. . .) “Y” pp (0, 1) ... Figure 3: A portion of the search DAG from Example 2. Only the output parts of the respective specs are shown in each node, their common input state is a single string “Yann”. Dashed arrows show recursive Learn calls on a corresponding DSL symbol. Notice that the above mentioned principles create logical non-determinism due to which we might need to explore multiple alternatives in a search tree. As such non-determinism arises at every level of the DSL with potentially any operator, the search tree (and the resulting search process) is exponential in size. While all the branches of the tree by construction produce programs that satisfy the given spec, most of the branches do not contribute to the overall top-ranked generalizable program. During deductive search, PROSE has limited information about the programs potentially produced from each branch, and cannot estimate their quality, thus exploring the entire tree unnecessarily. Our main contribution is a neural-guided search algorithm that predicts the best program scores from each branch, and allows PROSE to omit branches that are unlikely to produce the desired program a priori. 3 S YNTHESIS A LGORITHM Consider an arbitrary branching moment in the top-down search strategy of PROSE. For example, let N be a nonterminal symbol in L, defined through a set of productions N := F1 (. . .) | . . . | Fn (. . .), and let ϕ be a spec on N , constructed earlier during the recursive descent over L. A conservative way to select the top k programs rooted at N (as defined by the ranking function h), i.e., to compute Learn(N, ϕ), is to learn the top k programs of kind Fi (. . .) for all i ∈ [k] and then select the top k programs overall from the union of program sets learned for each production. Naturally, exploring all the branches for each nonterminal in the search tree is computationally expensive. In this work, we propose a data-driven method to select an appropriate production rule N := Fi (N1 , . . . , Nk ) that would most likely lead to a top-ranked program. To this end, we use the current spec ϕ to determine the “optimal” rule. Now, it might seem unintuitive that even without exploring a production rule and finding the best program in the corresponding program set, we can a priori determine optimality of that rule. However, we argue that by understanding ϕ and its relationship with the ranking function h, we can predict the intended branch in many real-life scenarios. Example 3. Consider a spec ϕ = {“alice” “alice@iclr.org”, “bob” “bob@iclr.org”}. While learning a program in L given by Figure 2 that satisfies ϕ, it is clear right at the beginning of the search procedure that the rule transf orm := atom does not apply. This is because any programs derived from transf orm := atom can either extract a substring from the input or return a constant string, both of which fail to produce the desired output. Hence, we should only consider transf orm := Concat(. . .), thus significantly reducing the search space. Similarly, consider another spec ϕ = {“alice smith” “alice”, “bob jones” “bob”}. In this case, the output appears to be a substring of input, thus selecting transf orm := atom at the beginning of the search procedure is a better option than transf orm := Concat(. . .). However, many such decisions are more subtle and depend on the ranking function h itself. For example, consider a spec ϕ = {“alice liddell” “al”, “bob ong” “bo”}. Now, 5 — 91 — 2018 Meeting Input state σ Production ruleProceedings Γ Embedding Output example(s) ψ Char Embedding Char Embedding LSTM for input encoding LSTM for output encoding Two FC layers Predicted score Published as a conference paper at ICLR 2018 Figure 4: LSTM-based model for predicting the score of a candidate production for a given spec ϕ. both transf orm := atom and transf orm := Concat(. . .) may lead to viable programs because the output can be constructed using the first two letters of the input (i.e. a substring atom) or by concatenating the first letters of each word. Hence, the branch that produces the best program is ultimately determined by the ranking function h since both branches generate valid programs. Example 3 shows that to design a data-driven search strategy for branch selection, we need to learn the subtle relationship between ϕ, h, and the candidate branch. Below, we provide one such model. 3.1 P REDICTING THE G ENERALIZATION S CORE As mentioned above, our goal is to predict one or more production rules that for a given spec ϕ will lead to a top-ranked program (as ranked a posteriori by h). Formally, given black-box access to h, we want to learn a function f such that, f (Γ, ϕ) ≈ max P ∈ S(Γ, ϕ) h(P, ϕ), where Γ is a production rule in L, and S(Γ, ϕ) is a program set of all DSL programs derived from the rule Γ that satisfy ϕ. In other words, we want to predict the score of the top-ranked ϕ-satisfying program that is synthesized by unrolling the rule Γ . We assume that the symbolic search of PROSE handles the construction of S(Γ, ϕ) and ensures that programs in it satisfy ϕ by construction. The goal of f is to optimize the score of a program derived from Γ assuming this program is valid. If no program derived from Γ can satisfy ϕ, f should return −∞. Note that, drawing upon observations mentioned in Section 1, we have cast the production selection problem as a supervised learning problem, thus simplifying the learning task as opposed to end-to-end reinforcement learning solution. We have evaluated two models for learning f . The loss function for the prediction is given by: 2 L(f ; Γ, ϕ) = f (Γ, ϕ) − max h(P, ϕ) . P ∈ S(Γ, ϕ) Figure 4 shows a common structure of both models we have evaluated. Both are based on a standard multi-layer LSTM architecture (Hochreiter & Schmidhuber, 1997) and involve (a) embedding the given spec ϕ, (b) encoding the given production rule Γ , and (c) a feed-forward network to output a score f (Γ, ϕ). One model attends over input when it encodes the output, whereas another does not. 3.2 C ONTROLLER FOR B RANCH S ELECTION A score model f alone is insufficient to perfectly predict the branches that should be explored at every level. Consider again a branching decision moment N := F1 (. . .) | . . . | Fn (. . .) in a search process for top k programs satisfying a spec ϕ. One naïve approach to using the predictions of f is to always follow the highest-scored production rule argmaxi f (Fi , ϕ). However, this means that any single incorrect decision on the path from the DSL root to the desired program will eliminate that program from the learned program set. If our search algorithm fails to produce the desired program by committing to a suboptimal branch anytime during the search process, then the user may never discover that such a program exists unless they supply additional input-output example. Thus, a branch selection strategy based on the predictions of f must balance a trade-off of performance and generalization. Selecting too few branches (a single best branch in the extreme case) risks committing to an incorrect path early in the search process and producing a suboptimal program or no program at all. Selecting too many branches (all n branches in the extreme case) is no different from baseline PROSE and fails to exploit the predictions of f to improve its performance. Formally, a controller for branch selection at a symbol N := F1 (. . .) | . . . | Fn (. . .) targeting k best programs must (a) predict the expected score of the best program from each program set: — 92 — 6 Published as a conference paper at ICLR 2018 function T HRESHOLD BASED(ϕ, h, k, s1 , . . . , sn ) ∗ 1: Result set S ← [] ∗ 2: i ← argmaxi si 3: for all 1 ≤ i ≤ n do 4: if |si − si∗ | ≤ θ then // Recursive search 5: S ∗ += L EARN(Fi , ϕ, k) 6: return the top k programs of S w.r.t. h Software Automation in the Big Data Era: function B N BBASED(ϕ, h, k, s1 , . . . , sn ) ∗ Challenges 1: Result set S ← []; Program and target k Opportunities ←k 2: 3: 4: 5: 6: 7: 8: Reorder Fi in the descending order of si for all 1 ≤ i ≤ n do Si ← L EARN(Fi , ϕ, k ) // Recursive search j ← B INARY S EARCH(si+1 , Map(h, Si )) S ∗ = Si∗ ∪ Si [0..j]; k ← k − j if k ≤ 0 then break return S ∗ Figure 5: The controllers for guiding the search process to construct a most generalizable ϕ-satisfying program set S of size k given the f -predicted best scores s1 , . . . , sn of the productions F1 , . . . , Fn . Given: DSL L, ranking function h, controller C from Figure 5 (T HRESHOLD BASED or B N BBASED), symbolic search algorithm L EARN(Production rule Γ , spec ϕ, target k) as in PROSE (Polozov & Gulwani, 2015, Figure 7) with all recursive calls to L EARN replaced with L EARN NGDS function L EARN NGDS(Symbol N := F1 (. . .) | . . . | Fn (. . .), spec ϕ, target number of programs k) 1: if n = 1 then return L EARN (F1 , ϕ, k) 2: Pick a score model f based on depth(N, L) 3: s1 , . . . , sn ← f (F1 , ϕ), . . . , f (Fn , ϕ) 4: return C(ϕ, h, k, s1 , . . . , sn ) Figure 6: Neural-guided deductive search over L, parameterized with a branch selection controller C. si = f (Fi , ϕ) ∀ 1 ≤ i ≤ n, and (b) use the predicted scores si to narrow down the set of productions F1 , . . . , Fn to explore and to obtain the overall result by selecting a subset of generated programs. In this work, we propose and evaluate two controllers. Their pseudocode is shown in Figure 5. Threshold-based: Fix a score threshold θ, and explore those branches whose predicted score differs by at most θ from the maximum predicted score. This is a simple extension of the naïve “argmax” controller discussed earlier that also explores any branches that are predicted “approximately as good as the best one”. When θ = 0, it reduces to the “argmax” one. Branch & Bound: This controller is based on the “branch & bound” technique in combinatorial optimization (Clausen, 1999). Assume the branches Fi are ordered in the descending order of their respective predicted scores si . After recursive learning produces its program set Si , the controller proceeds to the next branch only if si+1 exceeds the score of the worst program in Si . Moreover, it reduces the target number of programs to be learned, using si+1 as a lower bound on the scores of the programs in Si . That is, rather than relying blindly on the predicted scores, the controller guides the remaining search process by accounting for the actual synthesized programs as well. 3.3 N EURAL -G UIDED D EDUCTIVE S EARCH We now combine the above components to present our unified algorithm for program synthesis. It builds upon the deductive search of the PROSE system, which uses symbolic PL insights in the form of witness functions to construct and narrow down the search space, and a ranking function h to pick the most generalizable program from the found set of spec-satisfying ones. However, it significantly speeds up the search process by guiding it a priori at each branching decision using the learned score model f and a branch selection controller, outlined in Sections 3.1 and 3.2. The resulting neural-guided deductive search (NGDS) keeps the symbolic insights that construct the search tree ensuring correctness of the found programs, but explores only those branches of this tree that are likely to produce the user-intended generalizable program, thus eliminating unproductive search time. A key idea in NGDS is that the score prediction model f does not have to be the same for all decisions in the search process. It is possible to train separate models for different DSL levels, symbols, or even productions. This allows the model to use different features of the input-output spec for evaluating the fitness of different productions, and also leads to much simpler supervised learning problems. Figure 6 shows the pseudocode of NGDS. It builds upon the deductive search of PROSE, but augments every branching decision on a symbol with some branch selection controller from Section 3.2. We present a comprehensive evaluation of different strategies in Section 4. 7 — 93 — Published as a conference paper at ICLR 2018 2018 Metric Proceedings PROSE Meeting DC1 DC2 DC3 RF1 RF2 RF3 NGDS Accuracy (% of 73) Speed-up (× PROSE) 35.81 1.82 47.38 1.53 62.92 1.42 24.53 0.25 39.72 0.27 56.41 0.30 68.49 1.67 67.12 1.00 Table 1: Accuracy and average speed-up of NGDS vs. baseline methods. Accuracies are computed on a test set of 73 tasks. Speed-up of a method is the geometric mean of its per-task speed-up (ratio of synthesis time of PROSE and of the method) when restricted to a subset of tasks with PROSE’s synthesis time is ≥ 0.5 sec. 4 E VALUATION In this section, we evaluate our NGDS algorithm over the string manipulation domain with a DSL given by Figure 2; see Figure 1 for an example task. We evaluate NGDS, its ablations, and baseline techniques on two key metrics: (a) generalization accuracy on unseen inputs, (b) synthesis time. Dataset. We use a dataset of 375 tasks collected from real-world customer string manipulation problems, split into 65% training, 15% validation, and 20% test data. Some of the common applications found in our dataset include date/time formatting, manipulating addresses, modifying names, automatically generating email IDs, etc. Each task contains about 10 inputs, of which only one is provided as the spec to the synthesis system, mimicking industrial applications. The remaining unseen examples are used to evaluate generalization performance of the synthesized programs. After running synthesis of top-1 programs with PROSE on all training tasks, we have collected a dataset of ≈ 400,000 intermediate search decisions, i.e. triples production Γ, spec ϕ, a posteriori best score h(P, ϕ). Baselines. We compare our method against two state-of-the-art neural synthesis algorithms: RobustFill (Devlin et al., 2017) and DeepCoder (Balog et al., 2017). For RobustFill, we use the best-performing Attention-C model and use their recommended DP-Beam Search with a beam size of 100 as it seems to perform the best; Table 3 in Appendix A presents results with different beam sizes. As in the original work, we select the top-1 program ranked according to the generated log-likelihood. DeepCoder is a generic framework that allows their neural predictions to be combined with any program synthesis method. So, for fair comparison, we combine DeepCoder’s predictions with PROSE. We train DeepCoder model to predict a distribution over L’s operators and as proposed, use it to guide PROSE synthesis. Since both RobustFill and DeepCoder are trained on randomly sampled programs and are not optimized for generalization in the real-world, we include their variants trained with 2 or 3 examples (denoted RFm and DCm ) for fairness, although m = 1 example is the most important scenario in real-life industrial usage. Ablations. As mentioned in Section 3, our novel usage of score predictors to guide the search enables us to have multiple prediction models and controllers at various stages of the synthesis process. Here we investigate ablations of our approach with models that specialize in predictions for individual levels in the search process. The model T1 is trained for symbol transf orm (Figure 2) when expanded in the first level. Similarly, P P , P OS refer to models trained for the pp and pos symbol, respectively. Finally, we train all our LSTM-based models with CNTK (Seide & Agarwal, 2016) using Adam (Kingma & Ba, 2014) with a learning rate of 10−2 and a batch size of 32, using early stopping on the validation loss to select the best performing model (thus, 100-600 epochs). We also evaluate three controllers: threshold-based (Thr) and branch-and-bound (BB) controllers given in Figure 5, and a combination of them – branch-and-bound with a 0.2 threshold predecessor (BB0.2 ). In Tables 1 and 2 we denote different model combinations as NGDS(f , C) where f is a symbol-based model and C is a controller. The final algorithm selection depends on its accuracyperformance trade-off. In Table 1, we use NGDS(T1 + P OS, BB), the best performing algorithm on the test set, although NGDS(T1 , BB) performs slightly better on the validation set. Evaluation Metrics. Generalization accuracy is the percentage of test tasks for which the generated program satisfies all unseen inputs in the task. Synthesis time is measured as the wall-clock time taken by a synthesis method to find the correct program, median over 5 runs. We run all the methods on the same machine with 2.3 GHz Intel Xeon processor, 64GB of RAM, and Windows Server 2016. Results. Table 1 presents generalization accuracy as well as synthesis time speed-up of various methods w.r.t. PROSE. As we strive to provide real-time synthesis, we only compare the times for tasks which require PROSE more than 0.5 sec. Note that, with one example, NGDS and PROSE are — 94 — 8 Published as a conference paper at ICLR 2018 Method PROSE NGDS(T1 , Thr) NGDS(T1 , BB) NGDS(T1 , BB0.2 ) NGDS(T1 + P P , Thr) NGDS(T1 + P P , BB) NGDS(T1 + P P , BB0.2 ) NGDS(T1 + P OS, Thr) NGDS(T1 + P OS, BB) NGDS(T1 + P OS, BB0.2 ) Validation Software Automation in the Big Data Era: Test Challenges%and Opportunities of branches Accuracy Speed-up Accuracy Speed-up 70.21 59.57 63.83 61.70 59.57 61.70 61.70 61.70 63.83 63.83 1 1.15 1.58 1.03 0.76 1.05 0.72 1.19 1.13 1.19 67.12 67.12 68.49 67.12 67.12 72.60 67.12 67.12 68.49 67.12 1 1.27 1.22 1.22 0.97 0.89 0.86 1.93 1.67 1.73 100.00 62.72 51.78 63.16 56.41 50.22 56.43 55.63 50.44 55.73 Table 2: Accuracies, mean speed-ups, and % of branches taken for different ablations of NGDS. significantly more accurate than RobustFill and DeepCoder. This is natural as those methods are not trained to optimize generalization, but it also highlights advantage of a close integration with a symbolic system (PROSE) that incorporates deep domain knowledge. Moreover, on an average, our method saves more than 50% of synthesis time over PROSE. While DeepCoder with one example speeds up the synthesis even more, it does so at the expense of accuracy, eliminating branches with correct programs in 65% of tasks. Table 2 presents speed-up obtained by variations of our models and controllers. In addition to generalization accuracy and synthesis speed-up, we also show a fraction of branches that were selected for exploration by the controller. Our method obtains impressive speed-up of > 1.5× in 22 cases. One such test case where we obtain 12× speedup is a simple extraction case which is fairly common in Web mining: {“alpha,beta,charlie,delta” “alpha”}. For such cases, our model determine transf orm := atom to be the correct branch (that leads to the final Substring based program) and hence saves time required to explore the entire Concat operator which is expensive. Another interesting test case where we observe 2.7× speed-up is: {“457 124th St S, Seattle, WA 98111” “Seattle-WA”}. This test case involves learning a Concat operator initially followed by Substring and RegexPosition operator. Appendix B includes a comprehensive table of NGDS performance on all the validation and test tasks. All the models in Table 2 run without attention. As measured by score flip accuracies (i.e. percentage of correct orderings of branch scores on the same level), attention-based models perform best, achieving 99.57/90.4/96.4% accuracy on train/validation/test, respectively (as compared to 96.09/91.24/91.12% for non-attention models). However, an attention-based model is significantly more computationally expensive at prediction time. Evaluating it dominates the synthesis time and eliminates any potential speed-ups. Thus, we decided to forgo attention in initial NGDS and investigate model compression/binarization in future work. Error Analysis. As Appendix B shows, NGDS is slower than PROSE on some tasks. This occurs when the predictions do not satisfy the constraints of the controller i.e. all the predicted scores are within the threshold or they violate the actual scores during B&B exploration. This leads to NGDS evaluating the LSTM for branches that were previously pruned. This is especially harmful when branches pruned out at the very beginning of the search need to be reconsidered – as it could lead to evaluating the neural network many times. While a single evaluation of the network is quick, a search tree involves many evaluations, and when performance of PROSE is already < 1 s, this results in considerable relative slowdown. We provide two examples to illustrate both the failure modes: (a) “41.7114830017,-91.41233825683,41.60762786865,-91.63739013671” “41.7114830017”. The intended program is a simple substring extraction. However, at depth 1, the predicted score of Concat is much higher than the predicted score of Atom, and thus NGDS explores only the Concat branch. The found Concat program is incorrect because it uses absolute position indexes and does not generalize to other similar extraction tasks. We found this scenario common with punctuation in the output string, which the model considers a strong signal for Concat. (b) “type size = 36: Bartok.Analysis.CallGraphNode type size = 32: Bartok.Analysis.CallGraphNode CallGraphNode” “36->32”. In this case, NGDS correctly explores only the Concat branch, but the slowdown happens at the pos symbol. 9 — 95 — Published as a conference paper at ICLR 2018 2018 There are many different logics to extract the “36” and “32” substrings. NGDS explores the Meeting Proceedings RelativePosition branch first, but the score of the resulting program is less then the prediction for RegexPositionRelative. Thus, the B&B controller explores both branches anyway, which leads to a relative slowdown caused by the network evaluation time. 5 R ELATED W ORK Neural Program Induction systems synthesize a program by training a new neural network model to map the example inputs to example outputs (Graves et al., 2014; Reed & De Freitas, 2016; Zaremba et al., 2016). Examples include Neural Turing Machines (Graves et al., 2014) that can learn simple programs like copying/sorting, work of Kaiser & Sutskever (2015) that can perform more complex computations like binary multiplications, and more recent work of Cai et al. (2017) that can incorporate recursions. While we are interested in ultimately producing the right output, all these models need to be re-trained for a given problem type, thus making them unsuitable for real-life synthesis of different programs with few examples. Neural Program Synthesis systems synthesize a program in a given L with a pre-learned neural network. Seminal works of Bosnjak et al. (2017) and Gaunt et al. (2016) proposed first producing a high-level sketch of the program using procedural knowledge, and then synthesizing the program by combining the sketch with a neural or enumerative synthesis engine. In contrast, R3NN (Parisotto et al., 2016) and RobustFill (Devlin et al., 2017) systems synthesize the program end-to-end using a neural network; Devlin et al. (2017) show that RobustFill in fact outperforms R3NN. However, RobustFill does not guarantee generation of spec-satisfying programs and often requires more than one example to find the intended program. In fact, our empirical evaluation (Section 4) shows that our hybrid synthesis approach significantly outperforms the purely statistical approach of RobustFill. DeepCoder (Balog et al., 2017) is also a hybrid synthesis system that guides enumerative program synthesis by prioritizing DSL operators according to a spec-driven likelihood distribution on the same. However, NGDS differs from DeepCoder in two important ways: (a) it guides the search process at each recursive level in a top-down goal-oriented enumeration and thus reshapes the search tree, (b) it is trained on real-world data instead of random programs, thus achieving better generalization. Symbolic Program Synthesis has been studied extensively in the PL community (Gulwani et al., 2017; Alur et al., 2013), dating back as far as 1960s (Waldinger & Lee, 1969). Most approaches employ either bottom-up enumerative search (Udupa et al., 2013), constraint solving (Torlak & Bodik, 2013), or inductive logic programming (Lin et al., 2014), and thus scale poorly to real-world industrial applications (e.g. data wrangling applications). In this work, we build upon deductive search, first studied for synthesis by Manna & Waldinger (1971), and primarily used for program synthesis from formal logical specifications (Puschel et al., 2005; Chaudhari & Damani, 2015). Gulwani (2011) and later Polozov & Gulwani (2015) used it to build PROSE, a commercially successful domain-agnostic system for PBE. While its deductive search guarantees program correctness and also good generalization via an accurate ranking function, it still takes several seconds on complex tasks. Thus, speeding up deductive search requires considerable engineering to develop manual heuristics. NGDS instead integrates neural-driven predictions at each level of deductive search to alleviate this drawback. Work of Loos et al. (2017) represents the closest work with a similar technique but their work is applied to an automated theorem prover, and hence need not care about generalization. In contrast, NGDS guides the search toward generalizable programs while relying on the underlying symbolic engine to generate correct programs. 6 C ONCLUSION We studied the problem of real-time program synthesis with a small number of input-output examples. For this problem, we proposed a neural-guided system that builds upon PROSE, a state-of-the-art symbolic logic based system. Our system avoids top-down enumerative grammar exploration required by PROSE thus providing impressive synthesis performance while still retaining key advantages of a deductive system. That is, compared to existing neural synthesis techniques, our system enjoys following advantages: a) correctness: programs generated by our system are guaranteed to satisfy the given input-output specification, b) generalization: our system learns the user-intended program with just one input-output example in around 60% test cases while existing neural systems learn such a — 96 — 10 Published as a conference paper at ICLR 2018 Software Automation in the Big Data Era: program in only 16% test cases, c) synthesis time: our system can solve most of the test cases in less Challenges and Opportunities than 0.1 sec and provide impressive performance gains over both neural as well symbolic systems. The key take-home message of this work is that a deep integration of a symbolic deductive inference based system with statistical techniques leads to best of both the worlds where we can avoid extensive engineering effort required by symbolic systems without compromising the quality of generated programs, and at the same time provide significant performance (when measured as synthesis time) gains. For future work, exploring better learning models for production rule selection and applying our technique to diverse and more powerful grammars should be important research directions. R EFERENCES Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntaxguided synthesis. In Formal Methods in Computer-Aided Design (FMCAD), pp. 1–8, 2013. Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. DeepCoder: Learning to write programs. In International Conference on Learning Representations (ICLR), 2017. Matko Bosnjak, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel. Programming with a differentiable Forth interpreter. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 547–556, 2017. Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. In International Conference on Learning Representations (ICLR), 2017. Dipak L Chaudhari and Om Damani. Combining top-down and bottom-up techniques in program derivation. In International Symposium on Logic-Based Program Synthesis and Transformation, pp. 244–258. Springer, 2015. Jens Clausen. Branch and bound algorithms – principles and examples. Department of Computer Science, University of Copenhagen, 1999. Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. RobustFill: Neural program learning under noisy I/O. In International Conference on Machine Learning (ICML), 2017. Kevin Ellis and Sumit Gulwani. Learning to learn programs from examples: Going beyond program structure. In International Joint Conference on Artifical Intelligence (IJCAI), 2017. Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. TerpreT: A probabilistic programming language for program induction. CoRR, abs/1608.04428, 2016. URL http://arxiv.org/abs/1608.04428. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. CoRR, abs/1410.5401, 2014. URL http://arxiv.org/abs/1410.5401. Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In Principles of Programming Languages (POPL), volume 46, pp. 317–330, 2011. Sumit Gulwani and Prateek Jain. Programming by examples: Pl meets ml. In Asian Symposium on Programming Languages and Systems, pp. 3–20. Springer, 2017. Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and Trends in Programming Languages, 4(1-2):1–119, 2017. doi: 10.1561/2500000010. URL https: //doi.org/10.1561/2500000010. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735– 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx. doi.org/10.1162/neco.1997.9.8.1735. Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. CoRR, abs/1511.08228, 2015. URL http://arxiv.org/abs/1511.08228. 11 — 97 — Published as a conference paper at ICLR 2018 2018 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Meeting Proceedings Conference on Learning Representations (ICLR), 2014. Vu Le and Sumit Gulwani. FlashExtract: A framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49, pp. 542–553. ACM, 2014. Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua Tenenbaum, and Stephen Muggleton. Bias reformulation for one-shot function induction. In Proceedings of the Twenty-first European Conference on Artificial Intelligence, pp. 525–530. IOS Press, 2014. Sarah M. Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. In LPAR-21, 21st International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, 7-12th May 2017, pp. 85–105, 2017. Zohar Manna and Richard J. Waldinger. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165, 1971. Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. In International Conference on Learning Representations (ICLR), 2016. Oleksandr Polozov and Sumit Gulwani. FlashMeta: A framework for inductive program synthesis. In International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp. 107–126, 2015. Markus Puschel, José MF Moura, Jeremy R Johnson, David Padua, Manuela M Veloso, Bryan W Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, et al. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005. Scott Reed and Nando De Freitas. Neural programmer-interpreters. In International Conference on Learning Representations (ICLR), 2016. Reudismam Rolim, Gustavo Soares, Loris D’Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Björn Hartmann. Learning syntactic program transformations from examples. In International Conference on Software Engineering (ICSE), pp. 404–415, 2017. Frank Seide and Amit Agarwal. CNTK: Microsoft’s open-source deep-learning toolkit. In International Conference on Knowledge Discovery and Data Mining (KDD), pp. 2135–2135, 2016. Rishabh Singh and Sumit Gulwani. Predicting a correct program in programming by example. In Computer-Aided Verification (CAV), 2015. Emina Torlak and Rastislav Bodik. Growing solver-aided languages with Rosette. In Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software, pp. 135–152. ACM, 2013. Abhishek Udupa, Arun Raghavan, Jyotirmoy V. Deshmukh, Sela Mador-Haim, Milo M.K. Martin, and Rajeev Alur. TRANSIT: Specifying protocols with concolic snippets. In Programming Languages Design and Implementation (PLDI), pp. 287–296, 2013. Richard J Waldinger and Richard CT Lee. PROW: A step toward automatic program writing. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 241–252, 1969. Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. In International Conference on Machine Learning (ICML), 2016. — 98 — 12 Published as a conference paper at ICLR 2018 A Software Automation in the Big Data Era: ROBUST F ILL P ERFORMANCE WITH D IFFERENT B EAM S IZES Challenges and Opportunities For our experiments, we implemented RobustFill with the beam size of 100, as it presented a good trade-off between generalization accuracy and performance hit. The following table shows a detailed comparison of RobustFill’s generalization accuracy and performance for different beam sizes and numbers of training examples. Number of examples (m) Beam size %) Accuracy (% Speed-up (× PROSE) 1 10 100 1000 12.4 16.4 17.8 0.45 0.26 0.04 2 10 100 1000 19.2 26.0 28.7 0.47 0.27 0.04 3 10 100 1000 30.1 35.6 39.7 0.53 0.31 0.05 Table 3: Generalization accuracy and performance of RobustFill for different beam sizes and numbers of training examples. B P ERFORMANCE OF B EST NGDS M ODEL ON A LL N ON -T RAINING TASKS Task # Test/Val 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Test Validation Validation Test Test Test Validation Test Validation Test Test Validation Test Validation Test Test Test Test Validation Test Test Test Test Validation Validation Test Test Test Validation Test Test Test Test Validation PROSE Time (s) NGDS Time (s) Speed-up PROSE Correct? NGDS Correct? 3.0032564 1.1687841 0.4490832 6.665234 2.28298 3.0391034 0.5487662 2.4120103 7.6010733 2.1165486 0.9622929 0.4033455 0.4012993 2.9467418 0.3855433 6.0043011 3.0316721 0.3414629 0.3454594 0.3185586 0.2709963 0.4859534 0.8672071 0.3626161 2.3343791 0.2310051 0.1950921 0.8475303 0.4064375 0.2601689 0.2097732 1.2224533 0.5431827 0.4183223 0.233686 0.211069 0.1307367 2.012157 0.83715 1.1410092 0.2105728 0.9588959 3.052303 0.8816776 0.405093 0.1936532 0.1929299 1.4314372 0.1987497 3.1862577 1.6633142 0.1933263 0.2014236 0.202928 0.1734634 0.3169533 0.5865048 0.2590434 1.6800684 0.1718745 0.1456817 0.6425532 0.316499 0.2083826 0.1753706 1.0264273 0.4691296 0.3685321 12.85167 5.53745 3.43502 3.312482 2.727086 2.663522 2.606064 2.515404 2.490275 2.400592 2.375486 2.082824 2.080026 2.05859 1.939843 1.884437 1.82267 1.766252 1.715089 1.569811 1.562268 1.533202 1.478602 1.399828 1.389455 1.344034 1.339167 1.319004 1.284167 1.248515 1.196171 1.190979 1.157852 1.135104 13 — 99 — Published as a conference paper at ICLR 2018 2018 Proceedings TaskMeeting # Test/Val PROSE Time (s) NGDS Time (s) Speed-up PROSE Correct? NGDS Correct? 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 — 100 — Validation Validation Test Validation Test Test Validation Test Test Test Test Test Validation Test Validation Test Validation Test Validation Test Test Validation Test Validation Test Validation Test Validation Test Test Validation Validation Test Test Test Validation Test Test Test Test Test Test Test Test Test Test Test Validation Test Validation Test Test Validation Test Test Validation Validation Validation Validation Validation Validation Test Validation 0.2497723 0.2385918 0.2241414 0.2079995 0.2788713 0.1821743 0.1486939 0.3981185 0.9959218 0.2174055 1.8684116 0.1357812 0.2549691 0.1650636 0.5368683 0.1640937 0.5006552 0.2064185 0.2381335 0.2171637 0.6307356 0.3462029 0.4285604 0.155915 0.1651815 0.1212689 0.1980844 0.1534717 0.2443636 0.1217696 0.2446501 0.6579789 0.1490806 0.2668753 0.1072814 0.1310034 0.1954476 0.3323319 0.2679471 1.1505939 0.1318375 0.15018 0.146774 0.1123303 0.1623439 0.4243661 0.2945639 0.0892761 0.1992316 0.3260828 0.2181703 0.1757585 0.1811467 0.2774191 0.137414 0.1051238 1.5624891 0.1104184 0.1233551 0.189019 0.2997031 0.1057559 0.129731 0.2214195 0.212407 0.2004937 0.1880859 0.2654384 0.1758255 0.1456755 0.3900767 0.9960901 0.2239088 1.9473475 0.1428591 0.2709866 0.1762617 0.5781537 0.1851361 0.5736976 0.2401594 0.277788 0.2677121 0.7807711 0.4325302 0.5464594 0.1992245 0.2135129 0.1571558 0.257616 0.2004651 0.3258476 0.1635984 0.3301224 0.8886647 0.2022204 0.3681659 0.1487589 0.181912 0.273414 0.468445 0.3806013 1.6429378 0.1898685 0.2189491 0.2144594 0.1665129 0.2468262 0.6563517 0.4662018 0.1419142 0.3229269 0.5294719 0.3576818 0.3006565 0.3107196 0.4759698 0.2358583 0.1834589 2.7446374 0.1958337 0.2228252 0.3445496 0.5486731 0.19453 0.2426926 14 1.12805 1.123277 1.117947 1.105875 1.050606 1.036109 1.02072 1.020616 0.999831 0.970956 0.959465 0.950455 0.940892 0.936469 0.928591 0.886341 0.872681 0.859506 0.857249 0.811184 0.807837 0.800413 0.784249 0.78261 0.773637 0.771648 0.768913 0.765578 0.749932 0.74432 0.741089 0.740413 0.737218 0.724878 0.721176 0.720147 0.714841 0.709436 0.70401 0.700327 0.694362 0.685913 0.684391 0.674604 0.657726 0.646553 0.631838 0.629085 0.616956 0.615864 0.609956 0.584582 0.582991 0.58285 0.582613 0.57301 0.569288 0.563838 0.553596 0.548597 0.546233 0.543648 0.534549 Published as a conference paper at ICLR 2018 Task # Test/Val 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 C Test Test Test Test Validation Test Validation Test Test Validation Validation Test Test Validation Validation Test Validation Test Validation Validation Validation Validation Test PROSE Time (s) NGDS Time (s) 0.1706376 0.0936175 0.2101397 0.1816704 0.1516109 0.1102942 1.1538661 0.1241092 1.068263 0.1899474 0.205652 0.1332348 0.2137989 0.2233911 0.1742123 0.1798306 0.1576141 0.1441545 0.189833 0.3401477 0.1575744 0.7252624 0.1288099 0.320323 0.1764753 0.40277 0.3507656 0.2993282 0.2185006 2.3299578 0.251046 2.176145 0.389012 0.4312716 0.2819654 0.4625152 0.4898705 0.3872159 0.4059525 0.3592128 0.3462711 0.4649153 1.0468088 0.6015111 3.2088775 0.5958986 Software Automation in the Big Data Era: Speed-up PROSE Correct? NGDS Correct? Challenges and Opportunities 0.532705 0.530485 0.521736 0.517925 0.506504 0.504778 0.49523 0.494368 0.490897 0.488282 0.47685 0.472522 0.462253 0.456021 0.44991 0.442984 0.438776 0.416305 0.408317 0.324938 0.261964 0.226017 0.216161 ML- BASED R ANKER As noted in Section 2, learning a ranking function is an interesting problem in itself and is orthogonal to our work. Since our method can be used along with any accurate ranking function, we assume black-box access to such a high-quality ranker and specifically, use the state-of-the-art ranking function of PROSE that involves a significant amount of hand engineering. In this section, we evaluate the performance of our method and PROSE when employing a competitive ranker learned in a data-driven manner (Gulwani & Jain, 2017). From the table below, it can be observed that when using an ML-based ranking function, our method achieves an average ≈ 2× speed-up over PROSE while still achieving comparable generalization accuracy . Metric Accuracy (% of 73) Speed-up (× PROSE) PROSE NGDS (T1 , BB) NGDS (T1 + P OS, BB) 65.75 1.00 65.75 2.15 64.38 2.46 Table 5: Generalization accuracy and speed-up of NGDS variants vs. PROSE where all methods use a machine learning based ranking function from Gulwani & Jain (2017). 15 — 101 — 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) BugSwarm: 2018Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes Meeting Proceedings Naji Dmeiri†* , David A. Tomassi†* , Yichen Wang† , Antara Bhowmick† Yen-Chuan Liu† , Premkumar T. Devanbu† , Bogdan Vasilescu‡ , Cindy Rubio-González† † University of California, Davis {nddmeiri, datomassi, eycwang, abhowmick, yclliu, ptdevanbu, crubio}@ucdavis.edu ‡ Carnegie Mellon University vasilescu@cmu.edu Abstract—Fault-detection, localization, and repair methods are vital to software quality; but it is difﬁcult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and ﬁxes are vital to good experimental evaluation of approaches to software quality, but they are difﬁcult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like T RAVIS -CI, which are widely used, fully conﬁgurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe B UG S WARM, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The B UG S WARM toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually. Index Terms—Bug Database, Reproducibility, Software Testing, Program Analysis, Experiment Infrastructure I. I NTRODUCTION Software defects have major impacts on the economy, on safety, and on the quality of life. Diagnosis and repair of software defects consumes a great deal of time and money. Defects can be treated more effectively, or avoided, by studying past defects and their repairs. Several software engineering subﬁelds, e.g., program analysis, testing, and automatic program repair, are dedicated to developing tools, models, and methods for ﬁnding and repairing defects. These approaches, ideally, should be evaluated on realistic, up-to-date datasets of defects so that potential users have an idea of how well they work. Such datasets should contain fail-pass pairs, consisting of a failing version, which may include a test set that exposes the failure, and a passing version including changes that repair it. Given this, researchers can evaluate the effectiveness of tools that perform fault detection, localization (static or dynamic), or fault repair. Thus, research progress is intimately dependent on high-quality datasets of fail-pass pairs. There are several desirable properties of these datasets of fail-pass pairs. First, scale: enough data to attain statistical * Both authors contributed equally and are ordered alphabetically. 1558-1225/19/$31.00 ©2019 IEEE DOI 10.1109/ICSE.2019.00048 — 102 — signiﬁcance on tool evaluations. Second, diversity: enough variability in the data to control for factors such as project scale, maturity, domain, language, defect severity, age, etc., while still retaining enough sample size for sufﬁcient experimental power. Third, realism: defects reﬂecting actual ﬁxes made by real-world programmers to repair real mistakes. Fourth, currency: a continuously updated defect dataset, keeping up with changes in languages, platforms, libraries, software function, etc., so that tools can be evaluated on bugs of current interest and relevance. Finally, and most crucially, defect data should be durably reproducible: defect data preserved in a way that supports durable build and behavior reproduction, robust to inevitable changes to libraries, languages, compilers, related dependencies, and even the operating system.1 Some hand-curated datasets (e.g., Siemens test suite [23], the SIR repository [21], Defects4J [24]) provide artifact collections to support controlled experimentation with program analysis and testing techniques. However, these collections are curated by hand, and are necessarily quite limited in scale and diversity; others incorporate small-sized student homeworks [25], which may not reﬂect development by professionals. Some of these repositories often rely on seeded faults; natural faults, from real programmers, would provide more realism. At time of creation, these are (or rather were) current. However, unless augmented through continuous and expensive manual labor, currency will erode. Finally, to the extent that they have dependencies on particular versions of libraries and operating systems, their future reproducibility is uncertain. The datasets cited above have incubated an impressive array of innovations and are well-recognized for their contribution to research progress. However, we believe that datasets of greater scale, diversity, realism, currency, and durability will lead to even greater progress. The ability to control for covariates, without sacriﬁcing experimental power, will help toolbuilders and empirical researchers obtain results with greater discernment, external validity, and temporal stability. However, how can we build larger defect datasets without heavy manual labor? Finding speciﬁc defect occurrences, and creating recompilable and runnable versions of failing and passing software is difﬁcult for all but the most trivial systems: besides 1 While it is impossible to guarantee this in perpetuity, we would like to have some designed-in resistance to change. 339 the source code, one may also need to gather speciﬁc versions of libraries, dependencies, operating systems, compilers, and other tools. This process requires a great deal of human effort. Unless this human effort can somehow be automated away, we cannot build large-scale, diverse, realistic datasets of reproducible defects that continually maintain currency. But how can we automate this effort? We believe that the DevOps- and OSS-led innovations in cloud-based continuous integration (CI) hold the key. CI services, like T RAVIS -CI [13], allow open-source projects to outsource integration testing. OSS projects, for various reasons, have need for continuous, automated integration testing. In addition, modern practices such as test-driven development have led to much greater abundance of automated tests. Every change to a project can be intensively and automatically tested off-site, on a cloud-based service; this can be done continually, across languages, dependencies, and runtime platforms. For example, typical G IT H UB projects require that each pull request (PR) be integration tested, and failures ﬁxed, before being vetted or merged by integrators [22, 32]. In active projects, the resulting back-and-forth between PR contributors and project maintainers naturally creates many fail-pass pair records in the pull request history and overall project history. Two key technologies underlie this capability: efﬁcient, customizable, container-based virtualization simpliﬁes handling of complex dependencies, and scripted CI servers allows custom automation of build and test procedures. Project maintainers create scripts that deﬁne the test environment (platforms, dependencies, etc.) for their projects; using these scripts, the cloud-based CI services construct virtualized runtimes (typically D OCKER containers) to build and run the tests. The CI results are archived in ways amenable to mining and analysis. We exploit precisely these CI archives, and the CI technology, to create an automated, continuously growing, large-scale, diverse dataset of realistic and durably reproducible defects. In this paper, we present B UG S WARM, a CI harvesting toolkit, together with a large, growing dataset of durably reproducible defects. The toolkit enables maintaining currency and augmenting diversity. B UG S WARM exploits archived CI log records to create detailed artifacts, comprising buggy code versions, failing regression tests, and bug ﬁxes. When a successive pair of commits, the ﬁrst, whose CI log indicates a failed run, and the second, an immediately subsequent passing run, is found, B UG S WARM uses the project’s CI customization scripts to create an artifact: a fully containerized virtual environment, comprising both versions and scripts to gather all requisite tools, dependencies, platforms, OS, etc. B UG S WARM artifacts allow full build and test of pairs of failing/passing runs. Containerization allows these artifacts to be durably reproducible. The large scale and diversity of the projects using CI services allows B UG S WARM to also capture a large, growing, diverse, and current collection of artifacts. Speciﬁcally, we make the following contributions: • We present an approach that leverages CI to mine fail-pass pairs in open source projects and automatically attempts to reproduce these pairs in D OCKER containers (Section III). • We show that fail-pass pairs are frequently found in open Software Automation source projects and discuss the challenges in reproducing in the Big Data Era: such pairs (Section IV). • We provide the B UG S WARM dataset 3,091 artifacts, Challenges and of Opportunities for Java and Python, to our knowledge the largest, continuously expanding, durably reproducible dataset of failpass pairs, and describe the general characteristics of the B UG S WARM artifacts (Section IV).2 We provide background and further motivation for B UG S WARM in Section II. We describe limitations and future work in Section V. Finally, we discuss related work in Section VI and conclude in Section VII. II. BACKGROUND AND M OTIVATION Modern OSS development, with CI services, provides an enabling ecosystem of tools and data that support the creation of B UG S WARM. Here we describe the relevant components of this ecosystem and present a motivating example. A. The Open-Source CI Ecosystem G IT and G IT H UB. G IT is central to modern software development. Each project has a repository. Changes are added via a commit, which has a unique identiﬁer, derived with a SHA-1 hash. The project history is a sequence of commits. G IT supports branching. The main development line is usually maintained in a branch called master. G IT H UB is a webbased service hosting G IT repositories. G IT H UB offers forking capabilities, i.e., cloning a repository but maintaining the copy online. G IT H UB supports the pull request (PR) development model: project maintainers decide on a case-by-case basis whether to accept a change. Speciﬁcally, a potential contributor forks the original project, makes changes, and then opens a pull request. The maintainers review the PR (and may ask for additional changes) before the request is merged or rejected. T RAVIS -CI Continuous Integration. T RAVIS -CI is the most popular cloud-hosted CI service that integrates with G IT H UB; it can automatically build and test commits or PRs. T RAVIS CI is conﬁgured via settings in a .travis.yml ﬁle in the project repository, specifying all the environments in which the project should be tested. A T RAVIS -CI build can be initiated by a push event or a pull request event. A push event occurs when changes are pushed to a project’s remote repository on a branch monitored by T RAVIS -CI. A pull request event occurs when a PR is opened and when additional changes are committed to the PR. T RAVIS -CI builds run a separate job for each conﬁguration speciﬁed in the .travis.yml ﬁle. The build is marked as “passed” when all its jobs pass. D OCKER. D OCKER is a lightweight virtual machine service that provides application isolation, immutability, and customization. An application can be packaged together with code, runtime, system tools, libraries, and OS into an immutable, stand-alone, custom-built, persistent D OCKER image (container), which can be run anytime, anywhere, on any platform that supports D OCKER. In late 2014, T RAVIS -CI 2 The B UG S WARM dataset is available at http://www.bugswarm.org. 340 — 103 — 2018 Meeting Proceedings Fig. 1: Lifecycle of a T RAVIS -CI-built and tested PR began running builds and tests inside D OCKER containers, each customized for a speciﬁc run, as speciﬁed in the T RAVIS CI .travis.yml ﬁles. T RAVIS -CI maintains some of its base images containing a minimal build environment. B UG S WARM harvests these containers to create the dataset. B. Leveraging T RAVIS -CI to Mine and Reproduce Bugs We exploit T RAVIS -CI to create B UG S WARM. Figure 1 depicts the lifecycle of a T RAVIS -CI-built and tested PR. A contributor forks the repository and adds three commits, up to prV1; she then opens a PR, asking that her changes be merged into the original repository. The creation of the PR triggers T RAVIS -CI, which checks whether there are merge conﬂicts between the PR branch and master when the PR was opened (prV1 and baseV1). If not, T RAVIS -CI creates a temporary branch from the base branch, into which the PR branch is merged to yield temp1. This merge is also referred to as a “phantom” merge because it disappears from the G IT history after some time.3 T RAVIS -CI then generates build scripts from the .travis.yml ﬁle and initiates a build, i.e., runs the scripts to compile, build, and test the project. In our example, test failures cause the ﬁrst build to fail; T RAVIS -CI notiﬁes the contributor and project maintainers, as represented by the dashed arrows in Figure 1. The contributor does her ﬁx and updates the PR with a new commit, which triggers a new build. Again, T RAVIS -CI creates the merge between the PR branch (now at prV2) and the base branch (still at baseV1) to yield temp2. The build fails again; apparently the ﬁx was no good. Consequently, the contributor updates the PR by adding a new commit, prV3. A T RAVIS CI build is triggered in which the merge (temp3) between the PR branch (at prV3) and the base branch (now at baseV2) is tested.4 This time, the build passes, and the PR is accepted and merged into the base branch. Each commit is recorded in version control, archiving source code at build-time plus the full build conﬁguration (.travis.yml ﬁle). T RAVIS -CI records how each build fared (pass or fail) and archives a build log containing output of the build and test process, including the names of any failing tests. Our core idea is that T RAVIS -CI-built and tested pull 3 “Phantom” merges present special challenges, which are discussed later. 4 T RAVIS -CI creates each phantom merge on a separate temporary branch, but Figure 1 shows the phantom merges on a single branch for simplicity. requests (and regular commits) from G IT H UB, available in large volumes for a variety of languages and platforms, can be used to construct fail-pass pairs. In our example, the version of the code represented by the merge temp2 is “defective,” as documented by test failures in the corresponding T RAVIS -CI build log. The subsequently “ﬁxed” version (no test failures in the build log) is represented by temp3. Therefore, we can extract (1) a failing program version; (2) a subsequent, ﬁxed program version; (3) the ﬁx, i.e., the difference between the two versions; (4) the names of failing tests from the failed build log; (5) a full description of the build conﬁguration. Since each T RAVIS -CI job occurs within a D OCKER container, we can re-capture that speciﬁc container image, thus rendering the event durably reproducible. Furthermore, if one could build an automated harvesting system that could continually mine T RAVIS -CI builds and create D OCKER images that could persist these failures and ﬁxes, this promises a way to create a dataset to provide all of our desired data: G IT H UBlevel scale; G IT H UB-level diversity; realism of popular OSS projects; currency via the ability to automatically and periodically augment our dataset with recent events, and ﬁnally durable reproducibility via D OCKER images. III. B UG S WARM I NFRASTRUCTURE A. Some Terminology A project’s build history refers to all T RAVIS -CI builds previously triggered. A build may include many jobs; for example, a build for a Python project might include separate jobs to test with Python versions 2.6, 2.7, 3.0, etc. A commit pair is a 2-tuple of G IT commit SHAs that each triggered a T RAVIS -CI build in the same build history. The canonical commit pair consists of a commit whose build fails the tests followed by a ﬁx commit whose build passes the tests. The terms build pair and job pair refer to a 2-tuple of T RAVIS -CI builds or jobs, respectively, from a project’s build history. For a given build, the trigger commit is the commit that, when pushed to the remote repository, caused T RAVIS -CI to start a build. B UG S WARM has four components: PAIR M INER, PAIR F ILTER, R EPRODUCER, and A NALYZER. These components form the pipeline that curates B UG S WARM artifacts and are designed to be relatively independent and general. This section describes the responsibilities and implementation of each component, and a set of supporting tools that facilitate usage of the dataset. B. Design Challenges The tooling infrastructure is designed to handle certain speciﬁc challenges, listed below, that arise when one seeks to continuously and automatically mine T RAVIS -CI. In each case, we list the tools that actually address the challenges. Pair coherence. Consecutive commits in a G IT history may not correspond to consecutive T RAVIS -CI builds. A build history, which T RAVIS -CI retains as a linear series of builds, must be traversed and transformed into a directed graph so that pairs 341 — 104 — Algorithm 1: PAIR M INER Algorithm Algorithm 2: AssignCommits Algorithm Input: Project slug P Output: Set J of fail-pass job pairs (jf , jp ) 1 J = ∅ B = the list of T RAVIS -CI builds for P ; 2 G = {g | g ⊆ B and ∀b ∈ g belong to the same branch/PR}; 3 foreach g in G do 4 Order the builds in g chronologically; 5 foreach bi ∈ g do 6 if bi is failed and bi+1 is passed then 7 AssignCommits(bi ); 8 AssignCommits(bi+1 ); 9 J = J ∪ {(jf , jp ) | jf ∈ bi and jp ∈ bi+1 and jf has the same conﬁguration as jp }; Input: T RAVIS -CI build B in the Big Data Era: 1 Mark B as “unavailable” by default; 2 Clone the GChallenges IT repository for B; and Opportunities 3 if B is triggered by a push event then 4 Assign trigger commit t from T RAVIS -CI build metadata; 5 if t in G IT history or t in G IT H UB archive then 6 mark B as “available”; 10 Software Automation else if B is triggered by a pull request event then Assign trigger commit t, base commit b, and merge commit m for B from T RAVIS -CI build metadata; 9 if t and b in G IT history or m in G IT H UB archive then 10 mark B as “available”; 7 8 return J of consecutive builds map to pairs of consecutive commits. G IT’s non-linear nature makes this non-trivial. (PAIR M INER) Commit recovery. To reproduce a build, one needs to ﬁnd the trigger commit. There are several (sub-)challenges here. First, temporary merge commits like temp1,2,3 in Figure 1 are the ones we need to extract, but these are not retained by T RAVIS -CI. Second, G IT’s powerful history-rewriting capabilities allow commits to be erased from history; developers can and do collapse commits like prV1,2,3 into a single commit, thus frustrating the ability to recover the consequent phantom merge commits. (PAIR M INER, PAIR F ILTER) Image recovery. In principle, T RAVIS -CI creates and retains D OCKER images that allow re-creation of build and test events. In practice, these images are not always archived as expected and so must be reconstructed. (PAIR F ILTER, R EPRODUCER) Runtime recovery. Building a speciﬁc project version often requires satisfying a large number of software dependencies on tools, libraries, and frameworks; all or some of these may have to be “time-traveled” to an earlier version. (R EPRODUCER) Test ﬂakiness. Even though T RAVIS -CI test behavior is theoretically recoverable via D OCKER images, tests may behave non-deterministically because of concurrency or environmental (e.g., external web service) changes. Such ﬂaky tests lead to ﬂaky builds, which both must be identiﬁed for appropriate use in experiments. (R EPRODUCER) Log analysis. Once a pair is recoverable, B UG S WARM tries to determine the exact nature of the failure from the logs, which are not well structured and have different formats for each language, build system, and test toolset combination. Thus the logs must be carefully analyzed to recover the nature of the failure and related metadata (e.g., raised exceptions, failed test names, etc.), so that the pair can be documented. (A NALYZER) C. Mining Fail-Pass Pairs PAIR M INER extracts from a project’s G IT and build histories a set of fail-pass job pairs (Algorithm 1). PAIR M INER takes as input a G IT H UB slug and produces a set of failpass job pairs annotated with trigger commit information for each job’s parent build. The PAIR M INER algorithm involves (1) delinearizing the project’s build history, (2) extracting failpass build pairs, (3) assigning commits to each pair, and (4) extracting fail-pass job pairs from each fail-pass build pair. Analyzing build history. PAIR M INER ﬁrst downloads the project’s entire build history with the T RAVIS -CI API. For each build therein, PAIR M INER notes the branch and (if applicable) the pull request containing the trigger commit, PAIR M INER ﬁrst resolves the build history into lists of builds that were triggered by commits on the same branch or pull request. PAIR M INER recovers the results of the build and its jobs (passed, failed, errored, or canceled), the .travis.yml conﬁguration of each job, and the unique identiﬁers of the build and its jobs using the T RAVIS -CI API. Identifying fail-pass build pairs. Using the build and job identiﬁers, PAIR M INER ﬁnds consecutive pairs where the ﬁrst build failed and the second passed. Builds are considered from all branches, including the main line and any perennials, and both merged and unmerged pull requests. Next, the triggering commits are found, and recovered from G IT history. Finding trigger commits. If the trigger commit was a push event, then PAIR M INER can ﬁnd its SHA via the T RAVIS -CI API. For pull request triggers, we need to get the pull request and base branch head SHAs, and re-create the phantom merge. Unfortunately, neither the trigger commit nor the base commit are stored by T RAVIS -CI; recreating them is quite a challenge. Fortunately, the commit message of the phantom commit, which is stored by T RAVIS -CI, contains this information; we follow Beller et al. [17] to extract this information. This approach is incomplete but is the best available. T RAVIS -CI creates temporary merges for pull-request builds. While temporary merges may no longer be directly accessible, the information for such builds (the head SHAs and base SHAs of the merges) are accessible through the G IT H UB API. We resort to G IT H UB archives to retrieve the code for the commits that are no longer in G IT history. Even if the trigger commit is recovered from the phantom merge commit, one problem remains: developers might squash together all commits in a pull request, thus erasing the constituent commits of the phantom merge right out of the G IT history. In addition, trigger commits for push event builds 342 — 105 — TABLE I: B UG S WARM’s main metadata attributes can sometimes also be removed from the G IT history by the project personnel. As a result, recreating this merge is not always possible; we later show the proportion for which we Meeting Proceedings were able to reset the repository to the commits in the failpass pairs. The two steps of phantom recovery—ﬁrst ﬁnding the trigger commits and then ensuring that the versions are available in G IT H UB —are described in Algorithm 2. 2018 Extracting fail-pass job pairs. PAIR M INER now has a list of fail-pass build pairs for the project. As described in Section II, each build can have many jobs, one for each supported environment. A build fails if any one of its jobs fails and passes if all of its jobs pass. Given a failing build, PAIR M INER ﬁnds pairs of jobs, executed in the same environment, where the ﬁrst failed and the second passed. Such a pair only occurs when a defective version was ﬁxed via source code patches and not by changes in the execution environment (see Algorithm 1). D. Finding Essential Information for Reproduction Pairs identiﬁed by PAIR M INER must be assembled into reproducible containers. To stand a chance of reproducing a job, one must have access to, at a minimum, these essentials: (1) the state of the project at the time the job was executed and (2) the environment in which the job was executed. For each job in the pipeline, PAIR F ILTER checks that these essentials can be obtained. If the project state was deemed recoverable by PAIR M INER, PAIR F ILTER retrieves the original T RAVIS CI log of the job and extracts information about the execution environment. Using timestamps and instance names in the log, PAIR F ILTER determines if the job executed in a D OCKER container and, if so, whether the corresponding image is still accessible. If the log is unavailable, the job was run before T RAVIS -CI started using D OCKER, or the particular image is no longer publicly accessible, then the job is removed from the pipeline. E. Reproducing Fail-Pass Pairs R EPRODUCER checks if each job is durably reproducible. This takes several steps, described below. Generating the job script. travis-build,5 a component of T RAVIS -CI, produces a shell script from a .travis.yml ﬁle for running a T RAVIS -CI job. R EPRODUCER then alters the script to reference a speciﬁc past version of the project, rather than the latest. Matching the environment. To match the original job’s runtime environment, R EPRODUCER chooses from the set of T RAVIS -CI’s publicly available D OCKER images, from Quay and DockerHub, based on (1) the language of the project, as indicated by its .travis.yml conﬁguration, and (2) a timestamp and instance name in the original job log that indicate when that image was built with D OCKER’s tools. Reverting the project. For project history reversion, R EPRO DUCER clones the project and resets its state using the trigger commit mined by PAIR M INER. If the trigger was on a pull 5 https://github.com/travis-ci/travis-build Attribute Type Project G IT H UB slug, primary language, build system, and test framework Reproducibility Total number of attempts, and number of successful attempts to reproduce pair Pull Request Pull request #, merge timestamp, and branch T RAVIS -CI Job T RAVIS -CI build ID, T RAVIS -CI job ID, number of executed and failed tests, names of the failed tests, trigger commit, and branch name Image Tag Unique image tag (simultaneously serves as a reference to a particular D OCKER image) request, R EPRODUCER re-creates the phantom merge commit using the trigger and base commits mined by PAIR M INER. If any necessary commits were not found during the mining process, R EPRODUCER downloads the desired state of the project directly from a zip archive maintained by G IT H UB.6 Finally, R EPRODUCER plants the state of the project inside the execution environment to reproduce the job. Reproducing the job. R EPRODUCER creates a new D OCKER image, as described in Section III-E, runs the generated job script, and saves the resulting output stream in a log ﬁle. R EPRODUCER can run multiple jobs in parallel. R EPRODUCER collects the output logs from all the jobs it attempts to reproduce and sends them to A NALYZER for parsing. F. Analyzing Results A NALYZER parses a T RAVIS -CI build log to learn the status of the build (passed, failed, etc.) and the result of running the regression test suite. If there are failing tests, then A NALYZER also retrieves their names. A challenge here: the format of build logs varies substantially with the speciﬁc build system and the test framework; so parsers must be specialized to each build and test framework. For Java, we support the most popular build systems—Maven [8], Gradle [6], and Ant [1]— and test frameworks—JUnit [7] and testng [12]. For Python, we support the most popular test frameworks—unittest [15], unittest2 [16], nose [9], and pytest [10]. A NALYZER has a top-level analyzer that retrieves all language-agnostic items, such as the operating system used for a build, and then delegates further log parsing to languagespeciﬁc and build system-speciﬁc analyzers that extract information related to running the regression test suite. The extracted attributes—number of tests passed, failed, and skipped; names of the failed tests (if any); build system, and test framework—are used to compare the original T RAVIS -CI log and the reproduced log. If the attributes match, then we say the run is reproducible. Writing a new language-speciﬁc analyzer is relatively easy, mostly consisting of regular expressions that capture the output format of various test frameworks. 6 G IT H UB allows one to download a zip archive of the entire project’s ﬁle structure at a speciﬁc commit. Since this approach produces a standalone checkout of a project’s history (without any of the G IT data stores), R EPRODUCER uses this archive only if a proper clone and reset is not possible. 343 — 106 — Description TABLE II: Mined Fail-Pass Pairs Push Events Language Failed Jobs All Pairs Available D OCKER w/Image Failed Jobs All Pairs Available D OCKER w/Image Challenges and Opportunities 80,804 115,084 71,036 103,175 50,885 65,924 29,817 37,199 250,349 1,190,186 63,167 188,735 24,877 62,545 20,407 45,878 9,509 24,740 Grand Total 1,099,656 195,888 174,211 116,809 67,016 1,440,535 251,902 87,422 66,285 34,249 Java Python 30% 20% 10% 0% 0 50 1,000 500 100 10 5 1 100 150 200 250 300 350 Number of Projects (a) Percent of Fail-Pass Pairs per Language Java Python 10,000 5,000 Number of Pairs w/Image 320,918 778,738 Number of Fail-Pass Pairs Java Python 40% Percentage of Fail-Pass Pairs Software Automation Pull Request inEvents the Big Data Era: 1,000 500 50 100 150 200 250 300 350 Number of Projects (b) Cumulative Number of Fail-Pass Pairs Java Python 5,000 100 10 5 1 0 50 100 150 200 Number of Projects 250 (c) Cumulative Number of Pairs w/ Image Fig. 2: Frequency of Fail-Pass Pairs G. Tools for B UG S WARM Users B UG S WARM includes tools to support tasks such as artifact selection, artifact retrieval, and artifact execution. Artifact selection & retrieval. A given experiment may require artifacts meeting speciﬁc criteria. For this reason, each artifact includes metadata as described in Table I. The B UG S WARM website provides an at-a-glance view of the metadata for all artifacts. Simple ﬁltering can be done directly via the web interface. For more advanced ﬁltering, we provide a REST API; a Python API is also available. To facilitate retrieval of artifact D OCKER images, we provide a B UG S WARM command line interface that masks the complexities of the D OCKER ecosystem to use our artifacts. Given any B UG S WARM artifact identiﬁer, the CLI can download the artifact image, start an interactive shell inside the container, and clean up the container after use.7 Artifact execution. A typical workﬂow for experiments with B UG S WARM involves copying tools and scripts into a container, running jobs, and then copying results. We provide a framework to support this common artifact processing workﬂow. The framework can be extended to ﬁt users’ speciﬁc needs. See the B UG S WARM website for example applications. IV. E XPERIMENTAL E VALUATION Our evaluation is designed to explore the feasibility of automatically creating a large-scale dataset of reproducible bugs and their corresponding ﬁxes. In particular, we answer the following research questions: RQ1: How often are fail-pass pairs found in OSS projects? 7 https://github.com/BugSwarm/client RQ2: What are the challenges in automatically reproducing fail-pass pairs? RQ3: What are the characteristics of reproducible pairs? The B UG S WARM infrastructure is implemented in Python. R EPRODUCER uses a modiﬁed version of the travis-build component from T RAVIS -CI to translate .travis.yml ﬁles into shell scripts. The initial Java-speciﬁc A NALYZER was ported to Python from T RAVIS T ORRENT’s [17] implementation in Ruby. A NALYZER has been extended to support JUnit for Java and now also supports Python. B UG S WARM requires that a project is hosted on G IT H UB and uses T RAVIS -CI. We randomly selected 335 projects among the 500 G IT H UB projects with the most T RAVIS -CI builds, for each of Java and Python. A. Mining Fail-Pass Pairs We inspected a total of 10,179,558 jobs across 670 projects, from which 2,540,191 are failed jobs. We mined a total of 447,790 fail-pass pairs. As described in Section III-C, pairs can originate from push events or pull request events. Table II shows the breakdown: push events contribute to 44% of the fail-pass pairs (195,888) and pull requests contribute to 56% (251,902). Note that fail-pass pairs represent an under-approximation of the number of bug-ﬁx commits; B UG S WARM pairs do not capture commits that ﬁx a bug whose build is not broken. We calculate the percentage of fail-pass pairs with respect to the total number of successful jobs (potential ﬁxes to a bug) per project. Figure 2a plots a cumulative graph with the results. In general, we ﬁnd that Java projects have a slightly higher percentage of fail-pass pairs (at most 33%) than Python projects (at most 20%). For example, there are 80 Java projects and 61 Python projects for which at 344 — 107 — TABLE III: Reproduced Pairs 2018 Unreproducible Pending Java Python 39,326 61,939 584 + 15 785 + 41 564 + 22 387 + 3 626 + 16 48 + 0 1,827 1,264 17,369 35,126 20,130 25,549 Grand Total 101,265 1,425 976 690 3,091 52,495 45,679 Java Python 500 80% 60% 40% 20% 0% 20 40 60 80 100 Number of Projects 120 (a) Cumulative Percentage of Reprod. Pairs Java Python 100 50 10 5 Java Python 826 800 Number of Pairs 100% Number of Reproduced Pairs Percentage of Reproduced Pairs Fully Reproducible + Flaky Meeting Proceedings Language Pairs to Reproduce w/Failed Test w/Failed Job Error-Pass Total Pairs 600 599 642 586 390 400 200 48 1 20 40 60 80 100 Number of Projects (b) Cumulative Number of Reprod. Pairs 0 w/Failed Test w/Failed Job Error-Pass (c) Breakdown of Reproduced Pairs Fig. 3: Reproduced Pairs least 10% of the passing jobs ﬁx a build. Figure 2b plots the cumulative number of fail-pass pairs per project. The Java and Python projects with the most pairs have 13,699 and 14,510 pairs, respectively. We run PAIR F ILTER to discard fail-pass pairs that are unlikely to be reproducible. Table II shows the number of failpass pairs after each ﬁlter is applied. Speciﬁcally, columns “Available” show the pairs we can reset to or which are archived, columns “D OCKER” show the number of remaining pairs that use a D OCKER image, and columns “w/Image” show the number of remaining pairs for which we can locate T RAVIS -CI base images. Figure 2c plots the cumulative number of w/Image pairs, which are passed to R EPRODUCER. A total of 220 Java projects and 233 Python projects have w/Image pairs. RQ1: At most 33% and 22% of all pairs of Java and Python projects, respectively, follow the fail-pass pattern (Figure 2). Among 670 projects, we ﬁnd a total of 447,490 fail-pass pairs, from which 101,265 pairs may be reproducible. B. Reproducing Fail-Pass Pairs We successfully reproduced 3,091 out of 55,586 attempted pairs (45,679 pairs are pending reproduction due to time constraints). Recall from Section III-C that PAIR M INER mines job pairs. The corresponding number of reproducible unique build pairs is 1,837 (1,061 for Java and 776 for Python). The rest of the paper describes the results in terms of number of job pairs. The 3,091 artifacts belong to 108 Java projects and 52 Python projects. Table IV lists the 5 projects with the most artifacts for each language. We repeated the reproduction process 5 times for each pair to determine its stability. If the pair is reproducible all 5 times, then it is marked as “reproducible.” If the pair is reproduced only sometimes, then it is marked as “ﬂaky.” Otherwise, the pair is said to be “unreproducible.” Numbers for each of these categories can be found in Table III. Figure 3a shows the cumulative percentage of reproduced pairs across projects. We achieve a 100% pair reproduction rate for 10 Java projects and 2 Python projects, at least 50% for 38 Java projects and 50 Python projects, and at least 1 pair is reproducible in 108 Java projects, and 52 Python projects. Figure 3b shows the cumulative number of reproduced pairs. The Java and Python projects with the most reproducible pairs have 361 and 171, respectively. We further classify “reproducible” and “ﬂaky” pairs into three groups: (1) pairs that have failed tests, (2) pairs that do not have failed tests despite a failed build, and (3) pairs whose build ﬁnishes with an error. (1) and (2) are labeled failed and (3) errored. This naming convention is from T RAVIS CI [14] and is deﬁned by the part of the job lifecycle that encounters a non-zero exit code. Typically, errored builds have dependency-related issues. Figure 3c shows the breakdown for both Java and Python. We ﬁnd that 46.1%, 31.6%, and 22.3% of reproducible pairs correspond to each of the above categories, respectively. Surprisingly, only 97 pairs were “ﬂaky.” We suspect a number of unreproducible pairs are indeed ﬂaky but running them 5 times was not sufﬁcient to identify them. We plan to investigate how to grow the number of ﬂaky pairs in B UG S WARM. An initial direction could involve selecting pairs based on keywords in their commit messages (e.g., [27]). Among all the pairs that we attempted to reproduce, most were not reproducible. In other words, the log of the original job and the log produced by R EPRODUCER were different. To 345 — 108 — TABLE VI: Diversity of Artifacts TABLE IV: Top Projects with Artifacts Java # Pairs Python raphw/byte-buddy checkstyle/checkstyle square/okhttp HubSpot/Baragon tananaev/traccar Software Automation # Pairs 361 terasolunaorg/guideline 184 scikit-learn/scikit-learn 104 numpy/numpy 94 python/mypy 59 marshallward/f90nml Type LanguageChallenges Longetivity and Opportunities Java 1,827 2015; 2016 790; 989 Python 1,264 2017; 2018 807; 515 Build System Test Framework Maven 1,675 JUnit 768 Gradle 86 unittest 665 Ant 66 Others 1,415 171 151 145 114 65 TABLE V: Sources of Unreproducibility Reason # Artifacts Type # Artifacts in the Big Data Era: # Pairs Failed to install dependency URL no longer valid or network issue T RAVIS -CI command issue Project-speciﬁc issue R EPRODUCER did not ﬁnish Permission issue 59 57 38 22 14 4 Total 194 gather information about the causes of unreproducibility, we randomly sampled 100 unreproducible job pairs and manually inspected their 200 logs (two logs per job pair). For this task, we also examined the corresponding 200 original logs produced by T RAVIS -CI to compare the differences between logs and categorize the various sources of unreproducibility. As shown in Table V, we identiﬁed 6 sources of unreproducibility. From the 200 jobs, around 30% are unreproducible due to missing or incompatible dependencies. Another 30% referenced stale URLs or experienced network issues. Exceptions from invoking travis-build when creating build scripts are responsible for another 20%. The rest of the jobs are unreproducible due to project-speciﬁc issues, failure to terminate within the time budget, or permission errors. Interestingly, 6 jobs are actually reproducible, but since the corresponding failed or passed job is not reproducible, the entire pair is marked as unreproducible. We have not included unreproducible pairs in this iteration of B UG S WARM, but we think these could also be potentially useful to researchers interested in automatically ﬁxing broken builds. RQ2: Reproducing fail-pass pairs is indeed challenging with a 5.56% success rate. Based on the manual inspection of 100 unreproducible artifacts, we identiﬁed 6 main reasons for unreproducibility listed in Table V. C. General Characteristics of B UG S WARM Artifacts We have aggregated statistics on various artifact characteristics. Figure 4a shows the number of artifacts with a given number of changes (additions or deletions). Inserting or removing a line counts as one change. Modifying an existing line counts as two changes (an addition and a deletion). Commits with zero changes are possible but rare and are not included in Figure 4a. We report the number of changes of the ﬁxed version with respect to the failing version of the code, e.g., 31% (844) of the artifacts have at most 5 changes and 54% (1,458) have at most 20. Figure 4b shows the number of projects with a given number of ﬁles changed, e.g., 46% (1,335) of the artifacts have at most 5 changed ﬁles. Figure 4c shows the artifacts with a given number of failed tests. We ﬁnd that our artifacts are diverse in several aspects: language, build system, test framework, and longevity. Table VI shows the number of reproducible and ﬂaky artifacts for each of these categories. The current dataset has over a thousand artifacts for Java and Python with a wide range of build systems and testing frameworks being used. From these, the most common build system is Maven with 1,675 artifacts, and the most common testing framework is JUnit with 768. We plan to add support for other languages such as JavaScript and C++ in the near future, which will increase the number of build systems and testing frameworks being used. Our artifacts represent a variety of software bugs given the diverse set of projects mined. To better understand the types of bugs in B UG S WARM, we conduct a manual classiﬁcation of 320 randomly sampled Maven-based Java artifacts, ﬁrst described in [30]. The top 10 classiﬁcation categories are shown in Figure 5a. The classiﬁcation is not one-to-one; an artifact may fall under multiple categories depending on the bug. To correctly classify an artifact, we examine the source code, diff, commit message, and T RAVIS -CI log. We ﬁnd that the largest category is logic errors. Examples of logic errors include off-by-one errors and incorrect logical operations. We also conduct an automatic higher-level classiﬁcation of artifacts based on the encountered exceptions or runtime errors. We analyze the build logs and search for the names of Java exceptions and Python runtime errors. Figures 5b and 5c show the 10 exceptions/errors for which B UG S WARM has the most artifacts. For example, 252 Java artifacts fail with a NullPointerException. An example is shown in Figure 6. Using the B UG S WARM framework presented in Section III-G, we successfully ran the code coverage tool Cobertura [3] and two static analyzers—Google’s ErrorProne [5] and SpotBugs [11]—on the 320 randomly selected artifacts used in the manual classiﬁcation [30], with minimal effort. RQ3: We investigated various characteristics of artifacts, such as the distribution in the size of the diff, location of the diff, and number of failing tests (Figure 4). We also examined the reason for failure (Figure 5). For example, 844 artifacts have between 1 and 5 changes, 1,335 artifacts modify a single ﬁle, 845 artifacts have 1 failing test, and the top reason for a build failure is an AssertionError. 346 — 109 — 1,335 1,052 153 200 79 46 0 1,000 500 0 400 233172 131 200 0 43 26 10 5 1 10 4 600 15 610 11 -2 5 26 -5 0 51 -1 10 00 12 20 00 150 500 123 91 (a) Number of changes 339 112 45 27 800 (b) Number of ﬁles changed (c) Number of failing tests 2 35 615 16 -5 51 0 -1 10 00 140 400 118 26 400 362 614 591 Meeting Proceedings 845 Number of Artifacts 2018 Number of Artifacts 600 15 620 21 -1 0 10 0 15 50 00 120 0 20 01 0 50 500 0 01 -3 73 63 Number of Artifacts 844 800 Fig. 4: Artifact Characteristics Logic Error Test Error Assertion Error NullPointerException Conﬁguration Error Dependency Error Identiﬁer Error Visibility Error Casting Error Resource Leak 94 50 38 32 28 14 9 7 2 1 0 50 AssertionErr. NullPointerExc. IllegalStateExc. RuntimeExc. ClassNotFoundExc. IOExc. IllegalArgumentExc. SocketExc. FileNotFoundExc. SAXParseExc. 159 150 136 85 76 50 46 38 100 0 298 252 200 400 AssertionErr. AttributeErr. ValueErr. TypeErr. ImportErr. NameErr. FileNotFoundErr. SyntaxErr. RuntimeErr. IOErr. 104 97 86 64 31 29 28 26 24 0 250 320 500 Number of Artifacts Number of Artifacts Number of Artifacts (a) Manual Classiﬁcation of Java Bugs (b) Most Frequent Java Exceptions (c) Most Frequent Python Errors Fig. 5: Artifact Classiﬁcation protected void loadCommandVerificationSheet( SpaceSystem spaceSystem, String sheetName) { Sheet sheet = switchToSheet(sheetName, false); + if (sheet == null) return; int i = 1; while (i < sheet.getRows()) { // search for a new command definition ... } } Fig. 6: Example of NullPointerException bug and its ﬁx. D. Performance PAIR M INER and R EPRODUCER can be run in the cloud in parallel. The B UG S WARM infrastructure provides support to run these as batch tasks on Microsoft Azure [2]. Running time of PAIR M INER depends on the number of failed jobs to examine, taking between a few minutes to a few hours. Reproduction time varies per project as it depends on the project’s build time and the number of tests run. Mining and reproducing the pairs reported in this paper required about 60,000 hours of compute time in Azure. We will continue our effort to mine and reproduce pairs in additional projects. V. L IMITATIONS AND F UTURE W ORK PAIR M INER searches for two consecutive failed and passed builds ﬁrst then looks for failed and passed job pairs within these two builds. However, failed and passed job pairs can occur between two consecutive failed builds because a build marked as failed requires only one unsuccessful job. In addition, the fail-pass pattern does not guarantee that the difference between the two commits is actually a ﬁx for the failure; the supposed ﬁx could simply delete or revert the buggy code or disable any failing tests. Using only the pattern, PAIR M INER would also fail to identify a ﬁx for a failure if the ﬁx is committed along with the test cases that expose the fail point. Finally, the ﬁx may not be minimal. We plan to address some of these challenges in the future. In particular, we would like to explore other mining approaches that involve new patterns as well as bug reports. Note that R EPRODUCER is already capable of reproducing any pair of commits that triggered T RAVIS -CI builds, regardless of how these commits are gathered. Reproducible artifacts may still break later on due to stale URLs, among other reasons. To keep B UG S WARM up to date, we periodically test artifacts. We are currently exploring ways to make the artifacts more robust. In the future, we would like to crowdsource the maintainability of B UG S WARM. Thus far, our mining has been “blind.” However, it is possible to extend our mining tools to ﬁnd pairs with speciﬁc characteristics (e.g., pairs that have at most 5 changes and a single failed test caused by a NullPointerException). Such guided mining will allow B UG S WARM to grow in directions of interest to the research community. Finally, we plan to extend 347 — 110 — B UG S WARM to continuously monitor T RAVIS -CI events for real-time mining and reproducing of new artifacts. VI. R ELATED W ORK Some other defect repositories aim to provide experimental benchmarks for defect location and repair. On the whole, these repositories do not exploit CI and virtualization mechanisms; they generally pre-date the widespread adoption of these techniques. They do not achieve the same scale, diversity, and currency and are not as durably reproducible. The Siemens test suite [23] (7 small C programs and about 130 manually seeded bugs) is among the earliest. BugBench [26] is one of the earliest datasets of real-world bugs. BugBench is limited in scale and diversity, consisting of 7 memory and concurrency bugs found across 10 C/C++ OS projects. Each buggy program version includes failing tests. BegBunch [18] contains two suites to measure the accuracy and scalability of bug detection tools for C. iBugs [19] is a dataset drawn from the 5-year history of the AspectJ compiler with 369 faulty versions of the project. iBugs provides metadata such as number of methods and classes involved in the bug ﬁx. Unlike B UG S WARM, the above datasets were manually constructed. Metadata such as that included in iBugs could be built from B UG S WARM artifacts with additional effort. The Software-artifact Infrastructure Repository (SIR) [21] comprises source code, tests, and defects from OS projects along with needed infrastructure (e.g., automated build and test scripts). Currently, SIR consists of 85 projects in four languages, of which 64 (15 C, 1 C#, 1 C++, and 47 Java) include fault data: real ones; seeded ones; and a combination of real, seeded, and mutated. A project may contain multiple versions, and each version may contain multiple faults, with a total of 680 bugs. SIR provides a useful amount of scale and diversity while archiving sufﬁcient tooling for durable reproducibility. However, since it pre-dates CI and D OCKER, each defect datum therein is manually assembled. Thus, SIR is difﬁcult to scale up further and requires substantial effort to keep current. B UG S WARM already has 3,091 reproducible defects; the automated mining of CI and D OCKER image artifacts lowers the cost of keeping the dataset growing. M ANY B UGS [25] is a benchmark for program repair with 185 defects and ﬁxes from 9 large C projects. Each defect and ﬁx includes tests and is manually categorized. To facilitate the reproduction of these defects, M ANY B UGS provides virtual machine images (recently extended to use D OCKER [29]). Unlike B UG S WARM, mining and reproducing bugs requires signiﬁcant manual effort, and thus M ANY B UGS is not as easy to extend. On the other hand, M ANY B UGS provides a detailed bug categorization that can be useful for experiments, and its artifacts are collected from C programs, a programming language that B UG S WARM does not currently support. Defects4J [24][4] is a dataset of 395 real, reproducible bugs from 6 large Java projects. Defects4J provides manually constructed scripts for each project’s build and test; the entire setup relies on a functioning JVM. Defects4J provides an interface for common tasks and provides support for a number of tools. The Bugs.jar [28] dataset contains 1,158 Software Automation real, reproducible Java bugs collected from 8 Apache projects Bigbug Data by identifying commit messagesin thatthe reference reports.Era: Bugs.jar artifacts are stored on Gand IT branches. By contrast, Challenges Opportunities B UG S WARM relies on virtualized, D OCKER-packaged build and test environments, automatically harvested from the crossplatform T RAVIS -CI archives; thus it is neither limited to Java nor does it require manual assembly of build and test tools. In addition to the test fail-pass pairs, we include build failures and even ﬂaky tests. The above allows B UG S WARM to achieve greater scale, diversity, and currency. Urli et al. [31] describe an approach to mining builds that fail tests from T RAVIS -CI. This work can only handle Mavenbased Java builds; these are reproduced directly, without D OCKER. Their dataset includes 3,552 Maven Java builds for the purpose of automatic repair. Delﬁm et al. [20] develop B EARS, which mines Maven-based Java G IT H UB projects that use T RAVIS -CI. B EARS attempts to reproduce every mined build in the same environment, which does not account for the developer-tailored .travis.yml ﬁle, whereas B UG S WARM leverages D OCKER images to match each job’s original runtime environment. Compared to B UG S WARM, B EARS has a similar reproduction success rate of 7% (856 builds). B EARS pushes artifacts to G IT branches, instead of providing them as D OCKER images, and relies on Maven for building and testing, so new infrastructure must be implemented to include artifacts from other build systems. Our D OCKER-based approach allows other languages and build systems, and reﬂects our designed-in pursuit of greater diversity and reproducibility. Note that the B UG S WARM toolset supports the creation of fully reproducible packages for any pair of commits for which the T RAVIS -CI builds are archived. There are over 900K projects in G IT H UB that use T RAVIS -CI [13], so our toolkit enables the creation of datasets and ensuing experiments at a scale substantially larger than previous datasets allow. VII. C ONCLUSIONS This paper described B UG S WARM, an approach that leverages CI to mine and reproduce fail-pass pairs of realistic failures and ﬁxes in Java and Python OSS. We have already gathered 3,091 such pairs. We described several exciting future directions to further grow and improve the dataset. We hope B UG S WARM will minimize effort duplication in reproducing bugs from OSS and open new research opportunities to evaluate software tools and conduct large-scale software studies. ACKNOWLEDGMENTS We thank Christian Bird, James A. Jones, Claire Le Goues, Nachi Nagappan, Denys Poshyvanyk, Westley Weimer, and Tao Xie for early feedback on this work. We also thank Saquiba Tariq, Pallavi Kudigrama, and Bohan Xiao for their contributions to improve B UG S WARM and Aditya Thakur for feedback on drafts of this paper. This work was supported by NSF grant CNS-1629976 and a Microsoft Azure Award. 348 — 111 — R EFERENCES [1] Apache Ant. http://ant.apache.org, Accessed 2019. [2] Microsoft Azure. http://azure.microsoft.com, Accessed 2019. Meeting Proceedings [3] Cobertura. https://github.com/cobertura/cobertura/wiki, Accessed 2019. [4] Defects4J. https://github.com/rjust/defects4j, Accessed 2019. [5] Error Prone. https://github.com/google/error-prone, Accessed 2019. [6] Gradle Build Tool. https://gradle.org, Accessed 2019. [7] JUnit Test Framework. https://junit.org/junit5, Accessed 2019. [8] Apache Maven Project. https://maven.apache.org, Accessed 2019. [9] Test Framework nose. http://nose.readthedocs.io/en/latest, Accessed 2019. [10] Test Framework pytest. https://docs.pytest.org/en/latest, Accessed 2019. [11] SpotBugs Bug Descriptions. https://spotbugs.readthedocs.io/en/latest/bugDescriptions.html, Accessed 2019. [12] Test Framework testng. http://testng.org/doc, Accessed 2019. [13] Travis CI. https://travis-ci.org, Accessed 2019. [14] Job Lifecycle. https://docs.travis-ci.com/user/job-lifecycle/#breaking-the-build, Accessed 2019. [15] Test Framework unittest. https://docs.python.org/2/library/unittest.html, Accessed 2019. [16] Test Framework unittest2. https://pypi.python.org/pypi/unittest2, Accessed 2019. [17] M. Beller, G. Gousios, and A. Zaidman. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. In Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017, pages 447–450, 2017. URL https://doi.org/10.1109/MSR.2017.24. [18] C. Cifuentes, C. Hoermann, N. Keynes, L. Li, S. Long, E. Mealy, M. Mounteney, and B. Scholz. BegBunch: Benchmarking for C Bug Detection Tools. In DEFECTS ’09: Proceedings of the 2nd International Workshop on Defects in Large Software Systems, pages 16–20, 2009. URL http://doi.acm.org/10.1145/1555860.1555866. [19] V. Dallmeier and T. Zimmermann. Extraction of Bug Localization Benchmarks from History. In 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA, pages 433–436, 2007. URL http://doi.acm.org/10.1145/1321631.1321702. [20] F. M. Delﬁm, S. Urli, M. de Almeida Maia, and M. Monperrus. Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. To appear in SANER 2019. [21] H. Do, S. G. Elbaum, and G. Rothermel. Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact. Empirical Software Engineering, 10(4):405–435, 2005. URL https://doi.org/10.1007/s10664-005-3861-2. [22] G. Gousios, A. Zaidman, M. D. Storey, and A. van Deursen. Work Practices and Challenges in Pull-Based Development: The Integrator’s Perspective. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 1, pages 358–368, 2015. URL https://doi.org/10.1109/ICSE.2015.55. 2018 [23] M. Hutchins, H. Foster, T. Goradia, and T. J. Ostrand. Experiments of the Effectiveness of Dataﬂow- and Controlﬂow-Based Test Adequacy Criteria. In Proceedings of the 16th International Conference on Software Engineering, Sorrento, Italy, May 16-21, 1994., pages 191–200, 1994. URL http://portal.acm.org/citation.cfm?id=257734.257766. [24] R. Just, D. Jalali, and M. D. Ernst. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pages 437–440, 2014. URL http://doi.acm.org/10.1145/2610384.2628055. [25] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. T. Devanbu, S. Forrest, and W. Weimer. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Trans. Software Eng., 41(12):1236–1256, 2015. URL https://doi.org/10.1109/TSE.2015.2454513. [26] S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou. Bugbench: Benchmarks for Evaluating Bug Detection Tools. In In Workshop on the Evaluation of Software Defect Detection Tools, 2005. [27] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. An Empirical Analysis of Flaky Tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, pages 643–653, 2014. URL http://doi.acm.org/10.1145/2635868.2635920. [28] R. K. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. R. Prasad. Bugs.jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 10–13, 2018. URL https://doi.org/10.1145/3196398.3196473. [29] C. S. Timperley, S. Stepney, and C. Le Goues. BugZoo: A Platform for Studying Software Bugs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 446–447, 2018. URL http://doi.acm.org/10.1145/3183440.3195050. [30] D. A. Tomassi. Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 980–982, 2018. URL https://doi.org/10.1145/3236024.3275439. [31] S. Urli, Z. Yu, L. Seinturier, and M. Monperrus. How to Design a Program Repair Bot?: Insights from the Repairnator Project. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 95–104, 2018. URL https://doi.org/10.1145/3183519.3183540. [32] B. Vasilescu, Y. Yu, H. Wang, P. T. Devanbu, and V. Filkov. Quality and Productivity Outcomes Relating to Continuous Integration in GitHub. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, pages 805–816, 2015. URL https://doi.org/10.1145/2786805.2786850. 349 — 112 — History-Driven Build Failure Fixing: How Far Are We? Software Automation Yiling Lou∗ Junjie Chen HCST (Peking University), China {louyiling,chenjunjie}@pku.edu.cn Lingming Zhang in the Big Data Era: Challenges and Opportunities † UT Dallas, USA lingming.zhang@utdallas.edu ABSTRACT Dan Hao Lu Zhang HCST (Peking University), China {haodan,zhanglucs}@pku.edu.cn ACM Reference Format: Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang. 2019. History-Driven Build Failure Fixing: How Far Are We?. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’19), July 15–19, 2019, Beijing, China. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3293882.3330578 Build systems are essential for modern software development and maintenance since they are widely used to transform source code artifacts into executable software. Previous work shows that build systems break frequently during software evolution. Therefore, automated build-�xing techniques are in huge demand. In this paper we target a mainstream build system, Gradle, which has become the most widely used build system for Java projects in the open-source community (e.g., GitHub). HireBuild, state-of-the-art build-�xing tool for Gradle, has been recently proposed to �x Gradle build failures via mining the history of prior �xes. Although HireBuild has been shown to be e�ective for �xing real-world Gradle build failures, it was evaluated on only a limited set of build failures, and largely depends on the quality/availability of historical �x information. To investigate the e�cacy and limitations of the history-driven build �x, we �rst construct a new and large build failure dataset from Top-1000 GitHub projects. Then, we evaluate HireBuild on the extended dataset both quantitatively and qualitatively. Inspired by the �ndings of the study, we propose a simplistic new technique that generates potential patches via searching from the present project under test and external resources rather than the historical �x information. According to our experimental results, the simplistic approach based on present information successfully �xes 2X more reproducible build failures than the state-of-art HireBuild based on historical �x information. Furthermore, our results also reveal various �ndings/guidelines for future advanced build failure �xing. 1 INTRODUCTION Build systems (e.g., Gradle [4], Ant [2] and Maven [9]) and their corresponding build scripts have been widely used in modern software development to automate the build process. Such build scripts are also frequently updated during software evolution, to be consistent with the changed source code or environment (e.g., third-party libraries and plug-ins). If an inconsistency/bug occurs, a build script may incur build failures. Build failures occur frequently for both commercial and open-source software systems, and may seriously postpone the other activities in software development. For example, in Google, the build failures for Java and C projects occur at the frequency of 28.5% and 38.4%, respectively [57]; on Travis [10], the most popular continuous integration (CI) service, nearly 29% of all the commits su�er from build failures during CI testing [14]. The widespread build-failure problem has gained increasing attention from software engineering researchers, and various studies/techniques on di�erent types of build failures have been conducted/proposed [12, 23, 26, 45, 55, 68]. For example, Al-Kofahi et al. [12] proposed a fault localization approach for Make�le, which collects dynamic execution trace via the concrete build rules and then computes suspiciousness of each statement in Make�le via a ranking algorithm; Macho et al. [45] recently designed three strategies based on the frequently occurring repair types to �x only dependency-related build failures for Maven projects. Among them, HireBuild [23], state-of-the-art general-purpose build-failure �xing technique proposed in ICSE’18, learns �x patterns from successful �xes in history across projects and generates patches by embodying the learned patches. Taking 135 previous build-failure �xes as the training set, HireBuild has been shown to be able to successfully �x 11 (46%) of 24 studied real-world build failures, indicating a promising future for history-driven build-failure �xing. Despite its e�ectiveness, HireBuild was only evaluated on a limited dataset. Furthermore, as a history-driven technique, its e�ectiveness relies on the quality and availability of the training data. Therefore, it is unclear whether HireBuild’s e�ectiveness can be generalized to other evaluation datasets. In this paper, to fully understand the e�cacy and limitations of the history-driven build �xing technique HireBuild and facilitate future build-�x studies, we �rst build a new and large dataset of 375 real-world build failures from Top-1000 GitHub projects. To our knowledge, this is the largest evaluation in the literature for general-purpose build-failure �xing. CCS CONCEPTS • Software and its engineering → Software testing and debugging. KEYWORDS Automated Program Repair, Build System, Build Failure Fixing ∗ This work was done when Yiling Lou was a visiting student in UT Dallas. † Dan Hao is the corresponding author. HCST is short for Key Lab of High Con�dence Software Technologies, MoE, Beijing, China. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and the full citation on the �rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci�c permission and/or a fee. Request permissions from permissions@acm.org. ISSTA ’19, July 15–19, 2019, Beijing, China © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6224-5/19/07. . . $15.00 https://doi.org/10.1145/3293882.3330578 43 — 113 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 Table 1: Illustration Example Among the collected 375 build failures, 102 of them are currently reproducible. We thus re-visit the performance of HireBuild on these Meeting Proceedings 102 reproducible build failures. We also perform qualitative manual inspection on the successful and unsuccessful cases of HireBuild to investigate the strengths and limitations of history-driven buildfailure �xing. The quantitative results show that HireBuild is able to �x 9 (9%) out of 102 build failures, con�rming that HireBuild can indeed commit successful �xes, but in a much lower rate than reported. Meanwhile, based on our qualitative analysis, the �xed build failures by HireBuild usually fall into some �xed patterns, which actually could be also obtained from present information (i.e., present build code and external resources) rather than historical build �x information; the un�xed build failures are mainly due to the in�exible design of �x patterns and patch generation rules, making some patching results hard to generalize to new datasets. Inspired by the �ndings of our study, we propose a lightweight build-failure �xing technique, HoBuFF (History-oblivious Build Failure Fixing), which does not rely on history data but instead simply utilizes the present information of the build code, build log and external build-related resources. HoBuFF includes two phases: (1) fault localization [15–17, 37, 38, 40, 50, 51, 74, 76, 77], and (2) patch generation. In particular, in the �rst fault-localization phase, HoBuFF analyzes error logs of the given build failures to extract error information, and then localizes the possible buggy locations via inter-procedural data-�ow analysis; in the second patch-generation phase, HoBuFF generates patch candidates by de�ning three �xing operators and searching for the �xing ingredients both inside and outside the project (i.e., internal and external resources). We then conduct an empirical comparison between HireBuild and HoBuFF on the extended dataset and �nd that among the 102 reproducible failures, HoBuFF successfully �xes 18 bugs within less time, including the 8 bugs �xed by HireBuild (HireBuild �xes 9 in total). We also observe that for the build failures that cannot be �xed, HoBuFF can terminate its execution in minutes, whereas HireBuild may take hours. The paper makes the following contributions: • Dataset: A dataset including 375 real-world build failures within 102 reproducible build failures, which is much larger than the state-of-art build-failure datasets and can serve as the benchmark dataset for future build-�x studies. • Study: An extensive study of state-of-the-art history-driven build-failure �xing (HireBuild) on the extended dataset, with detailed manual inspection for both its strengths and limitations. • Technique: A novel build-failure �xing technique (HoBuFF) with only present information (i.e., no requirement for historical data), which utilizes lightweight data-�ow analysis and queries internal/external resources to perform build �xing. • Implications: An empirical evaluation of HoBuFF and stateof-the-art HireBuild, which demonstrates that present project information can greatly complement historical build �x information for automated build failure �xing, and also reveals various �ndings/guidelines for future advanced build-failure �xing. Error message * What went wrong: > Execution failed for task ’:releaseNeeded’. > Cannot get property ’needed’ on extra properties as it does not exist * Where: Script ’/gradle/release.gradle’ line: 104 Manual Patch 80 task (’’releaseNeeded’’) { ... 97 if (skippedByCommitMessage or skipEnvVariable) { 98 ext.needed = false 99 } else if (forceBintrayUpload or dryRun) { 100 ext.needed = true 101 } else { 102 logger.lifecycle( Criteria not met ) ext.needed = false + 103 } 104 logger.lifecycle(’’${ext.needed}’’) } ... 110 bintrayUpload { 111 dependsOn releaseNeeded 112 onlyIf { releaseNeeded.needed } failures for Gradle. Table 1 presents the build failure example with an error information related segment (i.e., error message) from its build log and its manual �xing patch. During the execution of task releaseNeeded, the build process terminates at Line 104, because Line 104 tries to access the property ext.needed, which is not de�ned on the else branch. To �x this failure, developers add an extra statement ext.needed = false (in green) after Line 102. Given the build log containing failure related information, the �rst challenge lies in fault localization in the Gradle script. As the error message in Table 1 indicates, Line 104 is the code location where the build failure is triggered and the build process stops. However, according to the manual patch, the �x is added after Line 102. Line 104 reveals the build failure, but it may not be the root cause for the build failure. Therefore, without identifying all potential root-cause statements, we may not �x a build failure. Even if the set of root cause statements are identi�ed, there is another challenge: how to generate a correct patch. In particular, to �x this example build failure, an automated �xing technique needs to �nd out the correct value for the error property ext.needed, which requires the understanding of the program. In other cases, when a build failure is caused by using incorrect values of external resources (e.g., a third-library dependency), an automated �xing technique also requires open knowledge of external resources. 2.2 State-of-the-Art HireBuild As the state-of-art build-failure �xing technique, HireBuild learns �xing patterns from historical build-failure �xes (i.e., pattern extraction) and generates speci�c patches by prede�ned rules and �lling in concrete values in the pattern (i.e., patch generation). In the phase of pattern extraction, HireBuild �rst requires a training set, which is composed of build �xes collected from the history, and then selects several build �xes from the training set whose error messages are similar with the error message of the given build failure. These selected build �xes are regarded as seed �xes. From these seed �xes, HireBuild extracts patterns and then ranks them with some heuristic strategies. For example, for a given �x which changes statement version = 1.4.0 to version = 1.7.0, HireBuild learns a pattern, which is to “replace the constant value (i.e., 1.4.0) in expression version = 1.4.0 with a new value”. Note that this new value will be decided later (in the patch generation phase). Moreover, since HireBuild de�nes patterns by using only 2 BACKGROUND 2.1 Build Failure Fixing: Challenges In this section, we present a build failure in Mockito/db8a3f3 and its manual �xing patch [8] to illustrate the challenges in �xing build 44 — 114 — History-Driven Build Failure Fixing: How Far Are We? ISSTA ’19, July 15–19, 2019, Beijing, China Software Table 2: Dataset Statistics Automation two-level AST expressions (i.e., current and its parent node expressions), its generated �xes often cover only a small span of script code, e.g., often a variable or a continuous segment of script code. In the phase of patch generation, HireBuild de�nes several rules for four types of build elements and �lls in concrete values into the abstract part of the patterns for these build elements. The four types of elements are (1) identi�ers (including task names, block names, variable names), (2) names of Gradle plug-ins and thirdparty tools/libraries, (3) �le paths within the project, and (4) version numbers. Then HireBuild ranks the generated patches based on the similarity between patches and seed �xes, applies them to the code locations matching the patterns, and at last validates them in order. However, besides the four types of elements considered in HireBuild, there still exist other elements (e.g., project-speci�c variables), whose concrete values are required in patch generation but are not considered by HireBuild at all. For the example in Table 1, ext.needed (whose value is true or false) does not belong to any of the four types. For these elements, HireBuild does not design speci�c rules for concrete value generation, and thus cannot generate candidate patches for the corresponding build failures. Extended Dataset in the Big Data Era: 110 375 Challenges and Opportunities Statistics #Projects #Bugs #Reproducible Bugs Previous Dataset 54 175 24 102 on Gradle �les) and then in total have 403 build failures, each of which is suited with a failed commit and a successive passed commit. To avoid overlap between this newly constructed dataset and the prior dataset [23], we further remove build failures already in the prior dataset, and �nally have a new dataset of 375 build failures. Among the 375 build bugs, we successfully reproduce 102 build failures. A build bug is reproducible when its failed commit still fails (due to the same reason) and its originally-passed commit still passes. We �nd it more challenging to reproduce build failures than general program bugs since build failures associate tightly with external resources (which di�er among di�erent time stamps), and are also susceptible to internal resources (which di�er among machines/environments): (1) The dependent libraries which were originally missing in repository, now are added to repository, or the dependent libraries which originally contained �aws, now are �xed. In this case, the originally-failed commits no longer fail anymore. (2) The external resource changes of the libraries which were not relevant to the build failure in the past, now also cause the build to fail (but due to a di�erent reason). In this case, the originallypassed commits no longer pass now. (3) The build failures are caused by internal machine resources (e.g., build process crash due to speci�c memory/process status), and are hard to reproduce in a new machine. (4) The build failures are simply �aky (e.g., due to �aky tests [44]), and are hard to reproduce. Table 2 presents the basic information of the extended dataset, which shows the scale of the extended data set compared with that of the prior evaluation dataset. Noted that, there is no overlap between the build failures in the extended dataset and the prior one [23] (since we intentionally removed such overlapped failures). 3 STUDY ON HIREBUILD As a learning-based technique, HireBuild was evaluated on a very small dataset, including the training and testing data, which may incur over�tting. To alleviate this concern, it is important to reevaluate its performance on a di�erent and larger dataset. Also, it is essential to manually inspect the cases that HireBuild produces correct �xes or not, to learn its e�cacy and possible limitations. 3.1 An Extended Dataset Why a new dataset is needed? The study of HireBuild [23] uses 135 previous build-failure �xes as training data and a set of 24 reproducible build failures as its evaluation dataset. Such a small data set, especially the evaluation dataset, may bring obvious external threat to validity, and thus the conclusions may not be generalized. To reduce this threat, we conduct a more extensive study on HireBuild by extending the existing dataset [23]. In particular, we collect extra 375 build failures, among which 102 build failures are reproducible, signi�cantly more than those of the prior dataset. In this paper, we evaluate build-failure �xing techniques on the dataset of these 102 build failures. This new dataset is abbreviated as the extended dataset hereinafter. How is the new dataset collected? First, we collect the top-1000 popular Java projects in GitHub [3], and keep only the 411 projects which have been integrated tested in Travis system according to whether the “.travis.yml” �le is in the project. Second, we collect build bugs based on the history data of these projects. For each of these 411 projects, we �rst identify a commit (denoted as VF ) whose build status is failed and whose immediately successive commit (denoted as VP ) is passed. As the failure of VF may come from either the build script or the source code, we keep only the commit VF whose changes from VF to VP occur on Gradle �les alone. Among these 501 resulting commits (which are actually commits containing build bugs), we manually remove those whose modi�cations on Gradle �les do not in�uence the build results (e.g., documentation modi�cations or semantic-equivalent modi�cations 3.2 Research Questions We investigate following research questions for studying HireBuild: • RQ1: How does HireBuild perform on the extended dataset in terms of the number of �xed build failures? • RQ2: Why does HireBuild succeed to �x some build failures? • RQ3: Why does HireBuild fail to �x some build failures? For an unbiased study of HireBuild, we ask for the original implementation of HireBuild from the authors and directly use it for our study. Moreover, we take the setting of HireBuild used in the previous work [23], i.e., using the same training dataset (135 build failures from its dataset) and the setting parameters (e.g., the number of seed build �xes is 5). 3.3 Results and Analysis 3.3.1 RQ1: Number of Fixed Failures by HireBuild. Among 102 reproducible build failures in the extended dataset, HireBuild �xes only 9 of them (9%), which is much lower than the �xing rate reported on the prior dataset (i.e., 46% [23]). In other words, HireBuild does not perform as well as it appears in its original dataset, and may su�er from the over�tting problem. Besides the quantitative analysis, it is also interesting to investigate the performance of HireBuild in details, including its successfully-�xed/un�xed cases. 45 — 115 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 Table 3: An Example Successful Patch by HireBuild Unsuccessful Patch Generation. To generate patches, HireBuild embodies rules for only four type of elements in speci�c patch generation. However, besides these four types of elements, Gradle scripts may contain other elements, e.g., user-de�ned variables whose values are Boolean or Strings, which do not belong to any of the four types. For example in Table 1, ext.needed (whose value is true or false) does not belong to any of the four types. For these elements, HireBuild does not design speci�c rules for concrete value generation, and thus cannot generate candidate patches for the corresponding build failures. To sum up, HireBuild only embodies patch generation rules for speci�c elements, and thus cannot generate patches for all build failures, even if HireBuild precisely localizes the buggy code. Error Message * What went wrong: Meeting Proceedings > A problem occurred evaluating root project ’:AnimeTaste’ > Couldn’t resolve all dependencies for con�g ’:debugCompile’ > Could not �nd com.afollestad:material-dialogs:0.6.3.1 Patch Generated by HireBuild 30 compile ’com.android.support:support-v4:22.0.0’ 31 compile ’com.android.support:appcompat-v7:22.0.0’ 32 compile ’com.github.johnpersano:supertoasts:1.3.4’ 33 compile ’fr.baloomba:viewpagerindicator:2.4.2’ 34 compile ’com.koushikdutta.async:androidasync:2.1.3’ 35 - compile ’com.afollestad:material-dialogs:0.6.3.1’ 35 + compile ’com.afollestad:material-dialogs:0.8.6.2’ 3.3.2 RQ2: Successfully-fixed Cases. For the 9 �xed build failures, most of them are �xed by using a rigid pattern in a straightforward way, and history information is not very necessary. For example, Table 3 shows a build failure caused by an unresolved thirdlibrary. The error message suggests that the third-library named com.afollestad.material-dialogs with version 0.6.3.1 could not be resolved. To �x this failure, HireBuild learns a pattern based on similar build-failure �xes related to library resolving, which is to update the constant value in the expression starting with keyword compile. However, as also shown by this table, compile is a very common keyword in Gradle scripts so that many expression starts with compile. To further localize the faulty code, HireBuild designs extra ranking rules which assign high priority to the expression sharing similar tokens in error message(i.e., com.afollestad.material-dialogs). To generate patches then, after recognizing the element as a third-library with prede�ned rules, HireBuild searches in the Gradle central repository to �nd proper version values for the library. The �xing process of this example suggests that the patterns learned by HireBuild from historical data (i.e., updating the constant value in the expression starting with keyword compile) actually play a marginal role to the �nal success of �xing. On the contrary, the present information, including the error message, script code itself, and the external resources (e.g., the Gradle central repository), may be su�cient for �xing such failures. 3.4 Enlightenment According to the �ndings of above research questions, we could infer some guidelines for automatically �xing build failures: • Necessity of using historical data. Historical �xing data are not the indispensable factor for build-failure �xing techniques. On the contrary, it is actually more essential to make good use of the present information (i.e., present script code, present build log and internal/external resources). • Feasibility of using present data. Present script code, build log are often available for a given build failure. Besides, lots of program analysis techniques could be adapted to analyze the script code. Furthermore, build logs are very well-structured and thus allow plain error information extraction. As for resources, there are o�cial documents and repositories stored in a wellstructured for automated reference. • Analyzing more pa�erns for build code. The patterns extracted by HireBuild are in�exible because it keeps little program information. It implies that, analyzing more �x patterns from build code could help draw pivotal clues for build-failure �xing. • Considering more elements in build code. The limited types of elements in build code considered for patch generation also cause the limitation of HireBuild’s e�cacy, which implies that, a more general/systematic build-failure �xing approach also requires considering more script elements. 3.3.3 RQ3: Unfixed Cases. HireBuild fails to �x 93 failures, because its in�exible pattern generation and application mechanism hampers (1) localizing the faulty code, (2) generating correct patches. Unsuccessful Fault Localization. As introduced in Section 2.2, HireBuild can not distinguish potential faulty code in the process of fault localization. In particular, HireBuild regards all the expressions matching the patterns as faulty code (i.e., the place where to apply the patterns then). In detail, if the learned pattern is “to update the constant value of an expression which starts with version =”, only the expressions starting with “version =” are considered faulty code. This strategy works well only when the faulty code exactly matches the learned patterns. However, the faulty code does not always match the learned patterns in such a rigid way. For the illustration example in Section 2, to �x this build failure, HireBuild needs to learn a pattern “inserting an expression: ext.needed = false”. However, the essential variable ext.needed is named in a project-speci�c way, and HireBuild can hardly learn such a pattern from the training dataset. To sum up, HireBuild generates in�exible patch patterns, which can hardly deal with project-speci�c failures. 4 A NEW TECHNIQUE: HOBUFF Inspired by the �ndings of study on HireBuild, we propose a lightweight build-failure �xing technique, named HoBuFF (Historyoblivious Build Failure Fixing), which does not take historical �xes as input, but utilizes present information in a more exhaustive way. Build-failure Fixing Problem De�nition. At a high level, a Gradle build script can be regarded as a collection of con�gurations, each of which consists of a con�guration element and its value. Most build failures can be attributed to incorrect con�gurations, e.g., assigning a wrong value to a con�guration element or missing a con�guration. More formally, any build script can be denoted as a set C = {c 1 , c 2 , . . . , c n }, where C denotes a build script and c i =< e i ,ψ i > (1  i  n) is a con�guration implicitly or explicitly claimed in C. Here, e i is a con�guration element and ψ i is the value of the element. Supposed that a build failure occurs when using C, the problem of build failure �xing is to generate a new build script 46 — 116 — History-Driven Build Failure Fixing: How Far Are We? ISSTA ’19, July 15–19, 2019, Beijing, China Software Automation in the Big Data Era: Challenges and Opportunities Figure 1: Overview of HoBuFF C + by conducting modi�cations on C so that the build failure will disappear when using C + . More speci�cally, build failure �xing consists of two steps: fault localization and patch generation. The former aims to �nd which con�guration is buggy, denoted as c b =< e b ,ψ b > (1  b  n), and localize e b in C, while the latter aims to generate a correct value for e b , denoted as ψ b+ , and update the con�guration c b in C with c b+ =< e b ,ψ b+ >. Overview of HoBuFF. HoBuFF consists of two phases: data�owbased fault localization and search-based patch generation. Figure 1 shows the overview of HoBuFF. In the �rst phase of fault localization, HoBuFF extracts error information by analyzing build logs, and then localizes the potential buggy code by applying lightweight data-�ow analysis (Section 4.1). In the second phase of patch generation, HoBuFF designs �xing operators and searches for the �xing ingredients, so as to generate patch candidates (Section 4.2). For ease of understanding, we use the example presented in Section 2 to illustrate our approach throughout this section. Figure 2: Work�ow of Data�ow-based Fault Localization When a build failure occurs, a build log records the corresponding error information, which is helpful to manually localize the root cause of the build failure. Therefore, the �rst step of HoBuFF for fault localization is to extract the error information from the build log. Based on the extracted error information, HoBuFF then localizes the bug-revealing statement(s), which is the statement(s) in the build script that exposes the build failure during the build process. Finally, HoBuFF traces the root cause from the bug-revealing statement(s) via lightweight inter-procedural data-�ow analysis. Figure 2 shows the work�ow of the fault-localization process on our example. easy to extract with regular expression matching. However, the causes behind build failures can be totally di�erent, and it is impractical to design �xed extraction templates to extract the buggy element names for all of them. Considering element names are always nominal, we �rst conduct POS Tagging [54] (using StanfordCoreNLP [46]) of the statement(s) related to error elements, and take the unusual nouns (e.g., NNP, NN, NNS, NNPS) as the possible names of error elements. Here, we refer the trivial nouns appearing frequently in build log as the usual nouns (e.g., “failure”, “test”, “complication”, “dependency”, “con�guration”, etc.) to reduce the noise. If more than one nouns are adjacent, we combine them together to reduce the number of potential error element names. To handle more cases, we also consider other special tokens which cannot be POS-tagged correctly (i.e., path names, or tokens in quotation mark). After this, we get potential error-element names and potential values for further analysis. In sum, the following speci�c error information will be collected to facilitate the fault-localization process, including: (1) project, (2) task, (3) con�guration element, (4) value of the element, and (5) location of the bug-revealing statement(s). For the illustration example in Table 1, HoBuFF extracts the following information from the error message: (1) project: root project, (2) task: releaseNeeded, (3) con�guration element: needed, (4) value of the element: null, and (5) location of the bug-revealing statement: Line 104. Note that project names occur before task names like “projectName:taskName”. If the project name is missing, “root project” will be used. 4.1.1 Error Information Extraction. A build log tends to record much information involving various stages of a build process, such as initialization and task execution. Therefore, HoBuFF �rst parses the build log to extract the message related to the build failure, called error message. Due to the standard form of Gradle build logs, there exist error-indicating headers in the log to mark the error message. As shown in the error message of the illustration example, there are two error-indicating headers: (1) “* What went wrong:” explaining the symptom of the build failure; (2) “* Where:” indicating the location of the bug-revealing statement(s). Following the existing work [23], HoBuFF utilizes the error-indicating headers to extract the error message from a build log. In the error message, Gradle always tries to report in which project and in which task, the build failure occurs, which are reported in standard form and 4.1.2 Bug-Revealing Statement Identification. The bug-revealing statement(s) is responsible to expose the build failure, and may not be the root cause of the failure, but it is actually very important to help identify the root cause. The bug-revealing statement(s) can be identi�ed based on the aforementioned extracted error information. If the last type of information, i.e., location(s) of the bug-revealing statement(s), exists, HoBuFF is able to directly �nd the bug-revealing statement in the build script according to the line number. As shown in the illustration example, the bug-revealing statement is identi�ed at Line 104. Otherwise, HoBuFF uses the other types of information to infer the bug-revealing statement in the build script. More speci�cally, HoBuFF �rst calculates the Levenshtein distance [36] between each element name in the build script 4.1 Data�ow-Based Fault Localization 47 — 117 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 and the name of the extracted con�guration element, and then identi�es the statement(s) containing the variable with the smallest Meeting Proceedings distance as the bug-revealing statement(s). If there are more than one statement satisfying the preceding condition, HoBuFF identi�es the ones within the extracted project or task. 4.2.1 Fixing Operators. Given a buggy or missing con�guration c b =< e b ,ψ b >, we de�ne three �xing operators in HoBuFF following existing work for source-code repair [29, 42, 67]: • Update: Update c b by replacing ψ b with the correct value ψ b+ , where ψ b+ is the ingredient. Note that how to de�ne ingredients will be introduced in the following subsection. • Insertion: Insert a con�guration c b , where e b is decided through fault localization, whereas ψ b+ is the ingredient to be introduced. • Deletion: Delete the con�guration c b , where no ingredient is required. In this case, c b = null (c b is removed from C). To avoid syntax problem, we further analyze the data dependencies from c b to delete all the a�ected statements as well. For each root-cause statement, HoBuFF applies these operators in the order of Update-Insertion-Deletion. As the previous study on source-code repair [67] shows, “Update” is the most widelyused operator in manual bug repairs while “Deletion” is the least used �xing operator. In particular, if the root-cause statement is an inserted null expression (e.g., 102: ext.need = null in Figure 2), only “Insertion” can be applied to it since it is an implicit statement. 4.1.3 Root Cause Localization. To identify the root cause in the build script, HoBuFF performs inter-procedural data-�ow analysis to trace backward from a bug-revealing statement(s). Note that we only consider data-�ow dependencies (ignoring control-�ow dependencies) to avoid over-approximations [59, 70] for lightweight analysis. First, HoBuFF constructs an inter-procedural control-�ow graph, and annotates it with the data dependency information computed via reaching de�nition analysis [49]. Since Gradle is a Groovy-based domain-speci�c language and it de�nes a set of its own rules to serve as a build tool, its analysis is slightly di�erent from widely-used programming languages such as C++ and Java. Therefore, we perform inter-procedural, context-insensitive, and �eld-sensitive data�ow analysis considering the following speci�c features of Gradle �les: (1) variable de�nitions in one Gradle script �le (or tasks) often use variables de�ned in other Gradle �les (or tasks), thus we perform inter-procedural analysis; (2) there are not intensive function invocations in Gradle �les (e.g., Gradle scripts largely reply on task dependencies/sequences rather than method invocations to implement the build logic), thus we perform context-insensitive analysis; (3) �elds in aggregate structure variables are widely used in Gradle �les (e.g., ext.needed in Table 1), thus we perform �eld-sensitive analysis. Note that if a variable V is used without de�nition for some program path, an implicit statement V = null would be inserted into the location which is not reached by any de�nition of V and is also the closest to the statement using V . Then there should be an edge between the inserted statement and the statement using V . For example, there is missing de�nition in else branch for Line 104 in Table 1, thus we insert statement ext.needed = null in Line 102, and add an edge from Line 102 to Line 104. The constructed data dependencies for the example in Table 1 is shown in Figure 2. Then, based on the graph, HoBuFF identi�es all the statements a�ecting the bug-revealing statement (including the bug-revealing statement) as potential root-cause statements, which is the output of the faultlocalization process. For example, Lines 98, 100, 102, and 104 are identi�ed as the potential root-cause statements of the build failure. 4.2.2 Fixing Ingredients. Both “Update” and “Insertion” require �xing ingredients. We propose a search-based approach to �nd the correct ingredients for these operators. We classify the con�guration elements into two types and decide their values (i.e., ingredients) in di�erent ways. The �rst type of con�guration elements is de�ned within the project, e.g., properties or �les, called internal elements; whereas the second type of con�guration elements is related to external libraries, e.g., thirdlibrary tools and dependencies, called external elements. For internal elements, HoBuFF searches ingredients inside the project, i.e., internal searching in short. For external elements, HoBuFF searches ingredients outside the project, i.e., external searching in short. HoBuFF with internal searching, is to search all the values that are assigned to the same con�guration element within the whole project. For example, there are two ingredients found in this way for the illustration example: (i) ext.needed = false; (ii) ext.needed = true; HoBuFF with external searching, is to search the values from external resources. Here we consider three kinds of external resources: (i) Gradle central repository [5] recording most of thirdparty dependencies; (ii) Gradle DSL document [6] recording Gradle types and their corresponding properties and potential values; (iii) Android DSL document [1] recording most of Android-related plugins and their corresponding properties. We also consider Android DSL because Gradle is the o�cial build tool for Android and Gradle build scripts usually have dependencies with Android. Since the external resources are recorded in a well-structured form, HoBuFF is able to collect the information for these external resources in advance. Due to the closure characteristic of Gradle, for each item in the external resources, it can be represented in a sequence like < pre f ix 1 .pre f ix 2 ...pre f ixn : valueType >. For example, from the Android DSL document, we could collect item like , which means in android block and its sub-block lintOptions, there is an element abortOnError, whose value type is Boolean and has two optional values: true or false. Given a searching keyword, HoBuFF tries to match it within collected sequences, and retrieves the value type and pre�xes for 4.2 Search-Based Patch Generation To �x a build failure, HoBuFF uses a search-based approach to generate patch candidates for each root-cause statement, and then applies these patches one by one to the buggy Gradle script. If the build failure disappears when applying some patch, HoBuFF regards this patch as valid. The �xing process is conducted continuously until all patch candidates have been applied or a valid patch is found. In the following, we �rst introduce the components required by search-based patch generation, i.e., �xing operators (Section 4.2.1) and �xing ingredients (Section 4.2.2). Then, we present the overall process of patch candidates generation (Section 4.2.3). 48 — 118 — History-Driven Build Failure Fixing: How Far Are We? ISSTA ’19, July 15–19, 2019, Beijing, China Software Automation Table 5: Build Failure with Non-existing File Table 4: Build Failures Fixed by HoBuFF/HireBuild Build Failure Category Internal Element Related External Element Related Total HoBuFF 8 10 18 HireBuild 0 9 9 in the Big Data Era: Challenges and Opportunities Error Message * What went wrong: > A problem occurred evaluating project ’:app’ > /home/travis/build/yydcdut/PhotoNoter/app/release.properties (No such �le or directory)* Overlap 0 8 8 the keyword. For example, if HoBuFF �nds the buggy element named lint, HoBuFF searches for lint within collected sequences and �nds the related one lintOptions and its value type. Based on the information, HoBuFF could generate several �xing ingredients, such as, android.lintOptions.abortOnError = false and android.lintOptions.abortOnError = true. The transformed Gradle code from these ingredients can refer to Table 6. In this way, a set of ingredients can be collected from external resources. Patch Generated by HoBuFF 55 Properties p = new Properties() 56 - p.load(new FileInputStream(project.file)) (’release.properties’))) 57 - storeFile file(p.storeFile) 58 - storePassword p.storePassword 59 - keyAlias p.keyAlias 60 - keyPassword p.keyPassword 5.3 RQ4: Build-Failure Fixing E�ectiveness Among the 102 reproducible build failures, HoBuFF successfully �xes 18 of them (18%), whereas the state-of-art HireBuild �xes only 9 of them (9%), indicating the superiority of simply using the present project information rather than using the historical �x information for build-failure �xing. In addition, among the 9 build failures �xed by HireBuild, 8 are also �xed by HoBuFF, which implies that the new simplistic approach is able to �x most of the build failures HireBuild �xes. To investigate the contribution of each component of HoBuFF, we further combine fault localization of HoBuFF with patch generation of HireBuild, fault localization of HireBuild with patch generation of HoBuFF on the 18 failures �xed by HoBuFF. The results show that the former only �xes 12 failures while the latter only �xes 8 failures, demonstrating the contribution for each component of HoBuFF. More qualitative analysis on the capability of HoBuFF over HireBuild can be found as follows. 4.2.3 Patch candidate generation. The input of the patch-candidategeneration process is a set of localized root-cause statements and the output is a list of patch candidates for the build failure. More speci�cally, the patch-candidate-generation process can be unscrambled as the following three layers: (1) for each root-cause statement, HoBuFF �rst decides which �xing operators should be applied according to whether the statement is null de�nition or not. If the statement is null de�nition, only “Insertion” is considered; otherwise, all three �xing operators are applied in the order of Update-Insertion-Deletion, respectively; (2) for each �xing operator, HoBuFF directly generates a patch candidate if it is “Deletion”, while HoBuFF generates �xing ingredients via internal or external searching if it is “Insertion” or “Update”; (3) for each �xing ingredient, HoBuFF generates a patch and validates it. 5.3.1 Case Analysis for Successfully-Fixed Failures. To facilitate analysis, we categorize build failures successfully �xed by HoBuFF according to the con�guration locations that the corresponding �xes deal with. Table 4 presents the results of each category, where Columns 2 and 3 show the number of build failures �xed by HoBuFF and HireBuild within each failure category, Column 4 presents the number of build failures �xed by both of these two techniques. Internal-element-related failures refer to the build failures resulting from wrong values of internal con�guration elements. The values of internal con�guration elements are often speci�c to the project so that their values usually vary among projects. Properties and �les are common internal elements. Table 5 presents such a real build failure in our study and its �x generated by HoBuFF. From the error message, such a failure is caused by non-existing local �les. HoBuFF identi�es the buggy con�guration in Line 56 as the bug-inducing statement based on the �le path. As HoBuFF cannot �nd �xing ingredients for insertion and update operators on this bug-inducing statement, it generates a patch by applying deletion operators to Line 56 and its a�ected code (Lines 57-60). Note that this patch is the same as the corresponding manual patch. According to Table 4, HireBuild does not �x any internal element related failures. HireBuild learns patches across projects and di�erent projects de�ne di�erent internal con�gurations (di�erent element names and di�erent element values), HireBuild cannot learn how to �x such category of failures from the history data. In contrast, HoBuFF is good at �xing these failures, since HoBuFF performs data-�ow analysis for precise localization and internal searching for valid patch generation. External-element-related failures refer to the failures resulting from improper values of elements in external con�gurations, such 5 COMPARISON EVALUATION 5.1 Research Questions • RQ4: How does HoBuFF perform in terms of the number of �xed build failures? • RQ5: How does HoBuFF perform in terms of the build-failure �xing time and the number of candidate patches? While RQ4 focuses on the e�ectiveness, RQ5 studies the time HoBuFF costs and the number of candidate patches validated before a correct patch is found. These measurements are widely used in the existing work in program repair [53, 65, 67]. 5.2 Implementation, Environment, and Process To implement HoBuFF, we use Groovy AST APIs [7] to systematically analyze/modify Gradle build scripts. We conduct all experiments on a computer with 64 Intel(R) Xeon e5 CPU Cores, 128GB Memory, and Ubuntu 14.04.1. The tool/dataset can be found in our website: https://sites.google.com/site/hobu�2019. To investigate the performance of HoBuFF, for each build failure, we �rst apply HoBuFF to its corresponding buggy build script and collect a list of patches generated by HoBuFF. For each patch, we follow previous work [23] to validate whether it is correct based on the following two criteria: (1) the build task �nishes successfully; and (2) the size of compiled �les is the same as the size of compiled �les generated by the manual patch. Finally, we count the number of build failures that HoBuFF can successfully �x. For each �xed build failure, we also record the time spent by HoBuFF and the number of validated candidate patches. To compared with the state-of-art, we also record the same measurements for HireBuild. 49 — 119 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 Table 6: Build Failure with Lint Error Table 8: Breakdown for Unsuccessful Fault Localization Error Message * What went wrong: Meeting > Execution failed for taskProceedings :PG_Edit_SDK:lint > Lint found errors in the project; aborting build Unsuccessful Reason Abstract Error Message (43) Patch Generated by HoBuFF 85 android { ... 94 lintOptions { + abortOnError false 95 disable ’ExtraTranslation’ } ... } Indirect Error Message (6) Unrelated Error Message (7) Inaccessible Build File (2) Table 7: Build Failure with Wrong Revision Number Number(%) 18 (31%) 12 (21%) 5 (9%) 8 (14%) 2 (3%) 4 (7%) 3 (5%) 4 (7%) 2 (3%) Among the 84 build failures that cannot be �xed by HoBuFF, 58 cannot be �xed due to the early fault localization phase of HoBuFF, whereas 26 cannot be �xed due to the latter patch generation phase. Fault localization limitation. For most unsuccessful cases caused by the limitation of fault localization, we further categorize them according to the unsuccessful reasons and corresponding build failure types, shown in Table 8. We observe that most of the unsuccessful cases (56 out of 58) get stuck in error information extraction. For 43 cases, error messages are too abstract to contain detailed information for localization, so that HoBuFF can not extract buggy con�guration information at all. This kind of error messages often comes from the build failures with test failures, compilation errors, abnormal processes, or startup failures. We list an example with error message and related manual �x in Table 9. From this example, even experienced developers may not be able to �nd the relation between the error message and the buggy code. Since HoBuFF can not extract buggy con�guration from the error message, one potential solution is to utilize extra information, such as stack-trace or report �les, which can be extended to HoBuFF in the future. Furthermore, HireBuild also fails to �x these failures, since such �xes can hardly be learned from the history either. For 6 cases of indirect error messages due to runtime exceptions and publish failures, their error messages contain scrap and indirect information composing in a complex way which is hard to extract error information. For 7 cases of unrelated error messages due to �le/dependency missing, the con�guration element reported by its error message is not the real cause of the build failure. Table 10 shows an example. With the error message, HoBuFF identi�es Line 7 as the buggy statement and tries to assign a correct value for this dependency. However, the real cause is that the build script does not include a proper central repository (i.e., jcenter()) and mavenCentral doesn’t contain the needed com.android.tools.build.gradle above 2.3.1. HoBuFF considers jcenter as the default repository for searching third-party dependency without considering the buggy build script missing including jcenter. To deal with such a build failure, HoBuFF needs to include some extra common sense rules during the process of root cause localization. Note that this is the only build failure in our study that is �xed by HireBuild but not by HoBuFF. HireBuild can �x this failure because its training set has a very similar failure which is �xed in this way, which implies that history and present information are complementary and we will further improve HoBuFF by considering history information in the future. For the left 2 cases, HoBuFF cannot map the buggy con�guration in the build script because the bug-revealing statement lies in a remote source �le which is not accessible (i.e., not in local path). Patch generation limitation. The remaining 26 cases are those whose root causes are correctly located but no correct patch is generated in the patch generation phase. HoBuFF does not �x such Error Message * What went wrong: > A problem occurred con�guring project ’:lib’ > failed to �nd Build Tools revision 26.0.2 Patch Generated by HoBuFF 2 ext { _buildToolsVersion = 26.0.2 } 3 _buildToolsVersion = 26.0.3 } 3 + 24 android { 25 buildToolsVersion = _buildToolsVersion } Patch Generated by HireBuild 2 ext { 3 _buildToolsVersion = 26.0.2 } ... 24 android { buildToolsVersion = _buildToolsVersion } 25 buildToolsVersion = 26.0.3 } 25 + as the options of third-party plug-ins. Among these 10 build failures successfully �xed by HoBuFF, we further found that Gradle repository, Gradle DSL document, and Android DSL document contribute to 20%, 20% and 60% of the cases, respectively. Table 6 presents such a real-world build failure in our study and its �x generated by HoBuFF. This failure is caused by error option of lint component. HoBuFF identi�es the buggy con�guration in Line 94 as the bug-revealing statement and the root cause. Then, HoBuFF applies external searching and �nd lintOptions owns a set of options (e.g., abortOnError, absolutePaths). Lastly, HoBuFF enumerates all values of these options since they are Boolean type, and combines them with the insertion operator to generate candidate patches. Note that although disabling lint option seems tricky, we observed that the developer(s) also did exactly the same “lazy” �x. Table 7 presents another real-world build failure in our study and its �xes generated by both HoBuFF and HireBuild. In particular, the error message shows that the value of “Build Tools revision” is wrong. With this message, HoBuFF �nds the buggy con�guration in the build �le, which is Line 25 in the table, i.e., buildToolsVersion=_buildToolsVersion. buildToolsVersion is an option of Android plug-in, and it declares the version number of build tools. However, this line is not the root cause and only triggers the failure. Based on the fault localization component, HoBuFF identi�es the real root cause of this build failure, which is Line 3 (i.e., _buildToolVersion = 26.0.2). For comparison, we also list the patch generated by HireBuild in the last row, which is di�erent from the manual patch. Although this patch can also �x this build failure (and has been counted as a successful �x), HireBuild brings dead code (i.e., Line 3: _buildToolsVersion=26.0.2 becomes dead code) to the build �le and thus brings bad code smell. Moreover, on the other hand, this case also demonstrates the importance of precise fault localization in build-failure �xing. 5.3.2 Case Analysis for the Remaining Failures. Although HoBuFF outperforms the state-of-art HireBuild to a large extent, it does not �x all the build failures. Therefore, in this subsection we investigate the remaining build failures that cannot be �xed by our approach to learn its limitation in fault localization and patch generation. 50 — 120 — Failure Type Test Failure Complication Error Abnormal Process Startup Failure Publish Failure Runtime Exception File Missing Dependency Missing Remote File History-Driven Build Failure Fixing: How Far Are We? ISSTA ’19, July 15–19, 2019, Beijing, China Table 9: Example on Failing Test 800 Error Message * What went wrong: > Execution failed for task: library: connectedDebugAndroidTest > There were failing tests. See the report at �le: /home/travis/build/grandcentrix/tray/library/ build/reports/androidTests/connected/index.html 600 400 Software Automation in the Big Data Era: Challenges and Opportunities 40 20 200 Manual Fix 38 - timeOutInMs 30000 38 + timeOutInMs 300000 0 HireBuild HoBuFF Figure 3: Fixing Time (s) Table 10: Example on Wrong Mapping 0 HireBuild HoBuFF Figure 4: NCP Score (#) less than 800 seconds). Moreover, although their �xing time is distributed in a close range, the �xing time of HoBuFF concentrates in a smaller range (i.e., less than 200 seconds) than HireBuild, demonstrating the superiority of HoBuFF in terms of the �xing time for successful �xes. For the 8 failures �xed both by HoBuFF and HireBuild, HoBuFF takes 118 seconds on average and HireBuild takes longer, 196 seconds. For the remaining un�xed failures, HoBuFF consumes 606 seconds on average and 1110 seconds at most, before notifying its incapability. However, HireBuild requires much longer time (i.e., 5 hours on average and 14 hours at most) to report its incapability. The reason is that HireBuild’s learning process creates many unnecessary patches (shown in the next paragraph). We also analyzed the Number of Candidate Patches (abbreviated as NCP) generated and evaluated before a valid patch is found in the successful repair cases. Similar to the existing work on program repair [53], a smaller NCP score indicates less patches are validated during the repair process, which implies high performance. Figure 4 presents a violin plot on the distribution of NCP scores for HoBuFF and HireBuild. From this �gure, both techniques distribute in a small range. The NCP scores of HoBuFF are distributed in a smaller range (i.e., 1-5). Such conclusions are consistent with the observations in time consumption. That is, as the number of candidate patches validated before �nding the correct patch is small, HoBuFF can �x the build failure quickly. Besides, we also investigate the generated patches when the corresponding repair technique does not �x a build failure. Since none of these patches are correct, it is also preferable to generate a small number of candidate patches. On average, HoBuFF and HireBuild generate 77 and 270 patches, respectively, further demonstrating the e�ciency of HoBuFF. Error Message * What went wrong: > A problem occurred con�guring root project ’MVPArms’. > Could not resolve all �les for con�guration ’:classpath’. > Could not �nd com.android.tools.build:gradle:2.3.1 Manual Fix 2 repositories { 3 mavenCentral() + jcenter() ... 6 dependencies { 7 classpath com.android.tools.build:gradle:2.3.1 Table 11: Example on Unsuccessful Patch Generation Error Message * What went wrong: > A problem occurred evaluating script. > No such property: pom_name for class: > org.gradle.api.publication.maven.internal.pom Manual Fix 29 repositories { if (project.hasProperty(’pom_name’)) { 30 + repositories.mavenInstaller { 31 + ... 38 + organization 40 ... } 48 - failures because their �xes require to modify plenty of lines in the build �le and the modi�cation involves complex logic, rather than reset the value of con�guration elements. Table 11 shows one example case. In particular, HoBuFF extracts suspicious con�guration < pom_name, null > and tries to �x the value of the property pom_name by update, insertion and deletion operators but the correct patch is not generated at all. The last row shows the manual patch for this build failure, which actually includes a large amount of complex modi�cation (i.e, adding 59 lines and deleting 58 lines). Moreover, such modi�cation requires comprehensive understanding of the project, which is beyond the current capability of HoBuFF. Such a failure whose �xing requires many complex modi�cations is still an open problem in both source-code repair [13, 22, 30– 33, 39, 41, 48, 58, 64, 71–73] and build-failure repair [23, 45], which may be further explored in the future. 5.5 Threats to Validity The threat to internal validity lies in the implementation of the build failure �xing techniques studied in the experimental study. To reduce this threat, we reused the code of HireBuild and used the mature Groovy Parsing APIs to implement HoBuFF. Moreover, the �rst two authors manually reviewed HoBuFF code carefully. The threat to external validity mainly lies in the datasets used. Before this study, Hassan et al. [23] have released a dataset of 175 Gradle build bugs when proposing the state-of-art HireBuild. In their work, 135 bugs have been already used as the training set for HireBuild, and among the left 40 bugs, they used the 24 successfullyreproduced bugs for evaluation. We tried to reproduce the 24 bugs as the process described in Section 3.1 and only successfully reproduced 5 bugs due to various reasons mentioned in Section 3.1. We have applied HoBuFF and HireBuild on the 5 reproducible bugs and they both can successfully �x 3 of them. We also build a new dataset, which is the largest dataset of reproducible real build failures from GitHub, to ensure that our experimental results can more 5.4 RQ5: Build-Failure Fixing E�ciency Besides the number of �xed bugs, it is also interesting to study the time spent by a bug-�xing technique. Even if the existing bug�xing techniques (including HireBuild and HoBuFF) can �x the same build failures, a technique with quick response (i.e., �xing a build failure quickly and notifying incapability quickly) is preferable. Therefore, in this subsection we analyze bug-�xing time by considering both successfully-�xed failures and un�xed failures. For the failures successfully �xed, HoBuFF spends 156 seconds on average, indicating the e�ciency of HoBuFF in �xing real-world build failures. We further draw a violin plot to compare the �xing time between HoBuFF and HireBuild in Figure 3. From this �gure, neither technique spends long time in successfully-�xed cases (i.e., 51 — 121 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 likely generalize to more build failures in the wild compared to prior work. Furthermore, our large dataset can be viewed as two Meeting di�erent sub-datasets eachProceedings with half of the bugs based on the build timestamp. We observe that the earlier half already includes all the necessary information to design HoBuFF. We then can view the later half as the test set. In this way, HireBuild �xes 7/2 bugs, while HoBuFF �xes 10/8 bugs for the ealier/later subset, further demonstrating the generality of HoBuFF over prior work. The threat to construct validity mainly lies in the metrics used. To reduce the threat, we adopt the widely used metrics in programrepair literature, e.g., the number/ratio of �xed bugs and time cost [67, 69]. Note that following prior work [23], we did not compare generated and manual patches for checking patch correctness since there can be over one solution to �x a build failure (Section 5.2). To further reduce this threat, we manually checked all 18 HoBuFF patches: 10/3 patches are syntactically/semantically equivalent to manual patches, and the rest 5 are all valid alternative patches. code �xing, while the existing techniques target at source code; (2) HoBuFF employs both internal and external searching during patch generation, and also constructs external knowledge graph from o�cial documents/sources for candidate generation; (3) HoBuFF utilizes lightweight NLP techniques and data�ow analysis to reduce the search space for the speci�c build-failure �xing problem. Build System Maintenance. Recently, the research work on build system maintenance mainly (but not limited to) focuses on empirical study of build-failures and build-failure detection/debugging. In particular, Sulír et al. [61] and Tufano et al. [63] investigated build errors from open source projects in Java. Hyunmin et al. [57] investigated build errors at Google. Both studies demonstrated that build failures occur frequently in practice. To facilitate build failure detection, Wolf et al. [19] proposed to predict build failures via social network analysis on developer communication. Tamrawi et al. [62] proposed a build code smell detection approach, which statically analyzes build code via symbolic evaluation. Besides, Adams et al. [11] utilized a �exible directed acyclic graph to model dependency graph for a build system, which may ease the understanding of a build system so as to reduce the possibility of build failure occurrence. Few work in the literature studies how to automatically �x a build failure. Al-Kofahi et al [12] proposed a fault localization approach for Make�le, which collects and analyzes dynamic execution trace of build code for precise fault localization. Macho et al. [45] designed three strategies based on frequently occurring repair types to �x only dependency-related build failures for Maven projects. Recently, Hassan and Wang [23] proposed a general-purpose history-based automatic build-failure �xing approach, HireBuild, which learns �xing patterns from historical �xes and feeds them into the buggy script. Our HoBuFF also targets at general-purpose build-failure �xing. However, HireBuild focuses on learning from the history, while HoBuFF searches from the present projects/resources. 6 RELATED WORK Program Repair. Automatic program repair is now attracting increasing research interests. There exist various techniques for �xing general bugs [13, 21, 24, 30–33, 48, 58, 64, 71, 72], concurrent bugs [39, 41, 73], and even tests [20, 60]. This section mainly discusses the closely related Generate&Validate (G&V) techniques: Search-based techniques explore the search space of �x templates and validate them heuristically. GenProg [35], one of the earliest and representative search-based APR techniques, searches for correct patches via genetic programming. To reduce the repair time cost of GenProg, RSRepair [52] searches among all candidates randomly, while AE [66] uses a deterministic repair algorithm. To generate high-quality �x patterns, PAR [29] learns various types of �xing templates via manually reviewing human written patches, and leverages them during candidate patch generation. Recently, more and more search-based techniques have been proposed: HDRepair [34] automatically mines historical data to help search correct patches; Elixir [56] uses machine learning to prioritize patches for faster repair; CapGen [67] utilizes context information to rank mutation operators [27, 75] and patches for fast patch generation; SketchFix [25] and JAID [18] reduce patch generation costs via sketching and meta-program encoding, respectively; SimFix [28] searches for similar code snippets from the current project under test for potential �xes; PraPR [21] recently demonstrates that even simple template-based APR mutators can outperform state-of-the-art APR techniques, and shows that bytecode-level repair can achieve over 10X speedup over existing techniques. Semantics-based APR techniques use constraints to generate correct-by-construction patches via formal veri�cation or speci�cations. SPR [42] leverages mutation operators to generate candidate patches and also applies condition synthesis via symbolic execution. Prophet [43] automatically learns from correct patches to rank candidate patches generated by SPR for faster repair. SemFix [48] derives repair constraints from tests and solves the repair constraints to generate valid patches. Angelix [47] is a more recent lightweight semantics-based repair technique that scales up to large programs. HoBuFF can also be categorized as search-based techniques, but is di�erent from existing techniques: (1) HoBuFF targets at build 7 CONCLUSION In this paper, we attempt to investigate the potential strengths and limitations of state-of-the-art history-driven build-failure �xing technique, HireBuild. To this end, we construct a new and large real-world build-failure dataset from Top-1000 GitHub projects. Then, we evaluate HireBuild on the extended dataset with both quantitative and qualitative analysis. Inspired by the �ndings of the study, we propose a history-oblivious technique, HoBuFF, which locates buggy con�gurations through lightweight data�ow analysis and then generates potential patches via searching from the present project under test and external resources (rather than the historical �x information). The experimental results demonstrate that the simplistic approach based on present information successfully �xes 2X more reproducible build failures than the state-of-art HireBuild and is much faster. Furthermore, our results also reveal various �ndings/guidelines for future advanced build failure �xing. ACKNOWLEDGEMENTS This work was partially supported by the National Key Research and Development Program of China under Grant No. 2017YFB1001803, the National Natural Science Foundation of China under Grant Nos. 61872008 and 61861130363, and the National Science Foundation under Grant Nos. CCF-1566589 and CCF-1763906. 52 — 122 — History-Driven Build Failure Fixing: How Far Are We? ISSTA ’19, July 15–19, 2019, Beijing, China Automation [32] Xuan-Bach D Le, Duc-Hiep Chu, Software David Lo, Claire Le Goues, and Willem Visser. 2017. S3: syntax-and semantic-guided repair synthesis via in the Bigprogramming Data byEra: examples. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software and Opportunities Engineering.Challenges ACM, 593–604. [33] Xuan-Bach D Le, Quang Loc Le, David Lo, and Claire Le Goues. 2016. Enhancing automated program repair with deductive veri�cation. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME),. IEEE, 428–432. [34] Xuan-Bach D Le, David Lo, and Claire Le Goues. 2016. History driven automated program repair. In SANER. [35] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering,. IEEE, 3–13. [36] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710. [37] Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. 2019. DeepFL: Integrating Multiple Fault Diagnosis Dimensions for Deep Fault Localization. In ISSTA. to appear. [38] Xia Li and Lingming Zhang. 2017. Transforming programs and tests in tandem for fault localization. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 92. [39] Huarui Lin, Zan Wang, Shuang Liu, Jun Sun, Dongdi Zhang, and Guangning Wei. 2018. PFix: �xing concurrency bugs based on memory access patterns. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 589–600. [40] Yun Lin, Jun Sun, Yinxing Xue, Yang Liu, and Jinsong Dong. 2017. Feedback-based debugging. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 393–403. [41] Peng Liu, Omer Tripp, and Charles Zhang. 2014. Grail: context-aware �xing of concurrency bugs. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. ACM, 318–329. [42] Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 166–178. [43] Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. ACM SIGPLAN Notices 51, 1 (2016), 298–312. [44] Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of �aky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 643–653. [45] Christian Macho, Shane McIntosh, and Martin Pinzger. 2018. Automatically repairing dependency-related build breakage. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 106–117. [46] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60. [47] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th international conference on software engineering. ACM, 691–701. [48] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. Sem�x: Program repair via semantic analysis. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 772–781. [49] Flemming Nielson, Hanne R Nielson, and Chris Hankin. 2015. Principles of program analysis. Springer. [50] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and improving fault localization. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 609–620. [51] Alexandre Perez, Rui Abreu, and Marcelo d’Amorim. 2017. Prevalence of singlefault �xes and its impact on fault localization. In 2017 IEEE International Conference on Software Testing, Veri�cation and Validation (ICST). IEEE, 12–22. [52] Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 254–265. [53] Yuhua Qi, Xiaoguang Mao, Yan Lei, and Chengsong Wang. 2013. Using automated program repair for evaluating the e�ectiveness of fault localization techniques. In Proceedings of the 2013 International Symposium on Software Testing and Analysis. ACM, 191–201. [54] Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing. [55] Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. 2017. An empirical analysis of build failures in the continuous integration work�ows of Java-based open-source software. In Proceedings of the 14th International Conference on Mining Software Repositories. IEEE Press, 345–355. [56] Ripon K Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R Prasad. 2017. Elixir: E�ective object-oriented program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 648–659. REFERENCES [1] 2019. Android Gradle DSL URL. https://google.github.io/android-gradle-dsl/ [2] 2019. Ant. http://ant.apache.org [3] 2019. Github SNAPSHOT. https://www.Github.com,accessonJan$31^{st}$2018. [4] 2019. Gradle. https://gradle.org [5] 2019. Gradle Central Repository URL. https://jcenter.bintray.com [6] 2019. Gradle DSL Document URL. https://docs.gradle.org/current/dsl/index.html [7] 2019. GROOVY AST API. http://docs.groovy-lang.org/2.4.7/html/api/org/ codehaus/groovy/ast/package-summary.html [8] 2019. Manual Patch of Mockito/db8a3f3. https://github.com/mockito/mockito/ compare/db8a3f3�2e4...4752e4fb0772 [9] 2019. Maven. http://maven.apache.org [10] 2019. Travis. https://travis-ci.org [11] Bram Adams, Herman Tromp, Kris De Schutter, and Wolfgang De Meuter. 2007. Design recovery and maintenance of build systems. In Proceedings of the IEEE International Conference on Software Maintenance. IEEE, 114–123. [12] Jafar Al-Kofahi, Hung Viet Nguyen, and Tien N Nguyen. 2014. Fault localization for build code errors in make�les. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 600–601. [13] Earl T Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 306–317. [14] Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Travistorrent: Synthesizing travis ci and github for full-stack research on continuous integration. In Proceedings of the 14th International Conference on Mining Software Repositories. IEEE press, 447–450. [15] José Campos, Rui Abreu, Gordon Fraser, and Marcelo d’Amorim. 2013. Entropybased test generation for improved fault localization. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 257–267. [16] Satish Chandra, Emina Torlak, Shaon Barman, and Rastislav Bodik. 2011. Angelic debugging. In Proceedings of the 33rd International Conference on Software Engineering. 121–130. [17] Junjie Chen, Jiaqi Han, Peiyi Sun, Lingming Zhang, Dan Hao, and Lu Zhang. 2019. Compiler Bug Isolation via E�ective Witness Test Program Generation. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM. to appear. [18] Liushan Chen, Yu Pei, and Carlo A Furia. 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 637–647. [19] Timo Wolf Adrian Schröter Daniela Damian and Thanh Nguyen. [n.d.]. Predicting Build Failures using Social Network Analysis on Developer Communication. ([n. d.]). [20] Brett Daniel, Tihomir Gvero, and Darko Marinov. 2010. On test repair using symbolic execution. In Proceedings of the 19th international symposium on Software testing and analysis. ACM, 207–218. [21] Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical Program Repair via Bytecode Mutation. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis. to appear. [22] Divya Gopinath, Sarfraz Khurshid, Diptikalyan Saha, and Satish Chandra. 2014. Data-guided repair of selection statements. In Proceedings of the 36th International Conference on Software Engineering. ACM, 243–253. [23] Foyzul Hassan and Xiaoyin Wang. 2018. HireBuild: an automatic approach to history-driven repair of build scripts. In Proceedings of the 40th International Conference on Software Engineering. ACM, 1078–1089. [24] Mei Hong and Lu Zhang. 2018. Can big data bring a breakthrough for software automation? Science China Information Sciences 61 (2018), 056101. [25] Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. ACM, 12–23. [26] Md Rakibul Islam and Minhaz F Zibran. 2017. Insights into continuous integration build failures. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 467–470. [27] Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. IEEE TSE 37, 5 (2011), 649–678. [28] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space with Existing Patches and Similar Code. (2018). [29] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 802–811. [30] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2018. Fixminer: Mining relevant �x patterns for automated program repair. arXiv preprint arXiv:1810.01791 (2018). [31] Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. JFIX: semantics-based repair of Java programs via symbolic PathFinder. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 376–379. 53 — 123 — ISSTA ’19, July 15–19, 2019, Beijing, China Yiling Lou, Junjie Chen, Lingming Zhang, Dan Hao, and Lu Zhang 2018 [57] Hyunmin Seo, Caitlin Sadowski, Sebastian Elbaum, Edward Aftandilian, and Robert Bowdidge. 2014. Programmers’ build errors: a case study (at google). In Proceedings Meeting of the 36th International Conference on Software Engineering. ACM, Proceedings 724–734. [58] Edward K Smith, Earl T Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? over�tting in automated program repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 532–543. [59] Manu Sridharan, Stephen J Fink, and Rastislav Bodik. 2007. Thin slicing. ACM SIGPLAN Notices 42, 6 (2007), 112–122. [60] Andrea Stocco, Rahulkrishna Yandrapally, and Ali Mesbah. 2018. Visual web test repair. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 503–514. [61] Matúš Sulír and Jaroslav Porubän. 2016. A quantitative study of java software buildability. In Proceedings of the 7th International Workshop on Evaluation and Usability of Programming Languages and Tools. ACM, 17–25. [62] Ahmed Tamrawi, Hoan Anh Nguyen, Hung Viet Nguyen, and Tien N Nguyen. 2012. SYMake: a build code analysis and refactoring tool for make�les. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, 366–369. [63] Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2017. There and back again: Can you compile that snapshot? Journal of Software: Evolution and Process 29, 4 (2017), e1838. [64] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Learning How to Mutate Source Code from Bug-Fixes. arXiv preprint arXiv:1812.10772 (2018). [65] Yi Wei, Yu Pei, Carlo A Furia, Lucas S Silva, Stefan Buchholz, Bertrand Meyer, and Andreas Zeller. 2010. Automated �xing of programs with contracts. In Proceedings of the 19th international symposium on Software testing and analysis. ACM, 61–72. [66] Westley Weimer, Zachary P Fry, and Stephanie Forrest. 2013. Leveraging program equivalence for adaptive program repair: Models and �rst results. In Proceedings of the 28th International Conference on the Automated Software Engineering (ASE). IEEE, 356–366. [67] Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-Aware Patch Generation for Better Automated Program Repair. ICSE. [68] Timo Wolf, Adrian Schroter, Daniela Damian, and Thanh Nguyen. 2009. Predicting build failures using social network analysis on developer communication. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 1–11. [69] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 416–426. [70] Tianyin Xu, Xinxin Jin, Peng Huang, Yuanyuan Zhou, Shan Lu, Long Jin, and Shankar Pasupathy. 2016. Early Detection of Con�guration Errors to Reduce Failure Damage.. In OSDI. 619–634. [71] Jooyong Yi, Umair Z Ahmed, Amey Karkare, Shin Hwei Tan, and Abhik Roychoudhury. 2017. A feasibility study of using automated program repair for introductory programming assignments. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 740–751. [72] Jooyong Yi, Shin Hwei Tan, Sergey Mechtaev, Marcel Böhme, and Abhik Roychoudhury. 2018. [Journal First] A Correlation Study Between Automated Program Repair and Test-Suite Metrics. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 24–24. [73] Tingting Yu and Michael Pradel. 2018. Pinpointing and repairing performance bottlenecks in concurrent programs. Empirical Software Engineering 23, 5 (2018), 3034–3071. [74] Lingming Zhang, Miryung Kim, and Sarfraz Khurshid. 2011. Localizing failureinducing program edits based on spectrum information. In ICSM. IEEE, 23–32. [75] Lingming Zhang, Tao Xie, Lu Zhang, Nikolai Tillmann, Jonathan De Halleux, and Hong Mei. 2010. Test generation via dynamic symbolic execution for mutation testing. In 2010 IEEE International Conference on Software Maintenance. 1–10. [76] Lingming Zhang, Lu Zhang, and Sarfraz Khurshid. 2013. Injecting mechanical faults to localize developer faults for evolving software. In OOPSLA. 765–784. [77] Mengshi Zhang, Xia Li, Lingming Zhang, and Sarfraz Khurshid. 2017. Boosting spectrum-based fault localization using PageRank. In ISSTA. 261–272. 54 — 124 — 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) Software Automation in the Big Data Era: Challenges and Opportunities Inferring Program Transformations From Singular Examples via Big Code † Jiajun Jiang† , Luyao Ren† , Yingfei Xiong† , Lingming Zhang‡ Key Laboratory of High Conﬁdence Software Technologies, Ministry of Education (PKU) Department of Computer Science and Technology, EECS, Peking University, Beijing, China ‡ University of Texas at Dallas, USA {jiajun.jiang, rly, xiongyf}@pku.edu.cn, lingming.zhang@utdallas.edu † A key challenge in transformation inference is to decide what can be generalized in the transformation. As an example, let us consider the following change: Abstract—Inferring program transformations from concrete program changes has many potential uses, such as applying systematic program edits, refactoring, and automated program repair. Existing work for inferring program transformations usually rely on statistical information over a potentially large set of program-change examples. However, in many practical scenarios we do not have such a large set of program-change examples. In this paper, we address the challenge of inferring a program transformation from one single example. Our core insight is that “big code” can provide effective guide for the generalization of a concrete change into a program transformation, i.e., code elements appearing in many ﬁles are general and should not be abstracted away. We ﬁrst propose a framework for transformation inference, where programs are represented as hypergraphs to enable ﬁne-grained generalization of transformations. We then design a transformation inference approach, G EN PAT, that infers a program transformation based on code context and statistics from a big code corpus. We have evaluated G EN PAT under two distinct application scenarios, systematic editing and program repair. The evaluation on systematic editing shows that G EN PAT signiﬁcantly outperforms a state-of-the-art approach, S YDIT, with up to 5.5x correctly transformed cases. The evaluation on program repair suggests that G EN PAT has the potential to be integrated in advanced program repair tools – G EN PAT successfully repaired 19 realworld bugs in the Defects4J benchmark by simply applying transformations inferred from existing patches, where 4 bugs have never been repaired by any existing technique. Overall, the evaluation results suggest that G EN PAT is effective for transformation inference and can potentially be adopted for many different applications. Index Terms—Pattern generation, Program adaptation, Code abstraction f(a, b) =⇒ f(g(a),b) A possible transformation could be the following one: Wrapping with g any element that • is a variable • has type integer • has identiﬁer name a We may also consider making the transformation more general such as the following one: Wrapping with g any element that • is a variable • has type integer We may also consider the context of the change to make the transformation more speciﬁc such as the following one: Wrapping with g any element that • is a variable • has type integer • is the ﬁrst argument of a call to f Making the transformation too speciﬁc may decrease recall, i.e., missing cases that should be transformed. Making the transformation too general may decrease precision, i.e., transforming cases that should not be transformed. Therefore, selecting a suitable level of generalization is critical to the quality of the inferred program transformation. A typical method adopted by many existing techniques [11]–[13] is to learn from many examples, where the statistical information from many examples is used to decide which part should be concrete in the transformation and which part should be abstracted away. In the above example, if there are many change examples that wrap the ﬁrst arguments of f with g and the ﬁrst arguments have many different names, we know that the last transformation should be the desirable one, and information such as variable name a should be abstracted away. However, such an approach requires many examples as the training set. In practice, we often do not have so many examples. For example, Genesis [13] uses hundreds of patches for the same type of bugs to generate transformations, while in practice the repetitiveness of patches tends to be tenuous [14], I. I NTRODUCTION Modern program development is often repetitive, where the same changes are applied over and over again in different positions or in different projects, by the same or different developers. Inferring program transformations from change examples could automate the changes of the same type, and has many potential uses such as systematically editing many places in the source code [1], ﬁxing bugs based on patches of recurring bugs [2]–[4], porting commits among forked projects [5], [6], adapting client code for incompatible API changes [7], [8], refactoring [9], [10], etc. ∗ Yingfei Xiong is the corresponding author. This work was partially done when Jiajun Jiang was a visiting student in UT Dallas. 978-1-7281-2508-4/19/$31.00 ©2019 IEEE DOI 10.1109/ASE.2019.00033 255 — 125 — 2018 based systems, including systematic-editing and programrepair systems. and only one or a few patches can be located for many types of bugs. Meeting On the other hand, aProceedings few approaches [12], [15] have tried to reduce the needed number of examples by using predeﬁned rules to decide what part in the concrete changes should be abstracted away in the transformation, i.e., always ignore the variable names and allow to match variables with any name [15]. However, as shown in the next section, predeﬁned rules cannot capture different situations and often fail to produce the desired transformation. In this paper, we address the challenge of inferring a program transformation from one single example. Our core insight is to learn from “big code”, utilizing the statistical information in a large code corpus to guide the generalization from a change example to a transformation. More concretely, the elements that appear in many ﬁles are potentially general and should be kept in the transformation in order to capture the transformation for all such instances. Along this line, we ﬁrst propose a general framework for transformation inference from an example, where a hypergraph is used to represent source code and ﬁne-grained transformation tuning is enabled by selecting elements and their attributes from the hypergraph. We then instantiate the framework with a transformation inference algorithm that ﬁne-tunes the hypergraph information based on code statistics from a big code corpus. We have already implemented our approach as a tool, G EN PAT, and evaluated G EN PAT in two distinct application scenarios. In the ﬁrst scenario, we employed G EN PAT to perform systematic editing as studied by Meng et al. [15] but with a much larger dataset. The result shows that G EN PAT signiﬁcantly outperforms state-of-the-art S YDIT with an up to 5.5x improvement in terms of correctly generated transformations. In the second scenario, we explore the potential of using G EN PAT to repair bugs by simply mining and applying ﬁxing patterns from existing patches. Although not designed as a comprehensive and standalone repair technique, G EN PAT successfully ﬁxed 19 bugs in a subset of the commonly used Defects4J [16] benchmark. Particularly, 4 bugs have never been ﬁxed by any existing technique as far as we know. The results suggest that G EN PAT is potentially useful in both systematic editing and program repair and indicate a promising future for adopting G EN PAT in practical systems with program transformations. In summary, this paper makes the following contributions: • A framework for transformation inference from a single example by representing code as a hypergraph to allow ﬁne-grained generalization of the transformation. • An algorithm to instantiate the framework by deﬁning the rules for selection based on the code context and statistics in a large code corpus. • An implementation of the proposed technique in Java language, called G EN PAT, which can be publicly accessed at https://github.com/xgdsmileboy/GenPat. • An evaluation with G EN PAT on two distinct practical application scenarios, showing the effectiveness of the proposed framework and calling for future research to integrate G EN PAT for advanced program-transformation- II. R ELATED W ORK In this section, we introduce the most related works to this paper. Existing techniques have explored different strategies for transformation extraction and two categories of transformations have been proposed. The ﬁrst one is executable transformations, which can be applied directly to modify a code snippet. The second one is abstract transformations, which cannot be applied directly but constrain a space of possible transformation results. In other words, executable transformations are functions, while abstract transformations are binary relations that are not functions. Abstract transformations are useful in guiding other technical processes, such as ranking candidates in automatic program repair [17]–[19]. In the following, we will introduce the most related approaches from these two categories in detail. Also, we will discuss few-shot learning problem in machine learning domain, of which the transformation inference problem can be seen as an instance. A. Executable Transformation Generation As explained in the introduction, the key challenge of transformation inference is to decide how to generalize a change into a transformation. To approach this challenge, existing techniques proposed to utilize different strategies, such as learning from many examples or employing predeﬁned rules. Learning from many examples. Multiple existing techniques learn practical transformations from a set of examples with similar code changes. The basic idea is that shared code elements across change examples are critical parts for the transformation and should be preserved, while the other parts tend to be speciﬁc to some individual examples and thus will be discarded. Andersen et al. [20], [21] proposed spdiff, which extracts a set of term replacement patches from each example, and then takes the longest common subpatches as a transformation pattern. Meng et al. [11] proposed L ASE that learns edit scripts from multiple examples. L ASE extracts a set of edit operations from each example, keeps the common operations, and extracts the context of the common operations to form a transformation. Reudismam et al. [12] proposed R EFAZER, which learns syntactic code transformations from examples. R EFAZER takes a program synthesis perspective and searches for a transformation program that is consistent with all the examples. Long et al. [13] proposed Genesis. It infers a template AST from existing patches, which can cover all mined examples. Bavishi et al. [22] proposed to mine repair strategies (or repair patterns) from examples for ﬁxing bugs reported by static analyzers, which clusters similar edit examples for pattern abstraction (i.e., Synthesis of ContextMatchers) via leveraging a DSL for representation. Nguyen et al [23] proposed CPAT M INER that aims to mine semantic code change patterns from code corpus and represents patterns as graphs. CPAT M INER also depends on the repetitiveness of 256 — 126 — Automation B. Abstract Transformation Software as Guidance code changes and leverages graph isomorphism for pattern clustering. Similarly, Molderez et al [24] leveraged frequent itemset mining algorithm to learn edit scripts of code changes from histories of open-source repositories, and employed them for code change recommendation. in techniques the Big Era: Recently, a number of existing wereData proposed Challenges and Opportunities to extract transformations from a set of examples for guiding other technical processes. In particular, transformations are often used in automatic program repair to guide the patch generation as the complete search space can be too huge [33]. For example, Xuan et al. [17] and Wen et al. [19] leveraged transformations from historical bug ﬁxes as program repair templates. Similarly, Jiang et al. [18] proposed to use such transformations to reﬁne the search space of patch generation. Also, other researches proposed to use such transformations for patch prioritization [34]. The core insight behind these techniques is that frequently appeared transformations in history bug ﬁxes have higher possibility to repair a bug, and thus can be utilized to reﬁne the patch space. However, these transformations cannot be directly applied and can be much more abstract compared to those executable transformations. As discussed in the introduction, to achieve a good level of generalizability, these approaches require a non-trivial number of examples, which are often difﬁcult to obtain in practice. For example, in the scenario of program synthesis, these approaches have been successfully applied to only the most common bugs [13] or the bugs in student assignments [12], where a large number of patches can be found for the same type of bug. However, in practice, the repetitiveness of patches tend to be tenuous [14], and only one or a few patches can be located for many types of bugs. Inferring transformations with predeﬁned rules. Several approaches rely on predeﬁned rules to infer a suitable transformation. A typical approach is S YDIT, which also infers a transformation from one example, and is similar to our goal. Given a change, S YDIT ﬁrst selects all related statements that have dependencies with the changed statement, then abstracts away all names (variable name, type name, method name, etc) in the statements and leaves only the structure of the statements. Then the structure is used to match other places and perform the change. However, there are many cases that we may need to abstract away part of the structure or keep some names in the transformation, where S YDIT cannot extract the desirable transformation. As our evaluation will show later, our approach signiﬁcantly outperforms S YDIT with an up to 5.5x improvement. Approaches relying on multiple examples may also use predeﬁned rules to select the desired transformation if the examples are not enough to ensure the quality of the transformation. For example, R EFAZER employs a set of rules to rank the transformations if the synthesizer found multiple possible transformations. C. Few-Shot Learning Few-shot learning [35] attempts to train a machine-learning model with a very small set of training data, and is often considered a grand challenge in the machine learning domain. A typical approach to few-shot learning is to utilize data that are beyond the current task, and train a meta-level model with these data, which can be used as a basis for the few-shot learning task. Our problem is similar to few-shot learning as we try to generalize a transformation from just one example. Also, we learn meta-level information from big code for the transformation inference, where the idea is similar to the approach of few-shot learning. On the other hand, the current few-shot learning techniques are still designed for classic classiﬁcation problem over feature vectors, and thus cannot be applied to the transformation inference problem since it is not a classiﬁcation procedure. III. M OTIVATING E XAMPLE In this section, we motivate the problem of transformation inference with an example in the systematic editing scenario [15]. In a typical systematic editing scenario, the programmer would like to perform the same change on a series of places. She would ﬁrst change one place, and ask the system to extract a transformation from the change, then navigate to the next place and invoke the transformation there. The process is similar to a copy-paste clipboard operation process except that only the transformation is “copied” and “pasted”. Listing 1 shows an example requiring systematic editing. Here “-” denotes deleted code lines and “+” denotes newly introduced code lines. The grayed description on the top gives the detailed information related to the code changes, including the GitHub link of the corresponding commit, ﬁx message and changed classes. Particularly, there are two separate code changes, where the ﬁrst one at line 68 is the change from which a transformation would be inferred, and the second one at line 35 is the ideal change that we expect the transformation to produce. Deﬁning transformation manually. There are some other approaches that perform code changes with manually deﬁned transformations. For example, Kim et al. [25] manually deﬁned a set of transformations for automatic program repair after analyzing a corpus of human patches. Similarly, Liu and Zhong [26] deﬁned transformations (a.k.a. repair templates) with analyzing code samples from StackOverﬂow. Molderez and De Roover [27] proposed to reﬁne a manually deﬁned template with a suite of mutation operations, which recommends changes to the templates iteratively. Additionally, to ease the description of transformations, a set of DSLs have been proposed by previous studies [7], [8], [21], [28]–[32] for program migration or API updating. These techniques provide a way for developers to systematically update a current program with manually deﬁned transformations. However, even with the help of DSL, manually deﬁning transformations is not easy, and automatic transformation inference is desirable in many situations. 257 — 127 — 2018 On the other hand, predeﬁned rules hardly meet the divergent requirements of different situations. For example, S YDIT has a predeﬁned rule to abstract away all variable/method/type names and keep only the structure. However, in this case, it would keep the structure of the ﬁrst and the third arguments, requiring them to take the o.m() form. It would also discard the name of the method call Description. Both are not desirable. Our approach decides how to generalize the change by analyzing the “big code”. We count the number of ﬁles where an element appears in a large code corpus. If an element appears in many ﬁles, it is probably a general element and should be kept in the transformation to transform other sibling instances, otherwise it is probably speciﬁc to the current change and should be abstracted away. In this example, testClass.getName() and testClass.getAnnotations() can be seldomly found in the codebase and thus is abstracted away. On the other hand, Description and null are both frequent and thus are kept in the transformation. While the general idea is simple, realizing the idea is not easy and faces multiple challenges: • Abstraction. We need to have a ﬂexible representation of the transformation, where the level of generalization can be adjusted at a ﬁne-grained level. • Match. The representation should be ﬂexible to allow matching code with different attributes (such as the static value type) or different relations (such as data dependency). • Transformation. The matched code pieces should be consistent with the transformation, i.e., when some code pieces are matched, the transformation must be able to be replayed on these code pieces. In the next section, we will propose a framework for transformation inference to address the above challenges. Commit : github.com/junit-team/junit4/commit/75f7892 Message: Removed nascent category implementation Source : src.main.java.org.junit.runner.Description Meeting Proceedings ========== // first case for pattern generation 67 Description createDescription(Class testClass){ 68 return new Description(testClass.getName(),null, 68 + return new Description(testClass.getName(), 69 testClass.getAnnotations()); 70 } // candidate place to apply the above pattern // Sydit failed to apply the above pattern because the // variable ‘‘name’’ cannot match the method ‘‘getName()’’ // while GenPat successfully applies it. 34 Description createDescription(String name, Annotation... annotations){ 35 return new Description(name,null,annotations); 35 + return new Description(name,annotations); 35 } Listing 1. An example that S YDIT fails to apply pattern. As we can analyze from the two examples, a desirable transformation should delete the second argument of a call to Description if it is null. In other words, the ﬁrst argument testClass.getName() and the third argument testClass.getAnnotations() are speciﬁc to the local change and should not be considered as part of the context of the transformation. The challenge is to know which part should be kept in the transformation and which part should be abstracted away, i.e., deciding how to generalize the change. As discussed previously, existing approaches rely on either multiple examples or predeﬁned rules. However, providing multiple examples is often not desirable or feasible. For example, in systematic editing, the examples are provided by the user, and asking the user to provide multiple and preferably diverse examples signiﬁcantly increases the cost of using this approach. In the scenario of bug repair, for many types of bugs, only one existing patch can be found, and we have to produce a transformation out of the patch. For example, Listing 2 shows a patch that inserts an equality check between two Object arguments into a method returning boolean. From this patch, our approach successfully inferred a transformation and ﬁxed bug Mockito-22 in Defects4J [16] benchmark, which is shown in Listing 3 and has never been ﬁxed by any previous technique. However, we found only one such change instance from more than 1 million historical code change examples of open-source Java projects on GitHub from 2011 to 2016. IV. F RAMEWORK OF T RANSFORMATION I NFERENCE In this section, we introduce the framework of transformation inference. Here we consider a transformation as ﬁrst a pattern to match code pieces and a sequence of modiﬁcation operations to change the code pieces. To address the challenges mentioned above, we make the following design decisions. • Match. To ensure the code elements could be ﬂexibly matched by their attributes and relations, we abstract source code into a hypergraph, where the nodes are AST nodes with their attributes (called code elements) and the hyperedges are relations among nodes. • Abstraction. We further introduce a match relation between hypergraphs such that a graph can be matched by a more abstract graph with possibly fewer elements and attributes. In this way, we can abstract a hypergraph into a pattern at a ﬁne-grained level by selecting elements and attributes that should be kept in the pattern. • Transformation. To ensure the matched code elements are transformable, we use elements and attributes as the Commit : github.com/clitnak/mcrailo/commit/8e76da8 Message: solved ticket #RAILO-2411 Source : railo-java.railo-core.src.railo.runtime.op.Operator ========== 526 boolean _equalsComplexEL(Object left,Object right,...){ 527+ if(left==right){ 528+ return true; 529+ } 530 if(Decision.isSimpV(left)&&Decision.isSimpV(right)){ Listing 2. Referenced history patch to ﬁx Mockito-22. 12 public static boolean areEqual(Object o1,Object o2){ 13 + if(o1==o2){ 14 + return true; 15 + } 16 if(o1==null||o2==null){ Listing 3. Patch of Mockito-22. 258 — 128 — Software Automation hypergraph match those of the target code elements. Formally, inelements. the Big Data Era: we ﬁrst deﬁne the match between interface between the pattern and the modiﬁcations. The modiﬁcations specify the elements and attributes that must be matched to make the transformation applicable, and the pattern ensures to match these elements and attributes. Deﬁnition 3Challenges (Element Match). Anand elementOpportunities �id, attrs� is said � to match another element �id , attrs� �, if ∀ �name, value� ∈ attrs, �name, value� ∈ attrs� . Now we introduce the design in detail. We start by deﬁning code elements. Intuitively, a code element captures a node in an AST, and the attributes of the AST node that we are interested in. Based on the match between code elements, we deﬁne the match between hypergraphs. Deﬁnition 4 (Hypergraph Match). A code hypergraph �E, R� matches another code hypergraph �E � , R� � via a mapping match : E → E � such that ∀e ∈ E, e matches match(e) and ∀ �rname, r� ∈ R, ∃ �rname� , r� � ∈ R� , rname = rname� ∧ r ⊆ r� . Deﬁnition 1 (Code Element). A (code) element is a pair �id, attrs� where id is an element ID and attrs is a set of attributes, where each attribute is a pair �name, value�. In our current implementation, we mainly consider three attributes, AST node type (such as Statement or Variable), content (such as a+b or >=, which is the string representation of the complete subtree), and static value type (such as String or int). The code element captures a single AST node and its attributes, but not the relation between AST nodes. To capture the relations, we further deﬁne code hypergraph as a collection of code elements and their relations. We say a code hypergraph g is more abstract than another code hypergraph g � if there exists a match from g to g � . Given a code hypergraph, we can abstract it into a pattern by removing elements and attributes from the hypergraph. The result is ensured to match the original hypergraph. In this way, we turn the generalization problem into a problem of selecting elements and attributes in a hypergraph, where the selected elements, their selected attributes, and their relations form a new hypergraph as a pattern. Please note that our framework also allows to select relations, but in this paper we only consider the selections of elements and attributes and keep all relations among the selected elements. For example, in Figure 1, the red elements, their red attributes, and their relations form a new hypergraph that would match both code snippets. The elements with solid frame are the matched elements while the elements in the dashed frame are not matched. After an element is matched, we can apply the modiﬁcation operations to the elements. Our framework does not enforce a particular set of modiﬁcation operations and treats modiﬁcations as uninterpreted atomic elements, denoted by set M . Furthermore, we assume the existence of two functions, preIDs and preAttrs. Function preIDs(m) denotes the element IDs that must be matched to make the modiﬁcation m feasible. Function preAttrs(m, id) returns the attribute name on element id that must be matched to make the modiﬁcation m feasible, where id ∈ preIDs(m). In our current approach we consider the following types of modiﬁcations. � � • insert(id, id , i): inserts an AST subtree rooted at id as th the i child of the element id. th child • insert str(id, str, i): inserts the text str as the i of the element id. � • replace(id, id ): replaces an AST subtree rooted at id with another AST rooted at id� . • replace str(id, str): replaces an AST subtree rooted at id with the text str. � • delete(id, id ): deletes an AST subtree rooted at id from � its parent id . For any modiﬁcation m of the above modiﬁcation type, preIDs(m) returns the set of element IDs appearing as the argument, e.g., preIDs(insert(id, id� , i)) = {id, id� }; Deﬁnition 2 (Code Hypergraph). A (code) hypergraph is a pair �E, R�, where E is a set of elements and R is a set of hyperedges, where each hyperedge is a pair �rname, r� consisting of a relation name rname and a relation r ⊆ E k for some k, where E k denotes the k-ary Cartesian power of E. The relation r can be either directed or undirected, but in our implementation, we consider mainly three directed relations, the parent relation in an AST, the ancestor relation which is the transitive closure of the parent relation, and the intra-procedural data dependency between l-values in the program. We only consider data dependencies (ignoring control-ﬂow dependencies) to avoid over-approximations [36], [37] for lightweight analysis. Please also note that when the parent relation is included, a hypergraph subsumes an AST. Additionally, the ancestor relation is necessary as it may still guarantee the program structure match even when two nodes do not have direct parent-child relation. For example, Figure 1 shows the code hypergraph of the two code snippets in Listing 1. Each node in the graph represents a code element, where their IDs and attributes are listed. Three types of relations are shown in the graph, the black lines represent the parent relation and the blue lines represent the data dependency relation. Note that there is no data dependency between p3 and node p6 while the omitted child node of p3 has data dependency on p6 . For clarity, we ignore the ancestor relation in the ﬁgure, e.g., node p1 is the ancestor of node p3 . After we have a code hypergraph, we can deﬁne a pattern that matches elements in the graph. Here we treat a pattern uniformly also as a hypergraph. A pattern matches some code elements if both the attributes and the relations on the pattern 259 — 129 — 2018 Meeting Proceedings $!!!% (#) (#) !!! ! $% $!!!% " " !!! &' !!! ! $% &' ! ! $!!!% ! $!!!% ! &' ! ! Fig. 1. Transformation instance inferred from the ﬁrst case in Listing 1 and its matched instance. In the ﬁgure, we use “...” to represent omitted code content for simplicity. Besides, the ancestor relations are omitted as well. preAttrs(m, id) always returns “AST node type” for any id ∈ preIDs(m), as we need to keep consistency of the node type to ensure the AST is well-formed. In our running example, the change can be captured by the modiﬁcation delete(p4 , p2 ), while preIDs requires p4 and p2 to be matched, and preAttrs requires the matched elements have the same AST node types. Finally, we give the deﬁnition of a transformation. • Elements. We parse the code and extract the AST nodes. • AST node type, content, parent relation, ancestor relation. We directly obtain them from the AST. • Value type. We apply type analysis in Eclipse JDT to infer the value types of all expressions and parameters. For the rest of the elements (i.e., statements), we set its value type to ⊥. • Data dependency. We perform a simple ﬂow-insensitive intra-procedural deﬁne-use analysis [38], [39] to extract data dependency relations. The variables are assumed to have no aliases during the analysis. We assume the change occurs within a method, and consider only the code within the method body in our current implementation. Deﬁnition 5 (Transformation). A transformation is a pair �g, m�, � where g is a code hypergraph and m � is a sequence of modiﬁcations such that for any m ∈ m, � id ∈ preIDs(m) and attrN ame ∈ preAttrs(m, id), there exists an element �id� , attrs� � in g such that id = id� and attrs� contains attrN ame. Given a code hypergraph g � = �E � , R� �, a transformation �g, m�, � and a match match from g to g � , applying the transformation generates a sequence of modiﬁcations m[id � 0 \id�0 , . . . , idn \id�n ] where �id0 , id�0 � , . . . , �idn , id�n � ∈ match. In other words, the element IDs in original sequence of modiﬁcations are replaced by the matched element IDs. Then we apply the sequence of modiﬁcations to obtain the changed code. B. Extracting Modiﬁcations V. T HE G EN PAT A PPROACH Based on the framework, we can now proceed to our approach. Given two code snippets before and after the change, our approach (1) extracts a code hypergraph from the snippet before the change, (2) extracts a sequence of modiﬁcations by comparing the two snippets, (3) infers a transformation by selecting elements and attributes from the hypergraph, and (4) matches and applies the transformation when given a new code snippet. In this section, we introduce how we implement the four components. To infer a transformation, we select elements and attributes from the hypergraph. Please note that we do not select relations in this paper and consider all relations among the selected elements. Element Selection. Since the deﬁnition of the transformation requires the elements in preIDs (i.e., elements corresponding to the modiﬁcations) to be included in the transformation, we ﬁrst select these elements. Next we add elements related to these elements as context. Here we follow the parent relation and the data dependency relation, both forwardly and backwardly, and include all elements that can be reached within k levels of the relations. In this study, we set k = 1 (the default conﬁguration). In the future, we plan to conduct a more thorough investigation of different conﬁgurations. In the current implementation, we employed the GumTree algorithm [40] to extract the modiﬁcations. Please note that the original GumTree algorithm also returns a “move” operation, which can be combined by a deletion and an insertion using our modiﬁcation types. C. Inferring Transformations A. Extracting Hypergraphs To extract the hypergraph, we need to extract the elements, their attributes, and their relations. In our current implementation we extract them as follows. 260 — 130 — For example, for the program in Figure 1, we ﬁrst select p2 and p4 since they are modiﬁed. Then following the parent relation we include p1 , p3 , and p5 . Attribute Selection. Same as elements, we ﬁrst add the attributes required by preAttrs to form a well-formed transformation. In our example we would add the “AST node type” attributes of p2 and p4 . Then we select from other attributes in the selected elements. For the attributes of content and value type, we compute the frequency for a given attribute. That is, we collect the element content and value types from a large code corpus, and then compute the cross-project frequency for a given attribute. If the frequency is larger than a threshold, we select the attribute. In current implementation, we use the following formula to compute the frequency of each attribute. In the experiment, we set the threshold as 0.5%. f req(attr) = node sim = Automation |{e | e ∈ E ∧ Software sameN odeT ype(e, match(e))}| in|E| the Big Data Era: Challenges and Opportunities LCS(tokenize(e), tokenize(match(e))) 1 text sim = |E| Σe∈E |tokenize(e)| score = node sim + text sim In the formulas, sameN odeT ype(e, e ) is used to judge whether element e is with the same node type as e , tokenize(e) is the tokenized sequence for the content of element e, and LCS(s1 , s2 ) computes longest common token sequence between two token lists s1 and s2 . Finally, we use the sum of the two similarities for match ranking. Our intuition for the ranking heuristic is that if the buggy code has more common parts with the pattern code, more conﬁdence can be gained to apply the transformation. As a result, in the formulas we consider both the node-type and token-sequence similarity information since they correspond to code syntax and semantics, respectively. |{f |attr exists in ﬁle f }| |{all f iles in dataset}| Finally, we select the attribute of AST node type when the corresponding code element is a statement. This is to avoid inconsistent matching such as matching a statement with a variable. In the example, we select the node type of p1 . VI. E VALUATION To evaluate the effectiveness of G EN PAT, we choose two application scenarios—systematic editing [1], [15] (Section VI-A), and automatic program repair [17]–[19], [41]–[46] (Section VI-B). D. Matching and Applying Transformations Now suppose we have a transformation t = �g, m�, � and we would like to apply the transformation to a code snippet sp. We ﬁrst transform sp to a hypergraph g =�E , R �, and then ﬁnd a match match from g to g to perform the transformation. In order to ﬁnd the match, we proceed with the following two steps. 1) Greedily matching each element e in E with all elements in E by considering only the attributes. 2) Exhaustively checking all possible matching combinations generated in the ﬁrst step with the relations between elements. In our running example, by considering only the attributes, we can obtain the following mapping. A. Systematic Editing 1) Subjects: We employ two datasets in our evaluation, both collected in existing studies for evaluating systematic editing. The ﬁrst one is the S YDIT dataset collected by Meng et al. [15]. The second one is the dataset collected by Kreutzer et al. [47], which we call C3. Both datasets contain similar changes collected from commits in open-source projects, where all modiﬁcations within a method in a commit are considered as a change. The difference is how they measure similarity: S YDIT uses ChangeDistiller [48] to extract changes for method pairs and requires they share at least one common syntactic edit and their content is at least 40% similar, and C3 represents a code change as a list of edit operations and then clusters the changes by calculating distances over the feature vector of code changes. The S YDIT dataset consists of 56 pairs of similar changes. For each pair, one change is used for pattern extraction and the other one is used to test the extracted transformation. The C3 dataset consists of 218,437 clusters of similar changes, where each cluster may have multiple changes. To unify the format of the two datasets, we randomly select a pair from each cluster of the C3 dataset. We summarize the detailed information of the subjects in Table I. 2) Procedure: In this experiment we use S YDIT as a baseline for comparison, which is a state-of-the-art technique that uses predeﬁned rules for inferring program transformations. For each pair of code changes (va → va , vb → vb ) in the dataset, we apply G EN PAT and S YDIT to extract the transformation from va → va , and apply the transformation match(p1 ) ∈ {n1 }, match(p2 ) ∈ {n2 }, match(p3 ) ∈ {n3 } match(p4 ) ∈ {n4 }, match(p5 ) ∈ {n2 , n3 · · · n7 } Then further considering the relations between elements, we can ﬁlter out the extra elements for p5 , forming a valid match. match(p1 ) ∈ {n1 }, match(p2 ) ∈ {n2 } match(p3 ) ∈ {n3 }, match(p4 ) ∈ {n4 }, match(p5 ) ∈ {n5 } Based on this match, we can generate the following transformation on the target snippet. delete(n2 , n4 ) It is possible that multiple matches exist for a target code snippet. In some applications, we would like to ﬁnd only one match. For example, in program repair, we usually assume that there is only one fault for a failed test. As a result, we need to rank the matches to ﬁnd the best one. In our current approach we use the similarity between the AST node type and the content attributes to rank the matches. 261 — 131 — 2018 39.1% cases, while on 16.0% cases, the result is syntactically identical to the ground truth. Then we further compare the result of G EN PAT with state-of-the-art S YDIT on the same dataset. Note that we directly borrow the experimental result of S YDIT on the S YDIT dataset as reported in the original paper [15]. For the other projects, we successfully ran S YDIT on three projects (S YDIT reported errors on other projects, such as missing dependencies as it requires the projects compilable, encountering exceptions like NullPointerException and IndexOutOfBoundException, etc.). Therefore, we compare the results of G EN PAT and S YDIT on the subset of our experiment dataset where they both apply. The details of the experimental results are listed in Table III. Please also note that S YDIT requires code change pairs for transformation extraction and application coming from the same versions (ref. Section VI-A2: va and vb should come from the same project version). To satisfy this constraint, we select only those pairs in this experiment. In Table III, for each technique, we report both the number (ratio) of cases adapted and the number (ratio) of cases that are transformed with syntactically identical editing (Columns 4-7). Particularly, since the result of S YDIT on the S YDIT dataset is based on the semantic equivalence between the adapted code and the ground truth. For a fair comparison, we also perform a manual inspection on the results of G EN PAT on the S YDIT dataset. However, for the other projects, we compare their results based on syntactic equivalence. From the table we can see that G EN PAT signiﬁcantly outperforms S YDIT on the numbers of both adapted and (syntactically or semantically) correctly transformed cases. Overall, G EN PAT produces 2.0x (1079/541) the adapted cases and 5.5x (570/103) the correctly transformed cases as S YDIT. If we consider the ratio of false positives, i.e., the cases where a transformation result is produced but not identical to the ground truth, G EN PAT ((1079-570)/1079=47.2%) still signiﬁcantly outperforms S YDIT ((541-103)/541=81.0%). Moreover, we found that G EN PAT can still achieve a much better result (117 vs 64) even only on the cases where S YDIT can ﬁnd a match (S YDIT #Adapted). To conclude, G EN PAT signiﬁcantly outperforms state-of-the-art S YDIT. The results suggest that using predeﬁned rules may produce undesired transformations in many cases, which either cannot match, or incorrectly match the target code. Considering the performance of the tools on different datasets, we can ﬁnd that on the S YDIT dataset G EN PAT only slightly outperforms S YDIT (49 vs 46 adapted and 40 vs 39 correctly transformed), while on the C3 dataset, G EN PAT signiﬁcantly outperforms S YDIT (1030 vs 495 adapted and 530 vs 64 correctly transformed). The reason is that the S YDIT dataset has stricter requirements on the similarity of the changes, and thus predeﬁned rules already achieve good performance. On the other hand, C3 contains more diverse pairs such that better transformation inference is needed. Please note that syntactical equivalence may not be a precise measurement as two changes may be syntactically different but TABLE I E VALUATION DATASET FOR Systematic Editing. Meeting Proceedings Dataset Source Project S YDIT [15] C3 [47] junit cobertura jgrapht checkstyle ant ﬁtlibrary drjava eclipsejdt eclipseswt Total #Pairs 56 3,904 2,570 2,490 13,263 25,063 3,199 31,393 73,109 63,446 218,493 to vb . If the transformation can be applied and produces vb∗ , we compare vb∗ with vb� . Since the complete dataset is large, in this experiment, the adapted code vb∗ is considered correct only if it is syntactically identical with the ground truth vb� . We also sample a small proportion of the programs that are not syntactically identical to the ground truth and check its semantic equivalence manually. For each pair, we set the timeout as 1 minute. We also need a code corpus for calculating the frequencies of attributes. For simplicity, we use the same corpus of patches as in the second program-repair experiment (Section VI-B1). Please note while this is not an ideal choice for systematic editing, as we will see later, we already achieved signiﬁcantly better performance than the state-of-the-art technique. TABLE II G EN PAT ON COMPLETE EXPERIMENT DATASET FOR Systematic Editing. Projects Total Pairs #Adapted #Syn-Eq S YDIT junit cobertura jgapht checkstyle ant ﬁtlibrary drjava eclipsejdt eclipseswt 56 3,904 2,570 2,490 13,263 25,063 3,199 31,393 73,109 63,446 49 (87.5%) 1,088 (27.9%) 769 (29.9%) 547 (22.0%) 5,918 (44.6%) 10,428 (41.6%) 922 (28.8%) 11,391 (36.3%) 32,037 (43.8%) 22,218 (35.0%) 27 (48.2%) 412 (10.6%) 305 (11.9%) 226 ( 9.1%) 1,679 (12.7%) 4,398 (17.5%) 374 (11.7%) 4,151 (13.2%) 14,150 (19.4%) 9,206 (14.5%) Total 218,493 85,367 (39.1%) 34,928 (16.0%) NOTE, the ratio in the table denotes the portion of Total Pairs. In the table, S YDIT represents the corresponding dataset. 3) Results: First, we evaluate G EN PAT on the complete dataset as shown in Table I, and the experimental results are listed in Table II. In the table, the second column shows the total number of cases for transformation in each project, and the last column (#Syn-Eq) denotes the number (ratio) that G EN PAT makes a syntactically identical adaptation among all the test cases. We also report the number of cases that G EN PAT can successfully match the generated transformation to the target code shown in the third column (#Adapted). In total, G EN PAT can successfully match and produce a result on 262 — 132 — Software Automation in the Big Data Era: Challenges and Opportunities #Syn-Eq/Sem-Eq #Syn-Eq of G EN PAT TABLE III C OMPARING G EN PAT WITH S YDIT ON Systematic Editing. Dataset S YDIT Projects jgrapht junit C3 cobertura Total Overall #Total Pairs 56 1,314 1,208 1,021 3,543 3,599 #Adapted G EN PAT S YDIT 49(87.5%) 46(82.1%) 354(26.9%) 20(1.5%) 383(31.7%) 240(19.9%) 293(28.7%) 235(23.0%) 1,030(29.1%) 495(14.0%) 1,079(30.0%) 541(15.0%) G EN PAT -/40(71.4%) 211(16.1%) 206(17.1%) 113(11.1%) 530(15.0%) 570(15.8%) S YDIT -/39(69.6%) 6(0.5%) 57(4.7%) 1(0.1%) 64(1.8%) 103(2.9%) in S YDIT #Adapted 7 110 0 117 - In the table, the ratio denotes the portion of Total Pairs, and we use “-” to denote missing data or directly omit “-”. semantically equivalent. To further understand how much of the syntactically different cases can be semantically equivalent, we perform a manual inspection on the transformed results. Since we do not have the detailed result of S YDIT on its own dataset, we randomly choose 20 cases in each project from C3 dataset, where the transformed code is not syntactically identical with the ground truth. As a result, we choose 60 cases for G EN PAT and 54 cases for S YDIT (only 14 cases in project jgrapht, cf. Table III). The results are 11.7% (7/60) semantically correct cases for G EN PAT, while 9.3% (5/54) semantically correct cases for S YDIT. The results suggest that the number of semantically equivalent cases would be slightly higher than the syntactically equivalent cases, and G EN PAT would probably still signiﬁcantly outperform S YDIT. directly related to our approach. The latter two reasons point out future directions to further develop the approach. In other words, with a better implementation our approach may show even better results. B. Automated Program Repair Our second experiment aims to explore the capability of repairing real-world bugs using G EN PAT. In this experiment, we infer transformations from a large dataset of existing patches, and then apply these transformations to repair new bugs. 1) Subjects: We prepare two datasets, one of which is used as a training set for transformation extraction, while the other one is used as the dataset for program repair. For the ﬁrst dataset, we downloaded more than 2 million code change examples from all open-source Java projects on GitHub corresponding to all their commits from 2011 to 2016. In this process, we leverage a set of keywords for ﬁltering, such as “ﬁx”, “repair”, “bug”, “issue”, “problem”, “error”, etc. Following previous studies [17], [49], we further ﬁlter out code change examples involving more than ﬁve java ﬁles or six lines of source code since they may include benign changes. Moreover, we remove commits in the projects to repair (i.e., Defects4J projects) or their forked projects to avoid using their own patches. As a result, we build a training set consisting of more than 1 million bug-ﬁxing examples, which will be used to extract transformations for program repair. Besides, this dataset is used as the big code corpus for attribute selection as well, where each changed ﬁle in each commit is treated as a code ﬁle. For the second dataset, we employ a commonly used benchmark Defects4J [16] (v1.4), which consists of 395 realworld bugs from six open-source projects. We select 113 bugs from Defects4J for our experiment. The reason is that G EN PAT is not designed to be a comprehensive and standalone repair tool and is not possible to ﬁx many speciﬁc types of bugs (e.g., bugs requiring additional invocations of speciﬁc methods only from the current projects). To save experiment time, we ﬁltered these bugs that cannot be ﬁxed, and used the remaining 113 bugs. The details of the benchmark are listed in Table IV. 2) Procedure: As suggested by existing studies [2], [3], [50] that same bug ﬁxes may recursively exist among the historical bug ﬁxes. To avoid repetitive computation, we ﬁrst We further investigate the reasons why G EN PAT do not produce syntactically or semantically equivalent cases. We randomly sampled 100 cases that are not equivalent to the ground truth. By manually analyzing these cases, we found the following four main reasons. (i) The dominating reason is that the dataset contains noise, where the given code change examples do not conform to the target code. In other words, we cannot obtain the desired code after applying the transformation inferred from the corresponding example. For example, the given code change example is updating a variable runners to fRunners, while the desired change is updating fRunners to runners. It is impossible to infer the latter transformation from the former example. In total, 64% incorrect cases are due to this reason. (ii) Some types of changes are not supported by our implementation. For example, some cases change the method signature, and some cases change two methods at the same time. Both situations are not supported by our current implementation. In total, 27% cases are due to this reason. (iii) Our current modiﬁcation types do not allow some transformations. For example, the desired transformation should insert a statement after some other statements, while our modiﬁcation operation only allows inserting at an absolute position, i.e., the ith child of the parent, rather than a relative position. In total, 3% cases are due to this reason. (iv) Our inference algorithm does not infer the correct transformation. For example, we may extract a too strong context that cannot match the target code. In total, 6% cases are due to this reason. Note that ﬁrst two reasons are not 263 — 133 — 2018 is not designed as a comprehensive and standalone repair technique, it still successfully repairs 16 bugs when only considering top-1 plausible patch, even outperforming some recent approaches, such as SketchFix and JAID. When considering top-10 plausible patches, G EN PAT can successfully repair 19 bugs. Moreover, among all the bugs ﬁxed by G EN PAT, 4 bugs have never been ﬁxed by any existing technique as far as we know, such as the example shown in Listing 3. The results demonstrate that it is possible to repair real-world bugs by learning executable program transformations from historical bug ﬁxes directly. Furthermore, the results also suggest that it would be promising to consider integrating G EN PAT when designing advanced APR techniques to repair more bugs, which calls for future research in this direction. TABLE IV E VALUATION BENCHMARK FOR Program Repair. Project Meeting Proceedings Bugs kLoC Tests JFreechart (Chart) Closure compiler (Closure) Apache commons-math (Math) Apache commons-lang (Lang) Joda-Time (Time) Mockito (Mockito) 12 22 34 33 6 6 96 90 85 22 28 45 2,205 7,927 3,602 2,245 4,130 1,457 Total 113 366 21,566 In the table, column“Bugs” denotes the total number of bugs used in our evaluation, column“kLoC” denotes the number of thousands of lines of code, and column “Tests” denotes the total number of test cases for each project. perform a transformation clustering, which collects the same transformations together to form a cluster. In this process, two transformations belong to the same cluster only if they can match each other, and they have the same modiﬁcations. As a result, after clustering 689,546 unique transformation clusters are left, which are ﬁnally employed for patch generation. Following existing APR techniques [43]–[45], [51], we ﬁrst leverage an existing fault localization framework [52] to obtain a ranked list of candidate faulty locations. Particularly, we employ the Ochiai [53] spectrum-based fault localization to compute suspicious scores. However, the fault localization result is at the statement level, while G EN PAT matches a code snippet rather than a single line. Therefore, we further apply Method-Level Aggregation [54]–[57] to obtain a ranked list of faulty methods from statement-level results since it has been demonstrated to outperform direct method-level fault localization [54]. Given a faulty method, G EN PAT locates a set of transformations whose attributes can be found in the faulty method. Then transformations will be ranked according to the size of corresponding clusters. Thereafter, G EN PAT tries to apply each transformation to a given faulty method and generates patches. In the matching process, we discard matches that involve no elements in a faulty line in the method. In our experiment, we collect at most 10,000 compilable patches for each faulty method and then rank them with the ranking method introduced in the approach (Section V-D). Finally, we validate each candidate patch with the test suites and set a timeout of 5 hours to repair one bug. In this paper, following recent repair work [17]–[19], [44], [45], [58]–[60], we consider a patch as correct only if it is semantically equivalent to the developer’s patch in Defects4J with manual check. 3) Results: In this section, we present the experimental result of G EN PAT on repairing real-world bugs and compare it with state-of-the-art APR techniques that are recently published on SE conferences. The results are shown in Table V. In the table, we listed the number of bugs correctly ﬁxed by each technique when considering top-k (k ∈ {1, 10}) plausible patches. We use “-” to represent those missed data. From the table we can observe that, surprisingly, although G EN PAT TABLE V C OMPARING G EN PAT WITH STATE - OF - THE - ART APR TECHNIQUES . Tech. #Top-1 Pos. #Top-10 Pos. ISSTA’19 ISSTA’18 ICSE’18 ICSE’18 ASE’17 ICSE’17 SANER’16 PraPR [60] SimFix [18] SketchFix [61] CapGen [19] JAID [59] ACS [45] HD-Repair [17] 30 34 9 21 9 18 10 39 22 15 - G EN PAT 16 19 In an investigation of the patches we found that the ﬁxed bugs are often non-trivial and may not be easily ﬁxed by approaches with a predeﬁned search space. For example, Listing 4 shows the patch for Lang-21, which is successfully generated by applying the transformation extracted from the example in Listing 5. It is not easy to predeﬁne a search space to include this constant replacement. 264 cal1.get(Calendar.MINUTE)==cal2.get(Calendar.MINUTE)&& 265 -cal1.get(Calendar.HOUR)==cal2.get(Calendar.HOUR)&& 265 +cal1.get(Calendar.HOUR_OF_DAY)==cal2.get(Calendar. HOUR_OF_DAY)&& 266 cal1.get(Calendar.YEAR)==cal2.get(Calendar.YEAR)&& Listing 4. Patch of Lang-21. Commit : github.com/Cbsoftware/PressureNet/commit/9d00742 Message: Fixing time display bugs, #113 Source : src.ca.cumulonimbus.barometernetwork.BarxxActivity ========== 2459 - if(start.get(Calendar.HOUR)==0&&end.get(Calendar. HOUR)==0){ 2459 + if(start.get(Calendar.HOUR_OF_DAY)==0&&end.get( Calendar.HOUR_OF_DAY)==0){ Listing 5. Referenced history patch to ﬁx Lang-21. Meanwhile, though G EN PAT is promising to repair realworld bugs, it still faces challenges. In our experiment, G EN PAT generates plausible but incorrect patches for other 23 bugs among all 113 bugs. Compared with some state-ofthe-art techniques, such as SimFix and CapGen, the repair precision of G EN PAT is slightly lower. By analyzing those incorrect patches, we found that the reasons for its low precision are mainly threefold. First, though we have already preprocessed the training dataset for transformation extraction, 264 — 134 — Conf. Software Automation applications. On the other hand, in the current implementation, the Big Era: we do not consider the context in information while Data computing the attribute frequencies, which potentially can further improve Challenges and Opportunities the quality of the inferred transformations. Also, there are also other attributes and relations besides those considered in our current implementation, such as the node-position attributes in AST or control-dependency relations, both of which may impact the quality of inferred program transformations. We leave a more thorough investigation to these variations to our future study. there still exist code changes that are not relevant to bug ﬁxes, which may produce incorrect patches. Second, the inferred transformation is too general and can be applied frequently, such as inserting a return statement in an if body. Since G EN PAT only expands one level dependency relation, the generated transformation can be applied wherever there is an if statement, and can easily introduce incorrect patches. Third, G EN PAT is not designed to be a standalone repair tool and thus does not include the patch-correctness checking mechanisms that mature tools use. In the future, recent advanced patchcorrectness checking techniques [62], [63] can also be further integrated with G EN PAT to mitigate this issue. ACKNOWLEDGMENT This work was partially supported by the National Key Research and Development Program of China under Grant No.2017YFB1001803, National Natural Science Foundation of China under Grant Nos. 61672045 and 61529201, and National Science Foundation under Grant Nos. CCF-1566589 and CCF-1763906, and Amazon. Special thanks should go to Xia Li (UT Dallas) who shared the big code base with us, making it possible to conduct our large-scale evaluation. VII. T HREATS TO VALIDITY In this section, we discuss the threats to validity of G EN PAT. First, the external threats to the validity fall into the data collection in our evaluation. We employed a subset of the C3 data set, i.e., we choose one pair of similar code changes from each cluster for the experiment, which may cause data selection bias. However, to mitigate this threat, we employed all 218,441 clusters in the data set shown in Table I with a random sample, which leaves us 218,441 pairs of examples. We believe that this big dataset can alleviate the threats. On the other hand, since the dataset is constructed automatically by previous research, which may involve noises as discussed in the previous section. As a consequence, we employed the manually-constructed dataset [15] as well in our evaluation, which can mitigate this issue to some extent. Second, the internal threats to validity are related to the implementation of G EN PAT. To ensure the correctness of its implementation, two authors of the paper collaborate with code review to make sure all functions are properly implemented. However, it is still possible to unintentionally get some implementation bugs involved. To further reduce this threat, we have also released both the source and test code of G EN PAT, as well as the replication package, and invite other researchers to contribute to this promising direction. R EFERENCES [1] M. Kim and D. Notkin, “Discovering and representing systematic code changes,” in 2009 IEEE 31st International Conference on Software Engineering, May 2009, pp. 309–319. [2] S. Kim, K. Pan, and E. E. J. Whitehead, Jr., “Memories of bug ﬁxes,” in FSE, 2006, pp. 35–45. [3] Q. Gao, H. Zhang, J. Wang, and Y. Xiong, “Fixing recurring crash bugs via analyzing Q&A sites,” in ASE, 2015, pp. 307–318. [4] T. T. Nguyen, H. A. Nguyen, N. H. Pham, J. Al-Kofahi, and T. N. Nguyen, “Recurring bug ﬁxes in object-oriented programs,” ser. ICSE, 2010, pp. 315–324. [5] B. Ray and M. Kim, “A case study of cross-system porting in forked projects,” in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ser. FSE ’12. New York, NY, USA: ACM, 2012, pp. 53:1–53:11. [6] B. Ray, C. Wiley, and M. Kim, “Repertoire: A cross-system porting analysis tool for forked software projects,” in FSE. New York, NY, USA: ACM, 2012, pp. 8:1–8:4. [7] J. Li, C. Wang, Y. Xiong, and Z. Hu, “SWIN: towards type-safe java program adaptation between apis,” in PEPM, 2015, pp. 91–102. [8] C. Wang, J. Jiang, J. Li, Y. Xiong, X. Luo, L. Zhang, and Z. Hu, “Transforming programs between apis with many-to-many mappings,” in ECOOP, 2016, pp. 25:1–25:26. [9] W. F. Opdyke, “Refactoring: An aid in designing application frameworks and evolving object-oriented systems,” in Proc. SOOPPA’90: Symposium on Object-Oriented Programming Emphasizing Practical Applications, 1990. [10] M. Fowler, Refactoring: improving the design of existing code. Addison-Wesley Professional, 2018. [11] N. Meng, M. Kim, and K. S. McKinley, “Lase: Locating and applying systematic edits by learning from examples,” ser. ICSE ’13, 2013, pp. 502–511. [12] R. Rolim, G. Soares, L. D’Antoni, O. Polozov, S. Gulwani, R. Gheyi, R. Suzuki, and B. Hartmann, “Learning syntactic program transformations from examples,” in ICSE, 2017, pp. 404–415. [13] F. Long, P. Amidon, and M. Rinard, “Automatic inference of code transforms for patch generation,” in ESEC/FSE, 2017, pp. 727–739. [14] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, “On the ”naturalness” of buggy code,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16. ACM, 2016, pp. 428–439. [15] N. Meng, M. Kim, and K. S. McKinley, “Systematic editing: Generating program transformations from an example,” in Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’11. ACM, 2011, pp. 329–342. VIII. C ONCLUSION In this paper, we propose a framework for transformation inference from a single example by representing code as a hypergraph, which allows ﬁne-grained generalization of transformations with big code. Based on this framework, we further propose a transformation inference algorithm and implement it in a tool called G EN PAT. Finally, we evaluated the effectiveness of G EN PAT in two distinct application scenarios, i.e., systematic editing and automatic program repair. The experimental results show that G EN PAT signiﬁcantly outperforms the state-of-the-art S YDIT with up to 5.5x correctly transformed cases in the ﬁrst application. Additionally, although not designed as a comprehensive and standalone repair technique, G EN PAT already shows potentialities in automatic program repair – it successfully ﬁxed 19 bugs in the Defects4J benchmark, 4 of which have never been repaired by any existing technique. In all, the evaluation results suggest that G EN PAT is effective and potentially can be adopted in many different 265 — 135 — 2018 [41] Y. Lou, J. Chen, L. Zhang, D. Hao, and L. Zhang, “History-driven build failure ﬁxing: How far are we?” in ISSTA. New York, NY, USA: ACM, 2019, pp. 43–54. [42] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automatically ﬁnding patches using genetic programming,” in ICSE, 2009, pp. 364– 374. [43] Q. Xin and S. P. Reiss, “Leveraging syntax-related code for automated program repair,” ser. ASE, 2017. [Online]. Available: http://dl.acm.org/citation.cfm?id=3155562.3155644 [44] R. K. Saha, Y. Lyu, H. Yoshida, and M. R. Prasad, “Elixir: Effective object oriented program repair,” in ASE. IEEE Press, 2017. [Online]. Available: http://dl.acm.org/citation.cfm?id=3155562.3155643 [45] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang, “Precise condition synthesis for program repair,” in ICSE, 2017. [46] J. Jiang, Y. Xiong, and X. Xia, “A manual inspection of defects4j bugs and its implications for automatic program repair,” Science China Information Sciences, vol. 62, p. 200102, Sep 2019. [47] P. Kreutzer, G. Dotzler, M. Ring, B. M. Eskoﬁer, and M. Philippsen, “Automatic clustering of code changes,” in Proceedings of the 13th International Conference on Mining Software Repositories, ser. MSR ’16. ACM, 2016, pp. 61–72. [Online]. Available: http://doi.acm.org/10.1145/2901739.2901749 [48] B. Fluri, M. Wuersch, M. PInzger, and H. Gall, “Change distilling:tree differencing for ﬁne-grained source code change extraction,” IEEE Transactions on Software Engineering, pp. 725–743, Nov 2007. [49] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “An empirical investigation into learning bug-ﬁxing patches in the wild via neural machine translation,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 2018, pp. 832–837. [50] H. A. Nguyen, A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, and H. Rajan, “A study of repetitiveness of code changes in software evolution,” in 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov 2013, pp. 180–190. [51] J. Xuan, M. Martinez, F. Demarco, M. Clément, S. Lamelas, T. Durieux, D. Le Berre, and M. Monperrus, “Nopol: Automatic repair of conditional statement bugs in java programs,” TSE, 2017. [52] S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” ser. ICSE ’17, 2017, pp. 609–620. [53] R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluation of similarity coefﬁcients for software fault localization,” ser. PRDC. Washington, DC, USA: IEEE Computer Society, 2006, pp. 39–46. [54] J. Sohn and S. Yoo, “Fluccs: Using code and change metrics to improve fault localization,” ser. ISSTA, New York, NY, USA, 2017, pp. 273–283. [55] D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, “An empirical study of fault localization families and their combinations,” IEEE Transactions on Software Engineering, 2019. [56] X. Li, W. Li, Y. Zhang, and L. Zhang, “Deepﬂ: Integrating multiple fault diagnosis dimensions for deep fault localization,” in ISSTA. New York, NY, USA: ACM, 2019, pp. 169–180. [57] J. Chen, J. Han, P. Sun, L. Zhang, D. Hao, and L. Zhang, “Compiler bug isolation via effective witness test program generation,” in ESEC/FSE. New York, NY, USA: ACM, 2019, pp. 223–234. [58] M. Martinez, T. Durieux, R. Sommerard, J. Xuan, and M. Monperrus, “Automatic repair of real bugs in java: A large-scale experiment on the Defects4J dataset,” Empirical Software Engineering, pp. 1–29, 2016. [59] L. Chen, Y. Pei, and C. A. Furia, “Contract-based program repair without the contracts,” in ASE, 2017. [60] A. Ghanbari, S. Benton, and L. Zhang, “Practical program repair via bytecode mutation,” in ISSTA. New York, NY, USA: ACM, 2019, pp. 19–30. [61] J. Hua, M. Zhang, K. Wang, and S. Khurshid, “Towards practical program repair with on-demand candidate generation,” in ICSE, 2018. [62] Y. Xiong, X. Liu, M. Zeng, L. Zhang, and G. Huang, “Identifying patch correctness in test-based program repair,” in ICSE, 2018. [63] S. H. Tan, H. Yoshida, M. R. Prasad, and A. Roychoudhury, “Antipatterns in search-based program repair,” in FSE, 2016. [16] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in ISSTA, 2014, pp. 437–440. [17] X.-B. D. Meeting Le, D. Lo, andProceedings C. Le Goues, “History driven program repair,” in SANER, 2016, pp. 213–224. [18] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program repair space with existing patches and similar code,” in ISSTA, 2018. [19] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-aware patch generation for better automated program repair,” in ICSE, 2018. [20] J. Andersen and J. L. Lawall, “Generic patch inference,” in 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, Sep. 2008, pp. 337–346. [21] J. Andersen, A. C. Nguyen, D. Lo, J. L. Lawall, and S. Khoo, “Semantic patch inference,” in 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, Sep. 2012, pp. 382–385. [22] R. Bavishi, H. Yoshida, and M. R. Prasad, “Phoenix: Automated datadriven synthesis of repairs for static analysis violations,” in ESEC/FSE. New York, NY, USA: ACM, 2019, pp. 613–624. [23] H. A. Nguyen, T. N. Nguyen, D. Dig, S. Nguyen, H. Tran, and M. Hilton, “Graph-based mining of in-the-wild, ﬁne-grained, semantic code change patterns,” in ICSE. IEEE Press, 2019, pp. 819–830. [24] T. Molderez, R. Stevens, and C. De Roover, “Mining change histories for unknown systematic edits,” in MSR, May 2017, pp. 248–256. [25] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation learned from human-written patches,” in ICSE, 2013, pp. 802–811. [26] X. Liu and H. Zhong, “Mining stackoverﬂow for program repair,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 118–129. [27] T. Molderez and C. De Roover, “Search-based generalization and reﬁnement of code templates,” in Search Based Software Engineering. Cham: Springer International Publishing, 2016, pp. 192–208. [28] Y. Padioleau, J. Lawall, R. R. Hansen, and G. Muller, “Documenting and automating collateral evolutions in linux device drivers,” in Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, ser. Eurosys ’08. ACM, 2008, pp. 247–260. [29] M. Nita and D. Notkin, “Using twinning to adapt programs to alternative apis,” in Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ser. ICSE ’10. ACM, 2010, pp. 205–214. [30] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser, “Stratego/xt 0.17. a language and toolset for program transformation,” Science of Computer Programming, vol. 72, no. 1, pp. 52 – 70, 2008. [31] M. Erwig and D. Ren, “An update calculus for expressing type-safe program updates,” Science of Computer Programming, vol. 67, no. 2, pp. 199 – 222, 2007. [32] J. R. Cordy, “The txl source transformation language,” Science of Computer Programming, vol. 61, no. 3, pp. 190 – 210, 2006, special Issue on The Fourth Workshop on Language Descriptions, Tools, and Applications (LDTA 04). [33] M. Martinez and M. Monperrus, “Mining software repair models for reasoning on the search space of automated program ﬁxing,” Empirical Softw. Engg., pp. 176–205, 2015. [34] F. Long and M. Rinard, “Automatic patch generation by learning correct code,” in Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016, pp. 298– 312. [35] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, 2017, pp. 4077–4087. [36] M. Sridharan, S. J. Fink, and R. Bodik, “Thin slicing,” ACM SIGPLAN Notices, vol. 42, no. 6, pp. 112–122, 2007. [37] T. Xu, X. Jin, P. Huang, Y. Zhou, S. Lu, L. Jin, and S. Pasupathy, “Early detection of conﬁguration errors to reduce failure damage.” in OSDI, 2016, pp. 619–634. [38] A. Hajnal and I. Forgacs, “A precise demand-driven deﬁnition-use chaining algorithm,” in Proceedings of the Sixth European Conference on Software Maintenance and Reengineering, March 2002, pp. 77–86. [39] M. J. Harrold and M. L. Soffa, “Efﬁcient computation of interprocedural deﬁnition-use chains,” ACM Trans. Program. Lang. Syst., no. 2, pp. 175– 204, 1994. [40] J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Finegrained and accurate source code differencing,” in ASE, 2014, pp. 313– 324. 266 — 136 — TreeGen: A Tree-Based Transformer Architecture for Code Generation Software Automation in the Big Data Era: Zeyu Sun Qihao Zhu Yingfei Xiong arXiv:1911.09983v2 [cs.LG] 28 Nov 2019 † † † † ‡ and Opportunities † Challenges Yican Sun Lili Mou Lu Zhang Key Laboratory of High Confidence Software Technologies (Peking University), MoE; Software Institute, Peking University, 100871, P. R. China {szy , zhuqh, xiongyf, sycpku, zhanglucs}@pku.edu.cn ‡ University of Alberta, Edmonton, AB, Canada doublepower.mou@gmail.com Abstract A code generation system generates programming language code based on an input natural language description. State-ofthe-art approaches rely on neural networks for code generation. However, these code generators suffer from two problems. One is the long dependency problem, where a code element often depends on another far-away code element. A variable reference, for example, depends on its definition, which may appear quite a few lines before. The other problem is structure modeling, as programs contain rich structural information. In this paper, we propose a novel tree-based neural architecture, TreeGen, for code generation. TreeGen uses the attention mechanism of Transformers to alleviate the longdependency problem, and introduces a novel AST reader (encoder) to incorporate grammar rules and AST structures into the network. We evaluated TreeGen on a Python benchmark, HearthStone, and two semantic parsing benchmarks, ATIS and GEO. TreeGen outperformed the previous state-of-theart approach by 4.5 percentage points on HearthStone, and achieved the best accuracy among neural network-based approaches on ATIS (89.1%) and GEO (89.6%). We also conducted an ablation test to better understand each component of our model. Introduction Code generation is an important artificial intelligence problem that has the potential to significantly boost the productivity of programmers. Given a specification written in natural language, a code generation system translates the specification into an executable program. For example, if a python programmer gives an instruction “initialize a dictionary, Dict”, the code generator is expected to automatically generates “Dict={ }”. With the development deep learning techniques, researchers have applied various neural architectures to this problem, such as sequence-to-sequence (Seq2Seq) models or sequence-to-tree (Seq2Tree) models (Sutskever, Vinyals, and Le 2014; Ling et al. 2016; Yin and Neubig 2017; Rabinovich, Stern, and Klein 2017; Hayati et al. 2018; ∗ ∗† Yingfei Xiong is the corresponding author. The code is available at https://github.com/zysszy/TreeGen c 2020, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. Sun et al. 2019). Especially, state-of-the-art approaches generate code by predicting a sequence of grammar rules (Yin and Neubig 2017; Rabinovich, Stern, and Klein 2017; Sun et al. 2019). That is to say, the system keeps a partial abstract syntax tree (AST) of the already-generated code, and predicts the grammar rule to be used to expand a particular node. The classification of grammar rules faces two main challenges. The first challenge is the long-dependency problem (Bengio, Simard, and Frasconi 1994). A code element may depend on another far-away element. For example, a variable reference statement “if len(a) < Max Length:” at line 100 may depend on a variable definition statement “Max Length = 100” at line 10. The second challenge is the representation of code structures. It is pointed out that the tree-structural information is crucial for modeling code (Mou et al. 2016; Yin and Neubig 2017; Rabinovich, Stern, and Klein 2017; Sun et al. 2019). However, a “flat” neural architecture, such as an RNN, cannot capture structure information well. In this paper, we propose a novel neural architecture, TreeGen, for the code generation. To address the first challenge, TreeGen adopts the recently proposed Transformer architecture (Vaswani et al. 2017), which is capable of capturing long dependencies. However, the original Transformer architecture is not designed for programs, and cannot utilize tree structures, i.e., the second above mentioned challenge. A standard way of utilizing structural information, as in graph- and tree-based convolutional neural networks, is to combine the vector representations of a node and its structural neighbors as the output of a structural convolution sub-layer. However, a standard Transformer architecture does not have such structural convolution sub-layers, and it is not clear where to add them. It is tempting to add structural convolution sub-layers in all the Transformer blocks. Our core conjecture is that when convolving a node and its structural neighbors, the vector representation should mainly contain the information from the original node. As the vector representation of the nodes is processed by more blocks in the decoder of the Transformer, they gradually mix in more information from other nodes and lose their original information. Therefore, we add — 137 — 2018 Meeting Proceedings — 138 — Softmax & Pointer Nd x NL Reader N1 x Conv Character Embedding Gating Self Attention Position Embedding AST Reader Tree Conv Software Automation in the Big Data Era: Challenges Decoder and Opportunities NL Attention Rule Definition Encoding + Gating Self attention Position Embedding Natural Language Description (Input) + N2 x Dense NL Attention AST Attention Depth Embedding Rule Sequence (Generated Code) Tree Path (Query) Figure 2: Overview of the TreeGen. table. We also use position embeddings to encode the information of word positions. In particular, we adopt the variant in Dehghani et al. (2018), and compute the position embedding for the ith word in the bth Transformer block as pb,i [2j] = sin((i + b)/(100002j/d )) (3) pb,i [2j + 1] = sin((i + b)/(100002j/d )) (4) where pi,b [·] indexes a dimension of the vector pi,b , and d is the number of dimensions (i.e., embedding size). A Transformer block learns non-linear features by multi-head attention, which yields a matrix Yb(self) = (self) (self) (self) , yb,2 , · · · , yb,L ] , where Yb(self) ∈ RL×d . For no[yb,1 tational simplicity, we omit the subscript b. The multi-head layer is computed by Y (self) = concat(head1 , · · · , headH )Wh (5) where H denotes the number of heads and Wh is the weight. An attention layer is applied in each head headt , computed by QK headt = softmax( √ )V (6) dk where dk = d/H denotes the length of each features vector. Q, K and V are computed by [Q, K, V ] = [x1 , · · · , xL ] [WQ , WK , WV ] (7) where WQ , WK , WV ∈ Rd×dk are model parameters. xi is the input of this Transformer block. For the first block, it is the vector sum of the table-lookup embedding and the position embedding, i.e., ni + p1,i ; for other blocks, it is the vector sum of the lower Transformer block’s output and the position embedding that corresponds to this block. Gating Mechanism. After the features are computed by self-attention, we further incorporate with the information of character embeddings. This is given by a gating mechanism based on softmax. For the ith word, we compute a control vector qi from yi(self) by a linear transformation. The softmax weight ki(c) for character embedding is given by a linear transformation from n(c) i in Equation 2. The softmax (y) weight ki for Transformer’s output is given by another linear transformation from yi(self) . Then, the gate is computed by (y) (c) (8) , αi,t ] = softmax{qi ki(y) , qi ki(c) } [αi,t They are used to weigh the feature of the Transformer’s layer vi(y) and the feature of character embeddings vi(c) , linear transformed from yi(self) and n(c) i , respectively. (y) (y) (c) (c) hi,t = [αi,t vi ] (9) vi + αi,t Similar to Equation 5, the output of our gating mechanism is Y (gate) = (hi,t )i,t , where (·)i,t represents a block matrix with the elements being hi,t . Word Convolution. Finally, two convolutional layers are applied to the output of the gating mechanism y1(gate) , · · · , yL(gate) and to extract the local features around each token y1(conv,l) , · · · , yL(conv,l) , where l denotes the layer of convolutional layers. The yi(conv,l) is computed by (conv,l−1) (conv,l−1) yi(conv,l) = W (conv,l) [yi−w ; · · · ; yi+w ] (10) where W (conv,l) are the convolution weights, w = (k − 1)/2, and k denotes the window size. In particular, yi(conv,0) denotes the output of gating mechanism yi(gate) . In these layers, separable convolution (Chollet 2017) is used. The reason is separable convolution has fewer parameters that is easy for training. For the first and the last words, we add zero padding. Between these layers, we use the GELU activation function (Hendrycks and Gimpel 2016). In summary, the NL reader has a few Transformer blocks of self-attention, the gating mechanism, and word convolution. The natural language description is encoded as features y1(NL) , y2(NL) , · · · , yL(NL) . — 139 — AST Reader 2018 We design an AST reader to model the structure of the partial AST that has generated. Although our programs are genMeeting Proceedings erated by predicting the sequence of grammar rules, these rules alone lack a concrete picture of the program and are insufficient for predicting the next rule. Therefore, our AST reader considers heterogeneous information, including the predicted rules and the tree structures. To incorporate such program-specific information, we first represent the code as a sequence of rules, then encode the rules with attention mechanism, and finally use a tree convolution layer to combine the encoded representation of each node with its ancestors. AST Representation Rule Sequence Embedding. To encode the rule information, we use the ID of the rules. Suppose we have a sequence of rules r1 , r2 , · · · , rP that are been used to generate the partial AST in a decoding step, where P denotes the length of the sequence. We represent these rules as real-valued vectors r1 , r2 , · · · , rP by table-lookup embeddings. Rule Definition Encoding. The above table-lookup embedding treats a grammar rule as an atomic token, and loses the information of the rule’s content. To alleviate this problem, we enhance the representation of a rule with the encoding of rule definition. For a grammar rule i : α → β1 · · · βK , where α is the parent node and β1 · · · βK are child nodes. They can be either terminal or non-terminal symbols. The index i is the ID of the rule. Similar to Equation 2, we encode the rule content as a vector r(c) by a fully-connected layer with input being the tablelookup embeddings α, β1 , · · · , βK of respective symbols. It is noted that the sequence is also padded to a maximum length. Then the rule definition features y1(rule) , · · · , yP(rule) are computed by another fully-connected layer as yi(rule) = W (rule) [ri ; r(c) ; α] (11) where ri is the table-lookup embedding of the rule ri , ri(c) is the content-encoding rule representation, and we emphasize the parent node α again. After that, a layer normalization is followed. Position and Depth Embeddings. Since our AST reader would use self-attention mechanisms, we need to represent the position where a grammar rule is used. We first adopt the position embedding as in Equation 4, representing when a rule is used in the sequence r1 , · · · , rP . (r) The position embeddings are denoted by p(r) 1 · · · , pP However, such position embedding does not capture the position of a rule in the AST. We further encode such information by a depth embedding. If we expand a symbol α by the rule r : α → β1 · · · βK , we represent the depth of the rule by its parent node, i.e., α. In this way, we associate another sequence of table-lookup depth embeddings d1 , · · · , dP to the sequence of used grammar rules r1 , · · · , r P . — 140 — Neural Structure of AST Reader. The AST reader is also composed of a stack of blocks (N1 blocks in total). Each block is decomposed into four sub-layers (namely, self-attention, a gating mechanism, NL attention, and tree convolution). We employ a residual connection around each sub-layer except the layer of tree convolution. After each sub-layer, we apply a layer normalization. Self-Attention. To capture the information of AST, we build a Transformer-like self-attention layer, where the input is sum of the rule embedding, position embedding, and depth embedding, i.e., ri + di + p(r) i . The self-attention sublayer extract features y1(ast-self) , y2(ast-self) , · · · , yP(ast-self) of AST input, using the same mechanism as Equations 4, 5, 6 with different weights but add an additional depth embedding to p(r) i . Gating Mechanism. We would like to incorporate the content-encoding rule representation yi(rule) into the Transformer-extracted features. We adopt a gating mechanism as in Equations 8, 9, and the fused features becomes y1(ast-g) , y2(ast-g) , · · · , yP(ast-g) after this sub-layer. NL Attention. During the decoding step, we should be informed of the input NL description. This is given by a multihead NL attention, similar to the Transformer decoder’s attention to its encoder (Vaswani et al. 2017). The extracted features are denoted byy1(ast-nl) , y2(ast-nl) , · · · , yP(ast-nl) . Tree Convolution. Should we consider only the above sub-layers, it would be hard for the reader to combine the information of a node with its ancestors. A node can be far away from its ancestors in the rule sequence but is close in structure. Therefore, it is difficult for a traditional Transformer to extract such structural features. We integrate the features of a node with those of its ancestors. We treat the AST as a graph and use an adjacency matrix M to represent the directed graph. If a node αi is the parent of αj , then Mji = 1. Suppose all the nodes are presented by features f1 , · · · , fn , their parents’ features can be given by the multiplication with the adjacency matrix: [f1(par) , · · · , fn(par) ] = [f1 , · · · , fn ]M (12) where fi(par) denotes the parent of the ith node. For the father of the root node, we pad it with the feature vector of the root node itself. The tree-based convolution window, applied to the current sub-tree, is given by Y (tconv, l) = f(W (tconv, l) [Y (tconv, l−1) ; (13) Y (tconv, l−1) M; · · · ; Y (tconv, l−1) M kt−1 ]) where W (tconv, l) is the weights of the convolutional layer, kt denotes the window size (we set to 3 in our experiments), l is the layer of these convolutional layers. In particular, Y (tconv, 0) = [y1(att) , y2(att) , · · · , yP(att) ], where Y (tconv, 0) ∈ Rd×P . For the last layer of the AST reader, we add additional two convoluation layers. In the equation, f is the activation function and GELU is applied between these layers. In summary, the AST reader has N1 blocks of these four sub-layers, and yields the features y1(ast) , y2(ast) , · · · , yP(ast) . Decoder Our final component is a decoder that integrates the information of the generated code with the NL description, and predicts the next grammar rule. Similar to the AST reader, a stack of blocks (N2 blocks in total) each with several sublayers is used in the decoder as follows. A residual connection is also employed around each sub-layer followed by a layer normalization. The decoder takes the non-terminal node to be expanded as a query. Inspired by a previous approach (Sun et al. 2019), the querying node is represented as a path from the root to the node to be expanded. For example, if we are going to expand node “Assign” in Figure 1, the path should be root, Module, body, Assign. We represent the nodes in this path as real-valued vectors. Then we apply a fully-connected layer like Equation 2 to these vectors and the output of the path (querying node) is qi(path) . We then apply two attention layers to integrate the outputs of the AST reader and the NL reader. We first apply an AST attention layer over the output of the AST reader with queries and extract features f1(tree) , · · · , fP(tree) . In this layer, Q is computed from queries q1(path) , · · · , qP(path) ; K and V are computed from the code features y1(ast) , · · · , yP(ast) . We further integrate the features from the input description. This integration is also implemented with an NL attention, where Q is computed by feature f1(tree) , · · · , fP(tree) ; and K and V are computed by the input description y1(NL) , · · · , yL(NL) . Finally, a set of two fully-connected layers, where the first layer has a GELU activation function, are followed to extract features for prediction. Training and Inference We predict the next grammar rule, among all possible candidates, by softmax based on the decoder’s last layer features. We also introduce the pointer network (See, Liu, and Manning 2017) (essentially, an attention) that can directly copy a token a from the NL description. In this case, the resulting grammar rule is α → a, where α is a non-terminal symbol to be expanded and a is a terminal symbol. Such pointer mechanism is helpful for user-defined identifiers (e.g., variable and function names). The choice between softmax rule prediction and the pointer network is given by another gating mechanism pg , also computed from the decoder’s last feature. The overall predicted probability of the next grammar rule is if i ∈ D p p(ri |·) p(ri |·) = g (1 − pg ) Pr{copy word t at step i|·} if i ∈ C (14) where i denotes the ID of the rule, D is the set of predefined rules, and C denotes the set of rules in the form of α → a, where a is a terminal token that occurs in the NL description. pg is the probability of using the type of predefined rules, and the p(ri |·) (the probability of each predefined rules) are computed by two single-layer perceptrons with the sigmoid and softmax activation functions, respectively, and the input of these layers are the features h(dec) . Software Automation NAME: Darkscale Healer ATK: 4 in the5 Big Data Era: DEF: Challenges and5 Opportunities COST: DUR: -1 TYPE: Minion PLAYER: Neutral RACE: NIL RARITY: Common DESCRIPTION: Battlecry: Restore 2 Health to all friendly characters. Figure 3: A example of the implement of HearthStone. The pointer network is computed by ξt = v T tanh(W1 h(dec) + W2 yt(NL) ) exp {ξt } Pr{copy word t at step i|·} = L j=1 exp {ξj } (15) where h(dec) denotes the decoder’s last feature. The model is optimized by maximizing negative log likelihood loss against the reference program. The inference starts with a start rule, start : snode −→ root, expanding a special symbol snode to the root symbol. The recursive prediction terminates if every leaf node in the predicted AST is a terminal. During prediction, we use beam search with a size of 5. Invalid rules are excluded during beam search. Evaluation We evaluated our approach on two types of benchmarks: (1) a Python code generation benchmark, HearthStone, and (2) two semantic parsing benchmarks, ATIS and GEO. Experiment: HearthStone Dataset. We first evaluated our approach on the HearthStone benchmark (Ling et al. 2016). The benchmark contains Python code that implements 665 different cards of HearthStone. Each card is composed of a semi-structural description and a groundtruth Python program. The Python programs have a length of 84 tokens on average. The description comes with several attributes such as card name, card type, as well as a natural language description for the functionality of the card. A Python program is mainly decided by the natural language description where the attributes decide the constants or identifier names. A sample description and its corresponding Python program are shown in Figure 3. When preprocessing the card description into token sequences, existing approaches consider two methods. — 141 — The first (Yin and Neubig 2017; Hayati et al. 2018) (called plain preprocessing) treats the whole description as plain text and delimit the tokens by standard separators such as Meeting Proceedings space or periods. The second (Rabinovich, Stern, and Klein 2017) (called structural preprocessing) treats the descriptions as semi-structural and always treat an attribute as one token. In this experiment, we consider both methods and denote the results corresponding to the plain preprocessing as TreeGen-A and that corresponding to the structural preprocessing as TreeGen-B. We followed the train-dev-test split in Ling et al. (2016), and the statistic is listed in Table 2. 2018 Metrics. We measured the performance following the metrics in Sun et al. (2019). We computed the StrAcc, which is the percentage of programs that has exactly the same token sequence as the ground truth; the BLEU score, which is used to measure the similarity between the generated code and the reference code at the token level; and the Acc+, which is evaluated manually, allows variable renaming on top of StrAcc, for every test case. Settings. For neural networks, we set the number of NL reader layers Nd = 6, and N1 = N2 = 5 for the AST reader as well as the decoder. The size of all embedding is 256. The hidden sizes were all set to the 256 except each fully-connected layers, except the first layer was 1024 dimensions. We applied dropout after each layer (including attention layers, gating mechanism layers, convolutional layers, and fully-connected layers, where the drop rate is 0.15). The model is optimized by Adafactor (Shazeer and Stern 2018) with default parameters. Overall Results. We show the results in Table 1. In this table, the structural preprocessing has a better performance compared with the plain preprocessing. As shown, our model achieves 6 percentage points accuracy improvement with plain preprocessing and 4.5 percentage points accuracy improvement with structural preprocessing. For the BLEU score, our model also achieves the best results. These boosts in performance indicate that TreeGen successfully alleviates the long dependency problem and effectively encodes the structural information in the code generation. Time Efficiency. We further evaluated the complexity of our model on the HearthStone, and the result shows that our model is faster than the previous ones. It takes 18s for an epoch on a single Nvidia Titan XP, whereas 180s for the CNN (Sun et al. 2019) and 49s for the RNN (Yin and Neubig 2017). Location of Structural Convolution Sub-layer. One of the keys of our approach is to add the structural convolution sub-layers only to part of the Transformer blocks in the decoder. To evaluate whether this design decision is effective, we evaluate four competing settings: 1) adding the structural convolution sub-layers to all Transformer blocks (i.e., — 142 — N1 = 10); 2) adding the structural convolution sub-layers to the first 7 blocks in AST reader (i.e., N1 = 10(7)); 3) adding the structural convolution sub-layers to the first 8 blocks in AST reader (i.e., N1 = 10(8)); 4) the other adds to none (i.e., N1 = 0). As we can see, from Table 1 our approach adding the sub-layer to all transformer blocks (N1 = 10) significantly outperforms the last setting (N1 = 0), but slightly worse than the other two settings. Ablation Test. We ablated our model (TreeGen-B was used) to analyze the contribution of each component, results also shown in Table 1. First, we compared our model with the traditional Transformer, which is a Transformer without effective structure modeling. We achieved 21 percentage points higher accuracy (p-value is less than 0.001) and 12 higher BLEU score. This result provides strong evidence of the effectiveness of the AST reader in our model and the importance of the structural information. Next, we replaced the tree convolutional layers in the AST Reader with two layers of fully-connected layers, and we removed the char embedding, rule definition encoding, self-attention layers in turn. The experimental results show the identifiers-encoding, alleviating long-dependency and structural information significantly influence the accuracy. Please note that in some cases BLEU increases while StrAcc and Acc+ decrease. Here we consider StrAcc and Acc+ more important as they guarantee the correctness of the generated programs and correctness is usually crucial in code generation. Experiment II: Semantic Parsing Dataset. We further evaluated our approach on the semantic parsing tasks. Our experiment was conducted on two semantic parsing datasets, ATIS and GEO. The input of these datasets is a natural language description, while the output is a short piece of code in lambda calculus. We followed the standard train-dev-test split of these datasets, and the statistics are listed in Table 2. Metrics and Settings. In this task, we follow the evaluation of the previous approaches (Dong and Lapata 2016) and use accuracy as the metric, where the tree exact match was considered to avoid spurious errors. In other words, the order of the children can be changed within conjunction nodes. We followed all the settings in the HS experiment except that we changed the embedding size and the hidden sizes to 128 compared with the setting of the HS experiment. Results. Table 3 shows the performance of our TreeGen. As seen, the accuracy of our approach is sightly worse than the traditional approach WKZ14 (Wang, Kwiatkowski, and Zettlemoyer 2014), which is based on the CCG parser and uses a large number of templates. This traditional approach is hard to generalize new datasets. However, our model was directly adopted from the HS dataset, and achieved the highest accuracy, among all neural models (Dong and Lapata 2016; Rabinovich, Stern, and Klein 2017; Dong and Lapata 2018; Chen, Sun, and Han 2018; Xu et al. 2018; Plain Structural Model LPN (Ling et al. 2016) SEQ2TREE (Dong and Lapata 2016) YN17 (Yin and Neubig 2017) ASN (Rabinovich, Stern, and Klein 2017) ReCode (Hayati et al. 2018) StrAcc Acc+ BLEU Automation 6.1 – Software 67.1 1.5 – 53.4in the Big Data Era: 16.2 ∼18.2 75.8 18.2 – 77.6 and Opportunities Challenges 19.6 – 78.4 TreeGen-A 25.8 25.8 79.3 ASN+SUPATT (Rabinovich, Stern, and Klein 2017) SZM19 (Sun et al. 2019) 22.7 27.3 – 30.3 79.2 79.6 TreeGen-B 31.8 33.3 80.8 Location of Structural Convolutional Sub-layer N1 = 10, N2 = 0 N1 = 10(7), N2 = 0 N1 = 10(8), N2 = 0 N1 = 0, N2 = 10 25.8 27.3 25.8 21.2 27.3 28.8 28.8 22.7 80.4 78.5 78.5 79.6 10.6 (p = 0.015) 27.3 (p = 0.015) 27.3 (p < 0.001) 15.2 (p < 0.001) 28.8 (p < 0.001) 12.1 27.3 28.8 18.2 28.8 68.0 80.9 81.8 72.9 81.0 Ablation test Baseline: Transformer - Tree Convolution - Rule Definition Encoding - Char Embedding - Self-Attention Table 1: Performance of our model in comparison with previous state-of-the-art results. Exp II Statistics HS ATIS GEO # Train # Dev # Test 533 66 66 4,434 491 448 600 280 Avg. tokens in description Max. tokens in description Avg. tokens in code Max. tokens in code 35.0 76.0 83.2 403 10.6 48 33.9 113 7.4 23 28.3 144 Neural Networks Traditional Table 2: Statistics of the datasets we used. Method ATIS GEO ZC07 (Zettlemoyer and Collins 2007) FUBL (Kwiatkowski et al. 2011) KCAZ13 (Kwiatkowski et al. 2013) WKZ14 (Wang, Kwiatkowski, and Zettlemoyer 2014) 84.6 82.8 91.3 86.1 88.6 89.0 90.4 SEQ2SEQ (Dong and Lapata 2016) SEQ2TREE (Dong and Lapata 2016) ASN (Rabinovich, Stern, and Klein 2017) ASN+SUPATT (Rabinovich, Stern, and Klein 2017) COARSE2FINE (Dong and Lapata 2018) TRANX (Yin and Neubig 2018) Seq2Act (Chen, Sun, and Han 2018) Graph2Seq (Xu et al. 2018) SZM19 (Sun et al. 2019) 84.2 84.6 85.3 85.9 87.7 86.2 85.5 85.9 85.0 84.6 87.1 85.7 87.1 88.2 88.2 88.9 88.1 - TreeGen 89.1 89.6 Table 3: Accuracy in semantic parsing (in percentage). Sun et al. 2019). This experiment shows the effectiveness and generalizability of TreeGen. Related Work Code generation achieves significant progress in recent years. The early approaches are mainly based on templates (Zettlemoyer and Collins 2007; Zettlemoyer and Collins 2005; Kushman and Barzilay 2013; Wang, Kwiatkowski, and Zettlemoyer 2014). With the prosperity of deep learning, the sequence-to-sequence framework has shown to be effective in various tasks (Sutskever, Vinyals, and Le 2014). Ling et al. (2016) applied this framework to generate code based on tokens. Unlike natural languages, it is shown that the code contains much more structural information. Thus, the abstract syntax tree (AST) was used in more recent works (Dong and Lapata 2016; Yin and Neubig 2017; Rabinovich, Stern, and Klein 2017; Hayati et al. 2018; Yin and Neubig 2018). However, these studies mainly use recurrent neural networks (RNNs) from the long dependency problem (Bengio, Simard, and Frasconi 1994). Sun et al. (2019) proposed to use the convolutional neural network (CNN) to handle the long dependency problem. Our approach addresses this problem by Transformer’s intensive attention mechanism (Vaswani et al. 2017). To incorporate the structural information and the idea of self-attention, we propose a tree-based Transformer architecture for code generation. Conclusion In this work, we propose TreeGen for program generation. TreeGen uses the attention mechanism of Transformers to alleviate the long-dependency problem and introduces the AST reader to combine the grammar rules and the AST structure. The evaluation was conducted on a Python dataset, HearthStone, and two semantic parsing datasets, ATIS and GEO. The experimental results show that our model significantly outperforms existing approaches. We also conducted in-depth ablation tests, which suggests that each component in our model plays a significant role. Acknowledgments This work is sponsored by the National Key Research and Development Program of China under Grant No. 2017YFB1001803, and National Natural Science Foundation of China under Grant Nos. 61672045, 61529201, and 61922003. Lili Mou is an Amii Fellow; he is supported by — 143 — the CCAI Chair Program; and he also thanks AltaML for support. 2018 Meeting Proceedings References [Bengio, Simard, and Frasconi 1994] Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5(2):157–166. [Chen, Sun, and Han 2018] Chen, B.; Sun, L.; and Han, X. 2018. Sequence-to-Action: End-to-End Semantic Graph Generation for Semantic Parsing. In ACL, 766–777. [Chollet 2017] Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In CVPR, 1251– 1258. [Dehghani et al. 2018] Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; and Kaiser, Ł. 2018. Universal transformers. arXiv preprint arXiv:1807.03819. [Dong and Lapata 2016] Dong, L., and Lapata, M. 2016. Language to Logical Form with Neural Attention. In ACL, 33–43. [Dong and Lapata 2018] Dong, L., and Lapata, M. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. In ACL, 731–742. [Hayati et al. 2018] Hayati, S. A.; Olivier, R.; Avvaru, P.; Yin, P.; Tomasic, A.; and Neubig, G. 2018. Retrieval-Based Neural Code Generation. In EMNLP, 925–930. [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778. [Hendrycks and Gimpel 2016] Hendrycks, D., and Gimpel, K. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv preprint arXiv:1606.08415. [Kushman and Barzilay 2013] Kushman, N., and Barzilay, R. 2013. Using semantic unification to generate regular expressions from natural language. In NAACL, 826–836. [Kwiatkowski et al. 2011] Kwiatkowski, T.; Zettlemoyer, L.; Goldwater, S.; and Steedman, M. 2011. Lexical generalization in CCG grammar induction for semantic parsing. In EMNLP, 1512–1523. [Kwiatkowski et al. 2013] Kwiatkowski, T.; Choi, E.; Artzi, Y.; and Zettlemoyer, L. 2013. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP, 1545–1556. [Lei Ba, Kiros, and Hinton 2016] Lei Ba, J.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. [Ling et al. 2016] Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K. M.; Kočiskỳ, T.; Wang, F.; and Senior, A. 2016. Latent Predictor Networks for Code Generation. In ACL, 599–609. [Mou et al. 2016] Mou, L.; Li, G.; Zhang, L.; Wang, T.; and Jin, Z. 2016. Convolutional neural networks over tree structures for programming language processing. In AAAI, 1287– 1293. — 144 — [Rabinovich, Stern, and Klein 2017] Rabinovich, M.; Stern, M.; and Klein, D. 2017. Abstract Syntax Networks for Code Generation and Semantic Parsing. In ACL, 1139–1149. [See, Liu, and Manning 2017] See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. In ACL, 1073–1083. [Shazeer and Stern 2018] Shazeer, N., and Stern, M. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235. [Sun et al. 2019] Sun, Z.; Zhu, Q.; Mou, L.; Xiong, Y.; Li, G.; and Zhang, L. 2019. A grammar-based structural cnn decoder for code generation. In AAAI, volume 33, 7055– 7062. [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112. [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 6000– 6010. [Wang, Kwiatkowski, and Zettlemoyer 2014] Wang, A.; Kwiatkowski, T.; and Zettlemoyer, L. 2014. Morphosyntactic lexical generalization for CCG semantic parsing. In EMNLP, 1284–1295. [Xu et al. 2018] Xu, K.; Wu, L.; Wang, Z.; Yu, M.; Chen, L.; and Sheinin, V. 2018. Exploiting Rich Syntactic Information for Semantic Parsing with Graph-to-Sequence Model. In ACL, 918–924. [Yin and Neubig 2017] Yin, P., and Neubig, G. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In ACL, 440–450. [Yin and Neubig 2018] Yin, P., and Neubig, G. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In EMNLP, 7– 12. [Zettlemoyer and Collins 2005] Zettlemoyer, L. S., and Collins, M. 2005. Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In UAI, 658–666. [Zettlemoyer and Collins 2007] Zettlemoyer, L., and Collins, M. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In EMNLP-CoNLL, 678–687. Learning Stateful Preconditions Modulo Software Automation in the Big Data Era: a Test Generator Challenges and Opportunities Angello Astorga P. Madhusudan University of Illinois at Urbana-Champaign USA aastorg2@illinois.edu Shambwaditya Saha University of Illinois at Urbana-Champaign USA madhu@illinois.edu Shiyu Wang University of Illinois at Urbana-Champaign USA shiyuw3@illinois.edu Abstract In this paper, we present a novel learning framework for inferring stateful preconditions (i.e., preconditions constraining not only primitive-type inputs but also non-primitivetype object states) modulo a test generator, where the quality of the preconditions is based on their safety and maximality with respect to the test generator. We instantiate the learning framework with a specific learner and test generator to realize a precondition synthesis tool for C#. We use an extensive evaluation to show that the tool is highly effective in synthesizing preconditions for avoiding exceptions as well as synthesizing conditions under which methods commute. CCS Concepts • Theory of computation → Program specifications; • Software and its engineering → Dynamic analysis; • Computing methodologies → Classification and regression trees. Keywords Specification Mining, Data-Driven Inference, Synthesis ACM Reference Format: Angello Astorga, P. Madhusudan, Shambwaditya Saha, Shiyu Wang, and Tao Xie. 2019. Learning Stateful Preconditions Modulo a Test Generator. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’19), June 22–26, 2019, Phoenix, AZ, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3314221.3314641 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6712-7/19/06. . . $15.00 https://doi.org/10.1145/3314221.3314641 University of Illinois at Urbana-Champaign USA ssaha6@illinois.edu Tao Xie University of Illinois at Urbana-Champaign USA taoxie@illinois.edu 1 Introduction Reliable and robust software needs to function well even when given illegal inputs. One common way to handle illegal inputs is to equip the software with a precondition: any inputs violating the precondition are classified as illegal inputs, and further executions on these inputs beyond the precondition-checking point are prevented. To address error-proneness and tediousness of manually deriving and specifying preconditions, researchers have proposed various existing automatic approaches of precondition inference based on static analysis (e.g., [4, 7–9, 12, 23, 24, 26, 34, 36]) or dynamic analysis (e.g., [5, 6, 10, 11, 13, 15, 29, 33]). Given that static analysis is conservative in nature and often results in many false positives, existing approaches based on dynamic analysis have their advantages, being broadly classified into two major categories: white-box ones [6, 10, 11, 13][29, VPregen] and black-box ones [15, 29, 33][29, PIE]. Both categories learn preconditions from runtime information collected from software execution. For white-box approaches, runtime information typically includes program states in between statements inside the software, whereas for black-box approaches, runtime information typically includes inputs and outputs of the invoked methods defined on the interface of a class. However, existing approaches of precondition generation based on dynamic analysis typically do not tackle two major challenges. First, most of these approaches do not give any guarantee on the quality of the synthesized preconditions. If preconditions are learned passively using feature vectors of states observed on some fixed set of test inputs, the learning is intrinsically incomplete and can lead to overfitting the given test inputs, producing preconditions that are not guaranteed to generalize to unseen test inputs. Certain recent white-box approaches (e.g., [29, VPreGen]) can prove that preconditions are safe with the help of a static verifier, but the required verification is a hard problem to automate, requiring synthesis of inductive loop invariants, etc. In this — 145 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA 2018 paper, we explore a different guarantee—to ensure that the synthesizedMeeting precondition is safe (and maximal) with respect Proceedings to a test generator, which is typically a lot more scalable than a static verifier. Second, the target preconditions are stateful by nature for object-oriented programs: the target preconditions constrain not only primitive-type inputs (such as integers and strings) but also non-primitive-type inputs, such as the receiverobject states and object states of a non-primitive-type argument for the method under consideration. To address these two challenges, in this paper, we present a novel active learning framework for testing-assisted inference of stateful preconditions that are guaranteed to be safe and maximal with respect to a given test generator. Safety and maximality are both parameterized with respect to a test generator. We want a precondition that is safe— the test generator cannot find a precondition-allowing (i.e., precondition-satisfying) input whose execution leads to a failure, and a precondition that is maximal—the test generator cannot find an input disallowed by the precondition whose execution does not lead to any failure. We define this requirement through a formal notion of ideal preconditions with respect to a test generator. To synthesize stateful preconditions, our framework includes an abstraction based on observer methods defined for various classes, namely an observer abstraction. This abstraction enables the learned precondition to express abstract properties of non-primitive-type inputs while avoiding revealing implementation details (e.g., primitive-type object fields recursively reachable from an input object along with the heap structure of an input object). Our active learning framework combines a black-box learner with a white-box teacher, with the latter realized using a test generator, in order to learn ideal preconditions. Working by actively querying a test generator to produce an ideal precondition alleviates the problem of learning a precondition that overfits a particular set of inputs. Learning ideal preconditions in a logic L with the aid of a test generator can then be posed as actively learning formulas using positive and negative feature vectors that the test generator produces in rounds of interaction between the test generator and the learner. However, there are two main issues that need to be addressed when realizing this active learning, mostly due to the fact that inherently the test generator cannot guarantee feature vectors to be positive (but can certify negative feature vectors). First, the test generator can label certain vectors as positive, and later change its mind and label these vectors negative. Second, the limited expressiveness of the logic to state preconditions can also force the exclusion of certain positive inputs from the learned precondition. To address these issues, our framework includes a component named a conflict resolver that effectively relabels positive feature vectors to negative vectors when necessary. The — 146 — Astorga, Madhusudan, Saha, Wang, and Xie resulting learning framework with the conflict resolver can be instantiated for any logic for expressing preconditions using any learner and any test generator in order to learn ideal preconditions. We also prove a convergence result—assuming that the logic expresses finitely many formulas closed under Boolean operations, an ideal precondition expressible in the logic exists, and the learner is able to always produce formulas consistent with samples when they exist, we are guaranteed to converge to an ideal precondition. We instantiate the learning framework with a learner that uses an algorithm for decision-tree learning to synthesize formulas in a particular logic that involves Boolean combinations of predicates and inequalities involving numerical predicates, where the predicates describe both properties of primitive-type inputs and non-primitive-type inputs (such as the receiver objects and non-primitive-type input parameters). This algorithm is a standard machine-learning algorithm for decision-tree learning that uses statistical measures to build trees of small size; these trees correspond to preconditions consistent with the (conflict resolved) counterexamples returned by the test generator in each round, and learning continues till the learner finds an ideal precondition. We also instantiate the framework for two important tasks in specification inference: runtime-failure prevention and conditional-commutativity inference [37]. The former problem asks to synthesize preconditions that avoid runtime exceptions of a single method. The latter problem asks, given two methods, a precondition that ensures that the two methods commute, when called in succession. Inferring preconditions for commutativity is important for programmers to understand when they can reorder calls to methods while preserving behavior equivalence, and also has applications to downstream tools such as program analysis and tools for instrumenting concurrency control [20–22, 41]. We implement a prototype of our framework in a tool called Proviso using a learner based on the ID3 classification algorithm [31], a powerful classification algorithm in the machine learning community, and Pex [39], an industrial test generator based on dynamic symbolic execution [19, 35], shipped as IntelliTest in the Microsoft Visual Studio Enterprise Edition since Visual Studio 2015/2017/2019. This paper makes the following main contributions: • A novel formalization for the inference problem of stateful precondition modulo a test generator, called ideal preconditions, that guarantees that it is safe and maximal with respect to the test generator. • A novel active learning framework for inferring ideal stateful preconditions, using a conflict resolver component to adaptively mark positive inputs as negative, when necessary, in order to deal with the incompleteness of the test generator and the inexpressiveness of the logic for expressing preconditions. Learning Stateful Preconditions Modulo a Test Generator [PexMethod] public void PUT-CommutativityAddContains( [PexAssumeUnderTest] ArrayList s1, int x, int y){ DataStructures.ArrayList s2 = new DataStructures.ArrayList(s1); //clone s1 int a1, a2; bool ad1, ad2; //First Interleaving a1 = s1.Add(x); ad1 = s1.Contains(y); //Second Interleaving ad2 = s2.Contains(y); a2 = s2.Add(x); PexAssert.IsTrue(a1 == a2 && ad1 == ad2 && Equals(s1, s2));} Figure 1. Encoding conditional property: Commutativity conditions for methods Contains and Add from the ArrayList class in the .NET Library. • Convergence arguments that the learning framework will eventually synthesize a safe and maximal precondition modulo the test generation, if there is one, when the hypothesis space is finite. • Instantiations of the framework in a tool Proviso for two important tasks in specification inference (preconditions for preventing runtime failures and conditions for commutativity of methods) and using a machinelearning algorithm for decision trees and an industrial test generator (Pex). • An extensive evaluation on various C# classes from well-known benchmarks and open source projects that demonstrates the effectiveness of the proposed framework. 2 An Illustrative Example We next show how our framework is instantiated for the task of conditional-property inference and then illustrate through an example how our approach addresses the precondition synthesis problem. Let us first model the problem of conditional-commutativity inference (finding conditions under which two methods commute) as a problem of precondition synthesis. Consider the parameterized unit test [40] in Figure 1. The method PUT_CommutativityAddContains checks whether the methods of an arraylist, Add and Contains, commute when called with an arraylist s1, and for particular parameter inputs. The method Add(x) returns the index at which x has been added, and Contains(y) returns true if y is in s1 and false otherwise. To check for commutativity, the test method first clones the input arraylist s1 into s2. It then calls the method sequence Add(x) and Contains(y) on s1, and Contains(y) and Add(x) on s2. Finally, it checks whether the return values of the methods and resulting objects s1, s2 are equal. If they are PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation not, the methods do not commute and hence it raises an in the Big Data Era: exception; it follows that the precondition for the method Challenges toand Opportunities PUT_CommutativityAddContains prevent exceptions (e.g., assertion failure) is precisely the condition under which the two methods Add(x) and Contains(y) commute. To synthesize stateful preconditions, we instantiate our framework by fixing a logic L of octagonal constraints, by fixing a conflict resolver, a component that effectively relabels positive feature vectors to negative ones when necessary (see Section 4.1 for details), by fixing an exact learning engine, decision-tree learning, and by fixing a test generator, Pex. As inputs, our approach takes a method m (e.g., Figure 1) for precondition synthesis and a set of Boolean and integer observer methods in the ArrayList class, Obs B = {Contains(int)} and Obs Z = {Count, IndexOf(int), LastIndexOf(int)}, respectively. Our approach uses these observer methods and primitive parameters of m to generate a feature vector f by applying those methods using various combinations of parameters of m: s1.IndexOf(x), s1.IndexOf(y), [s1.Count(), x, y, s1.LastIndexOf(x), s1.LastIndexOf(y), s1.Contains(x), s1.Contains(y)]. Next we demonstrate how our algorithm proceeds. A set X (initially empty) of cumulative positive and negative feature vectors is maintained. Our algorithm proceeds in rounds: the learner begins by proposing a conjectured precondition, the testing-based teacher generates counterexamples. To generate negative counterexamples, the teacher generates inputs that are allowed by the conjectured precondition but cause the method to fail. To generate positive counterexamples, the teacher generates inputs that are disallowed by the conjectured precondition and do not cause the method to fail. These counterexamples are given to a conflict resolver, which then relabels a positive counterexample c to negative if in X there is a negative counterexample c ′ that is L-indistinguishable from c. The algorithm then checks whether the current conjectured precondition is consistent with the updated set X (i.e., the conjectured precondition allows the positive feature vectors in X and disallow the negative feature vectors in X ): if yes, we stop and output the precondition; otherwise, we proceed to the next round. We elaborate the role of the conflict resolver and the soundness of the preceding technique in the rest of the paper. To illustrate the conflict resolver on this example, we assume that no observer methods are given, and the feature vector is f′ = [x, y]. The learner begins by proposing true, and the testing-based teacher produces negative counterexamples ([0, 0], −), ([10, 10], −), being added to X (which is initially empty). The precondition true is not consistent with X and so we proceed with the next round. The learner next proposes false (as it is consistent with X ). The teacher then generates two positive feature vectors ([0, 0], +) and ([8, 9], +). At this point, we have encountered conflict. X has a negative — 147 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA 2018 feature vector ([0, 0], −) and an L-indistinguishable positive vector ([0, Meeting 0], +). The Proceedings conflict resolver relabels ([0, 0], +) to ([0, 0], −). Again, the current conjecture false is not consistent with the updated X and so we proceed. This process (in our tool) continues for 4 rounds when the learner ultimately proposes (x y), which is consistent with all vectors that the test generator returns, and we stop and return (x y) as the precondition. When the feature vector ( f) mentioned earlier includes all the observer methods, the preceding conflict does not occur, and the learner synthesizes the precondition (x = y ∧ s1.Contains (x)) ∨ (x y). A crucial aspect here is that the testing-based teacher helps the learner by generating counterexamples that show the conjectured precondition to be unsafe or non-maximal. We terminate only when the learner is able to convince the test generator that the precondition is safe and maximal (modulo the power of the test generator). 3 Problem Formalization of Precondition Synthesis Modulo a Test Generator In this section, we formalize the problem of synthesizing preconditions with the aid of a test generator.  with formal paWe assume that we have a method m(p)  rameters p and assertions in it for which we want to synthesize a precondition. Intuitively, we want the precondition to satisfy two requirements: (a) be safe, in the sense that the method when called with any state allowed by the precondition does not throw an exception (either a runtime exception such as division by zero or an assertion-violating exception), and (b) be maximal, in the sense that it allows as many inputs as possible on which the method does not throw an exception. Since we do not know a priori the precise set of inputs on which the method throws an exception and does not throw exceptions, respectively, we resort to obtaining this information from a test generator. Challenges in defining the problem and framework. Defining the precondition synthesis formally modulo a test generator is complicated by three main aspects of the problem: − Incomplete information of object state: Preconditions can depend on the receiver object state of the method m() for which we are synthesizing the precondition for, and the state of objects that are passed as parameters to m(). We propose a set of observer methods that give properties of these objects, and allow the precondition to state restrictions using these properties. We hence work with feature vectors, which capture the return values of observer methods on objects. However, using observer methods intrinsically introduces incomplete information about the object state: several different input states can have the same feature vector. − Incomplete test generator: Given a method and a precondition for it, the test generator can find input states that the precondition should disallow as the method can throw an exception on these input states and find input states that — 148 — Astorga, Madhusudan, Saha, Wang, and Xie the precondition should allow as the method does not throw any exception on these input states. A feature vector is valid (or invalid) if the method can throw an exception on none (or one) of all input states conforming to the feature vector. However, since we work with an abstraction of input states using feature vectors, we need a test generator to find valid feature vectors and invalid ones. It turns out that given a precondition, a test generator can readily be adapted to find invalid feature vectors, but not valid ones. Consequently, we need to work with a test generator that may mark a feature vector tentatively valid, and then later change its mind and find it invalid. Learning of preconditions hence needs to accommodate such fluctuations. − Expressiveness of the logic: The logic used for expressing the precondition may not be expressive enough to distinguish two feature vectors, one being valid and the other being invalid. In other words, there is another level of abstraction caused by the logic, in addition to the abstraction induced by the use of observer methods, and the precondition must be permitted to disallow certain positive feature vectors. Our solution to the preceding challenges involves (1) defining the precondition synthesis problem as synthesizing an ideal precondition (Definition 3.2), where the notion of an ideal precondition accommodates the fluctuations of a test generator, and (2) a framework that synthesizes ideal preconditions using a conflict resolver (Section 4 and Figure 2) that manipulates counterexamples returned by the test generator in each round. We emphasize that the component for synthesizing formulas from the (conflict-resolved) counterexamples is standard, and we can use a variety of learning algorithms from the literature. However, arguing convergence of such learning algorithms in learning ideal preconditions in the presence of the conflict resolver has to be argued anew (Section 4.3). We next formalize the notions of programs, valid and invalid input states, and testing-based teachers (Section 3.1), and then formalize the problem of precondition synthesis modulo a testing-based teacher using the notion of an ideal precondition (Section 3.2). 3.1 Observer Methods, Logic for Preconditions, and Testing-Based Teachers Methods. Let us fix a set of types T, including primitive types and classes. Each type t ∈ T is associated with a data domain D(t) that denotes the set of values that variables of type t range over. In the following, we assume that each variablev has an implicit type t associated with it. In addition, we denote D(t) by simply using D(v).  with formal We assume that we have a target method m(p) parameters p that we want to synthesize a precondition for. Let us also fix a set of pure (i.e., side-effect free) observer methods F = {f 1 (p1 ), . . . , fn (pn )} that return a primitive type. These methods help query properties of the state of the Learning Stateful Preconditions Modulo a Test Generator objects whose class defines these methods. For a method m with input parameters p that we aim to find a precondition for, we allow the precondition to express properties of p using constraints on variables of primitive types in p as well as the return values of observer methods that return a value of primitive type when called with tuples of parameters drawn  from p. We have, apart from the above, other methods for classes (including constructors and mutating methods, i.e., those that mutate the object). The test generator can use these methods to create valid object states, by using method sequences composed of constructors and mutating methods. Let us now define the semantics of the methods abstractly. For any class c, let Sc denote the set of valid states of the object of the class c (Sc can be infinite, of course, and denotes the set of valuations and heaps maintained by the public/private fields in the class). Note that we assume that the set Sc contains valid object states, i.e., reachable states from initial object construction. For each parameter p of type class c, let us denote by D(p) the valid states Sc . The semantics of the observer method fi (pi ) is given by a (complete) function fi : D(pi ) −→ D i , where D i is the data domain for the return primitive type of method fi . Note that the observer methods return properties of the state of the object but do not change the state. Note also that we require these observer methods not to throw exceptions, and hence model their semantics using complete functions.  is given by a partial The semantics of the method m(p)  ⇀ Sc × D, where c is the class that function m : Sc × D(p) m belongs to and D is the data domain for the return type of m (whether it be of primitive type or a class).  Valid and invalid input states. An input state for m(p)  where c is the class that method is pair (s,v) ∈ Sc × D(p) m belongs to and v is a valuation of the parameters in p of method m. Note that the input state contains the receiver object state namely s of m, and the values of the parameters in p (some of which can be object states as well of their respective classes). We say that an input state (s,v) is an invalid input state for m if m throws an exception1 on that input state i.e., m is undefined on (s,v). We say that an input state (s,v) is a valid input state for m if (s,v) is not an invalid input state. Feature vectors. One fundamental aspect of our problem is that the client does not know precisely the internal states of objects (the receiver object state and state of other objects given as parameters), but has incomplete information about them gleaned from the return values of observer methods. We define a feature vector f as a vector of values of the primitive parameters in p and the values of observer methods 1 Note that in this paper, when we say an exception, we refer to an uncaught exception as unexpected program behaviors such as DivideByZeroException. Assertion violation can also cause an uncaught exception to be thrown. PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation on the object states (called with various combinations of the Big Data Era:  Typically,in parameters from p). the features are of primitive Challenges types (integer and Boolean in ourand tool). Opportunities Logic for expressing preconditions. The logic L, for ex in this paper is pressing preconditions for a method m(p) quantifier-free first-order logic formulas. Recall that classical first-order logic is defined by a class of functions, relations, and constants. We choose this vocabulary to include the following: (a) the usual vocabulary over the various primitive data domains that the program operates on (Booleans, integers, strings, arrays of integers, etc.), and (b) observer methods as functions. The logic then allows quantifier-free  Note that such a formula φ, formulas with free variables p. when interpreted at a particular program state (which gives meaning to various objects and hence to corresponding observer methods), defines a set of input states—the input states (s,v) such that when observer methods are interpreted using the state s, and input parameters p are interpreted using v, the formula holds. Hence, a logical formula represents a precondition—the set of states that satisfy the formula being interpreted as the precondition. Note that the logic cannot distinguish between two input states that have the same feature vector. We can in fact view logical formulas as defining sets of feature vectors. The logic hence introduces a coarser abstraction of feature vectors (which themselves are abstractions of input states). For the tool and evaluation in this paper, the logic L is a combination of Boolean logic and octagonal constraints on integers; the observer methods work on more complex datatypes/heaps (e.g., stacks, sets), returning Booleans or integers as output (e.g., whether a stack is empty, the size of a set container). Testing-based teachers and counterexamples. The general problem of precondition synthesis is to find a precondition expression φ (in logic L) that captures a maximal set of valid feature vectors (where a valid feature vector is one whose conforming input states are all valid) for the method m. This synthesis problem is clearly undecidable. In fact, checking whether m throws an exception on even a single input state is undecidable. Proving a precondition to be safe requires verification, a hard problem in practice, and current automatic verification techniques do not scale to large code bases. We hence shape the definition of our problem with respect to a test generator, which we call a testing-based teacher (T BT ). (We call it a teacher as it teaches a learner the precondition.) A TBT is just a test generator that generates test input states for m. Ideally, we would like the TBT to be guided to find test input states for showing that a given precondition φ is not safe or maximal, i.e., input states allowed by φ on which m throws exceptions and input states disallowed by φ where m does not throw an exception (hence property-driven — 149 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA 2018 testing tools such as Pex are effective, but not testing tools such as Randoop that generate random inputs). Formally, Astorga, Madhusudan, Saha, Wang, and Xie It turns out that incomplete information the client has about the object state creates many complications. In particMeeting Proceedings ular, a testing tool can find invalid feature vectors but cannot Definition 3.1 (Testing-based teacher). A testing-based teacher find valid feature vectors using test generation. Consider a method m and a precondition φ for it. The (T BT ) is a function that takes a method m, a precondition φ precondition defines a set of feature vectors, which in turn for m, and generates a finite set of input states for m (that define a set of input states. Notice that if we can find an input may or may not be allowed by φ) and whether they are valid state that conforms to φ on which the program throws an or not. □ exception, we can deem the precondition to be unsafe, and declare the feature vector corresponding to that input state Note that in our formulation, the T BT is a function; hence, as invalid. We name such feature vectors negative counterexfor any method and precondition, we expect the T BT to be amples, and a testing tool can find these invalid vectors. deterministic, i.e., it produces the same set of test inputs However, notice that an execution of the method on a across rounds for a given precondition. This assumption is single input state cannot show φ to be non-maximal. If the not a limitation of our framework, but a way to formalize a testing tool finds a valid input state (s,v) disallowed by φ, we T BT . Any testing-based tool can be made deterministic by still cannot say that the feature vector corresponding to the fixing its random seeds, and by fixing configurable bounds input state is valid. The reason is that there may be another such as the number of branches explored, etc. We do not invalid input state (s ′,v ′) that conforms to the same feature require a T BT to report all or any input states. The T BT is vector. Intuitively, witnessing non-maximality boils down to incomplete and may not be able to find a counterexample finding a valid feature vector. This situation is the same as (for safety or maximality), even if one exists. asking whether there exists a feature vector disallowed by φ Given a method m and a precondition φ for it, we can exsuch that all input states conforming to the feature vector are amine the test inputs generated by the T BT to check whether valid. The ∃∀ nature of the question is what makes finding they contain counterexamples. An input state that is allowed counterexamples for maximality hard using test generation. by φ but leads m to throw an exception shows that φ is not (Even logic-based tools, such as Pex, that use SMT solvers safe, and is a negative counterexample. An input state that is are typically effective/decidable for only ∃∗ properties, i.e., disallowed by φ and on which m executes without throwing quantifier-free formulas.) On the other hand, finding an inany exception indicates potentially that φ may not be maxivalid feature vector (included by by φ) asks whether there mal, and we call this input state a positive counterexample. exists a feature vector allowed by φ such that there does exist (As we shall see, such counterexample does not necessarily an invalid input state conforming to the feature vector; this indicate that φ is not maximal.) question is an ∃∃ question that can be found using tools We are now ready to define the goal of precondition gensuch as Pex. eration parameterized over such a T BT . Roughly speaking, we want to find maximal and safe preconditions expressible Formalizing precondition generation modulo a testingin our logic; however, the precise definition is more subtle based teacher. As explained earlier, for a precondition φ, an as we describe next. invalid input state (allowed by φ) found by a testing-based teacher (TBT) is a witness to the fact that φ is unsafe, i.e., no 3.2 Precondition Synthesis Modulo a Testing-Based safe precondition should allow this input state. Teacher Valid input states ((s,v), +) found by the TBT but disallowed by the current precondition indicate that the preconIncomplete information. Since the learner learns only with dition may potentially not be maximal, as it disallows an respect to an observer abstraction in terms of feature vectors, input state where m does not throw an exception. However, we assume that input states returned by the testing-based we do not want to demand that we find a precondition that teacher are immediately converted to feature vectors, where definitely allows (s,v). The reason is that such a requirement the feature values are obtained by calling the respective obis too strong as there may be another input state of the form server methods. We also refer to feature vectors as positive ((s ′,v ′), −) that conforms to the same feature vector as (s,v). or negative counterexamples, if the conforming input states Another reason is that even if the feature vectors are not are positive or negative, respectively. the same, the logic may be unable to distinguish between For any feature vector f, there are, in general, several input the two vectors. In other words, it may be the case that no states that are conforming to f (i.e., those input states whose precondition expressible in our logic is both safe and allows features are precisely f). Recall that a feature vector f is this positive example (s,v). valid if all input states conforming to it are valid input states; We next define the notion of an ideal precondition that a feature vector f is invalid if it is not valid—i.e., there is captures both safety and maximality modulo the incomplete information that the client has of the object state and modulo some invalid input state whose feature vector is f. — 150 — Learning Stateful Preconditions Modulo a Test Generator the expressiveness of the logic, with respect to the T BT . First, let us define some terminology: for any two input states (s,v) and (s ′,v ′), we say (s,v) is L-indistinguishable from (s ′,v ′) if there is no formula (in the logic L) that evaluates to true on one of them and false on the other (note that if the two input states conform to the same feature vector, then they are indistinguishable no matter the logic). In a similar way, we define L-indinguishability for feature vectors.  with respect Definition 3.2. An ideal precondition for m(p) to a T BT is a precondition φ in the logic L such that φ satisfies the following two conditions: • Safety wrt TBT: the T BT returns a set that has no invalid input state allowed by φ. • Maximality wrt TBT: for every valid input state ((s,v), +) returned by the T BT but disallowed by φ, there is some invalid input state ((s ′,v ′), −) (returned by the T BT allowed by some precondition) that is Lindistinguishable from (s,v). □ Intuitively, the first condition of safety demands that the T BT is not able to find any invalid input state allowed by the precondition (i.e., one on which m throws an exception). The second condition states that for any valid input state (s,v) found by the TBT but disallowed by the precondition, there must be some invalid input state (returned by the TBT allowed by some precondition, not necessarily φ) that is Lindistinguishable from (s,v). A precondition on which the T BT returns the empty set is hence also an ideal precondition. Note that in general, there may be no unique safe and maximal precondition. We can now state the precise problem of precondition generation modulo a T BT : Problem Statement: Given a program with the method  and observer methods and a logic L for expressing m(p) preconditions for m, and given a testing-based teacherT BT ,  with respect to the T BT . find an ideal precondition for m(p) 4 The Learning Framework for Synthesizing Preconditions Modulo a Test Generator In this section, we describe our general learning framework for synthesizing ideal preconditions with respect to a testingbased teacher (TBT). We first describe this framework (Section 4.1) and then discuss multiple ways to instantiate the framework (Section 4.2). Adapting a TBT to realize this framework is discussed in Section 5. Finally, in Section 4.3, we discuss general conditions under which we can show that our learners and learning framework converge to an ideal precondition with respect to any TBT. 4.1 Framework Overview Our learning framework, depicted in Figure 2, consists of five distinct components: (1) a passive learner (precondition PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation in the Big Data Era: Challenges and Opportunities Figure 2. The learning framework for synthesizing ideal preconditions with respect to a TBT. synthesizer) that synthesizes preconditions from positive and negative feature vectors, (2) a TBT, interacting in rounds of communication with the learner, that returns valid/invalid input states, (3) a featurizer that converts valid/invalid input states to positive/negative feature vectors, and (4) a conflict resolver (CR L ), which is the main novel component, that resolves conflicts (created by incomplete information) by changing positive feature vectors to negative ones when necessary. We emphasize that one can use any standard passive learner in this framework as long as it finds formulas that are consistent with the set X of labeled feature vectors. The framework maintains a set X , which contains the accumulated set of (conflict-resolved) positive/negative labeled feature vectors that the TBT has returned. In each round i, the learner proposes a precondition φ i that is consistent with the set, and the TBT returns a set of valid and invalid input states. The featurizer, with the help of observer-method calls, converts the input states to positive/negative labeled feature vectors Ci . We add the counterexample input states to X and call the conflict resolver for the logic L, and update X . We then check whether the current conjecture φ i is consistent with the updated X —namely whether φ is true on every positive feature vector and false on every negative feature vector. If it is, then we exit having found an ideal precondition, and if not, we iterate with the precondition synthesizer for the new set X . Conflict resolver. Formally, the conflict resolver, given a set X of positive and negative feature vectors, returns the set of positive and negative feature vectors such that • the returned set contains every feature vector (in X ) that is negative; • for any positive feature vector ( f, +) in X , if there is a negative feature vector ( f′, −) in X such that f and f′ are L-indistinguishable, then the returned set contains the negative feature vector ( f, −); otherwise, the set contains the positive feature vector ( f, +). To understand why the conflict resolver working as above is a sound way to obtain ideal preconditions, recall the two — 151 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA 2018 Meeting Proceedings Figure 3. An example of conflict resolution where positive vectors are made negative. Partitions denote equivalence classes of indistinguishable vectors; points denote positive and negative feature vectors. The shaded region denotes a consistent precondition. properties of ideal preconditions in Definition 3.2: safety wrt TBT and maximality wrt TBT. The conflict resolver keeps negative feature vectors as they are (since safety wrt TBT requires that the precondition exclude them). However, when a positive feature vector has a corresponding indistinguishable negative feature vector (returned by the TBT in this round or a previous round), it is clear that no precondition expressible in the logic can include the positive feature vector. Hence the conflict resolver turns it negative, which is allowed by the definition of maximality wrt TBT in the definition of ideal preconditions. Figure 3 shows an example of the effect of a conflict resolver— it converts two positive feature vectors to negative ones since they have corresponding negative feature vectors (in X ) that are not distinguishable from them. A consistent precondition (shown as the shaded region) consists of some equivalence classes of indistinguishable feature vectors that include the positive vectors and exclude the negative ones, after conflict resolution. Notice that in any set X of counterexamples accumulated during the rounds, X is a subset of the set of all counterexamples that TBT returns on all possible preconditions. Hence it is easy to see that if φ i is consistent with the conflict-resolved set obtained from X ∪ Ci , then it is in fact ideal (for every positive counterexample disallowed by φ i , in X there is a negative counterexample (returned by the TBT) being indistinguishable). Consequently, φ is ideal when the learning framework terminates. 4.2 Instantiations of the Framework Our framework can be instantiated by choosing a logic L, choosing any synthesis/learning engine for exactly learning logical expressions in L, and building conflict resolvers for L. We next list multiple such possibilities. Logic for preconditions. We can instantiate our framework to the logic LB,Z described below for expressing preconditions. Let us assume that feature vectors consist of a set of  . . . , α n (p)}  and a set of integer Boolean features P = {α 1 (p),  . . . , r t (p)}.  Note that these features all features N = {r 1 (p),  and can be either Boolean or depend on the parameters p, — 152 — Astorga, Madhusudan, Saha, Wang, and Xie integer parameters in p or calls to observer methods (using  that return Booleans or integers. The gramparameters in p) mar for the logic LB,Z of preconditions that we consider is  | r(p)  ≤ c | φ ∨ φ | φ ∧ φ | ¬φ φ ::− α(p) where α ∈ P, r ∈ N , and c ∈ Z. We also consider certain sublogics of the preceding logic; one being of particular interest is discussed in Section 4.3 on convergence, where we require the threshold constants c to be from a finite set of integers B. Learners. By treating the Boolean and integer features as Boolean and integer variables, we can use exact learning variants of the ID3 algorithm for learning decision trees [25, 31] in order to synthesize preconditions for the logic LB,Z . It is easy to adapt Quinlan’s decision tree learning algorithm (which synthesizes small trees using a greedy algorithm guided by statistical measures based on entropy) to an exact learning algorithm [17]. In our evaluation (Section 6), we mainly use such a learner. A second and more expressive choice is to use passive learners expressed in the syntax-guided synthesis framework (Sygus [2]). This framework allows specifying a logic syntactically (using standard logic theories) and allows a specification expressing properties of the formula to be synthesized. By making this specification express that the formula is consistent with the set of samples, we can obtain a passive learner that synthesizes expressions. The salient feature here is that instead of having a fixed set of predicates (like in the preceding decision-tree algorithm), predicates are dynamically synthesized based on the samples. There are multiple solvers available for the Sygus format, as it also is part of a synthesis competition, and learners based on stochastic search, constraint solving, and combinations with machine learning are known [3, 27, 32]. In fact, one recent tool named PIE [29] is similar to a Sygus solver and can be used as a passive learner too. We have, in our evaluation, tried multiple Sygus solvers and also the PIE passive learner. Conflict resolvers. Conflict resolver algorithms crucially depend on the logic. For the preceding logic LB,Z with Boolean and integer features, it is easy to see that any two feature vectors that are different are in fact separable using the logic, as each vector can be isolated from the rest. Consequently, the conflict resolver simply changes a positive feature vector to negative iff the same feature vector also occurs negatively in the set X . Consider now the same preceding logic LB,Z but where we require the threshold constants c to be bounded—i.e., |c| < b, where b is a fixed natural number. It is easy to see that a conflict resolver for this logic needs to turn a positive feature vector f to negative iff there is a negative feature vector д that agrees with f on all Boolean features and, for each integer feature, either f and д both have the same feature value or the feature values in f and д are both larger than b or Learning Stateful Preconditions Modulo a Test Generator both smaller than −b. The implementation of this algorithm is straightforward. 4.3 Convergence of Learning We argue earlier that if the learning framework, instantiated with any learner, terminates, then it has computed an ideal precondition. In this section, we consider settings where the learning framework is also convergent (i.e., is guaranteed to terminate). Let us fix a testing-based teacher TBT and let us assume that there is (at least) one target concept φ∗ in L such that if C is the set of all counterexamples returned by the TBT (in response to any possible precondition), then φ∗ is consistent with C. We consider the case when the hypothesis space H of preconditions is finite, i.e., when the number of possible preconditions is finite, and when the logic is closed under Boolean operations. For the logic LB,Z , this finite space naturally occurs when the features are all Boolean or when we fix a certain finite set of constants for thresholds for numerical inequalities, e.g., [−b, +b] for some b ∈ N. We can now show that our learning framework where the learner is any learner that learns consistent formulas is guaranteed to find an ideal precondition with respect to the TBT. Theorem 4.1. Let the logic for preconditions be any finite hypothesis space of formulas H that is closed under conjunctions, disjunctions, and negation. Consider any instantiation of the learning framework with any conflict resolver and any learner that always returns a concept consistent with all the given labeled feature vectors, if one exists. Then for any method m(p), the learning framework is guaranteed to terminate and return an ideal precondition for m, provided that m has an ideal precondition expressible in the logic. Proof gist: We argue that the learner can always return a hypothesis consistent with the samples in each round, and that when it first repeats the conjecture of a hypothesis H, the hypothesis H must be an ideal precondition. The reason for the latter is that when H was first proposed, the teacher returned a set of counterexamples. Later, if the learner proposed H, it must be that H is consistent with those counterexamples; this situation would happen since in the interim when H was proposed, the teacher would have returned at least one indistinguishable negative counterexample for each positive counterexample disallowed by H. Hence H would be ideal. Given that the hypothesis space is finite, the learner must eventually repeat a hypothesis, and hence always converges. The reason why any consistent learner always finds some logical formula that satisfies the set of (conflict-resolved) samples is as follows. First, let Ĥ denote the tightest preconditions, and hence any hypothesis in H is a disjunction of preconditions in Ĥ. The preceding is true since the logic is closed under Boolean operations. Let ≡ be an equivalence relation on the set of input states that relate any two input PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation states not distinguishable by the formulas in H (equivalently in the Big Data Era: by Ĥ). Then we know that the conflict resolver would ensure Challenges and Opportunities that feature vectors in each equivalence class are pure—that there are no positive and negative vectors in the same class. Consequently, the disjunction of the formulas corresponding to the equivalence classes containing the positive samples is consistent with the samples, and is in H. This concludes the proof. □ 5 Construction of a Testing-Based Teacher In this section, we describe our techniques for adapting a test generator to a testing-based teacher (TBT) that actively tries to find counterexamples to safety and maximality of given preconditions. We also describe how the featurizer can be implemented. 5.1 Extracting Counterexamples The first issue is to adapt the test generator to return negative counterexample inputs for showing that a precondition is not safe and positive counterexample inputs for showing that a precondition is potentially not maximal. A test generator’s goal is slightly different than a TBT’s (see Definition 3.1).  and a precondition, φ, the goal of a Given a method, m(p), test generator is to find samples of the form (s,v) allowed by φ, and to generate valid and invalid input states, typically trying to find invalid ones. Extracting negative counterexamples is easy—we keep the same precondition (the precondition needs to be evaluated by calling the various observer methods) and we ask the test generator to find inputs that cause exceptions. Valid and invalid inputs found by the test generator can be returned. To extract positive counterexamples, we instrument the method as follows: • replace the precondition φ with its negation ¬φ, • for every assert statement, we insert an assume statement for the same condition right before the assert statement, • add an assert(false) statement at the end of the method, and before every return statement (if any). The valid/invalid inputs found by the test generator for the instrumented method are returned (as valid/invalid inputs to the original method). 5.2 Implementing the Featurizer To form feature vectors from inputs generated by the test generator, we insert additional statements at the beginning of the method for computing the features. The features are computed by calling the various observer methods and storing their return values in variables of appropriate type. Although in theory we assume that observer methods are pure, this assumption may not always be true in practice. In our evaluation, we manually ensure that the chosen observer methods are pure. — 153 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA 2018 Table 1. Statistics of evaluation subjects Project /Meeting Classes Proceedings #Classes #LOC .NET Data structures: Stack, Queue, Set, Map, ArrayList 46 14886 QuickGraph: Undirected Graph, Binary Heap 319 34196 Lidregen Network: NetOutGoingMessage, BigInteger 59 14042 DSA: Numbers 60 5155 Hola Benchmarks 1 933 Code Contract Benchmarks 1 269 6 Evaluation We prototype an implementation of our framework, called the Proviso tool, for synthesizing preconditions for C# programs. We adapt an industrial test generator, Pex [39], to a testing-based teacher, choose the logic LB,Z , over Booleans and integers (introduced in Section 4.2), and a variant of Quinlan’s C 5.0 decision tree algorithm [25, 31] to an exact learner [17]. The conflict resolver is the one for LB,Z described in Section 4.2. We instantiate the framework for two tasks of specification inference: learning preconditions for preventing runtime failures and learning conditions for method commutativity in data structures. The Proviso tool terminates only when it finds an ideal precondition modulo the test generator. In our evaluation, we intend to address the following main research question: RQ: How effectively can the Proviso tool learn truly ideal preconditions? The purpose of this research question is to investigate how effective our framework is in learning preconditions that are truly ideal—truly safe and truly maximal. In Section 4, we show that our learning algorithm when it terminates will converge on a safe and maximal precondition, with respect to the test generator. However, it may be the case that the learned precondition is neither safe nor maximal when compared to the ground truth (as determined by a programmer examining the code). This situation can happen for multiple reasons: ineffectiveness of the test generator to generate counterexamples, lack of observer methods to capture sufficient detail of objects, and inexpressiveness of the logic to express the right preconditions. To answer this research question, we manually inspect all of the cases and derive ground truths to the best of our abilities and compare them with the preconditions synthesized by Proviso. Evaluation setup. We evaluate our framework on a combination of small and large projects studied in previous work related to precondition inference [6, 29, 30] and test generation [38, 42]. We consider classes with methods from these projects whose parameters are of primitive types currently supported by our learner (i.e., int, bool) or whose parameters — 154 — Astorga, Madhusudan, Saha, Wang, and Xie are of complex types that have observer methods (defined in their interface) whose return types are int or bool. For the task of learning preconditions for preventing runtime failures, our evaluation subjects include (1) two open source projects, Lidgren Network and Data Structures and Algorithms (DSA), and (2) a set of Code Contract benchmarks from the cccheck static analyzer [12] and benchmarks from the Hola engine [14]. For the task of learning conditions for method commutativity in data structures, our evaluation subjects include data structures available in two open source projects, QuickGraph and .NET Core. Table 1 shows the data structures/classes used as our evaluation subjects. The table also shows the number of classes and the size of code for the entire project (the individual methods that we consider in our evaluation are smaller, but they can call various other methods and our test generator does analyze the larger code base). In total, our evaluation subjects include 105 method pairs for learning commutativity conditions and 121 methods for learning conditions to prevent runtime failures. For each data structure and non-primitive type, we implement abstract equality methods and factory methods. The equality methods compare object states for equivalence, and the factory methods (which Pex exploits) create objects from primitive types. For each method or method pair in our evaluation, we use all and only the public observer methods in the interface of their respective class. Table 2 summarizes our evaluation results, including statistics on our subjects, statistics on learning, and details on validation with respect to Randoop [28] (a test generator) and ground truth. 6.1 RQ: Learning Ideal Preconditions We assess the effectiveness of our framework in two main aspects: one being quality of the learned preconditions while the other being the efficiency of precondition learning. Quality of learned preconditions. We examine in two ways whether the learned preconditions are indeed truly ideal. First, we use another test generator compatible with C# programs, namely Randoop [28], to check whether a precondition is safe. If Randoop can generate an invalid input allowed by the learned precondition, then it is clear that the learned precondition is not truly safe (despite the fact that Pex did not find such input). After this first step, we manually inspect each case where Randoop cannot generate inputs to show unsafety, deriving the truly ideal precondition manually and checking whether it is equivalent to the learned precondition. Our results shown in Table 2 suggest that learning modulo a test generator can be effective in learning truly safe and maximal preconditions, despite the test generator’s incompleteness. Out of the 105 commutativity cases, we find that Proviso can learn 73 (∼70%) truly safe and maximal preconditions. Learning Stateful Preconditions Modulo a Test Generator In addition, Proviso learns 24 other preconditions that are only truly safe. Overall, ∼92% of the preconditions learned by Proviso for the commutativity cases are truly safe. For the 121 exception-prevention cases, we find that Proviso can learn 105 (∼87%) truly safe and maximal preconditions. Proviso learns additional 4 preconditions that are only truly safe. In multiple cases, Proviso does not learn a truly ideal precondition due to lacking appropriate observer methods. For example, a commutativity precondition synthesized for a .NET benchmark involves checking whether a setter and getter on a dictionary commute, and Proviso learns a precondition that is neither truly safe nor truly maximal. However, if we implement an additional observer method ContainsValueAt(x), which returns the value at x, then Proviso learns (s1.ContainsValueAt(x) && s1.ContainsKey(y)) || (!(x == y) && s1.ContainsKey(y)), which is a truly safe and maximal precondition. Another example is the commutativity of methods peek() and pop() in a stack—they commute when the top two elements in the stack are identical. However, this property turns out to be not expressible using the available observer methods and the learned precondition is false. In most cases, however, Proviso does learn truly safe and maximal preconditions that are natural and easy to read. For example, for the commutativity of push(x) and push(y), Proviso learns the precondition x == y, which is indeed truly safe and maximal. A sample of learned preconditions can be found on the Proviso website2 . For learning preconditions to prevent runtime failures, Proviso performs very well. Proviso performs well also on the larger open source programs in Lidregen Network in terms of both correctness and time spent in learning. A particular case of interest in DSA is where Proviso is able to learn a truly ideal precondition [number < 1024] for the toBinary method, which converts an integer to its binary representation (the constant 1024 is discovered by Proviso). For values of 1024 or higher, an integer overflow exception occurs deep in .NET library code. Efficiency of precondition learning. We also measure the time efficiency of Proviso in learning preconditions. Proviso takes on average ∼740 seconds per method/method pair to synthesize preconditions. 6.2 On Empirical Comparison with Related Work It is hard to provide a fair comparison with two closely related approaches by Padhi et al. [29] and Gehr et al. [18]. The approach by Gehr et al., strictly speaking, does not learn preconditions. It learns conditions under which two methods have commuted after their execution; the learned conditions are expressed over primitive-type parameters and return values of these two methods (note that, by definition, preconditions for these two methods should not be expressed 2 http://madhu.cs.illinois.edu/proviso PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation using their return values). In addition, the learned conditions in the Big Data Era: cannot capture properties of object states. The approach by Challenges and Padhi et al. [29] learns preconditions whileOpportunities also synthesizing auxiliary predicates. In this case, the languages for the programs are different (ours is for C# while theirs is for OCaml), and a direct tool comparison is hard. However, since our framework allows any passive learner to be plugged in, we plug in the passive learner used by Padhi et al. [29, PIE] and re-produce our evaluation results. The results show that when the feature sets are fixed, Proviso equipped with Padhi et al.’s learner has similar effectiveness as Proviso (by default equipped with the decision-tree learner), but when features are not provided, Proviso equipped with Padhi et al.’s learner takes much longer time and even diverges in some cases. 7 Related Work Black-box approaches. Ernst et al. [15] proposed Daikon for dynamically detecting conjunctive Boolean formulas as likely invariants from black-box executions that gather runtime state information (method-entry states, method-exit states); Daikon, seen as a learning algorithm, learns using only positive counterexamples, and unlike our approach, does not make any guarantees of safety or maximality. Our work is most closely related to three black-box approaches by Padhi et al. [29], Gehr et al. [18], and Sankaranarayanan et al. [33]. The last two approaches [18, 33] rely on generating test inputs from sampling feasible truth assignment of input predicates or assignments satisfying representative formulas in a particular logic, followed by Boolean learning from positive and negative examples to infer preconditions. However, these approaches do not provide any guarantees unlike our work, where we guarantee that the final learned precondition is both safe and maximal with respect to a testing-based teacher. Padhi et al. [29] proposed a data-driven learning approach based on feature learning, including black-box and white-box components. Its blackbox component, PIE, learns a formula from a fixed set of tests. Its white-box component, VPreGen, includes an iterative refinement algorithm that uses counterexamples returned by a verifier to learn provably safe preconditions. However, the white-box component does not make any guarantees on maximality as we do. Furthermore, to assure that preconditions are provably safe, inductive loop invariants must be synthesized, further complicating the problem. In our approach, we replace the verifier with a testing-based teacher for practical reasons and handle the accompanying challenges. Program and expression synthesis. The field of program synthesis deals with the problem of synthesizing expressions that satisfy a specification. One of the most promising approaches of synthesizing expressions is counterexampleguided inductive synthesis (CEGIS) [2], which in fact resembles online learning. In this setting, the target expression is — 155 — PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Astorga, Madhusudan, Saha, Wang, and Xie 2018 Table 2. Evaluation results on benchmarks and open source programs using Proviso. Abbreviations: LOC=total number of Lines of Code of the class, Met.=total number of methods/method pairs, Obs.=total number of observer methods, #CE=average number of counterexamples, Meeting Proceedings #Rnd.=average number of rounds, Size=average size of the preconditions, time=average time taken per method (in seconds), #Test=total number of tests generated by Randoop, #Fail=total number of failing tests found by Randoop, #Safe=number of methods/method pairs whose preconditions are found safe by Randoop, #Corr.= number of methods/method pairs found by manual inspection to be truly safe and maximal. Subjects Validation Randoop Manual #CE #Rnd. Size Time(s) #Test #Fail #Safe #Corr. Commutativity 7.9 4.9 1.0 418.2 10117 0 10 8 11.3 2.5 2.1 493.3 10922 0 10 10 21.9 10.3 1.0 1646.5 10020 0 10 8 30.5 5.1 2.7 1230.1 8809 13 9 9 12.0 4.7 2.6 1212.0 9072 75 9 9 13.7 7.0 2.7 1045.2 5708 104 30 16 42.2 4.6 6.2 872.2 7456 25 17 17 Exceptions 9.5 2.8 1.7 515.0 1067 70 44 42 13.6 3.7 2.6 214.5 2214 178 38 34 20.3 60.5 1.8 4589.2 3626 0 4 2 22.4 16.0 1.6 376.9 14170 0 21 21 47.5 20.9 2.9 428.7 19236 0 10 7 Learning Framework Project/Class LOC Met. Obs. Stack Set Queue Map ArrayList Undirected Graph Binary Heap 502 1847 584 1382 2963 327 335 10 10 10 10 10 36 19 3 2 3 3 4 7 4 NetOutGoingMessage BigInteger Numbers CodeContract Hola 785 2334 284 269 933 47 39 4 21 10 3 2 0 0 0 learned in multiple rounds of interaction between a learner and a verifier. In each round, the learner proposes a candidate expression and the verifier checks whether the expression satisfies the specification, and returns counterexamples otherwise. In this sense, we can view our algorithm also as a CEGIS algorithm, but where the verifier is replaced by an incomplete testing-based tool. However, there are technical differences — in program synthesis, the aim would be to find a formula that precisely classifies the examples, while in our setting, we are required to learn a classifier that classifies negative examples precisely, but is allowed to negatively classify positive examples. Furthermore, we require that a minimal number of positive counterexamples are classified negatively; such maximality constraints are not the norm in program synthesis (indeed some problems involving maximality have been recently considered [1]). Decision-tree learning. Decision-tree learning has been used in several contexts in program synthesis before — in precondition synthesis [33], in invariant synthesis [16, 17], in synthesizing piece-wise linear functions [27], etc. Many of these algorithms have had to change the ID3 algorithm, similar to our work, so that the algorithm learns a tree consistent with the samples. The crucial differences in our framework from such previous work are that we dynamically modify the classifications of samples from positive to negative when we discover conflicting counterexamples, and ensure maximality of preconditions by learning across rounds using inputs from the testing-based teacher. 8 Conclusion In this paper, we have presented a novel formalization for the inference problem of stateful preconditions modulo a test generator. In this formalization, the quality of the precondition is based on its safety and maximality with respect to the test generator. We have further proposed a novel iterative active learning framework for synthesizing stateful preconditions, and a convergence result for finite hypothesis spaces. To assess the effectiveness of our framework, we have instantiated our framework for two tasks of specification inference and evaluated our framework on various C# classes from well-known benchmarks and open source projects. Our evaluation results demonstrate the effectiveness of the proposed framework. Acknowledgment This work was supported in part by National Science Foundation under grant no. CCF-1527395, CNS-1513939, CNS1564274, CCF-1816615 and the GEM fellowship. References [1] Aws Albarghouthi, Isil Dillig, and Arie Gurfinkel. 2016. Maximal specification synthesis. In POPL 2016. — 156 — Learning Stateful Preconditions Modulo a Test Generator [2] Rajeev Alur, Rastislav Bodík, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas Kress-Gazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2015. Syntax-guided synthesis. In Dependable Software Systems Engineering. NATO Science for Peace and Security Series, D: Information and Communication Security, Vol. 40. [3] Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling enumerative program synthesis via divide and conquer. In TACAS 2017. [4] Rajeev Alur, Pavol Černý, P. Madhusudan, and Wonhong Nam. 2005. Synthesis of interface specifications for Java classes. In POPL 2005. [5] Glenn Ammons, Rastislav Bodík, and James R. Larus. 2002. Mining Specifications. In POPL 2002. [6] Angello Astorga, Siwakorn Srisakaokul, Xusheng Xiao, and Tao Xie. 2018. PreInfer: Automatic inference of preconditions via symbolic analysis. In DSN 2018. [7] David Brumley, Hao Wang, Somesh Jha, and Dawn Xiaodong Song. 2007. Creating vulnerability signatures using weakest preconditions. In CSF 2007. [8] Raymond P. L. Buse and Westley Weimer. 2008. Automatic documentation inference for exceptions. In ISSTA 2008. [9] Satish Chandra, Stephen J. Fink, and Manu Sridharan. 2009. Snugglebug: a powerful approach to weakest preconditions. In PLDI 2009. [10] Manuel Costa, Miguel Castro, Lidong Zhou, Lintao Zhang, and Marcus Peinado. 2007. Bouncer: Securing software by blocking bad input. In SOSP 2007. [11] Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2005. Vigilante: End-toend containment of Internet worms. In SOSP 2005. [12] Patrick Cousot, Radhia Cousot, Manuel Fähndrich, and Francesco Logozzo. 2013. Automatic inference of necessary preconditions. In VMCAI 2013. [13] Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. 2008. DySy: Dynamic symbolic execution for invariant inference. In ICSE 2008. [14] Isil Dillig, Thomas Dillig, Boyang Li, and Ken McMillan. 2013. Inductive invariant generation via abductive inference. In OOPSLA 2013. [15] Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. In ICSE 1999. [16] P. Ezudheen, Daniel Neider, Deepak D’Souza, Pranav Garg, and P. Madhusudan. 2018. Horn-ICE learning for synthesizing invariants and contracts. In OOPSLA 2018. [17] Pranav Garg, P. Madhusudan, Daniel Neider, and Dan Roth. 2016. Learning invariants using decision trees and implication counterexamples. In POPL 2016. [18] Timon Gehr, Dimitar Dimitrov, and Martin T. Vechev. 2015. Learning commutativity specifications. In CAV 2015. [19] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed automated random testing. In PLDI 2005. [20] Maurice Herlihy and Eric Koskinen. 2008. Transactional boosting: A methodology for highly-concurrent transactional objects. In PPoPP, PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Software Automation 2008. in the Big Data Era: [21] Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, Xin Sui, and Challenges and Opportunities Keshav Pingali. 2011. Exploiting the commutativity lattice. In PLDI 2011. [22] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In PLDI 2007. [23] Fan Long, Stelios Sidiroglou-Douskos, Deokhwan Kim, and Martin Rinard. 2014. Sound input filter generation for integer overflow errors. In POPL 2014. [24] Ravichandhran Madhavan and Raghavan Komondoor. 2011. Null dereference verification via overapproximated weakest precondition analysis. In OOPSLA 2011. [25] Thomas M. Mitchell. 1997. Machine Learning (1 ed.). [26] Mangala Gowri Nanda and Saurabh Sinha. 2009. Accurate interprocedural null-dereference analysis for Java. In ICSE 2009. [27] Daniel Neider, Shambwaditya Saha, and P. Madhusudan. 2016. Synthesizing piece-wise functions by learning classifiers. In TACAS 2016. [28] Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedbackdirected random testing for Java. In OOPSLA 2007. [29] Saswat Padhi, Rahul Sharma, and Todd Millstein. 2016. Data-driven precondition inference with learned features. In PLDI 2016. [30] Nadia Polikarpova, Carlo A. Furia, Yu Pei, Yi Wei, and Bertrand Meyer. 2013. What good are strong specifications?. In ICSE 2013. [31] J. R. Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (1986). [32] Andrew Reynolds, Viktor Kuncak, Cesare Tinelli, Clark Barrett, and Morgan Deters. 2017. Refutation-based synthesis in SMT. Formal Methods in System Design (2017). [33] Sriram Sankaranarayanan, Swarat Chaudhuri, Franjo Ivančić, and Aarti Gupta. 2008. Dynamic inference of likely data preconditions over predicates by tree learning. In ISSTA 2008. [34] Mohamed Nassim Seghir and Daniel Kroening. 2013. Counterexampleguided precondition inference. In ESOP 2013. [35] Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A concolic unit testing engine for C. In ESEC/FSE 2005. [36] Saurabh Sinha, Hina Shah, Carsten Görg, Shujuan Jiang, Mijung Kim, and Mary Jean Harrold. 2009. Fault localization and repair for Java runtime exceptions. In ISSTA 2009. [37] Calvin Smith, Gabriel Ferns, and Aws Albarghouthi. 2017. Discovering relational specifications. In ESEC/FSE 2017. [38] Suresh Thummalapenta, Tao Xie, Nikolai Tillmann, Jonathan de Halleux, and Zhendong Su. 2011. Synthesizing method sequences for high-coverage testing. In OOPSLA 2011. [39] Nikolai Tillmann and Jonathan De Halleux. 2008. Pex: White box test generation for .NET. In TAP 2008. [40] Nikolai Tillmann and Wolfram Schulte. 2005. Parameterized unit tests. In ESEC/FSE 2005. [41] W. E. Weihl. 1988. Commutativity-Based concurrency control for abstract data types. IEEE Trans. Comput. 37, 12 (1988). [42] Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. 2013. Characteristic studies of loop problems for structural test generation via symbolic execution. In ASE 2013. — 157 — 2018 Meeting Proceedings 3.4 Invited Talks During 2018—2019, the meeting organizers made more than 10 reports related to software automation in domestic and foreign conference forums, as follows: (a) About the topic of Data-Driven Software Automation, Tao Xie was invited to give a Technical Briefing at the International Conference on Software Engineering (ICSE 2020). And at the International Conference on Engineering of Complex Computer Systems (ICECCS 2019) and Apsara Conference, he also made a theme report, the report is summarized as follows: Software automation typically refers to the process of generating software automatically based on formal or informal specifications. In the research community, software automation has been a decades-long dream, where software developers are freed from tedious programming tasks for constructing the initial version of software, and from expensive software maintenance tasks for evolving the software to future versions in order to catch up with the changes from requirements or execution environments. Example software automation technologies include program synthesis, code completion, program transformation, code recommendation, program repair, and software self-evolution. In the past decade, software development, maintenance, and deployment produce a huge volume of software engineering data such as source code, version histories, feature specifications, bug reports, test cases, execution — 158 — Software Automation in the Big Data Era: Challenges and Opportunities traces/logs, and real-world user feedback. These data supply a valuable source of inputs for software automation technologies. These data supplies provide great potential to substantially boost these technologies' effectiveness and efficiency toward realizing software automation in not only academic settings but also industrial settings. This talk discusses recent research and future directions in data-driven software automation, as an important objective in the increasingly popular fields of software analytics and intelligent software engineering. (b) Xiong Yingfei was invited to give a thematic report on program synthesis at the summer school of The ACM SIGSOFT International Symposium on Software Testing and Analysis（ISSTA 2019） and some other conference about software testing and validation. Summary of the report is as follows: Program synthesis is the task to automatically construct a program that meets a specific goal. It finds many applications such as end-user programming, optimization, and bug fixing. This talk introduces the basic concepts, methods, and typical applications for program synthesis, as well as pointers for future learning. This talk covers both the classic approaches for synthesizing a program to meet a formal specification, as well as recent approaches that rely statistical models to infer a likely program for an indefinite goal. — 159 — Organizer: The Academic Works and Publications Committee of the Academic Divisions of the Chinese Academy of Sciences Cooperater: the People’s Government of Beijing Municipality Executive Organizer: Peking University Beijing Yanqi Lake International Convention & Exhibition Center