软工大牛Collin McMillan及其顶会论文解读
前言
本文介绍软工大牛Collin McMillan及其今年顶会论文“Detecting Speech Act Types in Developer Question、Answer Conversations during Bug Repair”
一、论文及作者信息
论文名称:
作者:Andrew Wood (一作),Collin McMillan(导师)
单位:University of Notre Dame
圣母大学(University of Notre Dame,又音译为诺特丹大学),始建于19世纪中期,经历了一个多世纪的辉煌,享誉全美,是一所私立天主教大学、研究型大学,位于美国印地安纳州的南本德,本科教育稳居全美20所顶尖学府之列。2018年USNews美国大学排名,圣母大学排18名。 [1]
联系方式:[email protected] [email protected]
Collin McMillan的介绍:
其主页在 https://www3.nd.edu/~cmc/ 。这个人是真的厉害,现在在诺特丹大学当助理教授(associated professor),看上去真的很年轻。
下面有三个Ph.D的在读学生,都是17年就读的。而且还培养3个Ph.D学生,1个理学硕士(M.S. 即master of science)。
基本上每年都有两篇以上的ICSME,至少一篇顶会(ICSE, FSE,ASE等等)。
今年的文章:
- Wood, A., Rodeghero, P., Armaly, A., McMillan, C., “Detecting Speech Act Types in Developer Question/Answer Conversations During Bug Repair”, in Proc. of the 26th ACM Symposium on the Foundations of Software Engineering (ESEC/FSE’18), Lake Buena Vista, Florida, USA, Nov. 4-9, 2018. [arXiv] [data]
- LeClair, A., Eberhart, Z., McMillan, C., “Adapting Neural Text Classification for Improved Software Categorization”, in Proc. of the 34th IEEE International Conference on Software Maintenance and Evolution (ICSME’18), Madrid, Spain, Sept. 23-29, 2018. [arXiv]
- Armaly, A., Rodeghero, P., McMillan, C., “AudioHighlight: Code Skimming for Blind Programmers”, in Proc. of the 34th IEEE International Conference on Software Maintenance and Evolution (ICSME’18), Madrid, Spain, Sept. 23-29, 2018.
- Krasniqi, R., McMillan, C., “TraceLab Components for Generating Speech Act Types in Developer Question/Answer Conversations”, in Proc. of the 34th IEEE International Conference on Software Maintenance and Evolution, Artifacts (ICSME’18 Artifacts), Madrid, Sept. 23-29, 2018. [data]
可以看出来主要在搞speech act type的detection和generation。
虽然还没看几篇论文,但是看的越多,就越感慨世界上优秀的人太多了。
渺沧海之一粟,羡长江(天地)之无穷。不要有太大压力,但是要一直努力。
二、论文内容
这篇论文涉及到bug repair,但是和我之前了解的APR(automated program repair)又有区别,而且在Abstract有一些专业概念我没看懂:
1)wizard of Oz
2)simulated virtual assistant
3)speech act types in the conversations
4)an open coding manual annotation procedure
5)Our automated detection achieved 69% precision and 50% recall.
文章主要工作:The key application of this work is to advance the state of the art for virtual assistants in software engineering.
写作动机:Virtual assistant technology is growing rapidly, though applications in software engineering are behind those in other areas, largely due to a lack of relevant data and experiments. This paper targets this problem in the area of developer Q/A conversations about bug repair.
introduction给出了speech type和Virtual assistants的概念,而且还举了例子:Cortana,Siri,Google Now, etc., 这些都是Virtual assistants
感觉还是很生动、很容易理解的。
具体工作:
In this paper, we conduct a Wizard of Oz experiment in the context of bug repair. We then manually annotate the data from this experiment to find the speech act types and build and evaluate a detector for these speech acts in conversations. Our target problem domain is a virtual assistant to help programmers during bug repair. We chose bug repair because it is a common software engineering task, and because, as previous studies have shown, bug repair is a situation in which programmers are likely to ask questions [43, 82]. We recruited 30 professional programmers to fix bugs for two hours each, while providing an interface to a Wizard of Oz simulated virtual assistant. The programmers interacted with the simulated virtual assistant for help on the debugging task. We then manually annotated each conversation with speech act types in an open coding procedure (see Section 5). Finally, we trained a learning algorithm to detect speech acts in the user’s side of the conversations, and evaluated its performance (Sections 7 - 9)
三、论文特色
1)我觉得现在顶会文章都是要有很扎实的工作的,一般都是有一个publicly available的工具。
2)现在的顶会文章,introduction写的是真的好,就这篇文章,确实既写出了他这个领域方向大家都有在做,for decades,还有历史。但是又指出了不足,同时为自己的工作做出了很好的铺垫。这个是真的酷。
Today, virtual assistants are possible due to major efforts in understanding human conversation, though these efforts have largely been confined to everyday tasks. While virtual assistants for software engineering have been envisioned for decades [8, 77], progress is limited, largely due to three problems that we target in this paper: 1) there are very few experiments with data released of software engineering conversations, 2) the speech act types that software engineers make are not described in the relevant literature, and 3) there are no algorithms to automatically detect speech acts.
3)这位作者做实验竟然招聘(recruited)30个professional programmers来进行bug repair。
4)这位作者写自己的贡献:
By releasing this corpus, we contribute one of very few WoZ corpora, which are especially rare in the domain of Software Engineering [79]. We release all data, including conversations, annotations, and our detection algorithm source code via an online appendix (Section 11), to promote reproducibility and assist future research in software engineering virtual agents
我第一次看见这样写的。没有直接说:our contributions include: …
作者这样写,真的很酷诶!
5)
We chose bug repair because it is a common software engineering task, and because, as previous studies have shown, bug repair is a situation in which programmers are likely to ask questions [43, 82].
这里用了过去时 chose。然后 and 对应的后面一个平行动词用的是 is。所以这里可以注意下。
6)他们画的图,很好看
我现在还不知道怎么画这样的图,好像MATLAB还画不出这样的,我得进一步探究下。
四、生词收集
intersection
英 [ˌɪntəˈsekʃn] 美 [ˌɪntərˈsekʃn]
n.横断,横切;交叉,相交;交叉点,交叉线;[数]交集
anatomy
英 [əˈnætəmi] 美 [əˈnætəmi]
n.解剖,分解,分析;(详细的)剖析;(生物体的)解剖结构;骨骼
wizard
(传说中的)男巫,术士 (in stories) a man with magic powers
行家;能手;奇才 a person who is especially good at sth a computer/financial, etc. wizard 计算机、金融等奇才
(computing 计) 向导(程序) a program that makes it easy to use another program or perform a task by giving you a series of simple choices
mimic
模仿(人的言行举止);(尤指)做滑稽模仿 to copy the way sb speaks, moves, behaves, etc., especially in order to make other people laugh
[VN] (外表或行为举止)像,似 to look or behave like sth else
confine 英[kənˈfaɪn]
美[kənˈfaɪn]
vt. 限制; 局限于; 禁闭; 管制;
n. 界限,范围; 国界;
envisioned
v. 想像,展望( envision的过去式和过去分词 );
recruited
v. 招聘( recruit的过去式和过去分词 ); 吸收某人为新成员; 动员…(提供帮助); 雇用;
corpus 英[ˈkɔ:pəs]
美[ˈkɔ:rpəs]
n. 全集,文集; 资金,本金; [计] 语料库; 器官;
corpora 英[ˈkɔ:pərə]
美[ˈkɔ:rpəs]
n. 全集,任何事物的主体; (书面的,有时为口语的) 资料,文集,汇编( corpus的名词复数 );
五、好句摘录
We release all data, including conversations, annotations, and our detection algorithm source code via an online appendix (Section 11), to promote reproducibility and assist future research in software engineering virtual agents
这里的 release, including, via an online appendix, to promote reproducibility, assist future research in software engineering virtual agents 真的太高明了。
Automated virtual assistants such as Siri, Cortana, and Google Now are claiming an increasing role in computing for everyday tasks.
注意claiming, an increasing role.
As with most studies, this project has a few threats to validity.
有关threats to validity 的写法。
We thank and acknowledge the 30 professional developers who participated in this research study. This work is supported in part by the NSF CCF-1452959, CCF-1717607, and CNS-1510329 grants. Any opinions, findings, and conclusions expressed herein are the authors’and do not necessarily reflect those of the sponsors..
acknowledgements的写法。
参考文献的写法:
OpenCSV. 2017. OpenCSV. http://opencsv.sourceforge.net/. (2017). Accessed: 2017-08-20.
Apache Foundation. 2018. Apache Commons IO. https://commons.apache.org/proper/commons-io/. (2018). Accessed: 2018-03-02.
Apple. 2018. Siri. https://www.apple.com/ios/siri/. (2018). Accessed: 2018-03-02
参考文献
[1] 圣母大学. https://baike.baidu.com/item/圣母大学/8520498?fromtitle=University%20of%20Notre%20Dame&fromid=11173091&fr=aladdin
文末诗词
秋风清,秋月明,
落叶聚还散,寒鸦栖复惊。
相见相思知何日?此时此夜难为情!
——李白《三五七言》