Research

I believe that the real challenge facing today’s scientists isn’t grid computing or using progrmaming CPUs: it’s the fact that most scientists don’t know how to develop software efficiently, have no idea how reliable their programs are, and can’t reproduce their results. This isn’t surprising: after a generic first-year course in programming, they are expected to pick up everything else on their own, which is about as fair as showing someone how to differentiate polynomials and then asking them to reinvent tensor calculus.

I also believe that many professional software developers aren’t particularly good at their jobs either because good working practices are not taught in school. Undergraduate CS programs teach people the syntax of programming languages, and what computers can be used to do; few if any teach the mechanics of software development. Some students pick this up on their own, but most do not.

My research interests therefore grow out of two questions:

  1. Which of the tools and practices used by the best software developers in open source and industry should be taught to computer science undergraduates on the one hand, and scientific researchers on the other?
  2. How can we make them compelling so that CS undergrads, and grad students in science and engineering, will actually adopt them? It isn’t enough to find things that ought to work, or that work for their inventors, or that people will use when their grades depend on them; an innovation can only be counted as a success if people will adopt it voluntarily, and if it makes some measurable difference to their working lives.

I’ve been interested in these problems since the mid-1990s. The sections below describe how I’m trying to find answers.

University of Toronto

I am working with three graduate students at the University of Toronto:

  • Samira Abdi Ashtiani is using information retrieval (IR) techniques to cluster the events in a software project’s history (check-ins, ticket updates, mail messages, etc.) to create coarser-grained “chunks” in order to help developers understand what has happened, and why.
  • Jeremy Handcock is building visualization tools to help developers navigate project histories, and to improve their awareness of what other contributors are doing.
  • Carolyn MacLeod is looking for patterns in the mistakes newcomers make when using modeling tools like SPIN. If such patterns exist, then it may be possible to change training materials or the tools themselves to help people climb these tools’ steep learning curves.

I have some other research projects on the go as well:

  • Thanks to a grant from The MathWorks, the NRC’s Janice Singer and I will be surveying scientists (not just self-identified computational scientists) to find out how, and how well, they actually use computers in their research. I am particularly interested in knowing (a) how many hours per year scientists spend coding, (b) how many CPU hours a year they use running software (either their own or someone else’s), and (c) what tools and working practices they use, and how they learned them.
  • Jordi Cabot and I plan to examine the requirements that the creators of software development portals such as Trac, Rally, Scrumworks, and SourceForge set out to satisfy, and the feature sets they provide, in order to see how many of their differences come from trying to meet different needs, and how many are accidents of implementation.
  • I would like to finish building a reverse test oracle and see what short-term and long-term impact it has on students’ working practices. A reverse test oracle is a system that allows students to submit tests that are run against an instructor’s sample solution to an exercise, without allowing the students to see the sample solution itself. If, for example, a student wants to know whether Part 2 of Question 3 is supposed to be case-sensitive or case-insensitive, she would translate her question into a unit test, submit it to the system, and then check whether it passed or failed. I believe such a tool would encourage test-driven development, but only a real field trial would tell.

DrProject

DrProject is a software project management portal like SourceForge, but tailored for classroom use. Along with a repository browser, ticketing system, search, milestones, integrated mailing lists, and a wiki, DrProject includes a scripting interface to automate repetitive tasks (which come up frequently when 100 students are working on identical projects), integration with external authentication (so that it can rely on universities’ existing user provisioning systems), and ubiquitous tagging.

DrProject is being used by several courses and research groups at the University of Toronto and elsewhere. My goal now is to use it as a platform for studying ways of integrating modern collaboration tools into the software engineering process. Three projects are currently under way:

  1. Developers increasingly communicate via IM rather than email. However, these conversations are not tied to their projects in the way that email archives are. We are therefore experimenting with ways to integrate Internet Relay Chat (IRC) into DrProject—in particular, useful ways to present chat logs, and ways to hyperlink into and out of those logs.
  2. Students find even the simplest real ticketing system too heavyweight for their needs. At the same time, when presented with a simple to-do list, they all want to add “just one more field”. We are therefore building a new ticketing system for DrProject that will allow users to add fields on the fly, so that each project can grow the ticketing system its members want. While this system is interesting in its own right, our real aim is to reverse engineer actual project workflows from the way users customize different projects’ ticketing systems.
  3. Systems like SourceForge provide chart displays of project statistics, such as the number of tickets in different states over time. Jeremy Handcock is implementing a dashboard using the Flare library to provide a much richer set of visualizations. Again, while this will be useful in its own right, we are primarily interested in what its use will tell us about what developers are thinking.

DrProject is an open source project. If you’d like to help out, please get in touch.

Software Carpentry

Outside of tool building, my major interest is Software Carpentry, an open source course on basic software development skills aimed at scientists and engineers. Originally developed with Brent Gorda for Los Alamos National Laboratory, the material was substantially revised in 2005-06 with support from a Python Software Foundation grant; since going live in August 2006, the site has attracted over 120,000 distinct visitors from 70 countries, and the material has been used in Canada, the US, the EU, and elsewhere.

I’ve taught the course at the University of Toronto to a mix of students from computer science, physical and life sciences, engineering, and industry. It will serve as my reality check on the other problems I’m studying: I will only believe I’ve “solved” them when students in a course like this one will actually adopt the tools and techniques I present. We are presently converting the course site to a wiki to make it easier for people to contribute fixes and new content, and plan to provide the examples in MATLAB as well as Python. All of the material is freely available under a Collective Commons license, and help is always welcome.

DemoCamp

DemoCamp is open mike night for tech types, designers, and entrepreneurs. Participants can demo software they have been working on, or give a lightning presentation on any topic likely to inspire, inform, or amuse the audience. It’s held every month or two, with between two and three hundred people taking part each time. You’d be welcome to join us…