Experimentation in Research

Posted on May 5, 2017 by Gabe Parmer

Recently, two of my students (a PhD student, and an undergraduate) stated as fact something that I find deeply incorrect and unsettling. This has been a little bit of a wake-up call to me that I’m dropping the ball on some fundamental mentoring. This post will correct the (understandably) mistaken idea that we do experiments in our papers to show how well our system does. This is not correct. This post will also address some followup questions regarding which comparison cases we should choose for our publications, and how to have higher confidence in your evaluation.

We’re Awesome!

When you’ve just spent a year or two of your time doing a ton of system implementation, creating a theoretical framework around that implementation, and are ready to publish the work, it is understandable that the first inclination is to think “how can we show how awesome our work is!?” Alternatively, when reading other’s papers, it is easy to become disillusioned about the experimental results sections because one interpretation of them is that the authors are, of course, going to show how awesome their contributions are. A natural conclusion I’ve seen a few people make is that everyone is just showing how good their system is, so why should we do anything other than skim these sections?

I will make three arguments for why you should fight against the urge to view experimental results sections as opportunities for self-aggrandizement.

The Goal: Less Awesome, More Knowledge Creation

A set of perspectives:

Scientific progress. The goal of the scientific evaluation of our systems is not to show how awesome they are. The goal is to understand how they behave, and to validate the contributions in spite of any down-sides to the system. We particularly care about how they behave in situations that the world cares about (for example, with relevant applications) as this goes toward the value of the system. A subset of the evaluation will certainly show the situations where the research is superior, but that doesn’t mean you should focus your evaluation only on those situations.

Holistic evaluation. When designing and implementing a system, we have hypothesis about what the contributions of that system will be. That is, how will it add to human knowledge and capability? The evaluation is there to either validate those contributions, or force you to reevaluate your hypothesis. This points out that the fundamental contributions of the paper must also contain the same level of nuance as the experiments. They are there just as much to implicitly state where your system is not making a contribution as they are to state where it provides value. The contributions are generally stated in the positive (what value you add, not where you don’t), and the evaluations validate these claims. However, the evaluations also make the implicit, negative aspects (i.e. trade-offs) of your contributions more explicit.

Motivations matter. Only papers that provide value are accepted into proceedings. Regardless of the motivation for the experimental evaluation section, if we eventually just end up showing how great our system is, why does it matter what the goal is? To be part of the scientific community, the path you take in doing your own research matters immensely. Why?

  1. The peer-review system relies on all members evaluating contributions based on their actual value, not salesmanship. If this system falters, then the gap between academic research, and corporate whitepapers shrinks and eventually the public will lose trust in our ability to make real scientific progress as our one-sided arguments fail to produce actual value.
  2. Reputation is important, and gaining a bad reputation by doing work that has less value than is purported in the paper will taint other’s view of your own research. This will negatively impact peer-review, grant, and industry perception of your work.
  3. If you choose to do research in the global community, you should buy into the values of that community. Long-term progress is only possible via thorough, repeatable, scientific evaluations of our work. If the goal is only to push your own work and to gain fame, there are many other venues for that.

Motivations matter. The integrity of our own work and how it is conducted is as important as the research itself.

Evaluation of Applications and Comparisons

An aspect of how we show the trade-offs of our research, is which applications we choose to use as microscopes for our systems, and which systems we compare against. Instead of going into too many details and special cases here, some vague guidelines:

Microbenchmarks vs. Applications

It is often not clear to fledgling researchers why we require both microbenchmarks and complete applications, and what value both types of evaluation provide.

Scientific Evaluation

It is somewhat natural to treat the evaluation that we perform in our papers as the most boring part of the whole research process. We started the research in the first place because we hypothesized that our design had specific benefits, and this evaluation is to validate our implementation of that design. In reality, most of the design and implementation we do in systems is driven by an artistic mix of creativity and past experience, and the evaluation is where we must follow a scientific process1.

It is also important to understand when doing an evaluation that we (as humans) are somewhat flawed by confirmation bias. When we run an experiment and get a good result, we will tend to see it as bolstering our positive arguments for our system. However, experiments are not that simple; often a result that looks good can hide more nuance that requires deeper investigation. So what can we do to prevent results that look good from blinding us from the need for a deeper investigation?

Proper evaluation of our research requires the application of the scientific method.

As computer scientists (especially in systems), we create the world we study, which makes the nature of our research quite different from traditional scientific domains that aim to study physical phenomenon. This means that when building our systems, we aren’t necessarily pervasively applying the scientific method. When we switch over into evaluation, it is important that we switch over into “scientific mode” from “engineering mode”. The subset of the scientific method I want to focus on is the feedback loop between designing evaluations, forming hypothesis, conducting the evaluation, and interpreting the results (which starts the whole loop again).

This process takes a lot of time, especially if you’re hypothesis are off and you don’t understand the system as well as you need to. Each iteration takes quite a bit of effort and time, but is requires to ensure that your results are measuring and depicting what they should. Because of this, it is important to realize that:

The enemy of proper evaluation is a lack of time.

When I was a PhD student, I aimed to get all of my evaluation done at least two weeks in advance of a deadline, and that tended to leave enough time to iterate if my hypothesis was off, or if there were bugs to fix. However, this time should be larger if you have less confidence in the implementation, or if your dealing with a system you don’t know well.

TL;DR2

Evaluation of our systems deserves care and attention. Doing it poorly is the fastest way to significantly tarnish your reputation, and have very little impact on the world. It is necessary, as researchers in the global scientific community, to take our own, and our community’s scientific integrity seriously. As a graduate student, it is important to be systematic in your evaluation, follow the scientific method, and leave yourself the time so that both of these are possible.


  1. I’ll argue in a latter post that we should adopt the scientific method in our investigation of code-bases, and in our debugging. Regardless, the point here is that the scientific method is required in our evaluation of our systems.

  2. I like the irony of having a TL;DR at the end. Deal with it.