Experimentation in Research

Posted on May 5, 2017 by Gabe Parmer

Recently, two of my students (a PhD student, and an undergraduate) stated as fact something that I find deeply incorrect and unsettling. This has been a little bit of a wake-up call to me that I’m dropping the ball on some fundamental mentoring. This post will correct the (understandably) mistaken idea that we do experiments in our papers to show how well our system does. This is not correct. This post will also address some followup questions regarding which comparison cases we should choose for our publications, and how to have higher confidence in your evaluation.

We’re Awesome!

When you’ve just spent a year or two of your time doing a ton of system implementation, creating a theoretical framework around that implementation, and are ready to publish the work, it is understandable that the first inclination is to think “how can we show how awesome our work is!?” Alternatively, when reading other’s papers, it is easy to become disillusioned about the experimental results sections because one interpretation of them is that the authors are, of course, going to show how awesome their contributions are. A natural conclusion I’ve seen a few people make is that everyone is just showing how good their system is, so why should we do anything other than skim these sections?

I will make three arguments for why you should fight against the urge to view experimental results sections as opportunities for self-aggrandizement.

Reviewer Interpretation. Each paper that we submit to a conference typically has between four and eight reviewers chosen from among the research community of peers whose job is to vet the submission. These are not the type of people that you want to try and sell snake oil to. They are looking at what your contributions are, if they are backed up by evidence, and if those contributions are significant enough for inclusion in the proceedings. In short, they care if your work is valuable, and scientifically sound. Work that doesn’t show all the trade-offs involved in your system will never meet the bar of “scientifically sound”. Evaluations that show only the positive aspects of the design, are not scientifically sound. They often make mistakes. But you should never submit a paper that relies on the mistakes of reviewers for acceptance.
Reader Perception. When someone takes the time to sit down and read your paper, they are looking to trade their own (valuable) time, for knowledge. When they read an experimental section that paints a uniformly rosy picture of the system being introduced, they can and should be skeptical. Systems exist in complex environments that trade-off different resources and guarantees. Rarely, if ever, do we implement something that is uniformly an improvement over existing systems.

An astute reader approaches a paper with a honed sense of skeptical analysis. If they’re presented with only a glowingly positive perspective on the contributions of a paper, they will likely feel that they did not attain a complete understanding of the system. They might feel that the time they spent with the system didn’t glean them the knowledge they hoped for as significant questions linger about the work. When I hear industry members talking about academic research, they often complain about the “unrealistic” experimental evaluations. We are not in the business of publishing only to publish. We want to convince people that our techniques are valuable in a well-defined domain.
Scientific progress. Somewhat counter-intuitively, perhaps, just showing why your system is good does not add much to the scientific progress of a community. This is complicated because papers are generally not accepted when their contributions do not show benefit. “Negative results” have a very poor history of being published in systems, even when they do create knowledge (i.e. when one would expect a positive outcome, and do not find it).

Learning that some technique has a number of positive contributions is important, but you have to ask what knowledge the reader walks away with. The progress of our ideas relies on scientific progress, not salesmanship.

The Goal: Less Awesome, More Knowledge Creation

A set of perspectives:

Scientific progress. The goal of the scientific evaluation of our systems is not to show how awesome they are. The goal is to understand how they behave, and to validate the contributions in spite of any down-sides to the system. We particularly care about how they behave in situations that the world cares about (for example, with relevant applications) as this goes toward the value of the system. A subset of the evaluation will certainly show the situations where the research is superior, but that doesn’t mean you should focus your evaluation only on those situations.

Holistic evaluation. When designing and implementing a system, we have hypothesis about what the contributions of that system will be. That is, how will it add to human knowledge and capability? The evaluation is there to either validate those contributions, or force you to reevaluate your hypothesis. This points out that the fundamental contributions of the paper must also contain the same level of nuance as the experiments. They are there just as much to implicitly state where your system is not making a contribution as they are to state where it provides value. The contributions are generally stated in the positive (what value you add, not where you don’t), and the evaluations validate these claims. However, the evaluations also make the implicit, negative aspects (i.e. trade-offs) of your contributions more explicit.

Motivations matter. Only papers that provide value are accepted into proceedings. Regardless of the motivation for the experimental evaluation section, if we eventually just end up showing how great our system is, why does it matter what the goal is? To be part of the scientific community, the path you take in doing your own research matters immensely. Why?

The peer-review system relies on all members evaluating contributions based on their actual value, not salesmanship. If this system falters, then the gap between academic research, and corporate whitepapers shrinks and eventually the public will lose trust in our ability to make real scientific progress as our one-sided arguments fail to produce actual value.
Reputation is important, and gaining a bad reputation by doing work that has less value than is purported in the paper will taint other’s view of your own research. This will negatively impact peer-review, grant, and industry perception of your work.
If you choose to do research in the global community, you should buy into the values of that community. Long-term progress is only possible via thorough, repeatable, scientific evaluations of our work. If the goal is only to push your own work and to gain fame, there are many other venues for that.

Motivations matter. The integrity of our own work and how it is conducted is as important as the research itself.

Evaluation of Applications and Comparisons

An aspect of how we show the trade-offs of our research, is which applications we choose to use as microscopes for our systems, and which systems we compare against. Instead of going into too many details and special cases here, some vague guidelines:

Relevant. We should always use applications that the world cares about. We should always compare against systems that the world cares about. This could mean that they are commonly used in a domain of interest, or that the system (even if not broadly used) is the best exemplar of a state-of-the-art technique. Relevant applications and comparison cases show the reader that we care about showing the value of the system in the current context of the world. This is why so many systems compare against Linux. It is an immensely relevant system used in many domains, thus any value provided over it has a decent chance of being significant.
Sound. Applications we use, and systems we compare against must not be unreasonably ham-strung. We should always take pains to provide “apples-to-apples” comparisons where possible. If a comparison is bound to be unfair in some way (but the comparison is still relevant), then it is always best to err on the side of unfairly benefiting the competing system. If it is impossible to show value under these conditions, then significant effort must be put into increasing the fairness of the comparison.
Complete. The evaluation regime must be formulated to delve into all salient trade-offs of our system. Often the comparison systems are best at showing this as they might do better for some values of a system or evaluation parameter, and worse in others. Evaluations that focus only on the benefits of the system often fail in their completeness.

Microbenchmarks vs. Applications

It is often not clear to fledgling researchers why we require both microbenchmarks and complete applications, and what value both types of evaluation provide.

Microbenchmarks. Microbenchmarks provide small insights into the overheads of very specific parts of the system (often the system’s atoms). If a number of operations are provided in the system, these investigate their costs in the relevant dimensions (e.g. cycles/op). As argued in the post on system atoms, they provide an upper-bound on what is possible with the system, so this bound must be thoroughly evaluated. Because of this, microbenchmarks must be complete in the sense that they thoroughly detail the space of possibility for the system (i.e. what is the maximum achievable performance). If microbenchmarks include comparisons to comparable systems, they must be sound. However, by definition, microbenchmarks are not relevant on their own. They are analogous to understanding the strength of concrete (the microbenchmark); yet that strength alone doesn’t tell you the level of structural integrity of a bridge (the application).
Applications. Applications, as you may guess, provide the relevance to the evaluation. They demonstrate that the system can be used to provide value within the context of an application that, presumably, the world cares about in some way. These evaluations must also be sound. However, completeness is somewhat nuanced. When evaluating \(N\) applications, you’re almost explicitly only looking at the subset of system behaviors that are expressed during the application’s execution. In this way, completeness is often not entirely satisfied. One often sees multiple applications and competing systems evaluated within the same paper as an attempt to bolster the completeness of the evaluation. Within the confines of the page-limit, this should be a goal. However, there should be an acknowledgment that completeness is not achieved, and an qualitative argument about the bounds of conclusions that can be drawn from the applications is necessary.

Scientific Evaluation

It is somewhat natural to treat the evaluation that we perform in our papers as the most boring part of the whole research process. We started the research in the first place because we hypothesized that our design had specific benefits, and this evaluation is to validate our implementation of that design. In reality, most of the design and implementation we do in systems is driven by an artistic mix of creativity and past experience, and the evaluation is where we must follow a scientific process¹.

It is also important to understand when doing an evaluation that we (as humans) are somewhat flawed by confirmation bias. When we run an experiment and get a good result, we will tend to see it as bolstering our positive arguments for our system. However, experiments are not that simple; often a result that looks good can hide more nuance that requires deeper investigation. So what can we do to prevent results that look good from blinding us from the need for a deeper investigation?

Proper evaluation of our research requires the application of the scientific method.

As computer scientists (especially in systems), we create the world we study, which makes the nature of our research quite different from traditional scientific domains that aim to study physical phenomenon. This means that when building our systems, we aren’t necessarily pervasively applying the scientific method. When we switch over into evaluation, it is important that we switch over into “scientific mode” from “engineering mode”. The subset of the scientific method I want to focus on is the feedback loop between designing evaluations, forming hypothesis, conducting the evaluation, and interpreting the results (which starts the whole loop again).

Design the experiment. What aspect of the system are you evaluating, and what is the best test to perform that evaluation? Often you’re attempting to understand the behavior of the system when varying some set of variables. What are those variables? Given that, you have x- and y-axis, and a number of lines in each graph that you can use to investigate up to three of those variables. Often there are more variables than that, so how will you break the evaluation into separate graphs. One of the graphs should attempt to summarize the overall contributions, while the rest should study the trade-offs of the system. For each of the graphs, we continue on to the next phase.
Forming a hypothesis. This sounds simple, but this is the step that is the most important, and the easiest to overlook. You must ask yourself what you believe the results of the experiment will be, and why. This forces you to deeply contemplate the relevant system effects, and create a mental model of how they will mutually interact. If your graph is a typical x/y/number of lines plot, what should the shape of the lines be? Where are the inflection points? How will the lines react to changing variables relative to each other?
Run the experiments. You must be very careful at this stage to ensure that you’re executing the system in a way that is consistent with the intentions of the design. You must only modify the variables that you’re supposed to vary according to the experiment’s design.
Interpret the results. Do the results match your hypothesis? If they do not match (in any way), then you must determine why. There are two broad reasons: 1. your hypothesis does not reflect the actual effects and mechanisms of the system, or 2. your experiment shows a bug somewhere (in the design, in the implementation, or in the results). For the former, you need to understand why your mental model of the system is wrong, and go back to square one of designing the experiment. Redesign the experiment given your new understanding, reform hypothesis, and iterate. For the latter, it is debugging time.

This process takes a lot of time, especially if you’re hypothesis are off and you don’t understand the system as well as you need to. Each iteration takes quite a bit of effort and time, but is requires to ensure that your results are measuring and depicting what they should. Because of this, it is important to realize that:

The enemy of proper evaluation is a lack of time.

When I was a PhD student, I aimed to get all of my evaluation done at least two weeks in advance of a deadline, and that tended to leave enough time to iterate if my hypothesis was off, or if there were bugs to fix. However, this time should be larger if you have less confidence in the implementation, or if your dealing with a system you don’t know well.

TL;DR²

Evaluation of our systems deserves care and attention. Doing it poorly is the fastest way to significantly tarnish your reputation, and have very little impact on the world. It is necessary, as researchers in the global scientific community, to take our own, and our community’s scientific integrity seriously. As a graduate student, it is important to be systematic in your evaluation, follow the scientific method, and leave yourself the time so that both of these are possible.

I’ll argue in a latter post that we should adopt the scientific method in our investigation of code-bases, and in our debugging. Regardless, the point here is that the scientific method is required in our evaluation of our systems.↩
I like the irony of having a TL;DR at the end. Deal with it.↩