Report from the chairs

We are happy to report the conclusion of the artifact evaluation (AE) process for papers accepted at SOSP 2019. Here, we share our experiences in organizing the first such effort at SOSP, and ruminate on key takeaways. Our hope is that this effort serves as a catalyst in making artifact evaluation more common at systems conferences.


Preamble: As this was the inaugural year, AE was voluntary for all the accepted papers. However, we were encouraged to see that 23 of the 38 (= 61%) applied to go through the evaluation. Next, to form the artifact evaluation committee (AEC), we reached out to the broader systems community via Twitter and Slack (systems-research). This helped us bring together a team of 42 early-career researchers and graduate students, who volunteered to read the papers and undertake the evaluation. Finally, in terms of artifact badges to be awarded, we gleaned over recommendations from the ACM and the PL community, and decided on three badges that made sense for systems research.

Evaluation: We designed the evaluation process to be single-blind (i.e., identity of the evaluators was not revealed to the authors). Every artifact was evaluated by at least two members of the AEC. Unlike the paper review process, we advised the AEC members to work with the authors to help them achieve the badges they sought. This required a significant amount of communication between the evaluators and authors as well as accepting several revisions of artifacts and instructions. Due to the single-blind nature of the process, communications were via HotCRP comments. Average length of communications including reviews and comments was 3456 words per paper. The whole process starting from authors registering their artifacts, to evaluators familiarizing themselves with the underlying research papers, to verifying the artifact functionality, to reproducing the results, to writing the reviews, and to awarding the badges was completed in the span of 28 days.

Results of AE: Here we highlight the key outcomes from AE. For a more detailed view, check here.

Key Takeaways

1. AE made the papers better. This manifested at two levels. The requirements of the badges meant that the artifacts are not only publicly available but also verified to be functional and usable. So, for the first time, a number of SOSP papers will release artifacts that have been externally validated. The second aspect is about the accuracy of results. For instance, when AEC members identified a performance mismatch, the authors worked with them to root cause it to usage of older versions of external dependencies (TensorFlow). In response, the authors revised the numbers in their camera ready paper.

2. Specialized hardware is not a hindrance for AE. Our effort dispels the conventional wisdom that projects involving custom hardware or expansive clusters cannot be evaluated. We observed that in such cases, the authors allowed AEC members to access their resources (via ssh) to perform evaluations. This was the case for 6 (out of 23) evaluations.

3. Interest in AE is not limited to academic projects. Nearly 40% of the submitted artifacts either originated from industry, or had industrial collaborators. Half the industry papers got all three badges. Even when business concerns did not allow open-sourcing the artifacts, the authors were happy to let AEC members access artifacts privately.

4. AE should be alloted ample time and resources. This SOSP AE was conceived after the paper submission deadline had passed. As a result, we had to make it voluntary, and limit its applicability to only accepted papers. We anticipate that integrating the call for AE with the call for papers should allow much broader participation. Also, allocating dedicated hardware resources (or cloud credits) would free AEC members from the burdens of scouting for resources on their own.