March 31, 2021

AACR Journals Webinar: Why & How to Deliver Reproducible Analysis

We live in a world of ever increasing amounts of data, ever more complex research and novel analysis tools to turn all that data into information. Data analysis that could reasonably be done with spreadsheets and basic statistical analysis tools some years ago now requires far more advanced analysis pipelines, coding and visualization tools, as well as access to compute infrastructure.
In the context of publishing scientific research this need for sophisticated – and more difficult to use – tools creates challenges that research teams have to address before they publish their results.

Big Data and Complex Analysis Pipelines Require Packaging and Sharing of Data, Code, Environment and Results

Given the highly complex nature of data analysis simply describing the method used to analyze the data presented in the paper is no longer enough. Even sharing the code underlying the analysis isn’t sufficient to ensure the level of transparency needed to allow the scientific community to scrutinize the work. Full transparency requires sharing of data, code, computational environment and results so others can independently evaluate the study. Sharing computing environments is a sometimes overlooked but critical aspect: the software versions used to analyze the data may well influence the results.
However, many computational researchers and scientists lack the software development experience to expertly install and use these new analysis tools and academic research teams often don’t have computational researchers on staff to support or run the analysis for them.

Making data, code, computing environment and results available is important at several steps during the process of preparing a paper for publication:

Pre-submission review – tasking a lab member with a thorough pre-submission review is asking them to take on a serious amount of extra work, especially when “review” entails everything from critiquing the experimental set-up to setting up a whole analysis pipeline – correct software versions and all – to rerun the data. Especially rerunning of the raw or low-level processed data to independently verify the results can be difficult and time-consuming, but also very important: detecting issues at this stage prevents researchers from submitting irreproducible results.
Submission for peer-review – the problem of recreating complex analysis pipelines is repeated during peer-review; reviewers tend to lack the advanced computing skills and/or time required for setting up computing environments that exactly match those of the initial analysis and crunch the numbers independently to verify the results. However, enabling that level of scrutiny and interaction with the data not only allows a reviewer to gain confidence in the results, but also signals a high commitment to transparency and can increase its impact of the study by increasing its reusability.
Post-publication – even post publication giving others the opportunity to rerun the analysis can be highly valuable. Unexpected results, for example, can be controversial and making it possible for skeptical members of the scientific community to not just review data but to rerun the analysis can stop criticism in its tracks.

Source: AACR Journals Webinar – Code Ocean: A Tool to Support the Delivery of Reproducible Analysis With Dr. Benjamin Haibe-Kains, https://vimeo.com/485555234 based on Roger D. Peng, Stephanie C. Hicks, Reproducible Research: A Retrospective, submitted 23 Jul 2020, https://arxiv.org/abs/2007.12210

We made the argument that sharing is good and more sharing is better. But how can researchers without coding experience easily share all that information with other researchers without coding experience?

Compute Capsules – Share Data, Code, Computing Environment and Results

That’s the problem we solved for you by developing the Code Ocean Compute Capsules™.
Compute Capsules are in essence self-sufficient, secure and executable computing packages that contain code, data, environment, and the associated results. Each Capsule is versioned, open format, exportable, reproducible and interoperable.
A particularly important benefit of Compute Capsules is the ease with which they can be shared and used by others. Since the entire environment is set-up once for everybody inside the Capsule, even users with little (or no) coding experience can rerun the analysis within that environment. Adding visualization tools, e.g. Jupyter extensions, makes it easy to visualize data for reviewers and allows them to explore the data in an interactive way.
A researcher preparing their work for publication can put all the data in a Compute Capsule and perform all analysis in that environment. This guarantees that the researcher uses the same software and same versions throughout the entire data analysis process. It also allows them to share the whole package with team members for pre-submission review, with reviewers during the peer review process, and with the scientific community after publication. Capsules provide an easy way of updating results when new data become available, better methods are developed, or bugs are detected and need to be fixed.
Benefits of using Compute Capsules include:

painless installation
well-controlled, transparent software environments – everybody knows exactly what they are using
hassle-free collaboration – Capsules can be shared seamlessly allowing even those new to the project to get started with data analysis immediately in exactly the same environment without lengthy and complicated set-up
guaranteed reproducibility even when many people touch the data and/or code

Complex analysis requires experience with setting up pipelines as well as a working knowledge of devops tasks such as software version control and setting up secure access.
With the Code Ocean platform these tasks can be taken care of once at the Capsule level rather than forcing everybody who needs to interact with the data to repeat all these steps on their own computing machines. Data, results, code and the exact environment can be made available easily to all researchers.

Using Compute Capsules therefore makes research more reproducible and transparent while making life easier for scientists who rather generate and analyze data than laboring over software versioning issues.

For more about how to Code Ocean Compute Capsules support reproducible data analysis please click here to listen to the webinar “AACR Journals Webinar – Code Ocean: A Tool to Support the Delivery of Reproducible Analysis with Dr. Benjamin Haibe-Kains” or contact us at info@codeocean.com.