Explore Platform Cloud
This demo tutorial provides an introduction to Seqera Platform, including instructions to:
- Launch, monitor, and optimize the nf-core/rnaseq pipeline
- Select pipeline input data with Data Explorer and Platform datasets
- Perform tertiary analysis of pipeline results with Data Studios
The Platform Community Showcase is a Seqera-managed demonstration workspace with all the resources needed to follow along with this tutorial. All Seqera Cloud users have access to this example workspace by default.
The Launchpad in every Platform workspace allows users to easily create and share Nextflow pipelines that can be executed on any supported infrastructure, including all public clouds and most HPC schedulers. A Launchpad pipeline consists of a pre-configured workflow repository, compute environment, and launch parameters.
The Community Showcase contains 15 preconfigured pipelines, including nf-core/rnaseq, a bioinformatics pipeline used to analyze RNA sequencing data.
The workspace also includes three preconfigured AWS Batch compute environments to run Showcase pipelines, and various Platform datasets and public data sources (accessed via Data Explorer) to use as pipeline input.
To skip this Community Showcase demo and start running pipelines on your own infrastructure:
- Set up an organization workspace.
- Create a workspace compute environment for your cloud or HPC compute infrastructure.
- Add pipelines to your workspace.
Launch the nf-core/rnaseq pipeline
This guide is based on version 3.14.0 of the nf-core/rnaseq pipeline. Launch form parameters may differ in other versions.
Navigate to the Launchpad in the community/showcase
workspace and select Launch next to the nf-core-rnaseq
pipeline to open the launch form.
Nextflow parameter schema
Parameter selection
Adjust the following Platform-specific options:
- Workflow run name: A unique identifier for the run, pre-filled with a random name. This can be customized.
- Labels: Assign new or existing labels to the run. For example, a project ID or genome version.
nf-core/rnaseq
requires a set of parameters to run:
input
Most nf-core pipelines use the input
parameter in a standardized way to specify an input samplesheet that contains paths to input files (such as FASTQ files) and any additional metadata needed to run the pipeline. Use Browse to select either a file path in cloud storage via Data Explorer, or a pre-loaded Dataset:
Data Explorer
Datasets
- See Add data to learn how to add datasets and Data Explorer cloud buckets to your own workspaces.
output
Most nf-core pipelines use the outdir
parameter in a standardized way to specify where the final results created by the pipeline are published. outdir
must be unique for each pipeline run. Otherwise, your results will be overwritten.
For this tutorial test run, keep the default outdir
value (./results
).
For the outdir
parameter in pipeline runs in your own workspace, select Browse to specify a cloud storage directory using Data Explorer, or enter a cloud storage directory path to publish pipeline results to manually.
Pipeline-specific parameters
Modify other parameters to customize the pipeline execution through the parameters form. For example, under Read trimming options, change the trimmer
to select fastp
in the dropdown menu instead of trimgalore
.
Select Launch to start the run and be directed to the Runs tab with your run in a submitted status at the top of the list.
View run information
Run details page
As the pipeline runs, run details will populate with parameters, logs, and other important execution details:
View run details
View reports
Most Nextflow pipelines generate reports or output files which are useful to inspect at the end of the pipeline execution. Reports can contain quality control (QC) metrics that are important to assess the integrity of the results.
View run reports
See Reports to configure reports for pipeline runs in your own workspace.
View general information
The run details page includes general information about who executed the run and when, the Git hash and tag used, and additional details about the compute environment and Nextflow version used.
View general run information
View process and task details
Scroll down the page to view:
- The progress of individual pipeline Processes
- Aggregated stats for the run (total walltime, CPU hours)
- Workflow metrics (CPU efficiency, memory efficiency)
- A Task details table for every task in the workflow
The task details table provides further information on every step in the pipeline, including task statuses and metrics:
View task details
Task work directory in Data Explorer
If a task fails, a good place to begin troubleshooting is the task's work directory. Nextflow hash-addresses each task of the pipeline and creates unique directories based on these hashes.
View task log and output files
Tertiary analysis
Tertiary analysis of pipeline results is often performed in platforms like Jupyter Notebook or RStudio. Setting up the infrastructure for these platforms, including accessing pipeline data and the necessary bioinformatics packages, can be complex and time-consuming.
Data Studios streamlines the process of creating interactive analysis environments for Platform users. With built-in templates, creating a data studio is as simple as adding and sharing pipelines or datasets.
Analyze RNAseq data in Data Studios
In the Data Studios tab, you can monitor and see the details of the data studios in the Community Showcase workspace.
Data Studios is used to perform bespoke analysis on the results of upstream workflows. For example, in the Community Showcase workspace we have run the nf-core/rnaseq pipeline to quantify gene expression, followed by nf-core/differentialabundance to derive differential expression statistics. The workspace contains a data studio with these results from cloud storage mounted into the studio to perform further analysis. One of these outputs is an RShiny application, which can be deployed for interactive analysis.
Connect to the RNAseq analysis studio
Select the rnaseq_to_differentialabundance
data studio. This studio consists of an RStudio environment that uses an existing compute environment available in the showcase workspace. The studio also contains mounted data generated from the nf-core/rnaseq and subsequent nf-core/differentialabundance pipeline runs, directly from AWS S3.
Select Connect to view the running RStudio environment. The rnaseq_to_differentialabundance
studio includes the necessary R packages for deploying an RShiny application to visualize the RNAseq data.
Deploy the RShiny app in the data studio by selecting the green play button on the last chunk of the R script:
Data Studios allows you to specify the resources each studio will use. When creating your own data studios with shared compute environment resources, you must allocate sufficient resources to the compute environment to prevent data studio or pipeline run interruptions.
Explore results
The RShiny app will deploy in a separate browser window, providing a data interface. Here you can view information about your sample data, perform QC or exploratory analysis, and view the results of differential expression analyses.
Sample clustering with PCA plots
Gene expression changes with Volcano plots
Collaborate in the data studio
To share the results of your RNAseq analysis or allow colleagues to perform exploratory analysis, share a link to the data studio by selecting the options menu for the data studio you want to share, then select Copy data studio URL. With this link, other authenticated users with the Connect role (or greater) can access the session directly.
See Data Studios to learn how to create data studios in your own workspace.
Pipeline optimization
Seqera Platform's task-level resource usage metrics allow you to determine the resources requested for a task and what was actually used. This information helps you fine-tune your configuration more accurately.
However, manually adjusting resources for every task in your pipeline is impractical. Instead, you can leverage the pipeline optimization feature available on the Launchpad.
Pipeline optimization analyzes resource usage data from previous runs to optimize the resource allocation for future runs. After a successful run, optimization becomes available, indicated by the lightbulb icon next to the pipeline turning black.