1 Introduction

The tidycwl package takes the raw Common Workflow Language (CWL) workflows encoded in JSON or YAML, and turns the workflow elements into tidy data frames or structured lists. This package follows the tidyverse design principles and can be seamlessly used together with the other packages with similar designs.

Let’s use a real-world example to see how we can read, parse, and visualize a bioinformatics workflow with tidycwl.

library("tidycwl")

2 Read workflow

To read a CWL workflow into R, use read_cwl_json(), read_cwl_yaml(), or read_cwl(format = ...) depending on the workflow storage format.

flow <- system.file("cwl/sbg/workflow/gatk4-wgs.json", package = "tidycwl") %>%
  read_cwl_json()
flow

Name: Whole Genome Sequencing - BWA + GATK 4.0 (with Metrics) 
Class: Workflow 
CWL Version: sbg:draft-2

We see the name, class (workflow or command line tool), and the version of the CWL. Currently, tidycwl supports both sbg:draft2 and v1.0 workflows. As the standard evolves, we plan to add the support for higher versions as needed.

3 Parse workflow

After reading the workflow into R, let’s parse the main components from the CWL.

Besides the type (parse_type()) and metadata (parse_meta()), we are more than interested in the core components of a workflow, namely, the inputs, outputs, and the intermediate steps.

flow %>%
  parse_inputs() %>%
  names()

[1] "sbg:fileTypes"        "label"                "id"                  
[4] "sbg:includeInPorts"   "type"                 "description"         
[7] "sbg:category"         "sbg:toolDefaultValue"

flow %>%
  parse_outputs() %>%
  names()

[1] "source"             "label"              "required"          
[4] "id"                 "sbg:includeInPorts" "type"              
[7] "sbg:fileTypes"

flow %>%
  parse_steps() %>%
  names()

[1] "sbg:x"   "inputs"  "outputs" "run"     "id"      "sbg:y"   "scatter"

Depending on whether these components are represented as YAML/JSON dictionaries or lists in the workflow, the parsed results could be data frames or lists. This is because we want to keep the transformations for the original data minimal, at least at this stage. Plus, these results are not too useful compared to the following granular parsers.

4 Get parameters

We can use the get_*_*() functions to get the critical parameters, such as the ID, label, or documentation from the parsed inputs, outputs, and steps. For example, use get_steps_label() to get the labels of the steps in the workflow:

flow %>%
  parse_steps() %>%
  get_steps_label()

 [1] "SBG Genome Coverage"                 "SBG Untar fasta"                    
 [3] "Sambamba Merge"                      "SBG FASTQ Quality Adjuster"         
 [5] "Tabix Index"                         "GATK CollectAlignmentSummaryMetrics"
 [7] "SBG Prepare Intervals"               "FastQC"                             
 [9] "BWA INDEX 0.7.17"                    "SBG Pair FASTQs by Metadata"        
[11] "SBG FASTA Indices"                   "Tabix BGZIP"                        
[13] "BWA MEM Bundle 0.7.17"               "GATK HaplotypeCaller"               
[15] "GATK IndexFeatureFile"               "GATK IndexFeatureFile"              
[17] "GATK IndexFeatureFile"               "GATK MergeVcfs"                     
[19] "GATK GenotypeGVCFs"                  "GATK MergeVcfs"                     
[21] "GATK ApplyBQSR"                      "GATK BaseRecalibrator"

5 Get graph elements

In many cases, it is useful to construct a graph with the parsed inputs, outputs, and steps from the workflow. The functions get_nodes() and get_edges() can help us tidy the graph nodes and edges into data frames. Each row represents a node or an edge, with each variable representing an attribute of the node or edge.

The function get_graph() is a wrapper which returns everything in a list:

get_graph(
  flow %>% parse_inputs(),
  flow %>% parse_outputs(),
  flow %>% parse_steps()
) %>% str()

List of 2
 $ nodes:'data.frame':  36 obs. of  3 variables:
  ..$ id   : chr [1:36] "intervals_file" "dbsnp" "mills" "fastq" ...
  ..$ label: chr [1:36] "Target BED" "dbsnp" "Mills" "Fastq" ...
  ..$ group: chr [1:36] "input" "input" "input" "input" ...
 $ edges:'data.frame':  43 obs. of  5 variables:
  ..$ from     : chr [1:43] "SBG_FASTA_Indices" "Sambamba_Merge" "reference" "BWA_MEM_Bundle_0_7_17" ...
  ..$ to       : chr [1:43] "SBG_Genome_Coverage" "SBG_Genome_Coverage" "SBG_Untar_fasta" "Sambamba_Merge" ...
  ..$ port_from: chr [1:43] "fasta_reference" "merged_bam" NA "aligned_reads" ...
  ..$ port_to  : chr [1:43] "fasta" "bam" "input_tar_with_reference" "bams" ...
  ..$ type     : chr [1:43] "step_to_step" "step_to_step" "input_to_step" "step_to_step" ...

6 Visualize workflow

With tidycwl, we can visualize the workflow graph by calling visualize_graph(), which is built on the visNetwork package with an automatic hierarchical layout:

if (rmarkdown::pandoc_available("1.12.3")) {
  get_graph(
    flow %>% parse_inputs(),
    flow %>% parse_outputs(),
    flow %>% parse_steps()
  ) %>% visualize_graph()
}

Users can interact with the visualization by zooming in/out and dragging the view or nodes. The graphical details can be further fine-tuned by feeding additional parameters to visualize_graph().

The visualizations can be exported as HTML or static images (PNG/JPEG/PDF) with export_html() and export_image().

A Grammar for Tidying CWL Workflows

2022-04-01