Developer Tutorials/Intro to Building Workflows

Workflows represent a series of executables (apps or applets) that are linked together by dependencies, e.g. one executable's outputs may be another's inputs. It is easiest to create a workflow in the web interface but you can use the DNAnexus SDK, dx-toolkit, if you want to automate workflow creation or lock down your workflow. In this tutorial we will show you how to do that step by step.

Simple workflows

A workflow can be created in the DNAnexus platform from a JSON definition in dxworkflow.json, as described in the Workflow Build Process page.

We will build a workflow identical to the one described in the tutorial for Building Workflows in the UI, named BWA MEM + GATK Workflow. The most crucial field in our JSON is stages, which holds a list of executables for the workflow. We'll add two stages to our workflow, the first one will run the app BWA-MEM FASTQ Read Mapper and the second one - Vendor Human Exome GATK-Lite Pipeline. We'll also specify a name and an output folder. Our dxworkflow.json will look as follows:

{
  "name": "BWA MEM + GATK Workflow",
  "outputFolder": "/results",
  "stages": [
    {
      "id": "align_reads",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "genomeindex_targz": {
          "$dnanexus_link": "file-B6ZY4942J35xX095VZyQBk0v"
        }
      }
    },
    {
      "id": "call_variants",
      "executable": "app-vendor_human_exome_gatk_lite_pipeline",
      "input": {
        "sorted_bam": {
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        },
        "vendor_exome": "agilent_sureselect_human_all_exon_v2"
      }
    }
  ]
}

Each stage in the stages list should have an id, which is a free-form string unique in a given workflow, and the executable field, which holds either ID or name of an app, or an ID of an applet. We can add the input field for a stage if we want to bind the input of that stage with an output (or input) of another stage. For example, the input sorted_bam of our second stage with id call_variants will take its input from the output field sorted_bam of the first stage, align_reads:

{
  "input": {
    "sorted_bam": {
      "$dnanexus_link": {
        "stage": "align_reads",
        "outputField": "sorted_bam"
      }
    }
  }
}

The input of a stage can also be used to set default values for a field. We selected the file file-B6ZY4942J35xX095VZyQBk0v, which is publicly available in the DNAnexus platform, to be the default reference for the alignment step.

{
  "input": {
    "genomeindex_targz": {
      "$dnanexus_link": "file-B6ZY4942J35xX095VZyQBk0v"
    }
  }
}
TIP: For details on how to link stages' inputs and outputs please refer to the Binding Input specification.
TIP: names of the input and output fields of an executable defined for a stage can be learned by running the dx describe command.

We can now create a workflow object in the DNAnexus platform by executing the command below. We should first place our dxworkflow.json in the BWA MEM + GATK Workflow directory. The directory name does not have to be identical to the name of the workflow but it's a good practice to keep them in sync.

dx build "BWA MEM + GATK Workflow"

When we want to run a workflow, we can pass or override values to any stage inputs:

dx run -ialign_reads.reads_fastqgz=myreads.fastq.gz \
       -ialign_reads.genomeindex_targz=file-xxxx \
       "BWA MEM + GATK Workflow"

Locked workflows

Motivation

In certain situations, it may be desirable to disallow or discourage the user of our workflow to override an input to a particular stage. For example, we may want only a specific reference genome to be used for the workflow.

In order to achieve that we can add explicit inputs and outputs to the workflow during creation, with links to inputs and outputs of specific stages. When the workflow is run, the user will be able to pass values only to fields defined in inputs, and all the parameters that are not exposed in this workflow-level I/O interface will be unchangeable and unoverridable.

It may also be useful when we want to simplify the workflow execution and make it clear which inputs they are expected to provide.

This feature also makes the execution of CWL or WDL workflows on our platform more seamless since they too explicitly specify inputs and outputs to workflows.

Building a locked-down workflow

We will create a locked down version of the workflow above and will name it BWA MEM + GATK Workflow (locked). Our workflow will have all inputs locked except for one, reads_fastqgz of the stage align_reads. When locking workflows we always define those inputs that are not locked, by listing them in the workflow-level inputs field. All the other inputs will be automatically locked and the users of our workflow will not be able to override their values.

Inputs

To create a locked workflow we first need to add a workflow-level input specification in the inputs field, which may look like this:

{
  "inputs": [
    {
      "name": "reads",
      "class": "file"
    }
  ]
}

In this case the workflow will have only one input, named reads. Full input specification is documented in the IO and Run Specifications.

Stages

Next, we should define which stage or stages will consume that input by adding a link from the stage(s) to the workflow input. We can do this by using the field workflowInputField, as in the example below. If a file is supplied to reads when the workflow is run, it will be directed to reads_fastqgz of the stage align_reads.

{
  "stages": [
    {
      "id": "align_reads",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "reads_fastqgz": {
          "$dnanexus_link": {
            "workflowInputField": "reads"
          }
        },
        "genomeindex_targz": {
          "$dnanexus_link": "file-B6ZY4942J35xX095VZyQBk0v"
        }
      }
    },
    {
      "id": "call_variants",
      "executable": "app-vendor_human_exome_gatk_lite_pipeline",
      "input": {
        "sorted_bam": {
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        },
        "vendor_exome": "agilent_sureselect_human_all_exon_v2"
      }
    }
  ]
}

Notice that the stage's input fields genomeindex_targz, or vendor_exome are not exposed in the workflow level inputs, which indicates that these inputs are locked and the workflow will only be run with file-B6ZY4942J35xX095VZyQBk0v and agilent_sureselect_human_all_exon_v2 values for these fields, respectively. The same is true for all the other inputs of the executables, defined in their inputSpec but not included in the stages' input.

If any executables in the workflow have required inputs, they must be specified in the dxworkflow.json. In our example workflow, stages have the following required inputs:

  • align_reads stage: inputs reads_fastqgz and genomeindex_targz
  • call_variants stage: inputs sorted_bam and vendor_exome

reads_fastqgz is exposed as a workflow input in inputs (it is not locked) and the remaining inputs are locked. We have to set the value for all these locked required inputs using stages' input field as above.

If inputs is null or not specified at all, the parameters to stages are not locked down and the workflow accepts inputs provided directly to stages.

Multiple stages can link to the same workflow-level input.

Outputs

Optionally, we can also specify workflow-level outputs:

{
  "outputs": [
    {
      "name": "variants",
      "class": "file",
      "outputSource": {
        "$dnanexus_link": {
          "stage": "call_variants",
          "outputField": "variants_vcfgz"
        }
      }
    }
  ]
}

The field outputSource allows us to configure which stage-level outputs will be the outputs of the workflow. This, together with inputs, is especially useful when we want to set a workflow as an executable within another workflow (workflows within workflows is a feature under development).

Full JSON description of a locked-down workflow

Our example dxworkflow.json workflow description will look as follows:

{
  "name": "BWA MEM + GATK Workflow (locked)",
  "outputFolder": "/results",
  "inputs": [
    {
      "name": "reads",
      "class": "file"
    }
  ],
  "stages": [
    {
      "id": "align_reads",
      "executable": "app-bwa_mem_fastq_read_mapper",
      "input": {
        "reads_fastqgz": {
          "$dnanexus_link": {
            "workflowInputField": "reads"
          }
        },
        "genomeindex_targz": {
          "$dnanexus_link": "file-B6ZY4942J35xX095VZyQBk0v"
        }
      }
    },
    {
      "id": "call_variants",
      "executable": "app-vendor_human_exome_gatk_lite_pipeline",
      "input": {
        "sorted_bam": {
          "$dnanexus_link": {
            "stage": "align_reads",
            "outputField": "sorted_bam"
          }
        },
        "vendor_exome": "agilent_sureselect_human_all_exon_v2"
      }
    }
  ],
  "outputs": [
    {
      "name": "variants",
      "class": "file",
      "outputSource": {
        "$dnanexus_link": {
          "stage": "call_variants",
          "outputField": "variants_vcfgz"
        }
      }
    }
  ]
}

We can then build the workflow by running this command on the directory BWA MEM + GATK Workflow (locked) containing the dxworkflow.json:

dx build "BWA MEM + GATK Workflow (locked)"

Running a locked-down workflow in the CLI

To run the workflow, we should pass our FASTQ input file to the workflow-level reads input field:

dx run "BWA MEM + GATK Workflow (locked)" -ireads=my_input_file.fastq.gz

Providing the input file directly to the stage, for example -icall_variants.reads_fastqgz=my_input_file.fastq.gz, is not possible for locked workflows.

To find out how to run the workflow and what inputs it accepts, we can use this command:

dx run "BWA MEM + GATK Workflow (locked)" --help

Running a locked-down workflow in the UI

In the Web interface, a locked workflow will resemble an app, with inputs on the left side and outputs on the right one.

A list of stages can be viewed by clicking on the black icon in the middle. A locked workflow currently cannot be edited or created in the UI; we can build it in the CLI using dx get and then dx build commands.

Locking down an existing workflow

To lock down an existing workflow, we should get the workflow from the platform by running dx get "BWA MEM + GATK Workflow", add inputs to the downloaded dxworkflow.json, set workflowInputField references from stages to these inputs as explained above, and run dx build again.

Next steps

Last edited by Aleksandra Zalcman, 2018-03-23 04:20:26

 Feedback