Smart Reuse of Job Outputs on DNAnexus

Overview and Example Use Cases

DNAnexus allows organizations to optionally reuse outputs of jobs that share the same executable and input IDs, even if these outputs are across projects or entire organizations. This feature has two primary use cases.

1. Dramatically speed up R&D of workflows

For example, suppose you are developing a workflow, and at each stage you end up debugging an issue. Let's assume that each stage takes approximately one hour to develop and run. If you do not reuse outputs as you are developing, the development process takes 1 + 2 + 3 + ... + n hours since at every stage you fix something you have to recompute results from previous stages you were working on. On the other hand, if you simply reuse results for stages that have matured and are no longer modified, your total development time is now just the total amount of time it takes to develop and run the pipeline (in this case n hours). This is an order of magnitude difference in development time, and the improvement becomes more pronounced for longer workflows.

This feature is also powerful for saving time developing forks of existing workflows. For example, suppose you are a developer in an R&D organizaiton and want to modify the last couple of stages of a production workflow in another organization. As long as the new workflow uses the same executable IDs for the stages before it, the time required for R&D of the forked version is only that of last stages.

2. Dramatically reduce costs when testing at scale

In production environments, it is important to test R&D modifications to a workflow at scale (e.g. a workflow for a clinical test). For example, suppose you are testing a workflow like the forked workflow discussed in the example above. This is a clinical workflow that needs to be tested on thousands of samples (let that number be represented by m) before being vetted to run in a production environment. Let's also suppose the whole workflow takes n hours but you only have modified the last k stages. You save (*n*-*k*)*m* total compute hours. This can add up to dramatic cost savings as m grows and if k is small.

Example reuse with WDL

To demonstrate smart reuse, we will use WDL syntax as supported by DNAnexus through our toolkit and dxWDL.

task dupfile {
    File infile

    command { cat ${infile} ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

task headfile {
    File infile

    command { head -10 ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

workflow basic_reuse {
    File infile
    call dupfile { input: infile=infile }
    call headfile { input: infile=dupfile.outfile }
}

The workflow above is a two step workflow that simply duplicates a file and takes the first 10 lines from the duplicate.

Now suppose the user has run the workflow above on some file and simply wants to tweak headfile to output the first 15 lines instead:

task dupfile {
    File infile

    command { cat ${infile} ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

task headfile2 {
    File infile

    command { head -15 ${infile} > outfile.txt  }
    output { File outfile = 'outfile.txt' }
}

workflow basic_reuse_tweaked {
    File infile
    call dupfile { input: infile=infile }
    call headfile { input: infile=dupfile.outfile }
}

Here the only difference is that we renamed headfile, basic_reuse, and changed 10 to 15. The compilation process automatically detects that dupfile is the same but there's a different second stage. The generated workflow therefore uses the original executable ID for dupfile but a different one exists for headfile2.

When executing basic_reuse_tweaked on the same file with smart reuse enabled, the results from dupfile are reused. This is because since there is already a job on the DNAnexus platform that has run that specific executable with the same input file, the system can reuse that file.

Specific properties of Smart Reuse

Smart reuse:

  • only applies to jobs run in projects billed to an organization that has smart reuse enabled.
  • jobs may only reuse results from other jobs if there exists a previously run job that ran with the exact same executable and input IDs (including the function called within the applet)
  • jobs contain a field called outputReusedFrom that refers to the job ID that originally computed the requested outputs. This field never refers to another job that has itself been reused.
  • jobs may only use results across projects if the corresponding application's dxapp.json contains "allProjects": "VIEW" in the "access" field.
  • is applied only to completed jobs executed after the policies are updated for an org
  • jobs must have at least VIEW access to the original job's outputs, and those outputs must still exist on the platform (i.e. they have not been since deleted)
  • jobs are reported as having run for 0s and correspondingly are billed as $0
  • jobs are assumed to be deterministic in output

Enabling and disabling smart reuse

If you are an administrator of a licensed org and want to enable Smart Reuse. Run this command:

dx api org-myorg update '{"policies":{"jobReuse":true}}'

If you plan to reuse this feature across projects, you must modify all applet and app configurations with the "allProjects": "VIEW" as described above.

Conversely, set the value to false to disable it. If you are a licensed customer and cannot run the command above, please contact support@dnanexus.com. If you are interested in this feature and are not a licensed customer, please reach out to sales@dnanexus.com or your account executive for more information.

Last edited by Kurt Jensen, 2018-08-28 20:44:53

 Feedback