Ready to start running your own custom analyses in the cloud? We'll walk you through the process of packaging an existing executable as an applet that you can run from the website or from the command line to process a file in your project.
Applets and apps are both executables that can be run in the platform. The main difference is that applets are data objects, which live in projects, while apps do not live in a project and can be published to allow other users to run them.
The procedure for creating applets or apps is essentially the same, with the exception of the final build step where you specify whether it should be created as an applet or an app. (You can also transition an applet to an app.) As a result, we may use the two terms interchangeably in tutorials like this one.
In this tutorial, we will upload the fastq_quality_trimmer program from the FASTX-Toolkit to the platform as an applet called "mytrimmer". If we did not upload the program, we would have to run the analysis on a local machine by downloading the file to our local hard drive (1), running the executable locally (2), and then uploading the result back into our project (3), where a new DNAnexus file object is created in our project (4).
Of course, it'd be much better if we didn’t have to tie up our own computer's CPU and network bandwidth, and instead perform this analysis in the cloud. The next steps will cover how to run the equivalent commands in the cloud instead.
There are two key CLI tools that will be helpful: dx-app-wizard and dx build. We will run dx-app-wizard to create a skeleton directory for our applet, after which we can fill in the code (1). Next, we will run dx build (2), which will take the generated files and package them as an applet on the platform where it will reside as a data object in your project (3).
To follow along exactly, you should download the following two files.
We will be using the fastq_quality_trimmer found in the excellent FASTX-Toolkit (more details on installation at http://hannonlab.cshl.edu/fastx_toolkit/download.html; the link above is the "Linux (64bit)" binary). You can substitute it with any executable of your own as long as it can run on 64-bit Ubuntu 14.04, and it takes in a local file and outputs another local file.
The FASTQ file provided above contains reads from the small C. elegans sample (the first 25000 reads from SRR070372) also used in the Command-Line Quickstart. You can also substitute your own input file in FASTQ format if you wish.
You can test out the fastq_quality_trimmer on the sample on your local machine. Note that if you are not running 64-bit Linux, you should download the appropriate distribution from the FASTX-Toolkit download page to run it locally. The command we will run is:
The app wizard is a tool which will generate a local directory and a source code template for your applet after prompting you with questions about the applet you would like to build (e.g. the name, which we will set to "mytrimmer"). We will create a simple applet that will just execute a few shell commands, so we will pick the bash scripting language. For now, don't worry about the rest of the parameters; they are either not required or have safe default values. Just press <ENTER> for the rest of the prompts; other input you should enter is shown in blue below.
Within the platform, executables run on a virtual Linux PC with a stock
Ubuntu 14.04 or 12.04 environment. Thus, the FASTX-Toolkit (or your custom executable)
is not available there by default. To make it available to your applet, put a
copy of the executable inside mytrimmer/resources/usr/bin/
.
When you use the DNAnexus build utility to build your applet, any files
in the mytrimmer/resources
directory will be packaged as part of your applet
and will be placed in the root directory of the virtual Linux PC whenever your
applet is run in the cloud. Files placed in mytrimmer/resources/usr/bin
will
therefore be put in /usr/bin
in the container and be available in the default
path at runtime. (Note: while the mytrimmer/resources
subdirectory is unpacked into the root of the virtual filesystem, your applet's executable will later start in /home/dnanexus
as its current working directory.)
If you look inside the main() block in
mytrimmer/src/mytrimmer.sh
, you'll see that some of the code has
been filled in for you. In the code block below, we highlight the
crucial lines with pared-down comments changed for the purposes of
this tutorial. You should insert the line to run your executable on
input_file once it has been downloaded and write your output to
output_file.
#!/bin/bash
main() {
# When the applet is run, the variable "input_file" is already set
# to the DNAnexus link to the file object. Here, we download it to the
# job's scratch space
dx download "$input_file" -o input_file
# Insert the following line between the download and upload lines
fastq_quality_trimmer -t 20 -Q 33 -i input_file -o output_file
# Here, we set the variable "output_file" to be the ID of the
# uploaded file.
output_file=$(dx upload output_file --brief)
# This line reports the uploaded file ID under the output field
# called "output_file".
dx-jobutil-add-output output_file "$output_file" --class=file
}
If you run dx describe on the applet, you will see that it has been
created with the name "mytrimmer" and has input and output specifications,
each expecting one file. It also has a file in its bundledDepends, which
contains the fastq_quality_trimmer executable that you previously placed
in the mytrimmer/resources
directory.
You can also run dx run mytrimmer ‑h to see automatically generated help for the applet you've just created.
You are now ready to launch the analysis in the cloud, using the "dx run" command (4). Notice that the "mytrimmer" applet resides inside your project in the cloud, so you can really launch it from anywhere. When you launch the analysis, DNAnexus will bring up a new virtual Linux PC to run your code (5).
Upload the sample input file if you haven't already, and then run the applet.
Once DNAnexus instantiates the virtual execution environment, it will automatically run your code, executing the commands in the shell script you provided when building the applet. The script will run just like on your local computer, downloading the reads to the hard drive of that virtual computer (6), running the FASTX-Toolkit executable (7), then uploading the resulting file (8) back to the project (9).
If you look inside the file mytrimmer/dxapp.json
, you will find a JSON
hash containing the metadata that the build utility uses to package the
applet for the platform. If you wish to change the inputs or outputs of
your applet, or request additional execution resources (e.g., network
access or more CPU or memory), this is the file to edit before running
"dx build". Click here to get more
information on how to request additional resources. The following shows
the parameters automatically set by the app wizard.
{
"name": "mytrimmer",
"dxapi": "1.0.0",
"version": "0.0.1",
"resources": [],
"inputSpec": [
{
"name": "input_file",
"class": "file",
"optional": false
}
],
"outputSpec": [
{
"name": "output_file",
"class": "file"
}
],
"runSpec": {
"interpreter": "bash",
"file": "src/mytrimmer.sh"
}
}
When running dx-app-wizard above, you selected the "basic" execution template. Try the other available templates to see simple examples of how to parallelize your execution over multiple machines in the cloud by using additional entry points. You can also choose other programming languages where you can start using some of the DNAnexus client libraries. While the command-line client dx provides quite a bit of functionality, the client libraries can provide a richer experience for programmatically accessing and modifying data on the platform directly in the programming language of your choice.
See the Advanced App Tutorial to get a better understanding of the app directory structure and how to manually modify app inputs, outputs, and metadata.
If you're curious about what happens in the cloud, the Execution Environment Reference is a complete reference to what happens when a job is run.
The Job Lifecycle page describes the progression of a job's states and discusses the reasons a job may fail.
The Developer Portal contains additional resources, including discussion of more advanced concepts (such as parallelizing your workload across multiple instances) and links to example source code for your reference and inspiration.