Job Input and Output

Input

When launching an executable, either from outside of the platform, or from another executable that is already running, an input must be provided as part of the JSON in the API call. This input is the mechanism via which objects can be passed into the executable. If the objects are not in a publicly-viewable location or in a location otherwise made accessible to a job running the executable, then the objects need to be cloned into the workspace as mentioned previously for the job to be able to access them. For subjobs (i.e. jobs created via the API call /job/new), no cloning occurs because they share the same workspace as their parent jobs.

The input to a job must be a hash. The system treats this hash slightly differently, depending on whether the executable has a formally defined input specification.

If there is a formally defined input specification, the hash needs to abide by that. More specifically:

  • The keys of the hash must correspond to names of inputs as defined in the spec.
  • The value for a particular key must be compatible with the class of the input as defined in the spec. The following table summarizes value compatibility:

    ClassJSON syntax for valueExample
    intnumeric literal32
    floatnumeric literal3.14
    stringUTF-8 string"CFTR, HOXA"
    booleantrue/falsetrue
    hashJSON hash{"hello": "world", "foo": true}
    any data object class (e.g. record or file)
    • {"$dnanexus_link": objectid}
    • {"$dnanexus_link": {"project": projectid, "id": objectid} }

    • {"$dnanexus_link": "file-B2QkQvyK8yjQ48y890400012"}
    • {"$dnanexus_link": {"project": "project-B1350vfK8yjjKz3q00q00001", "id": "file-B2QkQvyK8yjQ48y890400012"} }
    array of a class (other than hash)JSON array of the class and/or nested arrays of the class;
    the array will be flattened before given as input to the executable

    • For class "array:int",
      • [1, 2, -4, 104]
      • [1, [2, -4], [[104]]] (interpreted to be the same value as above)
    • For class "array:file",
      • [{"$dnanexus_link": "file-B2QkQvyK8yjQ48y890400012"}, {"$dnanexus_link": "file-B2p0BZKK8yjg1JYyyQJQ000K"}]

As shown in the examples, data objects must be specially represented using the special syntax of DNAnexus links.

Job Dependencies

In some cases, the value of an input parameter may not be known at the time the job is created, but will instead be obtained from the output of some other job, which may still be running. In this case, a job-based object reference can be provided. A job-based object reference is a JSON hash with keys "job", referring to a job ID of a job in the same project, and "field", referring to the key of the job’s output hash whose value is expected to contain the value of the input parameter. Job-based object references will be discussed again later.

Note that the target value to which a job-based object reference eventually resolves will be whatever JSON value the other job provided. If it (or any other input field) does not match the input spec of the job using it as input, then the job will fail with failure reason "InputError". If the input spec mandates a type constraint for a particular input, the types associated with the object given as input will also be validated against the particular constraint.

Data Object Dependencies

If a data object in the input of an origin or master job is not yet in the closed state, then it cannot yet be cloned into the temporary workspace of the new job. As a result, the system will wait until the data object is closed before the job can be considered runnable. This allows you to queue up jobs without having to wait or come back when input is ready. For example, you can initiate a file upload and run a reads importer app on the file object without having to waiting for the file to finish uploading. Once the upload is done, the job running the reads importer will be marked as runnable and will be assigned to a worker in the cloud.

Explicit Dependencies

When creating any new job, you can specify the dependsOn field in the API call to be an array of job and/or data object IDs. This creates an explicit dependency for the newly created job to wait for any mentioned jobs to transition to the done state and for any mentioned data object IDs (and recursively their hidden linked data objects) to transition to the closed state.

This may be useful, for example, if you have two apps called "collectStats" and "aggregateStats". Every day, you launch some number of "collectStats" app executions on separate inputs, each of which write several sample statistics files to the project. Once a week, you run your "aggregateStats" app which is responsible for running an aggregation script to collect summary statistics for all the sample level statistics present in the project. If some "collectStats" jobs are currently running, you can still launch your "aggregateStats" job. By setting dependsOn to the list of running "collectStats" job IDs when creating the job that will run "aggregateStats", the resulting job will wait for the subjobs to finish without the need for job-based object references in the input.

Jobs with No Input Spec

If the executable does not have an input specification or the job in question is a subjob, the hash given as input can contain any values; those values will only be checked for the presence of DNAnexus links and job-based object references. A job will not be run until its input values (or values in arrays) do not contain job-based object references that cannot be resolved nor DNAnexus links pointing to non-closed data objects.

Output

Job outputs are treated differently for the following two categories of jobs:

  1. Outputs of jobs created via /job/new
  2. Outputs of jobs created via running an app or applet (see /applet-xxxx/run)

In case #1, the system does not need to perform any checks for the job output, as it is only going to be consulted by other jobs within the same job group. In case #2, the system needs to validate the output (if the app or applet spec formally defines its output), and also clone the output objects into the project context or the parent workspace. In both cases the resulting job output needs to be a JSON hash.

For case #2, the behavior depends on whether the app or applet spec formally defines its outputs. If that is the case, the job output is validated against the "outputs" field of the applet spec, in a way similar to the one for the "inputs", as mentioned in the description of the "run" method. Then, any outputs whose class is a data object class are cloned into the project context or parent workspace (depending on whether the executable was launched by a user from outside of the platform or by another job). If the executable does not have an output spec, the output hash is examined for any links (which can appear anywhere, including the values of hashes or inside arrays) and these are exactly the objects that will be cloned.

Job Dependencies

Job output can also contain job-based object references and will be resolved once the referenced jobs finish running. If a job’s output contains job-based object references, it is placed in the waiting_on_output state until these references are resolved. A job is also implicitly dependent on all of its child jobs before it can be moved into the done state. Thus an origin or master job is done only after all of its descendant jobs are also done. See Job Lifecycle for more information on job states.

Data Object Dependencies

A job can output a data object that is in the closing state. It will not be marked as done until all of its data object outputs have transitioned to the done state and all job-based object references have also been resolved.

Output Validation Failure Reasons

When job output is validated, a job may fail with failure reason "OutputError" for the following reasons (this list is non-exhaustive):

  • A job-based object reference in the job output did not resolve successfully (invalid job, job ID not found, job not viewable by the job, job is in either the failed or terminated state, field does not exist, field does not contain a valid object link).
  • For case #2 (here "output object" refers to objects pointed to by links in the output):
    • The JSON output of the job is not a hash.
    • An output object does not exist in the workspace.
    • An output object is not a data object (things like users, projects, or jobs are not data objects)
    • For applets that formally define their output:
      • An output object does not satisfy the class constraints.
      • An output object does not satisfy the type constraints.
      • The names of outputs do not match exactly those defined in the spec.
    • An output object is not in the "closing" or "closed" state.

Special Values

DNAnexus links are JSON hashes with a particular syntax that are recognized by the platform. They are used to refer to data objects and to the output fields of jobs. To avoid unexpected behavior, you should not reuse the syntax of these special values in your input and output if you do not wish the platform to treat them differently and potentially overwrite their values.

Data Object Links

Data objects in the system are represented by a mapping with the key-value pair:

  • $dnanexus_link string or mapping Either the data object ID, or a mapping with the key/values:
    • id string Data object ID
    • project string ID of the project or container in which to find the data object

If the project ID is not provided and the object is not found in the project from which the job is launched, then the API server will, if it can, find some copy of the object for which the user has VIEW permissions, and the project containing the copy is not RESTRICTED.

Examples
{
  "$dnanexus_link": "file-BFY5vKKgqZk89Vzk0Zj00GQb"
}
{
  "$dnanexus_link": {
    "id": "file-BFY5vKKgqZk89Vzk0Zj00GQb",
    "project": "project-B3387KygqZk2YQ12Zjf00001"
  }
}

Analysis and Job-based Object References

Analysis and job-based object references are references to values, but instead of directly specifying the value, they refer to the output of another job or analysis that may still be in progress. They can be specified in the input or in the output of another job or analysis.

When an analysis or job-based object reference is placed in the input of a job, the job will be in the "waiting_on_input" state until all references have been resolved. Once all its inputs are ready, the system will transition the job to the "runnable" state, at which point it will be placed in the queue to be run in the cloud.

When a job finishes running and reports an analysis or job-based object reference in its output, the job will be placed in the "waiting_on_output" state. The job will be marked as "done" only after all its outputs are ready and any executions it has launched are also done.

Job-based Object Reference

Job-based object references are mappings with the key/values:

  • $dnanexus_link mapping Mapping with the key/values:
    • job string Job ID
    • field string Output field name

Deprecated syntax: job-based object references are also recognized if they are provided as the inner mapping with just the two keys job and field.

Example
{
  "$dnanexus_link": {
    "job": "job-BFY5vKKgqZk89Vzk0Zj00GQb",
    "field": "mappings"
  }
}

Analysis Reference

Analysis references are links which act much like job-based object references but refer instead to the output of an analysis.

An analysis reference to its output can be specified as a mapping with the key-value pair:

  • $dnanexus_link mapping Mapping with the key/values:
    • analysis string Analysis ID
    • field string The output field name; this is one of the following:
      • stage ID and the stage's output field name, separated by a dot ("."), i.e. <stage ID>.<stage output field>
      • the exported output field name in the workflow that was run (see the section on customizing workflow IO specifications for more information)
Examples
{
  "$dnanexus_link": {
    "analysis": "analysis-BFY5vKKgqZk89Vzk0Zj00GQb",
    "field": "stage-BFY5yq6gqZkF9GGZbkq02Vj6.mappings"
  }
}
{
  "$dnanexus_link": {
    "analysis": "analysis-BFY5vKKgqZk89Vzk0Zj00GQb",
    "field": "mappings"
  }
}

Analysis Stage Reference

The output of a an analysis's stage can also be referenced explicitly using a mapping with the key-value pair:

  • $dnanexus_link mapping Mapping with the key/values:
    • analysis string Analysis ID
    • stage string Stage ID
    • field string Output field name of the stage
Example
{
  "$dnanexus_link": {
    "analysis": "analysis-BFY5vKKgqZk89Vzk0Zj00GQb",
    "stage": "stage-BFY5yq6gqZkF9GGZbkq02Vj6",
    "field": "mappings"
  }
}

Array Index References

Sometimes the output of an existing job is in the form of an array of some class of output, and it is desirable to provide a reference to just one of the members of this array when launching another job. In such a situation, the analysis or job-based object reference can be augmented with the integer index field to indicate the element that is referenced.

The syntax for the reference is a mapping with key-value pair:

  • $dnanexus_link mapping Mapping with the key/values:
    • The key-value pairs that are indicated by the analysis or job-based object reference syntax. See above.
    • index integer Index of the array to be linked; this is 0-indexed, so a value of 0 indicates the first element should be used
Example
{
  "$dnanexus_link": {
    "analysis": "analysis-BFY5vKKgqZk89Vzk0Zj00GQb",
    "stage": "stage-BFY5yq6gqZkF9GGZbkq02Vj6",
    "field": "array_of_mappings",
    "index": 1
  }
}

Symbolic Metadata References

Links can also be used to refer to metadata stored in a particular data object. Both data object links and execution output references can be augmented in this way. A symbolic metadata reference is a mapping with the key-value pair:

  • $dnanexus_link mapping Mapping with key/values:
    • The key-value pairs that are indicated by the link to be augmented. See above for data object and execution reference syntax. See the next Examples section for how to link to another field in the input.
    • metadata string A restricted JavaScript-style syntax for accessing values within hashes and arrays (see below for more details). The three types of metadata that can be accessed are as follows:
  • Name: using the value "name" will resolve the link to the string name of the data object.
  1. Properties: using a value starting with "properties" and specifying a single key will resolve the link to the string property value specified by the property key given.

  2. Details: using a value starting with "details" will resolve the link to either the entire JSON details of the data object, or to a value stored within it if any keys are provided.

Examples

There are three ways to reference an object, and the examples below show how they work.

  1. Data object ID: you can refer to a data object by ID, optionally providing a project ID as well.
    {
      "$dnanexus_link": {
        "id": "record-B65KqzygqZk7KvKK7VgQ00gp",
        "metadata": "name"
      }
    }
    {
      "$dnanexus_link": {
        "id": "record-B65KqzygqZk7KvKK7VgQ00gp",
        "project": "project-B3387KygqZk2YQ12Zjf00001",
        "metadata": "properties.foo"
      }
    }
    
  2. Input field: in addition to the existing link types, you can also refer to a data object provided in another field of the input. For example, if there are two input fields "genome" and "indexed_genome", you can provide a link for "genome" which resolves to the genome object that was used to create the indexed genome object.
    {
      "$dnanexus_link": {
        "input": "indexed_genome",
        "metadata": "details.genome"
      }
    }
    
  3. Job-based object reference or analysis reference: you can combine this syntax to with that of a job-based object or analysis reference so that the metadata will be resolved once the relevant execution finishes.
    {
      "$dnanexus_link": {
        "job": "job-BFxzqYfgqZkFxQyZ1QVQ001k",
        "field": "output_field",
        "metadata": "details.genome"
      }
    }
    {
      "$dnanexus_link": {
        "analysis": "analysis-BFxzq7ygqZk74yg420VQ0005",
        "field": "output_field",
        "metadata": "details.genome"
      }
    }
    

Note: if a default value is provided for an input, it will be used if the symbolic reference cannot be resolved.

Key Notation

There are two ways to specify a key: dot notation and bracket notation.

  • Dot notation: keys are restricted and must not include characters that are not alphanumeric but can additionally use the characters "$" and "_".
  • Bracket notation: any key can be represented, but are restricted in syntax to either integer keys or a single quoted string.

Below are valid and invalid examples for values for the metadata key.

Valid

  • name
  • properties.foo
  • properties["foo"]
  • properties['foo']
  • properties['f"o"o']
  • properties["hello world!"]
  • details.some_array[2]
  • details.genome.$dnanexus_link

Invalid

  • name[0]
  • properties[foo]
  • properties['foo' + 'bar']
  • properties.hello world!
  • details.thing[3.4]
  • details.foo\.bar
  • details.foo\ bar

Last edited by Thanh-Ha Nguyen, 2016-06-29 22:23:08

 Feedback