Apollo introduces the following new concepts in the DNAnexus Platform

  • Spark Application
  • Spark Database
  • Database Explorer
  • Database Query (Cohort)

Spark Application

The Spark application is an extension of the current app(let) framework. Currently, app(let)s have a spec for their VM (instance type, OS, packages). This has been extended to allow for an additional optional cluster spec with “type=spark”.

Here are some salient points about these applications:

  • Calling /app(let)-xxx/run for Spark apps creates a Spark cluster (+ master VM).
  • The master VM (where the app shell code runs) acts as the driver node for Spark.
  • Code in the master VM leverages the Spark infrastructure.
  • Job mechanisms (monitoring, termination, etc.) are the same for Spark apps as for any other regular app(let)s on the Platform. API
  • Spark apps use the same platform "dx" communication between the master VM and DNAnexus API servers.
  • There's a new log collection mechanism to collect logs from all nodes.
  • You can use the Spark UI to monitor running job using ssh tunneling.

Spark Database

Databases are used to contain structured data in tables for analysis later. They are always scoped/contained inside projects. The databases and tables are created from DNAnexus Spark apps.

Databases are created via Spark SQL and represented as database data objects in the platform. Please see the API documentation for details on how to maintain databases. To quickly query a database, tools like beeline can be used. Please see the documentation here.

Database Explorer

To visualize a database, a special type of record called the DatabaseExplorer is used. This record has the types field set to ["DatabaseExplorer"]. To link a database to a database explorer, both objects need to have the same tag.

Database Query (Cohort)

A query on a database can be stored in a special type of record called the DatabaseQuery. This is mainly used to store cohorts of samples based on filters set by users on the visualization UI. This record can be used to re-hydrate the filters in the visualization UI and also in Analysis apps.

Last edited by Elena Duranova, 2018-10-24 22:03:29