This page lists some of the preprocessing steps that you may be interested in doing before actually loading the data.
- The raw data is expected to be a set of gVCF files -- one file per sample in the cohort.
- GLnexus is used to harmonize sites across all gVCFs and generate a single pVCF file
containing all harmonized sites and all genotypes for all samples.
dx run app-glnexus -i common.gvcf_manifest=<manifest_file_id> -i common.config=gatk_unfiltered -i common.targets_bed=<bed_target_ranges>
dx run workflow-glnexus -i common.gvcf_manifest=<manifest_file_id> -i common.config=gatk_unfiltered -i common.targets_bed=<bed_target_ranges> -i unify.shards_bed=<bed_genomic_partition_ranges> -i etl.shards=<num_sample_partitions>
The VCF data can include variant annotations. Of particular interest are SnpEff annotations, which
are included in VCFs as
INFO/ANN tags--SnpEff annotations, if present, are loaded into Apollo databases.
If desired, then you may pre-annotate your VCF data to include SnpEff annotations after harmonizing
your data -- just pass your pVCF to any standard SnpEff annotator. If your pVCF is especially large
it may be advantageous to rely on the internal annotation step in the VCF loader instead of annotating
the pVCF yourself. The VCF loader annotation step annotates the pVCF in a distributed, massively parallel way.
Note that the VCF loader does not persist the intermediate, annotated pVCF as a file, so if you want
to have access to the annotated file up front you should annotate it yourself.
Figure. VCF annotation flows. In (a) the annotation step is external to the VCF loader, whereas
in (b) the annotation step is internal. In any case, SnpEff annotations present as
are loaded into the database by the VCF loader.