VCF Loader ingests Variant Call Format (VCF) files into a database. The input VCF
files are loaded into a Spark database that can be queried using Spark SQL.
The input VCF for every run can be a single VCF file or many VCF files, but the
merged input must represent a single logical VCF file. In the many files case, the
logical VCF may be partitioned by chromosome, by genomic region, and/or by sample.
In any case, every input VCF file must be a syntactically correct, sorted VCF file.
Although VCF data can be loaded into Apollo databases immediately after the variant call
step, the dataset may not be normalized for downstream analyses across large cohorts.
In that case, you'll want to preprocess and harmonize your data before loading.
See VCF Preprocessing.
How to Run VCF Loader
vcf_manifest -> a text file containing a list of file ID's of the VCF files to load (one per line). The referenced files' names must be distinct and end in .vcf.gz If more than one file is specified, then the complete VCF file to load is considered to be partitioned and every specified partition must be a valid VCF file. Moreover after the partition-merge step in preprocessing, the complete VCF file must be valid.
database_name -> name of the database into which to load the VCF files.
strict mode creates database and tables from scratch and
optimistic mode creates databases and tables if they do not already exist.
append appends data to the end of tables and
overwrite is equivalent to truncating the tables and then appending to them.
site mode processes only the site-specific data,
genotype mode processes genotype-specific data and other non-site-specific data and
all mode processes both types of data.
etl_spec_id -> currently only
genomics-phenotype schema choice is supported.
is_sample_partitioned -> boolean whether the raw VCF data is partitioned.
snpeff -> default
true -- whether to include the SnpEff annotation step in preprocessing with INFO/ANN tags. If SnpEff annotations are desired in the database, then either pre-annotate the raw VCF separately, or include this SnpEff annotation step -- it is not necessary to do both.
snpeff_human_genome -> default
GRCh38.92 -- id of the SnpEff human genome to use in the SnpEff annotation step in preprocessing.
snpeff_opt_no_upstream -> default
false -- exclude SnpEff upstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-upstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
snpeff_opt_no_downstream -> default
false -- exclude SnpEff downstream gene annotations in the SnpEff annotation step (equivalent to SnpEff's -no-downstream option). Note that this option does not filter pre-calculated annotations outside of the SnpEff annotation step.
calculate_worst_effects -> default
false -- whether to include the SnpEff worst-effects annotation step in preprocessing, which adds the SnpEff worst-effects for every alternate-allele--gene combination as INFO/ANN_WORST tags (Number "A"). Note that this option automatically filters SnpEff annotations to exclude 'feature_type!=transcript', 'transcript_biotype!=protein_coding', 'effect=upstream_gene_variant' and 'effect=downstream_gene_variant'
calculate_locus_frequencies -> default
false -- whether to include the locus-level frequencies annotation step in preprocessing, which adds locus-level allele and genotype frequencies as INFO tags.
snpsift -> default
true -- whether to include the SnpSift/dbSNP annotation step in preprocessing. The SnpSift/dbSNP annotation step adds dbSNP ID annotations to the INFO/RSID tag (Number "A"), which is required in the ETL stage. If the raw VCF is already pre-annotated, then this annotation step is not necessary.
num_init_partitions -> integer defining the the number of partitions for the initial VCF lines Spark RDD.
dx run vcf-loader \
-i vcf_manifest=file-xxxx \
-i is_sample_partitioned=false \
-i database_name=<my_favorite_db> \
-i etl_spec_id=genomics-phenotype \
-i create_mode=strict \
-i insert_mode=append \