dx-toolkit
The dx-toolkit provides command-line tools for working with DNANexus Platform, and they are very useful, so it is worthwhile to spend some time to read up on the documentation.
Installation
DNANexus provides instructions on how to install it using pip3 here. However I prefer following the St. Jude Cloud docs to install it in its own conda environment. Briefly, if you have conda/miniconda installed at your local computer, create a dx conda environment and pip install dxpy:
Interactive CLI Analysis
The dx-toolkit sometimes have conflicts with the organizational VPN. Alternatively, we can start a DNANexus ttyd app (see DNANexus tools) to use an unix shell on the browser. One of the advantages is that you can also launch https apps such as RStudio Server using this app. Cloud Workstation is another option, but it requires installing the dx-toolkit on your local computer to ssh into it.
Usage
The dx-toolkit is a convenient way to interact with the DNANexus cloud platform and can almost feel like working on an HPC cluster with commands prefixed with 'dx'. It can be utilized in various ways:
- Data Management: For example, using the
dx uploadanddx downloadcommand-line tools to transfer files, as well as using thedx describeanddx set_propertiescommands to manage metadata associated with the files. - Workflow Execution: We can use the
dx find appsanddx runcommands to find and run apps from the DNAnexus app store, using thedx new workflowcommand to create a new workflow with custom apps, using thedx waitanddx watchcommands to monitor the progress of the workflow execution. - Custom App Development: using the
dx buildanddx add appcommands to build DNANexus Apps (more on this in the later chapters)
Cost
Currently, We cannot set the spending limit for a project in DNANexus, so a good practice would be setting the cost limit for each analysis, as shown below. The same command in a bash script is also on the Confluence GitHub.
{% hint style="info" %} We recommend testing your script on small sample datasets with cost-limit before running it on the large dataset.
// adding --cost-limit to limit the cost of an analysis
dx run app-swiss-army-knife \
-iin=demo_data:Data/AMR.bed \
-iin=demo_data:Data/AMR.bim \
-iin=demo_data:Data/AMR.fam \
-icmd="plink2 --bfile AMR --hwe 1e-5 --make-bed --out AMR_hwe" \
--destination demo_analysis:/results -y \
--cost-limit 10
A simple example for submitting parallel jobs and using a Docker image
// This example shows how to loop through 3 chromosomes,
and run the RScript on each one of them by passing the ${chrom} variable to dx run.
It also shows that you can pass the docker image so that the job will be run using
the container
for chrom in 1 2 3; do \
dx run app-swiss-army-knife \
-iimage="shukwong/rstudio_with_gwas_tools:fd324eb9d3d2117fc37a157b66fa371a53693442" \
-iin=scripts/test_writing_output.R
-icmd="Rscript test_writing_output.R ${chrom}" -y; \
Done