Skip to content

Instantly share code, notes, and snippets.

View aofarrel's full-sized avatar
🧬

Ash O'Farrell aofarrel

🧬
View GitHub Profile

Querying SRA by/for filename

The Good Method: Using Enterz Direct

  1. Get Entrez Direct (EDirect) Utilities working on your computer, or use my handy docker image (basically any version will work). My Docker image comes with ssh so you can upload the results out of the container with relative ease.
  2. esearch sra with your query and pipe that to efetch in XML format (see example below)
  3. You now have an XML file with files in it. Throw it at my Ranchero script to parse it for you.

Example esearch query: esearch -db sra -query 'PRJNA701308[bioproject]' | efetch -format native -mode xml

@aofarrel
aofarrel / wrapper_example.wdl
Created December 28, 2024 22:26
Old example of a(n unnecessary) wrapper
version 1.0
import "https://raw.githubusercontent.com/aofarrel/myco/main/myco_simple.wdl" as main
# This is just a one-sample wrapper for myco_simple. It is intended for Terra data tables with a format like this:
#
# | entity:sample_id | FASTQ_forward | FASTQ_reverse | decontaminated_fastq_1 | decontaminated_fastq_2 |
# |------------------|---------------------------|---------------------------|------------------------|------------------------|
# | sampleA | raw_A_r1.fq | raw_A_r2.fq | decontam_A_r1.fq | decontam_A_r2.fq |
# | sampleB | raw_B1_r1.fq, rawB2_r1.fq | raw_B1_r2.fq, rawB2_r2.fq | decontam_B_r1.fq | decontam_B_r2.fq |
# Notes:
# * Works for SAMN, SAME, and SAMD BioSamples (should also work for SRS/ERS format)
# * Grabs date and location of sample isolation, host, sample source, and strain
# * edirect tools needs to be on $PATH, or you can use my Docker image for a pre-installed version: https://hub.docker.com/r/ashedpotatoes/sranwrp/tags
# * elink is known to randomly fail so this code doesn't use it -- however, without elink, you can't get the SRA reads (SRR/ERR/DRR) that make up a BioSample
import subprocess
biosamples = ["SAMEA6451356","SAMEA6451357","SAMEA6451358","SAMEA6451360","SAMEA6451361","SAMEA6451362","SAMEA6451363","SAMEA6451366","SAMEA6451367","SAMEA6451368","SAMEA6451369","SAMEA6451370","SAMEA6451371","SAMEA6451372","SAMEA6451373","SAMEA6451374","SAMEA6451375","SAMEA6451377"]

If the input variable is an array, we must include an array separator. In WDL 1.0, this is done using the sep= expression placeholder. Every value in the WDL Array[String] will be separated by whatever value is declared via sep. In this example, that is a simple space, as that is one way how to construct a bash variable.

task count_words {
  input {
    Array[String] a_big_sentence
  }
  command <<<
    ARRAY_OF_WORDS=(~{sep=" " a_big_sentence})
 echo ${#ARRAY_OF_FILES[@]} &gt;&gt; length.txt
@aofarrel
aofarrel / TBProf_SAMEA11744634.txt
Created July 14, 2023 17:05
TBProfiler output when run on myco_sra for BioSample SAMEA11744634. TBProfiler was run on the fastqs this time, not the bams.
TBProfiler report
=================
The following report has been generated by TBProfiler.
Summary
-------
ID: SAMEA11744634
Date: Mon Jul 10 17:53:25 2023
Strain: lineage4.2.2.2
@aofarrel
aofarrel / script
Created June 7, 2023 19:28
Pulled from the folder of a run of a decontamination task (the first one listed here: https://github.com/aofarrel/clockwork-wdl/blob/main/tasks/combined_decontamination.wdl) which ran on Terra. The file is known as "script" with no extension, but has a bash shebang.
#!/bin/bash
cd /cromwell_root
tmpDir=$(mkdir -p "/cromwell_root/tmp.65940862" && echo "/cromwell_root/tmp.65940862")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /cromwell_root
import os
import json
lazy_array_of_strings_to_write = []
for file in os.listdir("rand12344"):
if file.endswith(".json"):
sample = file.rstrip("._to_Ref.H37Rv.bam.results.json")
with open(f"rand12344/{file}") as thisjson:
data = json.load(thisjson)
strain = data["sublin"]
lazy_array_of_strings_to_write.append(f"{sample}\t{strain}\n")
@aofarrel
aofarrel / batch_download_by_shard_from_terra.py
Created October 28, 2022 19:13
Download files per shard from Terra.
import os
vcfs = []
bgas = []
diffs = []
for vcf in vcfs:
for bga in bgas:
for diff in diffs:
split = diff.split("/")
@aofarrel
aofarrel / batch_file_rename_example.py
Created October 25, 2022 20:52
Batch renaming + download of several files in google cloud. This was needed because, when downloading with the gcloud CLI (as opposed to the web interface) google just saves the file as "final.vcf" without the full gs:// path. So, trying to batch download a dozen final.vcf files doesn't work.
import os
vcfs = ["gs://your_bucket_here/call-varcall/shard-1/var_call_SRR1049633/final.vcf", "gs://your_bucket_here/call-varcall/shard-2/var_call_SRR1049634/final.vcf", "gs://your_bucket_here/call-varcall/shard-3/var_call_SRR1062904/final.vcf", "gs://your_bucket_here/call-varcall/shard-4/var_call_SRR1140590/final.vcf", "gs://your_bucket_here/call-varcall/shard-5/var_call_SRR1047996/final.vcf", "gs://your_bucket_here/call-varcall/shard-6/var_call_SRR1144759/final.vcf", "gs://your_bucket_here/call-varcall/shard-7/var_call_SRR1165658/final.vcf", "gs://your_bucket_here/call-varcall/shard-8/var_call_SRR1167377/final.vcf", "gs://your_bucket_here/call-varcall/shard-9/var_call_SRR1169242/final.vcf"]
new_vcfs = []
for vcf in vcfs:
split = vcf.split("/")
new_name_start = "/".join(split[:-1])+"/"
new_name_end = split[-2:][0] + ".vcf"
new_name = "".join([new_name_start, new_name_end])
new_vcfs.append(new_name)
@aofarrel
aofarrel / .dockstore.yml
Created January 24, 2022 21:02
Example .dockstore.yml file. This file is embedded into a Medium post, so please don't delete it.
version: 1.2
workflows:
- subclass: WDL
primaryDescriptorPath: goleft_functions.wdl
testParameterFiles:
- goleft_terra.json