Ash O'Farrell aofarrel

Querying SRA by/for filename

The Good Method: Using Enterz Direct

Get Entrez Direct (EDirect) Utilities working on your computer, or use my handy docker image (basically any version will work). My Docker image comes with ssh so you can upload the results out of the container with relative ease.
esearch sra with your query and pipe that to efetch in XML format (see example below)
You now have an XML file with files in it. Throw it at my Ranchero script to parse it for you.

Example esearch query: esearch -db sra -query 'PRJNA701308[bioproject]' | efetch -format native -mode xml

If the input variable is an array, we must include an array separator. In WDL 1.0, this is done using the sep= expression placeholder. Every value in the WDL Array[String] will be separated by whatever value is declared via sep. In this example, that is a simple space, as that is one way how to construct a bash variable.

task count_words {
  input {
    Array[String] a_big_sentence
  }
  command <<<
    ARRAY_OF_WORDS=(~{sep=" " a_big_sentence})
 echo ${#ARRAY_OF_FILES[@]} &gt;&gt; length.txt

	version 1.0

	import "https://raw.githubusercontent.com/aofarrel/myco/main/myco_simple.wdl" as main

	# This is just a one-sample wrapper for myco_simple. It is intended for Terra data tables with a format like this:
	#
	# \| entity:sample_id \| FASTQ_forward \| FASTQ_reverse \| decontaminated_fastq_1 \| decontaminated_fastq_2 \|
	# \|------------------\|---------------------------\|---------------------------\|------------------------\|------------------------\|
	# \| sampleA \| raw_A_r1.fq \| raw_A_r2.fq \| decontam_A_r1.fq \| decontam_A_r2.fq \|
	# \| sampleB \| raw_B1_r1.fq, rawB2_r1.fq \| raw_B1_r2.fq, rawB2_r2.fq \| decontam_B_r1.fq \| decontam_B_r2.fq \|

	# Notes:
	# * Works for SAMN, SAME, and SAMD BioSamples (should also work for SRS/ERS format)
	# * Grabs date and location of sample isolation, host, sample source, and strain
	# * edirect tools needs to be on $PATH, or you can use my Docker image for a pre-installed version: https://hub.docker.com/r/ashedpotatoes/sranwrp/tags
	# * elink is known to randomly fail so this code doesn't use it -- however, without elink, you can't get the SRA reads (SRR/ERR/DRR) that make up a BioSample


	import subprocess

	biosamples = ["SAMEA6451356","SAMEA6451357","SAMEA6451358","SAMEA6451360","SAMEA6451361","SAMEA6451362","SAMEA6451363","SAMEA6451366","SAMEA6451367","SAMEA6451368","SAMEA6451369","SAMEA6451370","SAMEA6451371","SAMEA6451372","SAMEA6451373","SAMEA6451374","SAMEA6451375","SAMEA6451377"]

	TBProfiler report
	=================

	The following report has been generated by TBProfiler.

	Summary
	-------
	ID: SAMEA11744634
	Date: Mon Jul 10 17:53:25 2023
	Strain: lineage4.2.2.2

	#!/bin/bash

	cd /cromwell_root
	tmpDir=$(mkdir -p "/cromwell_root/tmp.65940862" && echo "/cromwell_root/tmp.65940862")
	chmod 777 "$tmpDir"
	export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
	export TMPDIR="$tmpDir"
	export HOME="$HOME"
	(
	cd /cromwell_root

	import os
	import json
	lazy_array_of_strings_to_write = []
	for file in os.listdir("rand12344"):
	if file.endswith(".json"):
	sample = file.rstrip("._to_Ref.H37Rv.bam.results.json")
	with open(f"rand12344/{file}") as thisjson:
	data = json.load(thisjson)
	strain = data["sublin"]
	lazy_array_of_strings_to_write.append(f"{sample}\t{strain}\n")

	import os

	vcfs = []
	bgas = []
	diffs = []

	for vcf in vcfs:
	for bga in bgas:
	for diff in diffs:
	split = diff.split("/")

	import os

	vcfs = ["gs://your_bucket_here/call-varcall/shard-1/var_call_SRR1049633/final.vcf", "gs://your_bucket_here/call-varcall/shard-2/var_call_SRR1049634/final.vcf", "gs://your_bucket_here/call-varcall/shard-3/var_call_SRR1062904/final.vcf", "gs://your_bucket_here/call-varcall/shard-4/var_call_SRR1140590/final.vcf", "gs://your_bucket_here/call-varcall/shard-5/var_call_SRR1047996/final.vcf", "gs://your_bucket_here/call-varcall/shard-6/var_call_SRR1144759/final.vcf", "gs://your_bucket_here/call-varcall/shard-7/var_call_SRR1165658/final.vcf", "gs://your_bucket_here/call-varcall/shard-8/var_call_SRR1167377/final.vcf", "gs://your_bucket_here/call-varcall/shard-9/var_call_SRR1169242/final.vcf"]
	new_vcfs = []
	for vcf in vcfs:
	split = vcf.split("/")
	new_name_start = "/".join(split[:-1])+"/"
	new_name_end = split[-2:][0] + ".vcf"
	new_name = "".join([new_name_start, new_name_end])
	new_vcfs.append(new_name)

	version: 1.2
	workflows:
	- subclass: WDL
	primaryDescriptorPath: goleft_functions.wdl
	testParameterFiles:
	- goleft_terra.json