Skip to content

Instantly share code, notes, and snippets.

@aofarrel
Last active May 28, 2025 23:59
Show Gist options
  • Select an option

  • Save aofarrel/6699014b665d7d8d8b375f530584122c to your computer and use it in GitHub Desktop.

Select an option

Save aofarrel/6699014b665d7d8d8b375f530584122c to your computer and use it in GitHub Desktop.

Querying SRA by/for filename

The Good Method: Using Enterz Direct

  1. Get Entrez Direct (EDirect) Utilities working on your computer, or use my handy docker image (basically any version will work). My Docker image comes with ssh so you can upload the results out of the container with relative ease.
  2. esearch sra with your query and pipe that to efetch in XML format (see example below)
  3. You now have an XML file with files in it. Throw it at my Ranchero script to parse it for you.

Example esearch query: esearch -db sra -query 'PRJNA701308[bioproject]' | efetch -format native -mode xml

The Bad Method: Using BigQuery

Warning

This approach is platform-agnostic, but it WILL BREAK for most multi-file run accessions. For example, SRR26313862 stores only 08_17_21_R941_HG02809_1_Guppy_6.5.7_450bps_modbases_5mc_cg_sup_prom_fail.bam in BigQuery, but if you look on SRA's website, you can see six bams were submitted to that accession. Five of those six bams would not match using this method. To restate: If you are certain that every run accession is associated with PRECISELY ONE file, this should work, but in any other cases it will not.

  1. Do your search on BQ to cover the full range of samples -- typically the easiest way to do this is searching by a bioproject
  2. Download BQ output as newline delimited JSON
  3. pip install polars, pandas, pyarrow, and tqdm (versions shouldn't matter for this use case)
  4. git clone my ranchero repo (it's not pip installable yet due to relying on very specific behavior in one of its dependencies, but this use case does not rely upon that behavior at all)
  5. Slap this script into ranchero's basedir
  6. pull_SRA_meta_from_filename.py <bq_file> <output_tsv>

Example BigQuery search:

SELECT r.*
FROM `nih-sra-datastore.sra.metadata` r
WHERE r.bioproject IN ('PRJNA731524', 'PRJNA701308')

Other stuff I tried that didn't work

  • Search NCBI Run Selector for a BioProject then saving the resulting table --> This doesn't have any filenames at all
  • sra-tools --> You could dump the files (if they are fastqs) and then look at what you end up with, but doing that takes ages. There doesn't some to be a way to get just a filename list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment