- Get Entrez Direct (EDirect) Utilities working on your computer, or use my handy docker image (basically any version will work). My Docker image comes with ssh so you can upload the results out of the container with relative ease.
esearchsra with your query and pipe that toefetchin XML format (see example below)- You now have an XML file with files in it. Throw it at my Ranchero script to parse it for you.
Example esearch query:
esearch -db sra -query 'PRJNA701308[bioproject]' | efetch -format native -mode xml
Warning
This approach is platform-agnostic, but it WILL BREAK for most multi-file run accessions. For example, SRR26313862 stores only 08_17_21_R941_HG02809_1_Guppy_6.5.7_450bps_modbases_5mc_cg_sup_prom_fail.bam in BigQuery, but if you look on SRA's website, you can see six bams were submitted to that accession. Five of those six bams would not match using this method.
To restate: If you are certain that every run accession is associated with PRECISELY ONE file, this should work, but in any other cases it will not.
- Do your search on BQ to cover the full range of samples -- typically the easiest way to do this is searching by a bioproject
- Download BQ output as newline delimited JSON
pip installpolars, pandas, pyarrow, and tqdm (versions shouldn't matter for this use case)git clonemy ranchero repo (it's not pip installable yet due to relying on very specific behavior in one of its dependencies, but this use case does not rely upon that behavior at all)- Slap this script into ranchero's basedir
pull_SRA_meta_from_filename.py <bq_file> <output_tsv>
Example BigQuery search:
SELECT r.*
FROM `nih-sra-datastore.sra.metadata` r
WHERE r.bioproject IN ('PRJNA731524', 'PRJNA701308')
- Search NCBI Run Selector for a BioProject then saving the resulting table --> This doesn't have any filenames at all
- sra-tools --> You could dump the files (if they are fastqs) and then look at what you end up with, but doing that takes ages. There doesn't some to be a way to get just a filename list.