Skip to content

Instantly share code, notes, and snippets.

@rhine3
Last active July 27, 2017 21:58
Show Gist options
  • Select an option

  • Save rhine3/2317391257eefaeb8fe61520148c4b91 to your computer and use it in GitHub Desktop.

Select an option

Save rhine3/2317391257eefaeb8fe61520148c4b91 to your computer and use it in GitHub Desktop.
eco-evo_intro-to-bioinfo

Introduction to Bioinformatics - EcoEvo17

Goal: trim and map paired-end read files.

Trimming

Each sequenced nucleotide has a quality associated in a .fastq file. To be sure that we are working with high-quality data, trim off low-quality reads.

To trim, use a program called trim_galore. This program can take several different parameters, specified as "flags":

  • --paired - indicates that we are trimming paired-end reads
  • length 70 - indicates we want to keep only reads that are longer than 70bp
  • --quality 30 - indicates we want to keep only reads with a quality score greater than (or equal to?) 30

To submit this job to the cluster, make a file called trim_submit.sh:

# first line tells what should execute the script, e.g. this line says execute on the command line
#!/bin/bash

# specify number of nodes and processes for node; e.g. 16
#PBS -l nodes=1:ppn=1

# specify the time you expect the job to run
#PBS -l walltime=2:00:00

# optional: specify output and error files; hh:mm:ss
#PBS -o myout.o$PBS_JOBID
#PBS -e myout.e$PBS_JOBID

# load source
source ~/.bashrc

# move to current directory
cd $PBS_O_WORKDIR

# Arguments are paths to paired-end read files, read1.fastq and read2.fastq.
trim_galore --paired --length 70 --quality 30 $1 $2

Notice that the last line has $1 $2 at the end. These indicate that when we use this file on the command line, we will need to specify two arguments:

qsub trim_script_example.sh -F "/home/qbiodata/morbidostat/PA83/v00/d00/read1_sample.fastq /home/qbiodata/morbidostat/PA83/v00/d00/read2_sample.fastq"

These filenames are long, ugly, and hard to type. You can also make it easier to submit files by making links using the ln -s command with the syntax ln -s /path/of/file/to/link/to /path/of/link

ln -s /home/qbiodata/morbidostat/PA83/v03/d14/PA83v03d14_GCCAAT_L003_R1_001.fastq.gz ./read1.fastq.gz
ln -s /home/qbiodata/morbidostat/PA83/v03/d14/PA83v03d14_GCCAAT_L003_R2_001.fastq.gz ./read2.fastq.gz

After making these links, the syntax becomes:

qsub trim_script_example.sh -F "read1.fastq.gz read2.fastq.gz"

Mapping

To map a file to a reference file, first make an index of the reference file. This command copies the file to your current directory, then makes an index of the file in the current directory. (You can also make )

cp /home/qbiodata/morbidostat/PA83/reference/PSMA83.fasta .
bwa index PSMA83.fasta

After doing that, make a qsub file for bwa with the following text. BWA is a popular alignment tool.

# first line tells what should execute the script, e.g. this line says execute on the command line
!/bin/bash

# specify number of nodes and processes for node; e.g. 16
#PBS -l nodes=1:ppn=1

# specify the time you expect the job to run
#PBS -l walltime=2:00:00

# optional: specify output and error files; hh:mm:ss
#PBS -o myout.o$PBS_JOBID
#PBS -e myout.e$PBS_JOBID

# load source
source ~/.bashrc

# move to current directory
cd $PBS_O_WORKDIR

# this next line runs bwa, mapping to reads in the file specified (./PSMA83.fasta)
# take arguments in this format: -F "readfile1_val_1.fq.gz readfile2_val_2.fq.gz output.sam"
bwa mem -t 8 ./PSMA83.fasta $1 $2 > $3

Run this qsub command in the command line:

qsub map_script_example.sh -F "PA83v03d14_GCCAAT_L003_R1_001_val_1.fq.gz PA83v03d14_GCCAAT_L003_R2_001_val_2.fq.gz map.sam"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment