Goal: trim and map paired-end read files.
Each sequenced nucleotide has a quality associated in a .fastq file. To be sure that we are working with high-quality data, trim off low-quality reads.
To trim, use a program called trim_galore. This program can take several different parameters, specified as "flags":
--paired- indicates that we are trimming paired-end readslength 70- indicates we want to keep only reads that are longer than 70bp--quality 30- indicates we want to keep only reads with a quality score greater than (or equal to?) 30
To submit this job to the cluster, make a file called trim_submit.sh:
# first line tells what should execute the script, e.g. this line says execute on the command line
#!/bin/bash
# specify number of nodes and processes for node; e.g. 16
#PBS -l nodes=1:ppn=1
# specify the time you expect the job to run
#PBS -l walltime=2:00:00
# optional: specify output and error files; hh:mm:ss
#PBS -o myout.o$PBS_JOBID
#PBS -e myout.e$PBS_JOBID
# load source
source ~/.bashrc
# move to current directory
cd $PBS_O_WORKDIR
# Arguments are paths to paired-end read files, read1.fastq and read2.fastq.
trim_galore --paired --length 70 --quality 30 $1 $2
Notice that the last line has $1 $2 at the end. These indicate that when we use this file on the command line, we will need to specify two arguments:
qsub trim_script_example.sh -F "/home/qbiodata/morbidostat/PA83/v00/d00/read1_sample.fastq /home/qbiodata/morbidostat/PA83/v00/d00/read2_sample.fastq"
These filenames are long, ugly, and hard to type. You can also make it easier to submit files by making links using the ln -s command with the syntax ln -s /path/of/file/to/link/to /path/of/link
ln -s /home/qbiodata/morbidostat/PA83/v03/d14/PA83v03d14_GCCAAT_L003_R1_001.fastq.gz ./read1.fastq.gz
ln -s /home/qbiodata/morbidostat/PA83/v03/d14/PA83v03d14_GCCAAT_L003_R2_001.fastq.gz ./read2.fastq.gz
After making these links, the syntax becomes:
qsub trim_script_example.sh -F "read1.fastq.gz read2.fastq.gz"
To map a file to a reference file, first make an index of the reference file. This command copies the file to your current directory, then makes an index of the file in the current directory. (You can also make )
cp /home/qbiodata/morbidostat/PA83/reference/PSMA83.fasta .
bwa index PSMA83.fasta
After doing that, make a qsub file for bwa with the following text. BWA is a popular alignment tool.
# first line tells what should execute the script, e.g. this line says execute on the command line
!/bin/bash
# specify number of nodes and processes for node; e.g. 16
#PBS -l nodes=1:ppn=1
# specify the time you expect the job to run
#PBS -l walltime=2:00:00
# optional: specify output and error files; hh:mm:ss
#PBS -o myout.o$PBS_JOBID
#PBS -e myout.e$PBS_JOBID
# load source
source ~/.bashrc
# move to current directory
cd $PBS_O_WORKDIR
# this next line runs bwa, mapping to reads in the file specified (./PSMA83.fasta)
# take arguments in this format: -F "readfile1_val_1.fq.gz readfile2_val_2.fq.gz output.sam"
bwa mem -t 8 ./PSMA83.fasta $1 $2 > $3
Run this qsub command in the command line:
qsub map_script_example.sh -F "PA83v03d14_GCCAAT_L003_R1_001_val_1.fq.gz PA83v03d14_GCCAAT_L003_R2_001_val_2.fq.gz map.sam"