Skip to content

BLAST+, BLAST

module avail blast+/
module avail blast/

BLAST (Basic Local Alignment Search Tool) library is a collection of software tools and algorithms developed by the National Center for Biotechnology Information (NCBI).

BLAST is widely used in bioinformatics for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences.

BLAST+ refers to an enhanced version of the original BLAST.

Usage

Programs

blastp - compares an amino acid query sequence against a protein sequence database
blastn - compares a nucleotide query sequence against a nucleotide sequence database
blastx - compares a nucleotide query sequence translated in all reading frames against a protein sequence database
tblastn - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
tblastx - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database

Command line examples

blast database: makeblastdb -input_type fasta -in FASTA_FILE -dbtype nucl -title NAME -out NAME
blastn: blastn -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
blastp: blastp -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
tblastx: tblastx -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
blastx: blastx -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
tblastn: tblastn -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8

Effectivity

Blast is not using effectively most of the reserved CPUs in jobs. Set export BATCH_SIZE=3000000 before running any blast command (e.g. blastn, blastp). It will run much faster.

Databases

We maintain a local copy of Blast databases in /storage/projects/BlastDB directory. Databases are ready to use.

  • For short/single query jobs, you can use the databases directly in storage and refer to them from the batch script by their full path, i.e. /storage/projects/BlastDB/DB_NAME_PREFIX.
  • If you run a longer job, multiple queries or multiple jobs with a particular DB, it is more efficient to copy the database to the scratch directory.

In both cases, refer to the database (-db option) within your blastn/blastp/tblastx job by its basename only ( e.g. nt, nr, wgs, refseq_genomic). For example -db /storage/projects/BlastDB/nt.

All available databases are described on the NCBI web. We mirror all of them. If you need to update DBs or add some new ones, please contact the user support meta@cesnet.cz.

Warning

A new DB release contains very large GI numbers (GenInfo Identifier) which are incompatible with older versions of blast modules. Use the latest version of the blast module to prevent potential incompatibilities.

Network load optimization

If you need to run several BLAST jobs with the same database, we ask user to optimize the network load by copying the database only once and using it for all the jobs running on the same node.

Note

This requires that you don't clean the content of your scratch directory after the first job is finished!

This can be done by inserting following construction into the batch script:

DB="nt" # name of the database you need to use
TIMEOUT=120
TIMEWAIT=0
LINKDB=false

...

# Enter the scratch dir
cd "$SCRATCHDIR" || exit 4

...

# search the content of all your other scratch directories on that node
# and look for a file called ${DB}.db_here 
LOCAL_DB=$(find .. -name ${DB}.db_here -print -quit) # LOCAL_DB contains a path as well, contrary to DB

# if the file exists, do...
if [ -n "$LOCAL_DB" ]; then
  LINKDB=true 
  LOCAL_DB="${LOCAL_DB%%.db_here}" # cut off the ".db_here" suffix

  # if in that scratchdir where LOCAL_DB resides does NOT exist a file "${LOCAL_DB}.db_is_ready", wait for it
  while ! test -f "${LOCAL_DB}.db_is_ready"; do
    sleep 5
    TIMEWAIT=$((TIMEWAIT+5))
    if [ $TIMEWAIT -gt $TIMEOUT ]; then
      echo "timed out"
      break
      LINKDB=false
    fi
  done
fi

# the DB exists somewhere on this machine and is complete, so we can link it
if $LINKDB; then
  ln "${LOCAL_DB}"* . || exit 5 # link everything into current scratch directory
# the DB either does not exist on this machine or is not complete, so copy it from /storage/projects
else
  touch ${DB}.db_here
  cp -p /storage/projects/BlastDB/${DB}* . && touch ${DB}.db_is_ready || exit 6 
  # ${DB}.db_is_ready is empty file just telling your other future jobs on this machine that the cp operation has finished                                 
  export CLEAN_SCRATCH=false # do not remove content of this scratch
fi

....

# then run the calculation
blastp -db "./${DB}" -query INPUT_FASTA -out OUTPUT_NAME ...