Sequence analysis Tools

A tool that finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.

Bioinformatics

BUSCO

Provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.

Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Read mapping

Genome indexing

Cactus

Cactus is a reference-free whole-genome multiple alignment program.

Genomics

GATK4

The Genome Analysis Toolkit (GATK) is a set of bioinformatic tools for analyzing high-throughput sequencing (HTS) and variant call format (VCF) data. The toolkit is well established for germline short variant discovery from whole genome and exome sequencing data. GATK4 expands functionality into copy number and somatic analyses and offers pipeline scripts for workflows.

Genetic variation

Polymorphism detection

Genotyping

Statistical calculation

GECKO

Software aimed at pairwise sequence comparison generating high quality results (equivalent to MUMmer) with controlled memory consumption and comparable or faster execution times particularly with long sequences.

MAFFT

Multiple sequence alignment

MAFFT (Multiple Alignment using Fast Fourier Transform) is a high speed multiple sequence alignment program.

MUSCLE

This tool performs multiple sequence alignments of nucleotide or amino acid sequences.

Maker

Portable and easily configurable genome annotation pipeline. It’s purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases.

Genomics

DNA

Nextclade

Nextclade is an open-source project for viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement.

Genomics

Caldistics

PACU

PACU is a workflow for whole genome sequencing based phylogeny of Illumina and ONT R9/R10 data. PACU stands for the Prokaryotic Awesome variant Calling Utility and is named after an omnivorous fish (that eats both Illumina and ONT reads).

Phylogenetics

RAxML

A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies.

Phylogenetics

RepeatMasker

A program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).

Sequence composition

complexity and repeats

SAMtools

SAMtools are widely used for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods.

T-Coffee

A multiple sequence alignment package that can be used for DNA, RNA and protein sequences. It can be used to align sequences or to combine the output of other alignment methods (Clustal, Mafft, Probcons, Muscle...) into one unique alignment.

Multiple sequence alignment

Vt

Variant tool set that discovers short variants from Next Generation Sequencing data.

Genetics

Sequencing

breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence using high-throughput DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.

Sequencing

DNA mutation

kallisto

Gene expression profiling

A program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment.

Transcriptomics

RNA-Seq

Gene expression

Statistical models

trimAl

Tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment.