Wurm lab: home | |

Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs – BUSCOs

Rob Waterhouse

Introduction

Challenges associated with sequencing, assembling, and annotating genomes are numerous and range from obtaining enough high-quality sample to begin with, to dealing with high heterozygosity and very large, often highly repetitive, genomes. Several statistical measures can provide some indications of the quality of an assembly, e.g. contig/scaffold N50 reflects its contiguity. However, a key measure of quality is to assess the completeness of the genome assembly in terms of its expected gene content.

The identification of genes from many diverse species that are evolving under single-copy control (Waterhouse et al. 2011), i.e. they are found in almost all species and almost never with duplicate copies, defines an evolutionarily-informed expected gene content. Benchmarking Universal Single-Copy Orthologue (BUSCO) sets are genes selected from the major species clades at the OrthoDB catalogue of orthologues (Waterhouse et al. 2013; Kriventseva et al. 2015) requiring single-copy orthologues to be present in at least 90% of the species. Their widespread presence as single-copy orthologues means that any BUSCO group is expected to find a matching single-copy orthologue in any newly-sequenced genome from the appropriate species clade. If these BUSCOs cannot be identified in a genome assembly or annotated gene set, it is possible that the sequencing and/or assembly and/or annotation approaches have failed to capture the complete expected gene content. Real gene losses can and do occur, even of otherwise well-conserved genes (Wyder et al. 2007), so some apparently missing genes could in fact be rare but true biological gene losses.

The BUSCO assessment tool (Simão et al. 2015) implements a computational pipeline to identify and classify BUSCO group matches from genome assemblies, annotated gene sets, or transcriptomes, using HMMER (Eddy 2011) hidden Markov models and de novo gene prediction with Augustus (Keller et al. 2011). The recovered matches are classified as ‘complete’ if their lengths are within the expectation of the BUSCO group lengths. If these are found more than once they are classified as ‘duplicated’. The matches that are only partially recovered are classified as ‘fragmented’, and BUSCO groups for which there are no matches that pass the tests of orthology are classified as ‘missing’.

Suggested Reading

Tutorial Instructions

1. BACKGROUND

For the purposes of this tutorial we will focus on assessing bacterial gene sets and genome assemblies as they are smaller than for eukaryotes and the BUSCO assessment set is made up of only 40 conserved orthologues. The same principles apply to the assessment of data from species from other lineages, but working with bacteria means that we can run the analyses and examine the results within the timeframe of the tutorial. We will begin by assessing a selection of bacterial gene set annotations and then a smaller selection of bacterial genome assemblies, downloaded from Ensembl Bacteria (http://bacteria.ensembl.org).

1.1. In your research projects that involve making use of an assembled genome:

1.2. From the introduction and your own background reading, can you briefly describe what BUSCO assessments can tell you about the quality of your genome assembly?

1.3. Can you think of a complementary approach?

2. SETUP

mkdir MyBUSCO
cd MyBUSCO
# if required: wget cegg.unige.ch/pub/SIBCOURSE/BUSCO-datasets.tar.gz
cp ~/data/BUSCO/BUSCO-datasets.tar.gz .
tar -xzf BUSCO-datasets.tar.gz
ls -lR
export AUGUSTUS_CONFIG_PATH=~/software/augustus-3.2.1/config/
printenv

2.1. Is your Augustus config path set correctly?

3. TEST SETUP

python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o test1 -in ~/software/BUSCO_v1.1b1/sample_data/target.fa -l ~/software/BUSCO_v1.1b1/sample_data/example -m genome --sp fly >& test1_log.txt &
ls -l run_test1/
# Expected result from test run
more ~/software/BUSCO_v1.1b1/sample_data/run_TEST/short_summary_TEST
# Actual result from the test run
more run_test1/short_summary_test1

3.1. Understanding the test:

4. RUN ONE GENE SET

python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o test2 -in PROTS/Streptomyces_albulus_pd_1.GCA_000504065.2.29.pep.all.fa -l bacteria -m OGS >& test2_log.txt &
ls -l run_test2/
more run_test2/short_summary_test2

4.1. Understanding the OGS analysis:

5. RUN ONE GENOME ASSEMBLY

python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o test3 -in GENOS/choose_a_genome.fa -l bacteria -m genome --sp thermoanaerobacter_tengcongensis >& test3_log.txt &
ls -l run_test3/
more run_test3/short_summary_test3

5.1. Understanding the assembly analysis:

6. RUN MULTIPLE GENE SETS

#!/bin/bash
FILENO=1
echo `date`
printf "Run\tName\n" > run2name_ogs_map.txt
for i in $( ls PROTS/*); do
    echo $i
    python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o s$FILENO -in $i -l bacteria -m OGS >& s$FILENO\.log.txt
    printf "%s\t%s\n" "s$FILENO" $i >> run2name_ogs_map.txt
    let "FILENO++"
done
echo `date`
bash busco_ogs_set.sh >& busco_ogs_set.log.txt &

6.1. Understanding the OGS analysis output:

7. VISUALISE RESULTS

cp run_s*/full_table_* RESULTS/.
ls -l RESULTS/
perl BUSCO_summary_plots.pl RESULTS

7.1. Understanding the results chart:

8. RUN MULTIPLE GENOME ASSEMBLIES

#!/bin/bash
FILENO=1
echo `date`
printf "Run\tName\n" > run2name_geno_map.txt
for i in $( ls GENOS/*); do
    echo $i
    python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o g$FILENO -in $i -l bacteria -m genome --sp thermoanaerobacter_tengcongensis >& g$FILENO\.log.txt
    printf "%s\t%s\n" "g$FILENO" $i >> run2name_geno_map.txt
    let "FILENO++"
done
echo `date`
bash busco_geno_set.sh >& busco_geno_set.log.txt &

8.1. While waiting for the assembly assessments to run:

9. VISUALISE RESULTS

cp run_g*/full_table_* RESULTS/.
ls -l RESULTS/
perl BUSCO_summary_plots.pl RESULTS

9.1. Understanding the results chart:

10. EXTRATIME – optional homework for later

wget http://busco.ezlab.org/files/vertebrata_buscos.tar.gz
tar -xzf vertebrata_buscos.tar.gz
wget ftp://ftp.ensembl.org/pub/release-84/fasta/gallus_gallus/pep/Gallus_gallus.Galgal4.pep.all.fa.gz
gunzip Gallus_gallus.Galgal4.pep.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-67/fasta/gallus_gallus/pep/Gallus_gallus.WASHUC2.67.pep.all.fa.gz
gunzip Gallus_gallus.WASHUC2.67.pep.all.fa.gz
mkdir galnew
mv Gallus_gallus.Galgal4.pep.all.fa galnew/
mkdir galold
mv Gallus_gallus.WASHUC2.67.pep.all.fa galold/
cd galnew
python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o GALNEW -in Gallus_gallus.Galgal4.pep.all.fa -l ../vertebrata -m OGS -c 2 >& GALNEW_log.txt &
cd ../galold
python3 ~/software/BUSCO_v1.1b1/BUSCO_v1.1b1.py -o GALOLD -in Gallus_gallus.WASHUC2.67.pep.all.fa -l ../vertebrata -m OGS -c 2 >& GALOLD_log.txt &
cd ../
mkdir galresults
cp galnew/run_GALNEW/full_table_* galresults/
cp galold/run_GALOLD/full_table_* galresults/
ls -l galresults/
perl BUSCO_summary_plots.pl galresults