Return to Main Page

Part 3: Gene prediction

You need to have gone through Part 2: Genome assembly before starting this practical.

Many tools exist for gene prediction, some based on ab initio statistical models of what a protein-coding gene should look like, others that use similarity with protein-coding genes from other species, and others (such as Augustus and SNAP), which use both.

There is no perfect tool or approach, thus we typically run many gene-finding tools and call a consensus between the different predicted gene models. MAKER, BRAKER and JAMg can do this for us.
In this practical, we will use MAKER.

1. Running Maker

Following the same procedure described in Section 1.2 of Part 1: Read cleaning, create a new main directory for today’s practical (e.g., 2024-09-26-gene_prediction) and the input, tmp, and results subdirectories, and the file WHATIDID.txt to log your commands. Link the output (assembly) from Part 2 practical into input subdirectory:

cd ~/2024-09-26-gene_prediction/input
ln -s ~/2024-09-25-assembly/results/scaffolds.fasta .
cd ..

Pull out the longest few scaffolds from scaffolds.fasta into a new file:

seqtk seq -L 10000 input/scaffolds.fasta > tmp/min10000.fa

Gene prediction can be difficult if the assembly is low quality and does not include long scaffolds (remember, trash in = trash out). For instance, in the case of short scaffolds, if a gene is 2,000 bp long and includes introns, it may be very hard to find many entire genes.

Note:
If you have difficulty in predicting the genes or you suspect that your assembly may be affected by the aforementioned issues, you can use already assembled scaffolds.
# Link this scaffolds file into your input directory
ln -s /shared/data/backup_assembly/scaffolds.fasta .

In this practical, we will show how to run MAKER in a simple scenario. For a better understanding of how this tool works, and how it can be applied in real-case scenarios, we strongly encourage to read the paper and documentation. Also, checking which settings were used in recent publications can be very helpful for reproducing (or critiquing) analyses.

Change to your tmp directory and run maker:

cd tmp
maker -OPTS

This will generate an empty maker_opts.ctl configuration file (ignore the warnings). Edit that file using a text editor such as nano or emacs and specify:

genome: min10000.fa
deactivate RepeatMasker by changing model_org line to model_org= (i.e., nothing afer =)
deactivate RepeatRunner by changing repeat_protein line to repeat_protein= (i.e., nothing after =)
Augustus_species:honeybee1 (remember to add 1 at the end; this provides hints to Augustus about the gene structure based on what we know about gene composition from the honeybee.

We deactivated RepeatMakser and RepeatRunner due to computational limitations as well as the lack of a suitable library of repetitive elements for this species. For a real project, we would include RepeatMasker, likely after creating a new repeat library for our species.
For a real project, we would also include gene expression data (RNAseq improves gene prediction performance tremendously), protein sequences from related species, and iteratively train gene prediction algorithms (e.g., Augustus and SNAP) for our data.

Finally, run MAKER using the edited configuration. This may take a few minutes, depending on how much data you gave it:

maker maker_opts.ctl

Genome annotation software like MAKER usually provide information about the exon-intron structure of the genes (e.g., in GFF3 format), and sequence of corresponding messenger RNA and protein products (e.g., in FASTA format).

Question:
While MAKER is running, note the different file formats you have encountered until now.

Which type of data does each file format contain?

Do you understand the difference between the file formats and data types?

Once MAKER is done the results will be hidden in subdirectories of min10000.maker.output. MAKER provides a helper script to collect this hidden output in one place (again please ignore the warnings for these steps):

# Pull out information about exon-intron structure of the predicted genes. This
# will be saved to the file min10000.all.gff.
gff3_merge -d min10000.maker.output/min10000_master_datastore_index.log

# Pull out predicted messenger RNA and protein sequences. These will be saved
# to the files: min10000.all.maker.augustus.transcripts.fasta, and
# min10000.all.maker.augustus.proteins.fasta
fasta_merge -d min10000.maker.output/min10000_master_datastore_index.log

2. Quality control of individual genes

So now we have some gene predictions… how can we know if they are any good? The easiest way to get a feel for this is to use the following example sequences: predicted protein sequences from rice and honeybee. We will compare them using BLAST to known sequences from other species against the Swissprot database (faster), or the Uniref50 database (slower).

2.1 Running BLAST with SequenceServer

We will use SequenceServer to run BLAST. Open genomicscourse.sequenceserver.com in your browser, paste the example rice and honeybee protein sequences in the textbox and click on the ‘BLAST’ button to run a BLAST search. THIS WILL TAKE A MINUTE OR TWO.

Question:
Look at the BLAST results:

Do any of the gene predictions have significant similarity to known sequences?

For a given gene prediction, do you think it is complete, or can you infer from the BLAST alignments that something may be wrong? Start by comparing the length of your gene prediction to that of the BLAST hits.

Is your gene prediction considerably longer or considerably shorter than BLAST hits? Why?

Now try a few of your gene predictions (use the predicted protein sequences). Run BLAST on only a maximum of 12 sequences at a time (instead of simply selecting the first 12 genes in your file, copy-paste sequences randomly from the file). This will take a bit longer.

As you can see, gene prediction software is imperfect. This is even the case when using all available evidence. This is potentially costly for analyses that rely on gene predictions, i.e. many of the analyses we might want to do:

“Incorrect annotations [ie. gene identifications] poison every experiment that makes use of them. Worse still the poison spreads.” – Yandell & Ence (2012).

2.2 Using GeneValidator

The GeneValidator tool can help to evaluate the quality of a gene prediction by comparing features of a predicted gene to similar database sequences. This approach expects that similar sequences should for example be of similar length. Genevalidator was built to automate the comparison of sequence characteristics similarly to what we just did through visual individual BLAST results.

Try to run the example rice and honeybee protein sequences through GeneValidator. It should be accessible at https://genevalidator.genomicscourse.com/ or https://genevalidator.wurmlab.com/.

3. Comparing whole genesets and prioritizing genes for manual curation

Genevalidator’s visual output can be handy when looking at a few genes. But the tool also provides tab-delimited output, useful when working in the command-line or running the software on whole proteomes. This can help the analysis:

in situations when you can choose between multiple gene sets.
to identify which gene predictions are likely correct, and which predictions need might require further inspection and potentially be manually fixed.

4. Manual curation

Because automated gene predictions are not perfect, manual inspection and fixing is often required. The most commonly used software for this is Apollo/WebApollo.

We will not curate any gene models as part of this practical, but you can learn about gene model curation through these YouTube videos: