Population genetics in R
We have samples with two genotypes: the B genotype (associated with single-queen colony phenotype) and the b genotype (associated with multiple-queen colony phenotype).
Our dummy assembly has two scaffolds, from two different chromosomes. The aim of our analysis is to test whether any parts of this assembly differ between the individuals from these two groups (B and b).
In the first part of the analysis, we are going to create a heat map of the genotypes of the individuals and we are going to run Principal Component Analysis (PCA) on these genotypes. This will allow us to test if any of the individuals cluster together by their B/b genotype. This will be done using the
adegenet package in R.
In the second part, we are going to measure genetic differentiation between the two groups (B and b). We will do this analysis over a sliding window, to see if the differentiation between B and b are specific to any portion of the genome. We will also measure the genetic diversity among each of the groups, which may tell us something about the evolutionary history of the portions of genome represented in our assembly. This will be done using the
PopGenome package in R.
Input into R
Again, make a directory for this practical. You should create a directory for the
input, one for the
results and a
WHATIDID.txt file in which you log your commands.
2019-10-xx-population_genetics ├── input │ └── snp.vcf ├── results └── WHATIDID.txt
You will only need the
snp.vcf file we created in the last practical and place it into the appropriate input directory (if you don’t have this file, you can download it from here, if you do this you will need to replace
snp.vcf.gz in the code below).
It’s a good idea to note down the results of your analysis in the results directory, as well as saving any graph you make.
adegenet uses an object called
r genlight. To create it, we need to input a matrix where each row is an individual and each column is a locus (i.e. a SNP position). We can do this using bcftools:
# Select the information in the vcf file without the header bcftools query input/snp.vcf -f '%CHROM\t%POS[\t%GT]\n' > snp_matrix.txt # get sample names bcftools query -l input/snp.vcf > sample_names.txt
You can open a new R session by typing
rstudio-genomics in the terminal. The following lines of code should be entered into your console within Rstudio:
# input the SNP data and the sample names by reading the .txt files you just created with bcftools snp_data <- read.table("snp_matrix.txt") names_input <- read.table("sample_names.txt") raw_sample_names <- names_input$V1 # Keep the position of the loci loci <- snp_data[, 1:2] colnames(loci) <- c("scaffold", "position") # Transpose the data (turn it on its side so that rows = individuals and columns = loci) snp_matrix <- t(snp_data[,3:ncol(snp_data)]) # Clean the sample names and add them to the rows of our matrix sample_names <- gsub("\\.bam", "", raw_sample_names) row.names(snp_matrix) <- sample_names # Finally we reorder the rows by population bb <- sample_names[grep("B", sample_names)] # B samples (bb = "big B") lb <- sample_names[grep("b", sample_names) ]# b samples (lb = "little b") snp_matrix <- snp_matrix[c(bb, lb), ]
Once this is done, we can create a new
genlight object that contains all the SNP data
library(adegenet) # Note that the following line has been split up over multiple lines to make it easier to read # R won't run the command until all brackets have been closed snp <- new("genlight", snp_matrix, chromosome = loci$scaffold, position = loci$position, pop = as.factor(c(rep("B", 7), rep("b", 7)))) # You can access the data using the "@" sign: snp email@example.com snp@gen snp@chromosome # There are some functions that may also be helpful: ploidy(snp)
Now plot a heatmap showing the genotypes.
## Heat map of genotype glPlot(snp)
You can also perform a Principle Component Analysis (PCA) and plot the first few components.
## PCA pca <- glPca(snp, nf = 10) # you can select 10 components # Quick plot scatter(pca, posi = "none") # Plot of the first few axes coloured by population plot(x = pca$scores[, 1], y = pca$scores[, 2], col = c(rep("blue", 7), rep("orange", 7)), cex = 2) text(pca$scores[, 1], pca$scores[, 2] + 0.7, labels = rownames(pca$scores), cex = 0.7) plot(pca$scores[, 1], pca$scores[, 3], col = c(rep("blue", 7), rep("orange", 7)), cex = 2) text(pca$scores[,1], pca$scores[,3] + 0.7, labels=rownames(pca$scores), cex = 0.7)
The aim of these analysis is to test whether the B and the b individuals cluster together or separately. The first principal component separates B and b. But if you look at the heat map, the separation is not constant in the whole genome (you can navigate between your plots using the arrow buttons above the plot window in Rstudio).
Each of the scaffolds has been retrieved from a different chromosome. We can test whether the differentiation between B and b is only seen in one of the scaffolds.
# We will start with scaffold_2 scaffold_2_index <- which(snp@chromosome == "scaffold_2") scaffold_2 <- snp[, scaffold_2_index] glPlot(scaffold_2) scaffold_2_pca <- glPca(scaffold_2, nf = 10) scatter(scaffold_2_pca, posi = "none") # scaffold_1 scaffold_1_index <- which(snp@chromosome == "scaffold_1") scaffold_1 <- snp[, scaffold_1_index] glPlot(scaffold_1) scaffold_1_pca <- glPca(scaffold_1, nf = 10) scatter(scaffold_1_pca, posi = "none")
- Is differentiation coming mainly from one of the scaffolds?
- If so, which one?
PopGenome to measure differentiation and diversity
Another way of measuring differentiation between groups of individuals is using the fixation index, FST, which tests whether there is genetic structure in the population. FST can be used, for example, to test whether there is evidence of low gene flow between populations. In the case of the fire ant, the B and b supergene variants coexist in the population, but do not recombine with each other - so they should show strong differentiation.
An important population genetics measure is genetic diversity. Patterns of genetic diversity can be informative of a population’s evolutionary past - for example, low genetic diversity may be evidence for a recent population bottleneck. Furthermore, the variation of diversity within the genome can be informative of different evolutionary effects, such as the strength of selection in different parts of the genome.
We will measure FST and nucleotide diversity (a measure of genetic diversity) using the R package
In theory, the
r PopGenome can read VCF files directly, using the
readVCF function. However, because our samples are haploid, we need to use a different function,
readData, which requires a folder with a separate VCF for each scaffold.
Return to the command line by either opening a new terminal or by typing
q() into R.
## On your command line # Make new directory mkdir popgenome-vcf # compress and index the VCF bgzip input/snp.vcf # This step is not necessary if you downloaded the snp.vcf.gz file from the link above tabix -p vcf input/snp.vcf.gz bcftools view input/snp.vcf.gz scaffold_1 > popgenome-vcf/scaffold_1 bcftools view input/snp.vcf.gz scaffold_2 > popgenome-vcf/scaffold_2
You can now load the data in
R (Open with
library(PopGenome) # Load the data # MODIFY the path for the above created folder snp <- readData("popgenome-vcf", format = "VCF") # This is complex object, with several slots get.sum.data(snp) show.slots(snp) # You can access the different "slots" by using the "@" sign: firstname.lastname@example.org # Set populations pops <- get.individuals(snp)[] pop1 <- pops[grep("B\\.bam", pops)] pop2 <- pops[grep("b\\.bam",pops)] snp <- set.populations(snp, list(pop1, pop2)) snp@populations # check if it worked
Let’s calculate FST between the two populations and nucleotide diversity in each of the populations.
# Diversities and FST (by scaffold) snp <- F_ST.stats(snp) # this does the calculations and # adds the results to the appropriate slots # Print FST get.F_ST(snp) # each line is a scaffold snp@nucleotide.F_ST # Print diversities get.diversity(snp) get.diversity(snp)[] # pop1 (B) get.diversity(snp)[] # pop2 (b) email@example.com
Another useful tool is to do the calculations along a sliding window.
# Transform object into object divided by sliding window win_snp <- sliding.window.transform(snp, width = 10000, jump = 10000, type = 2, whole.data = FALSE) # Measurements per window win_snp <- F_ST.stats(win_snp) win_snp@nucleotide.F_ST firstname.lastname@example.org # A simple plot win_fst <- win_snp@nucleotide.F_ST[, 1] bb_div <- email@example.com[, 1] # diversity among B (bb = "big B") lb_div <- firstname.lastname@example.org[, 2] # diversity among b (lb = "little b") plot(1:length(win_fst), win_fst) plot(1:length(bb_div), bb_div) plot(1:length(lb_div), lb_div)
- Knowing that there is no recombination between B and b somewhere in the genome, why do you think there is high FST in one scaffold and not the other?
- What factors could make b have such low diversity in one of the scaffolds?