VIPERA/template.qmd at main · PathoGenOmics-Lab/VIPERA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "VIPERA report: `r params$name`"
subtitle: "Workflow version: `r params$workflow_version`"
date: last-modified
date-format: "YYYY-MM-DD"
title-block-banner: true
format:
  html:
    page-layout: article
    embed-resources: true
    smooth-scroll: true
    theme: cosmo
    toc: true
    toc-expand: true
    toc-location: left
    toc-title: Summary
    number-sections: true
css: __CSSPLACEHOLDER__
editor: source
params:
  ufboot_reps:
  shalrt_reps:
  min_ivar_freq:
  workflow_version:
  use_bionj:
  cor_method:
  div:
  demix:
  tree:
  tempest:
  SNV:
  SNV_s:
  evo:
  div_value:
  panel:
  volcano:
  tree_ml:
  fig_cor_snp:
  stats_lm:
  stats_ml:
  table:
  sum_nv:
  heat_tab:
  omega_plot:
  name:
  freyja_ts:
output-file: report.html
---
<head>
  <link rel="preconnect" href="https://fonts.googleapis.com">
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
  <link href="https://fonts.googleapis.com/css2?family=Noto+Sans&display=swap" rel="stylesheet">
</head>

```{r setup, echo = F, message = F, include = F}
display.num <- function(n, digits) {
  format(n, digits = digits, nsmall = digits)
}
```

```{r read, echo = F, message = F, include = F}
library(jsonlite)
library(gt)
library(tidyr)
library(dplyr)
library(heatmaply)
library(readr)
library(tibble)

# Diversity
div_values <- fromJSON(params$div_value)

# Temporal signal and context tree
stats <- append(
  fromJSON(params$stats_lm),
  fromJSON(params$stats_ml)
)
correlation <- stats[["r2"]]
sub_rate <- stats[["sub_rate"]]
p_value_lm <- stats[["pvalue"]]

# NV counts
nv.counts <- fromJSON(params$sum_nv)
n_SNV <- nv.counts[["SNV"]]
n_INDELS <- nv.counts[["INDELS"]]

# Summary table
table <- read.csv(params$table)
n.samples <- table %>% pull(Sample) %>% unique() %>% length()

# Heatmap
cor.mat <- read_csv(params$heat_tab) %>%
  column_to_rownames("...1") %>%
  as.matrix()
if (params$cor_method == "pearson") {
  cor.method.name <- "Pearson's"
} else if (params$cor_method == "spearman") {
  cor.method.name <- "Spearman's"
} else if (params$cor_method == "kendall") {
  cor.method.name <- "Kendall's"
} else {
  cor.method.name <- "Undetermined"
}

# Distance tree
if (params$use_bionj) {
  dist.tree.algo <- "BIONJ (modified neighbor-joining)"
} else {
  dist.tree.algo <- "neighbor-joining (NJ)"
}

# Freyja timestamp
freyja.timestamp <- read_file(params$freyja_ts) %>%
  as.POSIXct(., format = "%m_%d_%Y-%H-%M") %>%
  strftime(., format = "%Y-%m-%d at %H:%M")
```

## Summary of the target samples dataset

```{r tbl-summary, echo = F,message = F}
#| tbl-cap: Target dataset summary

table %>%
  mutate(
    DeltaTime = difftime(
      as.Date(Collection_Date),
      min(as.Date(Collection_Date), na.rm = TRUE), units = "days"
    )
  ) %>%
  gt %>%
  cols_label(
    Sample = "Sample",
    Collection_Date = "Collection date",
    Lineage = "Lineage",
    DeltaTime = "Days after first sampling"
  ) %>%
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    location = cells_column_labels()
  ) %>%
  cols_align(
    align = "center",
    columns = everything()
  ) %>%
  opt_table_font(
    font = google_font("Noto Sans")
  )
```

## Evidence for single, serially-sampled infection

### Lineage admixture

The lineage admixture for each sample has been estimated
using [Freyja](https://github.com/andersen-lab/Freyja) (@fig-demix).

![Estimated lineage admixture of each sample (Freyja barcode timestamp: `r freyja.timestamp`).
Samples in the X-axis are ordered chronologically, from more ancient to newer.](`r params$demix`){#fig-demix}

### Phylogenetic reconstruction

A maximum likelihood tree of the target and context samples has been
built using [IQTREE](http://www.iqtree.org/).
The target samples `r stats[["monophyly"]]` monophyletic. The clade
that contains all $`r n.samples`$ target samples has $`r stats[["clade_tips"]]`$
tips ($`r round(100 * n.samples / stats[["clade_tips"]], 1)`$% targets) and is supported by a UFBoot score of
$`r stats[["boot"]]`$% and a SH-aLRT score of $`r stats[["alrt"]]`$% (@fig-tree_ml).

![Maximum-likelihood phylogeny with $`r params$ufboot_reps`$ UFBoot
and $`r params$shalrt_reps`$ SH-aLRT support replicates of the
target dataset and its context samples. The clade that contains the target
samples is squared.](`r params$tree_ml`){#fig-tree_ml}

### Nucleotide diversity comparison

Nucleotide diversity (π) has been calculated for $`r div_values[["boot.reps"]]`$ random
subsets of size $`r div_values[["sample.size"]]`$ drawn from the context dataset.
A Shapiro–Wilk test indicates the distribution is `r div_values[["norm.text"]]`
normal (p-value of $`r div_values[["normal.pvalue"]]`$).

The nucleotide diversity of the target samples is $`r display.num(div_values[["diversity"]], 3)`$
(vertical line in @fig-div). Assuming the context subsets are independent, the
`r div_values[["type.test"]]` test gives a p-value of $`r display.num(div_values[["p.value"]], 3)`$
for observing a nucleotide diversity as low as that of the target samples.

![Distribution of nucleotide diversity (π) calculated from $`r div_values[["boot.reps"]]`$
random subsets of $`r div_values[["sample.size"]]`$ sequences from the context dataset.
The shaded area shows the empirical density; the overlaid curve is the normal distribution
with the same mean and standard deviation.
The vertical line marks the π value of the target samples.](`r params$div`){#fig-div}

## Evolutionary trajectory of the serially-sampled SARS-CoV-2 infection

### Number of polymorphic sites

Sites with minor allele frequency $> `r params$min_ivar_freq`$ are considered polymorphic.
The linear association between the collection date of the samples and the number of
polymorphic sites has an $R^2$ of $`r display.num(nv.counts[["r2"]], 4)`$ and a p-value of
$`r nv.counts[["value"]]`$ (@fig-fig_cor_snp).

![Number of polymorphic sites along time. The
blue line shows the linear model fit.](`r params$fig_cor_snp`){#fig-fig_cor_snp}

### Description of intra-host nucleotide variants

A total of $`r n_SNV`$ different nucleotide variants (NV) and $`r n_INDELS`$
insertions and deletions (indels) have been detected along the genome (@fig-SNV).

::: {.panel-tabset}

## Whole genome

![Summary of the intra-host accumulation of nucleotide variants,
using the reconstructed dataset ancestor as reference. A) Nucleotide
variants per site along the SARS-CoV-2 genome. Relative abundance of variants is calculated
with a sliding window of width $`r nv.counts[["window"]]`$ nucleotides and a step of
$`r nv.counts[["step"]]`$. Labels indicate the coding regions of the non structural
proteins (NSP) within ORF1ab. B) Genome variation along the genome for each sample.
The Y-axis displays samples in chronological order, with the earliest collection date
at the bottom, and the latest, at the top.](`r params$SNV`){#fig-SNV}

## Spike ORF

![Summary of the intra-host accumulation of nucleotide variants
in the spike sequence, using the reconstructed dataset ancestor as reference. A) Nucleotide
variants per site along the S gene. Relative abundance of variants is calculated
with a sliding window of width $`r nv.counts[["window"]]`$ nucleotides and a step of
$`r nv.counts[["step"]]`$. B) Genome variation along the S gene for each sample.
The Y-axis displays samples in chronological order, with the earliest collection date
at the bottom, and the latest, at the top.](`r params$SNV_s`){#fig-SNV_s}

:::

### Temporal signal of the intra-host mutations

`r cor.method.name` correlation of the allele frequency of each variant with the time since the
initial sampling has been calculated (@fig-volcano).

![`r cor.method.name` correlation coefficients and adjusted p-values of
allele frequencies with time. The dashed line indicates adjusted (false discovery rate) $p = 0.05$.
Labeled dots represent nucleotide variants whose allele frequency is correlated with time
(adjusted $p < 0.05$).](`r params$volcano`){#fig-volcano}

Significantly correlated nucleotide variants are described in more detail in @fig-panel.

![Time series of relative allele frequencies. Nucleotide variants with a significant
correlation with time and sites with more than two possible states are included. Each
subplot depicts the progression of the allele frequencies in time for a given genome
position; lines connecting points indicate data from contiguous time points, while
unconnected points denote intervals with missing data.](`r params$panel`){#fig-panel}

A `r dist.tree.algo` tree has been constructed using pairwise distances
between target samples (@fig-tree), based on the allele frequencies measured from
read mappings.

![Distance tree built with the `r dist.tree.algo` method based on the pairwise allele
frequency-weighted distances.](`r params$tree`){#fig-tree}

To estimate the evolutionary rate, root-to-tip distances measured on the previous
tree (@fig-tree) have been correlated with time, obtaining a $R^2$ of
$`r display.num(correlation, 4)`$ and a p-value of $`r p_value_lm`$. The estimated evolutionary
rate is $`r display.num(sub_rate, 2)`$ changes per year (@fig-tempest).

![Scatterplot depicting the relationship between root-to-tip
distances and the number of days passed since the first sample. The solid
line represents the linear model fit.](`r params$tempest`){#fig-tempest}

### Correlation between alternative alleles

`r cor.method.name` correlation coefficients of allele frequencies between pairs
of variants were calculated to detect interactions between them (@fig-heatmap).
The heatmap is interactive and allows zooming in on specific regions.

```{r fig-heatmap, echo = F, message = F, warning = F, fig.align = 'center'}
#| fig-cap: "Interactive heatmap with hierarchical clustering of the pairwise correlation coefficients between the time series of allele frequencies in the case study."

# Calculate dendrograms using data with no NAs
cor.mat.clustering <- cor.mat
cor.mat.clustering[is.na(cor.mat.clustering)] <- 0
dist.rows <- dist(cor.mat.clustering)
dend.rows <- hclust(dist.rows)
dend.cols <- hclust(dist(t(cor.mat.clustering)))

# Display custom dendrograms over real data
heatmaply_cor(
  cor.mat,
  Rowv = as.dendrogram(dend.rows),
  Colv = as.dendrogram(dend.cols),
  grid_gap = 0.25,
  fontsize_row = 7,
  fontsize_col = 7,
  column_text_angle = 45,
  label_names = c("row", "column", "coefficient"),
  width  = 600,
  height = 600
)
```

### Non-synonymous and synonymous substitution rate over time

To track selection footprints, the substitutions per synonymous site ($dS$) and
per non-synonymous site ($dN$) for each sample with respect to the reconstructed
ancestral sequence have been calculated (@fig-evo), as well as their ratio ($\omega = dN/dS$; @fig-omega).

::: {.panel-tabset}

## $dN$ and $dS$
![Time series of $dN$ and $dS$. Each point corresponds to a different sample, sorted in chronological order.](`r params$evo`){#fig-evo}

## $\omega$ ($dN/dS$)

![Time series of $\omega$ ($dN/dS$). Each point corresponds to a different sample, sorted in chronological order.](`r params$omega_plot`){#fig-omega}

:::