Riding the wave of genomics to investigate aquatic coliphage diversity and activity

Summary Bacteriophages infecting Escherichia coli (coliphages) have been used as a proxy for faecal matter and water quality from a variety of environments. However, the diversity of coliphages that is present in seawater remains largely unknown, with previous studies largely focusing on morphological diversity. Here, we isolated and characterized coliphages from three coastal locations in the United Kingdom and Poland. Comparative genomics and phylogenetic analysis of phage isolates facilitated the identification of putative new species within the genera Rb69virus and T5virus and a putative new genus within the subfamily Tunavirinae. Furthermore, genomic and proteomic analysis combined with host range analysis allowed the identification of a putative tail fibre that is likely responsible for the observed differences in host range of phages vB_Eco_mar003J3 and vB_Eco_mar004NP2.


Introduction 28
Bacteriophages are a key component of microbial communities playing important roles such as 29 increasing the virulence and driving the evolution of their bacterial hosts, and influencing major 30 biogeochemical cycles (see (1-3) for reviews). It is estimated that there are 10 31 viruses in the 31 biosphere with each millilitre of seawater containing millions of these viruses (1, 2) largely infecting 32 the numerically dominant bacterial genera Synechococcus, Prochlorococcus and SAR11 (5-11). 33 Culture-and metagenomics-based approaches have shed much light on their genetic diversity (12-34 16) including the description of several previously unknown phage groups that are widespread in the 35 environment (9, 10, [17][18][19]. 36 In the context of marine systems, bacteriophage infecting Escherichia coli, so-called coliphage, have 37 perhaps received less attention even though they have been widely studied as a proxy for drinking 38 water quality and the presence of faecal coliforms and enteric viruses (20-23). Thus, much is known 39 about how the use of different E. coli strains or growth media used can lead to variable estimates of 40 phage abundance (24-26) and this has resulted in global standards for using coliphages as a measure 41 of water quality (27). For assessment of water quality these standards rely on the use of E. coli C strains 42 derived from ATCC13706, which has been shown to detect increased titres over E. coli B and E. coli 43 K12 derivatives (26). A criticism of the use of coliphages as indicators of water quality has been the 44 reproduction of coliphages in the environment which will increase abundance estimates (28). Whilst 45 the consensus seems to be that coliphage replication is not a significant issue (24), more recent 46 research provides evidence that coliphages may well replicate in the environment (29). 47 Regarding the diversity of coliphages found in seawater, studies have largely focused on 48 morphological diversity (29-32), assessing the number and range of E. coli hosts they can infect. This 49 has shown that many coliphages have a broad host range, with detection of coliphages comprising 50 members of the Siphoviridae and Myoviridae families off the Californian (29) and Brazilian coasts (31) 51 but with Siphoviridae being the most frequently observed taxa (31). 52 Coliphages in general are one of the most sequenced phage types with ~450 complete phage genomes 53 within Genbank, isolated from a variety of sources including animal faeces (33-36), human faeces (37), 54 urine (38), clinical samples (39) river water (40), agricultural surface waters (41), lagoons (42), sewage 55 (43) and animal slurries (34). However, as alluded to above, much less is known about the genetic 56 diversity of coliphages in seawater. To begin to resolve this we isolated coliphages from three locations 57 in the UK and Poland and undertook genomic and proteomic characterisation of the isolated phages, 58 to provide insights into their phylogenetic position and functional potential. 59 between proteins was lowered by 5% on each run of ROARY and the analysis repeated until a number 127 of phylogenetic markers passed the filtering criteria, this was reached at a protein identity of 75%. At 128 this point 44 core-genes were identified, of which only 14 passed further filtering steps (Table S2). The 129 top nine markers as selected by the GET_PHYLOMARKERS pipeline were used for phylogenetic analysis 130 (47). 131 Phylogenetic analysis on the selected marker genes confirmed that vB_Eco_mar004NP2 and 132 vB_Eco_mar003J3 fall within the genus T5virus (Figure 2). Phage vB_Eco_mar004NP2 is a sister group 133 to that of phage SPC35 (HQ406778) and vB_Eco_mar003J3 a sister group to that of phage LVR16A 134 (MF681663) (Figure 2). Phage vB_Eco_mar004NP2 represents a new species within the genus T5virus, 135 as it has <95% ANI with any other phage within the genus (45). For phage vB_Eco_mar003J3, it is not 136 clear if the phage represents a new species. It has an ANI >95% with phages saus132, and paul149 137 which have recently been described as new species (52). However, these phages are not the closest 138 group based on a phylogenetic analysis ( Figure 2). When an ANI value of >97% is used then currently 139 defined species are more congruent with the observed phylogenetic analysis, suggesting 140 vB_Eco_mar003J3 is a novel species (Figure 2). Applying this threshold of 97% ANI across the entire 141 genus would maintain the current species and create a total of 23 species across the genus. 142 Tunavirinae 143 Phages vB_Eco_mar001J1, vB_Eco_mar002J2 and vB_Eco_swan01 had greatest nucleotide sequence 144 similarity to pSf-1 and SECphi27 which are members of the subfamily Tunavirinae. To classify the newly 145 isolated phages, a phylogenetic analysis was carried out using the gene encoding the large subunit 146 terminase that has previously been used to classify phages within the subfamily Tunavirinae by the 147 ICTV (53). The analysis included all current members of the subfamily Tunavirinae (April 2018). The 148 newly isolated phages vB_Eco_mar001J1, vB_Eco_mar002J2 and vB_Eco_swan01 form a clade with 149 phages pSf-1, SECphi27 and Esp2949-1 ( Figure S3). This clade is a sister to the clades that represent 150 the previously defined genera of KP36virus and TLSvirus, thus clearly placing these new phages within 151 the subfamily Tunavirinae ( Figure S3). 152 To further clarify the phylogeny of these phages, again a core-gene analysis of all members of the 153 subfamily Tunavirinae was carried out. Given these phage form part of a taxonomic sub-family, using 154 ROARY with similarity cut-off values of 90% resulted, unsurprisingly, in the detection of no core genes. 155 Therefore, an alternative method was used using an orthoMCL approach from within 156 GET_HOMOLOGUES software (54). OrthoMCL based analysis identified a core of only nine genes, 157 which were then filtered in the same manner as for the RB69virus and T5virus genera. A phylogeny 7 Phylogenetic analysis confirmed the previously defined genera within Tunavirinae, with the five 160 genera of Kp36virus, Roguevirus, Rtpvirus, T1virus and TLSvirus also supported by good bootstrap 161 support values (Figure 3). Furthermore, a clade which is sister to that of genus TLSvirus was identified 162 with good bootstrap support comprising vB_Eco_mar001J1, vB_Eco_mar002J2, vB_Eco_swan01, 163 SECphi27 (KC710998) and pSf-1 (NC_021331). Their clear separation from existing genera within the 164 subfamily suggests this clade is a new genus. The phages within this putative genus all share an ANI 165 >75% with other phages in the genus, compared to 60-70% ANI with phages in the other described 166 genera within the Tunavirinae. All phages within the putative genus have a conserved genome 167 organisation and share thirty orthologues. We propose that this clade represents a new genus and 168 should be named pSF1virus after pSF-1, the first representative isolate. Furthermore, we propose the 169 unclassified phage Esp2949-1 (NC_019509) is the sole representative of a new genus, as it doesn't 170 currently fit within currently defined genera. Phylogenetic analysis indicates that phages of the genus 171 TL1virus, TLSvirus, psF1virus all have a common ancestor, with Esp2949-1 ancestral to phages in the 172 genus TL1virus and psF1virus. (Figure 3). Comparative genomic analysis also supports this, with 173 Esp2949-1 having <70% ANI to phages of the genera TL1virus or TLSvirus, its closest relatives. Phages 174 within the putative genus psF1virus were further analysed to determine the number of species. Using 175 a cut-off of 95% or 97% ANI, the genus will contain three species vB_Eco_swan01 (SECphi27, 176 vB_Eco_swan01), vB_Eco_mar002J2 (vB_Eco_mar001J1, vB_Eco_mar002J2) and the orphan species classified the phages within subgroup B1 (55). Phage vB_Eco_mar005P1 was also observed to have a 189 polyhedral head, but with a long contractile tail, with tail fibres clearly observable which allows 190 classification within sub group A2 within the Myoviridae (55) ( Figure S4, Table 2). 191

Proteomic Characterisation 193
As with most phages the majority of the genes predicted within each genome encode for hypothetical 194 proteins with unknown function. In order to identify further structural proteins or proteins that may 195 be contained within the capsid, proteomic analysis of representative phages was carried out using 196 electrospray ionization mass spectrometry (ESI-MS/MS). The number of identified proteins per phage 197 was five, five, seven and eight for phages vB_Eco_mar005P1, vB_Eco_swan01, vB_Eco_mar003J3, and 198 vB_Eco_mar004NP2 respectively (Table 3) The burst size, latent period and eclipse period for representative phage isolates was also determined 223 (Table 2). There was considerable variation in these parameters across all isolates, with burst size 9 ranging from 31 (vB_Eco_mar005P1) to 192 (vB_Eco_mar004NP2) ( The host range of representative phage isolates was determined using a range of bacterial hosts via a 232 spot test assay (Table S4). Phylogenetic analysis highlighted that the isolated coliphages were often 233 closely related to phages that are known to infect other Enterobacteriaceae, including Klebsiella and 234 Salmonella (Figures 1, 2, 3). For this reason the host range of these phage was also tested against 235 other Enterobacteriace. Phage vB_Eco_mar005P1 a representative of the genus RB69virus was only 236 able to infect its host of isolation (E. coli MG1655), whereas phages of the genus T5virus and subfamily 237 Tunavirinae were capable of infecting between five and eight strains (Table S4). Whilst 238 vB_Eco_mar002J2 was found to infect the greatest number of strains (8), this was limited to strains of 239

Detection in viral metagenomes 242
The presence of these new coliphage species in viral metagenomes was investigated using existing 243 metagenomics databases. The Baltic virome dataset was chosen as it contains both DNA sequence 244 data and RNA expression data (56). The abundance of representative phage species was determined 245 by the stringent mapping of reads from the virome to representative genomes. Synechococcus phage 246 Syn9 was also included, as it has previously been demonstrated to be present in this dataset (56). The 247 overall coverage of each genome was low, with small numbers of reads mapping to each genome 248 ( Figure S4a). However, reads mapping to coliphage were found, although at far less abundance than 249 cyanophages Syn9 ( Figure S4a). We then searched for evidence of gene expression from these phages 250 using transcriptomic datasets. The majority of samples showed expression of cyanophage Syn9 genes, 251 as previously reported (56). In contrast, genes from coliphage NP2 and RB69 ( Figure S4b) were only 252 detected in samples GS852 and GS677, respectively. These samples, GS852 and GS677, were collected 253 from low salinity surface waters (56). The reads mapping to coliphages were further analysed by 254 Discussion 258 Using E.coli MG1655 we were able to isolate and characterise ten phages from coastal marine waters 259 and one from a freshwater pond. The titre of coliphages in all water samples was extremely low (range 260 0.0125 pfu ml -1 -0.28 pfu ml -1 ). This low abundance is lower than previous reports of coliphages that 261 are present at an order of 1 x 10 2 pfu ml in other coastal environments (57-59). This lower abundance 262 may well be linked to water quality, as faecal contamination is known to be linked to coliphage 263 abundance and/or the time of sampling. Only one sample point was collected, and previous work has 264 found there are distinct seasonal patterns in coliphage abundance (59). Despite this low abundance, 265 it was still possible to isolate coliphages to further characterise their genetic diversity, which was the 266 focus of this study. 267 Given the small number of phages isolated and sequenced, there was a surprising amount of genomic 268 diversity. Five species of coliphage were identified in the 10 phages isolated. The phages 269 vB_Eco_mar005P1, vB_Eco_mar006P2, vB_Eco_mar008P4 and vB_Eco_mar009P5 were identical, 270 with vB_Eco_mar007P3 only differing from the others by a single SNP. This similarity is probably due 271 to the enrichment method, which has enriched for a single phage that has then proliferated in the 272 enrichment and been re-isolated. Phages vB_Eco_mar001J1 and vB_Eco_mar002J2 also had identical 273 genome sequences despite being independently isolated, and represent a novel species. The 274 remaining phages vB_Eco_mar003J3, vB_Eco_mar004NP2, vB_Eco_swan01 were all unique and also 275 represent new species. 276 Phages infecting Escherichia account for ~7% of all phages sequenced to date. To discover a novel 277 genera from the sequencing of a small number of coliphages here further highlights the vast diversity 278 of phages present in the environment and how much more is to be discovered. To accurately place 279 phages in the context of current phage taxonomy, we identified core-genes and used the 280 GET_PHYLOMARKERS pipeline to select the most appropriate gene for phylogeny reconstruction that 281 do not show signs of recombination, and are thus likely to lead to inaccurate branch lengths (49). Our 282 phylogenetic analysis of phage genomes using selected marker genes was congruent with current 283 classifications of phage species. Some of these classifications are originally based on historical 284 phenotypic data such as phage RB69 which cannot recombine with phage T4 and was classified as a 285 separate species (60). Recently, this inability to recombine with phage T4 DNA was postulated to be 286 caused by the arabinosyl modification of DNA in RB69, likely caused by a novel glucosyltransferase 287 present in RB69 but not T4 (61). In this study, the gene thought to encode a putative 288 arabinosyltransferase (61), was found to be core to all members of the genus RB69virus. Whether the phage isolated in this study also glycosylate their DNA in a similar manner to RB69 remains to be 290 determined. However, the genes thought to be responsible for it are clearly a signature of this genus. 291 Whilst the phylogenetic analysis was congruent with currently defined species within the T5virus and 292 RB69virus genera, combining this phylogenetic analysis with ANI data demonstrated that using an ANI 293 value >95% was insufficient to delineate species that were congruent with the observed phylogeny 294 when additional phage from this study, and those present in GenBank but having undefined species 295 were added. Phages that form clearly distinct clades had an ANI >95% with phages outside of the 296 phylogenetic clades. Thus, suggesting 95% ANI is insufficient to discriminate between species for some 297 genera. We therefore suggest an ANI of 97% should be used to discriminate phage within the genera 298 T5virus and RB69virus, which has previously been used for the demarcation of phage species within 299 the genus Seuratvirus (62). We have begun to elucidate for the first time the genomic diversity of coliphage within seawater, 347 identifying phages that represent several novel taxa, further expanding the diversity of phages that 348 are known to infect E. coli. Furthermore, the analysis and identification of core-genes and selection of 349 genes suitable for phylogenetic analysis provides a framework for the future classification of phages 350 in the genera RB69virus, T5virus and subfamily Tunanvirinae. We further suggest that an ANI of >95% 351 is not suitable for the delineation of species within the genera RB69virus and T5virus and that a value 352 of >97% ANI should be used. Characterisation of phage replication parameters and host range further 353 reinforces that morphologically similar phage can have diverse replication strategies and host ranges.
Whilst we are cautious about the detection of coliphage transcripts in seawater metatranscriptomes, 355 the most parsimonious explanation is that coliphage are actively replicating, an observation that 356 certainly warrants further investigation. 357 358

Materials and Methods 359
Escherichia coli MG1655 was used as the host for both phage isolation and phage characterisation 360 work. E .coli MG1655 was cultured in LB broth at 37°C with shaking (200 rpm). Seawater samples were 361 collected from UK and Polish coastal waters (see Table 1), filtered through a 0.22 µm pore-size 362 polycarbonate filter (Sarstedt) and stored at 4°C prior to use in plaque assays. Plaque assays were 363 undertaken within 24 hr of collecting these samples. Phages were initially isolated and enumerated 364 using a simple single layer plaque assay (67). However, where this was unsuccessful a modified plaque 365 assay was used that allowed a greater volume of water to be added. Briefly, filtered seawater was 366 mixed with CaCl2 to a final concentration of 1 mM followed by addition of E. coli MG1655 cells at a 367 1:20 ratio and incubating the mixture at room temperature for 5 minutes. Subsequently, samples were 368 mixed with molten LB agar at a 1:1 ratio, final agar concentration 0.5% (w/v). Agar plates were 369 incubated overnight at 37°C and checked for the presence of plaques. For samples in which no 370 coliphage were detected an enrichment procedure was carried out. Briefly, 20 mL filtered seawater 371 was mixed with 20 mL LB broth and 1 mL E. coli MG1655 (OD600=~0.3 i.e. mid-exponential phase) and 372 incubated overnight at 37°C, followed by filtration through a 0.22 µm pore-size filter. Phages from this 373 enriched sample were then isolated using the standard plaque assay procedure. Three rounds of 374 plaque purification were used to obtain clonal phage isolates (67) . 375

Genome Sequencing 376
Phage DNA was prepared using a previously established method (68). DNA was quantified using Qubit 377 and 1 ng DNA used as input for NexteraXT library preparation following the manufacturer's 378 instructions. Sequencing was carried out using a MiSeq platform with V2 (2 x250 bp) chemistry. Fastq 379 files were trimmed with Sickle v1, using default parameters (69). Genome assembly used SPAdes v3.7 380 with the careful option (70). Reads were then mapped back against the resulting contig with BWA 381 MEM v0.7.12 (71) and SAM and BAM files manipulated with SAMtools v1.6 to determine the average 382 coverage of each contig (71). If the coverage exceeded 100x then the reads were subsampled and the 383 assembly process repeated, as high coverage is known to impede assembly (68). Phage genomes were 384 then annotated with Prokka using a custom database of all phage genomes that had previously been 385 extracted from Genbank (72). Further annotation was carried out using the pVOG database to 386 annotate any proteins that fall within current pVOGS using hmmscan (73, 74). Raw sequence data and 387 assembled genomes were deposited in the ENA under the project accession number PRJEB28824 A MASH database was constructed of all complete bacteriophage genomes available at the time of 390 analysis (~ 8500, April 2018) using the following mash v2 settings " -s 1000" (75). This database was 391 then used to identify related genomes based on MASH distance which has previously been shown to 392 be equivalent to ANI (75). Phage genomes that were found to be similar were re-annotated with 393 Prokka to ensure consistent gene calling between genomes for comparative analysis (72). Core 394 genome analysis was carried out with ROARY using "--e --mafft -p 32 -i 90" as a starting point for 395 analysis (76). These parameters were adjusted as detailed in the text. The optimal phylogenetic 396 markers were determined using the GET_PHYLOMARKERS pipeline, with the following settings "-R1 -397 t DNA" (47). Average nucleotide identity was calculated using autoANI.pl (77). Phylogenetic analysis 398 was carried out using IQ-TREE (78), with models of evolution selected using modeltest (79); trees were 399 visualised in ITOL (80). 400

One-step growth experiments 401
Phage growth parameters (burst size, eclipse and latent period) were determined by performing one-402 step growth experiments as described by Hyman and Abedon (81), with free phages being removed 403 from the culture by pelleting the host cells via centrifugation at 10,000 g for 1 min, removing the 404 supernatant and resuspending cells in fresh medium (81). Three independent replicates were carried 405 out for each experiment. 406

TEM 407
Representative phages, as determined from genome sequencing, were imaged using a Transmission 408 electron microscope (TEM) as follows: 10 µl high titre phage stock was added to a glow discharged 409 formvar copper grid (200 mesh), left for 2 mins, wicked off and 10 µl water added to wash the grid 410 prior to being wicked off with filter paper. 10 µl 2% (w/v) uranyl acetate stain was added to the grid 411 and left for 30 secs, prior to its removal. The grid was air dried before imaging using a JEOL JEM-1400 412 TEM with an accelerating voltage of 100kV. Digital images were collected with a Megaview III digital 413 camera using iTEM software. Phage images were processed in ImageJ using the measure tool and the 414 scale bar present on each image to obtain phage particle size (82). Measurements are the average of 415 at least 10 phage particles. 416

Preparation of viral proteomes for nanoLC-MS/MS and data analysis 417
Prior to proteomics high titre phage stocks were purified using CsCl density gradient centrifugation at 418 35,000 g for 2 hrs at 4 °C. Subsequently, 30 µl concentrated phage was added to 10 µl NuPAGE LDS 4X 419 sample buffer (Invitrogen) heated for 5 min at 95°C and analysed by SDS-PAGE as described (83). with iodoacetamide and trypsin (Roche) proteolysis was performed prior to tryptic peptide extraction 422 (83). Samples were separated and analysed by means of a nanoLC-ESI-MS/MS using an Ultimate 3000 423 LC system (Dionex-LC Packings) coupled to an Orbitrap Fusion mass spectrometer (Thermo Scientific) 424 with a 60 minute LC separation on a 25 cm column and settings as described previously (83). Compiled 425 MS/MS spectra were processed using the MaxQuant software package (version 1.5.5.1) for shotgun 426 proteomics (84). Default parameters were used to identify proteins (unless specified below), searching 427 an in-house-generated database derived from the translation of phage genomes. Firstly, a six reading 428 frame translation of the genome with a minimum coding domain sequence (CDS) cut-off of 30 amino 429 acids (i.e. stop-to-stop) was used to search for tryptic peptides. Second, the search space was reduced 430 by using a database containing only CDS detected in the first database search, again, looking for tryptic 431 peptides. Finally, the reduced CDS database was also searched using the N-terminus semi-tryptic 432 digest setting to find the protein N-terminus. Analysis was completed using Perseus software version 433 1.6.0.7 (85). All detected peptides from all three analyses are compiled in Supplementary Table S5. 434 Only proteins detected with two or more non-redundant peptides were considered.    Table S1. Core-genes, ANI and genes used for phylogenetic analysis of phages within the genus 679 RB69virus. All phages were re-annotated to ensure consistent gene calling. ANI was calculated using 680 autoANI. 681 682   Table S2. Core-genes, ANI, and genes used for phylogenetic analysis of phages within the genus 683 T5virus. All phages were re-annotated to ensure consistent gene calling. ANI was calculated using 684 autoANI. 685 686 Table S3. Core-genes, ANI, and genes used for phylogenetic analysis of phages within the subfamily 687 Tunavirinae. ANI was calculated using autoANI. 688 689 690