Unravelling the Identity, Metabolic Potential and Global Biogeography of the Atmospheric Methane‐Oxidizing Upland Soil Cluster α

Summary Understanding of global methane sources and sinks is a prerequisite for the design of strategies to counteract global warming. Microbial methane oxidation in soils represents the largest biological sink for atmospheric methane. However, still very little is known about the identity, metabolic properties and distribution of the microbial group proposed to be responsible for most of this uptake, the uncultivated upland soil cluster α (USCα). Here, we reconstructed a draft genome of USCα from a combination of targeted cell sorting and metagenomes from forest soil, providing the first insights into its metabolic potential and environmental adaptation strategies. The 16S rRNA gene sequence recovered was distinctive and suggests this crucial group as a new genus within the Beijerinckiaceae, close to Methylocapsa. Application of a fluorescently labelled suicide substrate for the particulate methane monooxygenase enzyme (pMMO) coupled to 16S rRNA fluorescence in situ hybridisation (FISH) allowed for the first time a direct link of the high‐affinity activity of methane oxidation to USCα cells in situ. Analysis of the global biogeography of this group further revealed its presence in previously unrecognized habitats, such as subterranean and volcanic biofilm environments, indicating a potential role of these environments in the biological sink for atmospheric methane.

assembly using Minimus2 (Sommer et al., 2007) in directional mode with the minimetagenomes as subjects and each metagenome assembly as references. Minimum overlap size was set at 500 bp, with a 94% sequence identity cutoff and a maximum of 20 bp divergence allowed at the contig ends. Thereby, the median contig size could be increased more than four-fold for a total subset of 972 potentially USCα enriched original metagenome contigs. This included a total subset of 384 metagenome contigs, which were originally shorter than 1 kb, but reached lengths suitable for subsequent binning steps after merging. Furthermore, a total of 229 additional contigs could be added to the assembly, which were shown to be present in the metagenomes by read mappings, but not successfully reconstructed in the original metagenome assemblies due to low coverage and high complexity. The resulting SCG-merged metagenomes were then similarly merged, using MFS12 as subject and MFS3 as reference.
Contig classification and quantification: 16S and 23S rRNA gene sequences were predicted using rnammer v1.2 (Lagesen et al., 2007) and classified against the SINA database using the least common ancestor approach implemented by SINA (Pruesse et al., 2012). Total proteins were predicted using prodigal v.2.6.3 (Hyatt et al., 2010) and universal single copy marker genes were extracted using fetchMG (Mende et al., 2013). Contigs were then classified, both on total protein and on universal marker level, by BLASTing the respective protein sequences against the NCBI nr database, associating the hits of each protein directly with the respective contig, sorting the results based on alignment score and finally submitting this information to the blast2lca script included with MEGAN6 (Huson and Weber, 2013) distributions. Final taxonomic assignments were selected hierarchically: Whenever possible, rRNA based classifications were preferred over protein-based classifications. Otherwise, marker based classifications were preferred over total protein based classifications. Furthermore, contigs were quantified by mapping each read dataset individually against the final merged assembly using bwa (Li and Durbin, 2010) and calculating the "outlier pileup coverage" using bamm (http://ecogenomics.github.io/BamM/).

Binning:
In order to increase binning specificity, the assembly was again partitioned into separate fractions, analoguous to the previous read partitioning, based on relative contig abundances in datsets MFS1 and MFS2 (MFS12) compared to MFS3_1 and MFS3_2 (MFS3). This was considered more specific than the original read partitioning step, as it is based on greater sequence context. Contig partitioning resulted in nine fractions, ranging from MFS12low_MFS3low to MFS12high_MFS3high. Each partition was then binned with MaxBin v2.2.3 (Wu et al., 2016), using the contig coverage information of each read dataset in order to enable a differential coverage based approach.
Analyses and processing of metagenomic bins: Estimations of bin completeness and potential contamination were obtained using the "lineage_wf" workflow of checkM v1.0.4 (Parks et al., 2015). All bins were further "purified" using a two-step approach. First, the predominant taxa were determined for each bin on domain, phylum, class, order and family level, based on the previously calculated, SINA-and blast2lca-based, hierarchical contig classifications. However, for this purpose, classification confidence values produced by blast2lca were also considered: Taxonomic assignments with confidence values below 50 were ignored and assumed "unassigned" instead. All contigs that were confidently classified to a taxon other than "unassigned" and differed from the respective bin's predominant taxon on any level, were removed from the bin. Second, the remaining contigs of each bin were filtered using a z-score based differential coverage approach previously described (Vollmers et al., 2017a) using a custom python script (https://github.com/jvollme/bin_polisher). Contigs displaying z-score differences of more than 2 between any read datasets were removed.
Identification and reassembly of USCα among metagenome bins: Processed bins were screened for presence of the USCα fosmid sequence using BLAST. Hits were retained if they showed a sequence identity >99.9% over >90% of the alignable subject contig length (but allowing for overhangs extending beyond the fosmid boundaries). Reads corresponding to contigs of a potential USCα bin were extracted from metagenome read mappings using bamm, and reassembled using spades v.3.10 with "careful" mode and automatic k-mer selection. Contigs shorter than 1 kb were removed from the final assembly.
Four contigs corresponding to the USCα fosmid could be identified (Suppl. Fig. S4). The majority (~90%) of the USCα fosmid sequence (Ricke et al., 2005), including a partial pmoC gene, could be reconstructed, and even extended, by only two contigs found exclusively in bin "007" of metagenome fraction "MFS12low_MFS3mid", which was classified as closely related to the genus Methylocapsa within the Beijerinckiaceae (Suppl. Fig. S2 and Suppl. Table 1) and subsequently designated as "USCα genomic bin". This metagenome fraction also contained two almost identical contigs, mid_NODE_24546 and mergemg_128398, encoding the remaining USCα pmoCAB gene cluster, but these were not unambiguously assigned to any specific bin, most likely due to the dissemination of related pmo genes among phylogenetically diverse organisms (Gilbert et al., 2000). However, based on best BLAST hits against the NCBI nr database, both contigs were found to be most closely related to reference sequences of the genus Methylocapsa, thereby matching the taxonomic assignment of the USCα genomic bin. The fact that the pmo gene cluster sequence was identical between both contigs, with differences only occurring at the extreme contig ends, may indicate strain variations in the exact genomic location of this feature within USCα genomes. Therefore, the relative abundance profiles of the USCα pmoCAB containing contigs were tested for compatibility with the USCα genomic bin, by temporarily adding them and performing the z-score based differential coverage filtering approach already used during preliminary bin processing (this study and Vollmers et al, 2017a). Only one of the contigs, mergemg_128398, passed this filtering criterion, thereby proving a matching coverage profile across all metagenomic samples. As a result, and in consideration of the matching taxonomic assignments and co-localization within the known USCα fosmid sequence, this contig was transferred to the USCα genomic bin.

Multilocus sequence analyses:
All reference comparison genomes were downloaded from NCBI. In order to maximize the available sequence information for phylogenetic analyses, a custom MLSA workflow was performed: Potential homologs between all reference genomes were detected using the bidirectional BLAST approach implemented by proteinortho5 (Lechner et al., 2011) with the following settings: "-ident=25 -cov=60 -e=1e-8 --selfblast". Single copy core genes were identified based on the proteinortho results and aligned using muscle (Edgar, 2004). In order to remove unalignable N-and C-terminal overhangs, as well as reduce the influence of spurious ORF-calling, each core gene alignment was cropped to the region between the first and last alignment positions that contained sequence information for all aligned gene products. The cropped alignments of all core genes were concatenated and any remaining unalignable regions were filtered using Gblocks (Castresana, 2000) with the following settings: '-t=p -d=n -b3=10 -b4=5 -b5=a' and the '-b2' argument set to 50% of the number of comparison genomes. The complete workflow was implemented in the form of a custom python script (https://github.com/jvollme/PO_2_MLSA). Phylogenetic clustering of the resulting alignment was performed using fasttree2 (Price et al., 2010) in double precision mode.

Percentage of conserved proteins (POCP) Analyses:
In order to better match the criteria described by Qin et al (Qin et al., 2014), proteinortho-based homolog detection was repeated with the following settings: "-ident=40 -cov=50 -e=1e-5selfblast". Pairwise POCP values were calculated from the proteinortho results using a custom python script (https://github.com/jvollme/po2pocp).   Legend: Supplementary Figure S4: Alignment of metagenomic contigs against the USCα fosmid reference CT005232 (Ricke et al. 2014), prior to reassembly of the USCα genomic bin. Annotated ORFs are represented as arrows, colorcoded as indicated by the legend included on the lower left. Subsequences which were aligned between contigs are shown connected by colored blocks, which are colorcoded based on sequence identity. The USCα Fosmid reference could be almost completely reconstructed by only three contigs (mergemg123_101251, mergemg123_100323 & mergemg123_128398) showing >99.9% sequence identity over the complete respective alignment length as well as compatible coverage profiles. In addition, the fosmid reference sequence context was expanded by ~ 38 kb by metagenome contig mergemg123_100323. An similar expansion of the 5'-end of the fosmid reference was not possible, due to the presence of a transposase gene, The USCα signature pmoBAC genecluster was found to be encoded contig mergemg123_128398 as well as contig mid_NODE_24546, sharing 100% sequence identity over the gene cluster but differing at the extreme contig ends, thereby indicating potential strain heterologies concerning the exact genomic location of this gene cluster. However, in contrast to contig mergemg123_128398, the coverage profile of contig mid_NODE_24546 did not prove compatible with the fosmid reference and the respective genomic bin, therefore this contig was excluded from the metagenomic bin.