SPIRE v01 2023-09 Description of genome metadata bulk download file Contact: anthony.fullam@embl.de and sebastian.schmidt@embl.de For further questions and feedback please visit https://github.com/grp-bork/spire_contribute/issues After downloading, uncompress the file "spire_v1_genome_metadata.tsv.gz", either by double clicking the file or by typing "gunzip spire_v1_genome_metadata.tsv.gz" in a UNIX shell while in the appropriate path. Afterwards, "spire_v1_genome_metadata.tsv" is a tab-separated file with one row per SPIRE MAG (only MAGs of medium+ quality are included) and the following columns: genome_id: genome identifier. spire_cluster: genome cluster to which current genome was assigned; this can either be a specI cluster as defined in proGenomes3 (progenomes.embl.de; "specI_v4_") or a de novo genome cluster based on 95% whole-genome average nucleotide identity ("spire_v1_095_"). spire_cluster_assignment: how was the current genome assigned to a cluster? possible values are: "pg_v3_mapped_marker_gene": assignment to a proGenomes3 specI cluster based on a consensus mapping of specI marker gene sequences (performed using MAPseq, https://github.com/jfmrod/mapseq) "pg_v3_mapped_ANI_95": assignment to a proGenomes3 specI cluster based on whole-genome average nucleotide identity (>= 95%) "95_ANI": de novo average linkage clustering of unmapped genomes based on >=95% whole-genome average nucleotide identity genome_size: actual genome size (total number of nucleotides in all binned contigs for the current MAGs) genome_size_est: estimated real genome size based on (estimated) genome completeness and contamination, using the following formula: genome_size * (100 / completeness) * ((100 - contamination) / 100) gs_est_ratio: ratio between "genome_size_est" and "genome_size" n_contigs: total number of binned contigs n50: binned contig N50 max_contig_length: maximum length among binned contigs translation_table: (inferred) translation table, as per CheckM2 (https://github.com/chklovski/CheckM2) completeness: CheckM2-estimated genome completeness contamination: CheckM2-estimated genome contamination drep: custom-computed dRep score. n_genes: total number of predicted ORFs (as per prodigal, https://github.com/hyattpd/Prodigal) gunc_taxlevel: taxonomic level with maximum clade separation score as per GUNC (see https://github.com/grp-bork/gunc and https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02393-0 for details) clade_separation_score: GUNC clade separation score (CSS) at "gunc_taxlevel" gunc_contamination: GUNC-estimated genome contamination reference_representation_score: GUNC reference representation score (estimates confidence of computed clade separation score and contamination based on how well the current genome is represented in the used GUNC reference database) gunc_pass: does this genome pass the default GUNC quality filter (based on clade separation and reference representation scores; <= 2% contamination)? gunc_pass_5: does this genome pass the relaxed GUNC quality filter (<= 5% contamination)? classification: full taxonomic classification as per GTBD-Tk (https://github.com/Ecogenomics/GTDBTk) domain, phylum, class, order, family, genus, species: (re-formatted) classification at individual taxonomic levels red_value: relative evolutionary distance to reference as per GTDB-Tk ############################################## Cluster Metadata After downloading, uncompress the file "spire_v1_cluster_metadata.tsv.gz", either by double clicking the file or by typing "gunzip spire_v1_cluster_metadata.tsv.gz" in a UNIX shell while in the appropriate path. Afterwards, "spire_v1_cluster_metadata.tsv" is a tab-separated file with one row per SPIRE MAG (only MAGs of medium+ quality are included) and the following columns: spire_cluster: IDs of genome clusters included in SPIRE, spanning both specI clusters as defined in proGenomes3 (progenomes.embl.de) to which SPIRE MAGs were mapped (‘specI_v4_XXXXX’), and newly generated de novo clusters at 95% average nucleotide identity (‘spire_v1_095_XXXXXXXXX’) size.spire: number of medium or high quality SPIRE MAGs in cluster size.pg3: number of reference genomes in proGenomes3 (>0 only for specI clusters) size.combined: total number of genomes in cluster domain, phylum, class, order, family, genus, species: consensus classification (re-formatted) at individual taxonomic levels ############################################## Microntology A list of terms included in the initial version (v01) of the newly developed microntology. Terms are organised by ‘category’, i.e. the highest level of biome (e.g., aquatic, terrestrial, host-associated, etc.), physicochemical (e.g., temperature, ph, oxygen level) or human host (age group, birth term) descriptors for microbial samples. The file contains the following columns: category: highest level term category (see above) term: microntology terms, organised into a flat hierarchy ontology_link: cross-links to terms in established ontologies (EnvO, UBERON) pull_term: microntology annotations in SPIRE follow a ‘multiple tag’ logic where a given sample is annotated with a combination of terms, rather than choosing a single most descriptive term. In consequence, some microntology terms are inherently cross-linked: e.g., a mangrove sample (microntology term ‘terrestrial:wetland:mangrove’) will automatically receive the further tags ‘aquatic:lentic’ and ‘aquatic:littoral’. host_ncbi_taxonomy_id: cross-link to highest-level assignable host taxonomy ID in NCBI, where applicable comment: additional descriptive definitions individual terms ##############################################