SPIRE v01
2023-09
Description of genome metadata bulk download file
Contact: anthony.fullam@embl.de and sebastian.schmidt@embl.de
For further questions and feedback please visit https://github.com/grp-bork/spire_contribute/issues
After downloading, uncompress the file "spire_v1_genome_metadata.tsv.gz", either by double clicking the file or by typing "gunzip spire_v1_genome_metadata.tsv.gz" in a UNIX shell while in the appropriate path. Afterwards, "spire_v1_genome_metadata.tsv" is a tab-separated file with one row per SPIRE MAG (only MAGs of medium+ quality are included) and the following columns:
genome_id:
genome identifier.
spire_cluster:
genome cluster to which current genome was assigned; this can either be a specI cluster as defined in proGenomes3 (progenomes.embl.de; "specI_v4_") or a de novo genome cluster based on 95% whole-genome average nucleotide identity ("spire_v1_095_").
spire_cluster_assignment:
how was the current genome assigned to a cluster? possible values are:


      "pg_v3_mapped_marker_gene": assignment to a proGenomes3 specI cluster based on a consensus mapping of specI marker gene sequences (performed using MAPseq, https://github.com/jfmrod/mapseq)

        
      "pg_v3_mapped_ANI_95": assignment to a proGenomes3 specI cluster based on whole-genome average nucleotide identity (>= 95%)

        
      "95_ANI": de novo average linkage clustering of unmapped genomes based on >=95% whole-genome average nucleotide identity

genome_size:
actual genome size (total number of nucleotides in all binned contigs for the current MAGs)
genome_size_est:
estimated real genome size based on (estimated) genome completeness and contamination, using the following formula:
genome_size * (100 / completeness) * ((100 - contamination) / 100)
gs_est_ratio:
ratio between "genome_size_est" and "genome_size"
n_contigs:
total number of binned contigs
n50:
binned contig N50
max_contig_length:
maximum length among binned contigs
translation_table:
(inferred) translation table, as per CheckM2 (https://github.com/chklovski/CheckM2)
completeness:
CheckM2-estimated genome completeness
contamination:
CheckM2-estimated genome contamination
drep:
custom-computed dRep score.
n_genes:
total number of predicted ORFs (as per prodigal, https://github.com/hyattpd/Prodigal)
gunc_taxlevel:
taxonomic level with maximum clade separation score as per GUNC (see https://github.com/grp-bork/gunc and https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02393-0 for details)
clade_separation_score:
GUNC clade separation score (CSS) at "gunc_taxlevel"
gunc_contamination:
GUNC-estimated genome contamination
reference_representation_score:
GUNC reference representation score (estimates confidence of computed clade separation score and contamination based on how well the current genome is represented in the used GUNC reference database)
gunc_pass:
does this genome pass the default GUNC quality filter (based on clade separation and reference representation scores; <= 2% contamination)?
gunc_pass_5:
does this genome pass the relaxed GUNC quality filter (<= 5% contamination)?
classification:
full taxonomic classification as per GTBD-Tk (https://github.com/Ecogenomics/GTDBTk)
domain, phylum, class, order, family, genus, species:
(re-formatted) classification at individual taxonomic levels
red_value:
relative evolutionary distance to reference as per GTDB-Tk

##############################################
Cluster Metadata
After downloading, uncompress the file "spire_v1_cluster_metadata.tsv.gz", either by double clicking the file or by typing "gunzip spire_v1_cluster_metadata.tsv.gz" in a UNIX shell while in the appropriate path. Afterwards, "spire_v1_cluster_metadata.tsv" is a tab-separated file with one row per SPIRE MAG (only MAGs of medium+ quality are included) and the following columns:
spire_cluster: IDs of genome clusters included in SPIRE, spanning both specI clusters as defined in proGenomes3 (progenomes.embl.de) to which SPIRE MAGs were mapped (‘specI_v4_XXXXX’), and newly generated de novo clusters at 95% average nucleotide identity (‘spire_v1_095_XXXXXXXXX’)
size.spire: number of medium or high quality SPIRE MAGs in cluster
size.pg3: number of reference genomes in proGenomes3 (>0 only for specI clusters)
size.combined: total number of genomes in cluster
domain, phylum, class, order, family, genus, species: consensus classification (re-formatted) at individual taxonomic levels
##############################################
Microntology
A list of terms included in the initial version (v01) of the newly developed microntology. Terms are organised by ‘category’, i.e. the highest level of biome (e.g., aquatic, terrestrial, host-associated, etc.), physicochemical (e.g., temperature, ph, oxygen level) or human host (age group, birth term) descriptors for microbial samples. The file contains the following columns:
category: highest level term category (see above)
term: microntology terms, organised into a flat hierarchy
ontology_link: cross-links to terms in established ontologies (EnvO, UBERON)
pull_term: microntology annotations in SPIRE follow a ‘multiple tag’ logic where a given sample is annotated with a combination of terms, rather than choosing a single most descriptive term. In consequence, some microntology terms are inherently cross-linked: e.g., a mangrove sample (microntology term ‘terrestrial:wetland:mangrove’) will automatically receive the further tags ‘aquatic:lentic’ and ‘aquatic:littoral’.
host_ncbi_taxonomy_id: cross-link to highest-level assignable host taxonomy ID in NCBI, where applicable
comment: additional descriptive definitions individual terms
##############################################