Results & Reports

vConTACT3 generates several standard output files as part of every run. These files include the assignments and performance metrics, which are typically the most commonly used outputs used for downstream analysis.

Standard Outputs

The following files are created during every run in the output directory (--output). These are distinct from optional exports (e.g., GraphML, profiles), which are documented more extensively in the Exports section.

Assignments

final_assignments.csv

This file contains the predicted taxonomic assignments (naming skills are top-notch) for each input genome or contig and basic genome statistics (length, proteins, genome name). Included with this file are all the references available in the database.

final_assignments.csv
Genome	GenomeName	Reference	Size_Kb	realm_reference	realm_prediction	...	genus_reference	genus_prediction
genome_001	environmental genome HF-0	False	10.5	np.nan	Duplodnaviria	...	np.nan	Herellevirus
genome_002	environmental genome XZ-2	False	15.1	np.nan	Monodnaviria	...	np.nan	Circovirus

For every prediction is reference taxonomy, if available.

Note: The taxonomy is derived from NCBI and can be out of date. Since the official taxonomy (defined by the ICTV) is slowly incorporated into NCBI, there will always be a delay between when the official names are released and that being reflected in NCBI. Further, there is an additional delay from when the NCBI taxonomy database is pulled by vConTACT3.

A more detailed description of the final assignments file, including a walkthrough, is available here

Performance Metrics

performance_metrics.csv

A table containing performance metrics that uses reference data for internal calibration. Each of the 5 virus realms, along with their 6 ranks, has their vConTACT3-predicted rank/groups compared against the references against the provided database. In short, reference genomes that are in the same rank where other references of the same rank are, the higher the accuracy. Additionally, two network/clustering based approaches are included.

performance_metrics.csv
index	realm	rank	Clustering-wise PPV	Clustering-wise Sensitivity	Clustering-wise Accuracy	Clustering-wise Separation	Complex-wise Separation	Separation	Adjusted Rand Index (ARI)	Normalized Mutual Info Score (NMI)	Compo Score
7	Duplodnaviria	realm	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
8	Duplodnaviria	kingdom	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
9	Duplodnaviria	phylum	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
10	Duplodnaviria	class	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
11	Duplodnaviria	order	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
12	Duplodnaviria	family	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	3.0
13	Duplodnaviria	subfamily	0.9464	0.9965	0.9711	0.9636	0.8153	0.8864	0.9961	0.9738	2.9142
14	Duplodnaviria	genus	0.9720	0.9966	0.9842	0.9903	0.9091	0.9488	0.9995	0.9890	2.9529

Other Outputs

These are outputs that will be generated based on --exports chosen.

Completeness

An EXPERIMENTAL feature that relies on the number of genes found within each predicted genus.

Graph-Based

networks/ folder

Networks generated from the run. Formats include:

GraphML (graphml)

Cytoscape (cyjs)

Cosmograph (cosmo_data.csv, cosmo_metadata.csv)

d3js (d3.json and d3.json.html)

Profiles

pc_profiles/ folder

PC profiles (genome X PCs), split by group indicated by --target-rank and filtered to a minimal genome count by --target-members.

For every desired rank and every group predicted within that rank, the following files are generated:

rank_rank-name.svg (SVG heatmap)

rank_rank-name.csv (presence/absense PC counts)

WARNING: If selecting lower ranks (e.g. subfamily and genus) and/or the input dataset is large, this can result in 100s or 1000s of file pairs.

PyUpset

Forthcoming

Newick

Forthcoming

Temporary & Intermediate Files

By default, vConTACT3 cleans up large intermediate files after a successful run. However, if the --keep-temp flag is used, additional files (e.g., MMSeqs2 outputs, subgraphs) will be retained in the results folder.

It is not recommended to use --keep-temp as none of these files are re-used by vConTACT3. Even if a run unexpectedly terminates, all of these temporary files need to be rebuilt. This is mainly only useful for advanced users who wish to directly use MMSeqs2 results for other analyses.

Checkpoint Files

Checkpoint files are generated during the course of every run. They are comprised of file paths, parameters, and other data processing components that are required between the last checkpoint file and then next. The presence of a checkpoint file, generally speaking, indicates that the run has completed successfully up to that point. Failure "downstream" of that checkpoint file will allow subsequent runs to restart from that point.

The files are:

*.profile.pkl.gz

*.resolver.pkl.gz

*.gbga.pkl.gz

*.performance.pkl.gz

There is one caveat to checkpoint files. Since they often contain most (if not all) of the information/data required to move to the next step, any errors or issues that survived to the checkpoint will be propagated to the rest of the processing.

For example, if a genome is mis-named or duplicated within the dataset, vConTACT3 may not identify it. The genes, while technically being from two different genomes, will be connected via the duplicated name. If the user notices this issue and delete the last checkpoint file, it won't matter, because all the checkpoints will have incorporated the same issue.

File Locations

All results are placed in the output directory specified by --output (default: vConTACT3_results/).

Directory structure:

vConTACT3_results/
├── final_assignments.csv
├── performance_metrics.csv
└── networks/
    ├── part1.graphml
    ├── part1.cyjs
    ├── part1.d3.json
    ├── part1.d3.json.html
    ├── part2_cosmo_data.csv
    ├── part2_cosmo_metadata.csv
    └── part2.d3.json
└── pc_profiles/
    ├── rank_rank-name.csv
    └── rank_rank-name.svg
└── completeness/
    └── completeness.csv

For more details about the optional export formats (profiles, Cytoscape files, etc.), see the Exports section.