Results & Reports
vConTACT3 generates several standard output files as part of every run. These files include the assignments and performance metrics, which are typically the most commonly used outputs used for downstream analysis.
Standard Outputs
The following files are created during every run in the output directory (--output). These are distinct from optional exports (e.g., GraphML, profiles), which are documented more extensively in the Exports section.
Assignments
final_assignments.csv
This file contains the predicted taxonomic assignments (naming skills are top-notch) for each input genome or contig and basic genome statistics (length, proteins, genome name). Included with this file are all the references available in the database.
Genome |
GenomeName |
Reference |
Size_Kb |
realm_reference |
realm_prediction |
... |
genus_reference |
genus_prediction |
|---|---|---|---|---|---|---|---|---|
genome_001 |
environmental genome HF-0 |
False |
10.5 |
np.nan |
Duplodnaviria |
... |
np.nan |
Herellevirus |
genome_002 |
environmental genome XZ-2 |
False |
15.1 |
np.nan |
Monodnaviria |
... |
np.nan |
Circovirus |
For every prediction is reference taxonomy, if available.
Note: The taxonomy is derived from NCBI and can be out of date. Since the official taxonomy (defined by the ICTV) is slowly incorporated into NCBI, there will always be a delay between when the official names are released and that being reflected in NCBI. Further, there is an additional delay from when the NCBI taxonomy database is pulled by vConTACT3.
A more detailed description of the final assignments file, including a walkthrough, is available here
Performance Metrics
performance_metrics.csv
A table containing performance metrics that uses reference data for internal calibration. Each of the 5 virus realms, along with their 6 ranks, has their vConTACT3-predicted rank/groups compared against the references against the provided database. In short, reference genomes that are in the same rank where other references of the same rank are, the higher the accuracy. Additionally, two network/clustering based approaches are included.
index |
realm |
rank |
Clustering-wise PPV |
Clustering-wise Sensitivity |
Clustering-wise Accuracy |
Clustering-wise Separation |
Complex-wise Separation |
Separation |
Adjusted Rand Index (ARI) |
Normalized Mutual Info Score (NMI) |
Compo Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
7 |
Duplodnaviria |
realm |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
8 |
Duplodnaviria |
kingdom |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
9 |
Duplodnaviria |
phylum |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
10 |
Duplodnaviria |
class |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
11 |
Duplodnaviria |
order |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
12 |
Duplodnaviria |
family |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
13 |
Duplodnaviria |
subfamily |
0.9464 |
0.9965 |
0.9711 |
0.9636 |
0.8153 |
0.8864 |
0.9961 |
0.9738 |
2.9142 |
14 |
Duplodnaviria |
genus |
0.9720 |
0.9966 |
0.9842 |
0.9903 |
0.9091 |
0.9488 |
0.9995 |
0.9890 |
2.9529 |
Other Outputs
These are outputs that will be generated based on --exports chosen.
Completeness
An EXPERIMENTAL feature that relies on the number of genes found within each predicted genus.
Graph-Based
networks/ folder
Networks generated from the run. Formats include:
GraphML (graphml)
Cytoscape (cyjs)
Cosmograph (cosmo_data.csv, cosmo_metadata.csv)
d3js (d3.json and d3.json.html)
Profiles
pc_profiles/ folder
PC profiles (genome X PCs), split by group indicated by --target-rank and filtered to a minimal genome count by --target-members.
For every desired rank and every group predicted within that rank, the following files are generated:
rank_rank-name.svg (SVG heatmap)
rank_rank-name.csv (presence/absense PC counts)
WARNING: If selecting lower ranks (e.g. subfamily and genus) and/or the input dataset is large, this can result in 100s or 1000s of file pairs.
PyUpset
Forthcoming
Newick
Forthcoming
Temporary & Intermediate Files
By default, vConTACT3 cleans up large intermediate files after a successful run. However, if the --keep-temp flag is used, additional files (e.g., MMSeqs2 outputs, subgraphs) will be retained in the results folder.
It is not recommended to use --keep-temp as none of these files are re-used by vConTACT3. Even if a run unexpectedly terminates, all of these temporary files need to be rebuilt. This is mainly only useful for advanced users who wish to directly use MMSeqs2 results for other analyses.
Checkpoint Files
Checkpoint files are generated during the course of every run. They are comprised of file paths, parameters, and other data processing components that are required between the last checkpoint file and then next. The presence of a checkpoint file, generally speaking, indicates that the run has completed successfully up to that point. Failure "downstream" of that checkpoint file will allow subsequent runs to restart from that point.
The files are:
*.profile.pkl.gz
*.resolver.pkl.gz
*.gbga.pkl.gz
*.performance.pkl.gz
There is one caveat to checkpoint files. Since they often contain most (if not all) of the information/data required to move to the next step, any errors or issues that survived to the checkpoint will be propagated to the rest of the processing.
For example, if a genome is mis-named or duplicated within the dataset, vConTACT3 may not identify it. The genes, while technically being from two different genomes, will be connected via the duplicated name. If the user notices this issue and delete the last checkpoint file, it won't matter, because all the checkpoints will have incorporated the same issue.
File Locations
All results are placed in the output directory specified by --output (default: vConTACT3_results/).
Directory structure:
vConTACT3_results/
├── final_assignments.csv
├── performance_metrics.csv
└── networks/
├── part1.graphml
├── part1.cyjs
├── part1.d3.json
├── part1.d3.json.html
├── part2_cosmo_data.csv
├── part2_cosmo_metadata.csv
└── part2.d3.json
└── pc_profiles/
├── rank_rank-name.csv
└── rank_rank-name.svg
└── completeness/
└── completeness.csv
For more details about the optional export formats (profiles, Cytoscape files, etc.), see the Exports section.