Results & Reports

vConTACT3 generates several standard output files as part of every run. These files include the assignments and performance metrics, which are typically the most commonly used outputs used for downstream analysis.

Standard Outputs

The following files are created during every run in the output directory (--output). These are distinct from optional exports (e.g., GraphML, profiles), which are documented more extensively in the Exports section.

Assignments

final_assignments.csv

This file contains the predicted taxonomic assignments (naming skills are top-notch) for each input genome or contig and basic genome statistics (length, proteins, genome name). Included with this file are all the references available in the database.

final_assignments.csv

Genome

GenomeName

Reference

Size_Kb

realm_reference

realm_prediction

...

genus_reference

genus_prediction

genome_001

environmental genome HF-0

False

10.5

np.nan

Duplodnaviria

...

np.nan

Herellevirus

genome_002

environmental genome XZ-2

False

15.1

np.nan

Monodnaviria

...

np.nan

Circovirus

For every prediction is reference taxonomy, if available.

Note: The taxonomy is derived from NCBI and can be out of date. Since the official taxonomy (defined by the ICTV) is slowly incorporated into NCBI, there will always be a delay between when the official names are released and that being reflected in NCBI. Further, there is an additional delay from when the NCBI taxonomy database is pulled by vConTACT3.

A more detailed description of the final assignments file, including a walkthrough, is available here

Performance Metrics

performance_metrics.csv

A table containing performance metrics that uses reference data for internal calibration. Each of the 5 virus realms, along with their 6 ranks, has their vConTACT3-predicted rank/groups compared against the references against the provided database. In short, reference genomes that are in the same rank where other references of the same rank are, the higher the accuracy. Additionally, two network/clustering based approaches are included.

performance_metrics.csv

index

realm

rank

Clustering-wise PPV

Clustering-wise Sensitivity

Clustering-wise Accuracy

Clustering-wise Separation

Complex-wise Separation

Separation

Adjusted Rand Index (ARI)

Normalized Mutual Info Score (NMI)

Compo Score

7

Duplodnaviria

realm

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

8

Duplodnaviria

kingdom

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

9

Duplodnaviria

phylum

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

10

Duplodnaviria

class

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

11

Duplodnaviria

order

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

12

Duplodnaviria

family

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

3.0

13

Duplodnaviria

subfamily

0.9464

0.9965

0.9711

0.9636

0.8153

0.8864

0.9961

0.9738

2.9142

14

Duplodnaviria

genus

0.9720

0.9966

0.9842

0.9903

0.9091

0.9488

0.9995

0.9890

2.9529

Other Outputs

These are outputs that will be generated based on --exports chosen.

Completeness

An EXPERIMENTAL feature that relies on the number of genes found within each predicted genus.

Graph-Based

networks/ folder

Networks generated from the run. Formats include:

  • GraphML (graphml)

  • Cytoscape (cyjs)

  • Cosmograph (cosmo_data.csv, cosmo_metadata.csv)

  • d3js (d3.json and d3.json.html)

Profiles

pc_profiles/ folder

PC profiles (genome X PCs), split by group indicated by --target-rank and filtered to a minimal genome count by --target-members.

For every desired rank and every group predicted within that rank, the following files are generated:

  • rank_rank-name.svg (SVG heatmap)

  • rank_rank-name.csv (presence/absense PC counts)

WARNING: If selecting lower ranks (e.g. subfamily and genus) and/or the input dataset is large, this can result in 100s or 1000s of file pairs.

PyUpset

Forthcoming

Newick

Forthcoming

Temporary & Intermediate Files

By default, vConTACT3 cleans up large intermediate files after a successful run. However, if the --keep-temp flag is used, additional files (e.g., MMSeqs2 outputs, subgraphs) will be retained in the results folder.

It is not recommended to use --keep-temp as none of these files are re-used by vConTACT3. Even if a run unexpectedly terminates, all of these temporary files need to be rebuilt. This is mainly only useful for advanced users who wish to directly use MMSeqs2 results for other analyses.

Checkpoint Files

Checkpoint files are generated during the course of every run. They are comprised of file paths, parameters, and other data processing components that are required between the last checkpoint file and then next. The presence of a checkpoint file, generally speaking, indicates that the run has completed successfully up to that point. Failure "downstream" of that checkpoint file will allow subsequent runs to restart from that point.

The files are:

  • *.profile.pkl.gz

  • *.resolver.pkl.gz

  • *.gbga.pkl.gz

  • *.performance.pkl.gz

There is one caveat to checkpoint files. Since they often contain most (if not all) of the information/data required to move to the next step, any errors or issues that survived to the checkpoint will be propagated to the rest of the processing.

For example, if a genome is mis-named or duplicated within the dataset, vConTACT3 may not identify it. The genes, while technically being from two different genomes, will be connected via the duplicated name. If the user notices this issue and delete the last checkpoint file, it won't matter, because all the checkpoints will have incorporated the same issue.

File Locations

All results are placed in the output directory specified by --output (default: vConTACT3_results/).

Directory structure:

vConTACT3_results/
├── final_assignments.csv
├── performance_metrics.csv
└── networks/
    ├── part1.graphml
    ├── part1.cyjs
    ├── part1.d3.json
    ├── part1.d3.json.html
    ├── part2_cosmo_data.csv
    ├── part2_cosmo_metadata.csv
    └── part2.d3.json
└── pc_profiles/
    ├── rank_rank-name.csv
    └── rank_rank-name.svg
└── completeness/
    └── completeness.csv

For more details about the optional export formats (profiles, Cytoscape files, etc.), see the Exports section.