Results & Reports ================= vConTACT3 generates several standard output files as part of every run. These files include the assignments and performance metrics, which are typically the most commonly used outputs used for downstream analysis. Standard Outputs ---------------- The following files are created during every run in the output directory (`--output`). These are distinct from optional exports (e.g., GraphML, profiles), which are documented more extensively in the :doc:`Exports ` section. Assignments ~~~~~~~~~~~ **final_assignments.csv** This file contains the predicted taxonomic assignments (naming skills are top-notch) for each input genome or contig and basic genome statistics (length, proteins, genome name). Included with this file are all the references available in the database. .. csv-table:: final_assignments.csv :header: Genome,GenomeName,Reference,Size_Kb,realm_reference,realm_prediction,...,genus_reference,genus_prediction genome_001,environmental genome HF-0,False,10.5,np.nan,Duplodnaviria,...,np.nan,Herellevirus genome_002,environmental genome XZ-2,False,15.1,np.nan,Monodnaviria,...,np.nan,Circovirus For every prediction is reference taxonomy, if available. Note: The taxonomy is derived from NCBI and can be out of date. Since the official taxonomy (defined by the ICTV) is slowly incorporated into NCBI, there will always be a delay between when the official names are released and that being reflected in NCBI. Further, there is an additional delay from when the NCBI taxonomy database is pulled by vConTACT3. A more detailed description of the final assignments file, including a walkthrough, is available :doc:`here ` Performance Metrics ~~~~~~~~~~~~~~~~~~~ **performance_metrics.csv** A table containing performance metrics that uses reference data for internal calibration. Each of the 5 virus realms, along with their 6 ranks, has their vConTACT3-predicted rank/groups compared against the references against the provided database. In short, reference genomes that are in the same rank where other references of the same rank are, the higher the accuracy. Additionally, two network/clustering based approaches are included. .. csv-table:: performance_metrics.csv :header: index,realm,rank,Clustering-wise PPV,Clustering-wise Sensitivity,Clustering-wise Accuracy,Clustering-wise Separation,Complex-wise Separation,Separation,Adjusted Rand Index (ARI),Normalized Mutual Info Score (NMI),Compo Score 7,Duplodnaviria,realm,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 8,Duplodnaviria,kingdom,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 9,Duplodnaviria,phylum,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 10,Duplodnaviria,class,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 11,Duplodnaviria,order,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 12,Duplodnaviria,family,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0 13,Duplodnaviria,subfamily,0.9464,0.9965,0.9711,0.9636,0.8153,0.8864,0.9961,0.9738,2.9142 14,Duplodnaviria,genus,0.9720,0.9966,0.9842,0.9903,0.9091,0.9488,0.9995,0.9890,2.9529 Other Outputs ------------- These are outputs that will be generated based on `--exports` chosen. Completeness ~~~~~~~~~~~~ An EXPERIMENTAL feature that relies on the number of genes found within each predicted genus. Graph-Based ~~~~~~~~~~~ **networks/ folder** Networks generated from the run. Formats include: - GraphML (graphml) - Cytoscape (cyjs) - Cosmograph (cosmo_data.csv, cosmo_metadata.csv) - d3js (d3.json and d3.json.html) Profiles ~~~~~~~~ **pc_profiles/ folder** PC profiles (genome X PCs), split by group indicated by `--target-rank` and filtered to a minimal genome count by `--target-members`. For *every* desired rank and every group predicted within that rank, the following files are generated: - rank_rank-name.svg (SVG heatmap) - rank_rank-name.csv (presence/absense PC counts) WARNING: If selecting lower ranks (e.g. subfamily and genus) and/or the input dataset is large, this can result in 100s or 1000s of file pairs. PyUpset ~~~~~~~ Forthcoming Newick ~~~~~~~ Forthcoming Temporary & Intermediate Files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, vConTACT3 cleans up large intermediate files after a successful run. However, if the `--keep-temp` flag is used, additional files (e.g., MMSeqs2 outputs, subgraphs) will be retained in the results folder. It is not recommended to use `--keep-temp` as none of these files are re-used by vConTACT3. Even if a run unexpectedly terminates, all of these temporary files need to be rebuilt. This is mainly only useful for advanced users who wish to directly use MMSeqs2 results for other analyses. Checkpoint Files ---------------- Checkpoint files are generated during the course of every run. They are comprised of file paths, parameters, and other data processing components that are required between the last checkpoint file and then next. The presence of a checkpoint file, generally speaking, indicates that the run has completed successfully up to that point. Failure "downstream" of that checkpoint file will allow subsequent runs to restart from that point. The files are: - \*.profile.pkl.gz - \*.resolver.pkl.gz - \*.gbga.pkl.gz - \*.performance.pkl.gz There is one caveat to checkpoint files. Since they often contain most (if not all) of the information/data required to move to the next step, any errors or issues that survived to the checkpoint *will be propagated* to the rest of the processing. For example, if a genome is mis-named or duplicated within the dataset, vConTACT3 may not identify it. The genes, while technically being from two different genomes, will be connected via the duplicated name. If the user notices this issue and delete the last checkpoint file, it won't matter, because all the checkpoints will have incorporated the same issue. File Locations -------------- All results are placed in the output directory specified by `--output` (default: `vConTACT3_results/`). Directory structure: .. code-block:: text vConTACT3_results/ ├── final_assignments.csv ├── performance_metrics.csv └── networks/ ├── part1.graphml ├── part1.cyjs ├── part1.d3.json ├── part1.d3.json.html ├── part2_cosmo_data.csv ├── part2_cosmo_metadata.csv └── part2.d3.json └── pc_profiles/ ├── rank_rank-name.csv └── rank_rank-name.svg └── completeness/ └── completeness.csv For more details about the optional export formats (profiles, Cytoscape files, etc.), see the :doc:`Exports ` section.