FAQ & Troubleshooting

Frequently Asked Questions (FAQ)

The name / shared name columns in my Cytoscape export show numeric IDs, not my original FASTA headers.

These columns are internal network node identifiers used by Cytoscape — they are not genome identifiers. The original FASTA sequence IDs are preserved in the GenomeName column for all user sequences (Reference=False). For reference genomes (Reference=True), GenomeName contains the NCBI genome name from the database. To cross-reference the Cytoscape export with final_assignments.csv, join on GenomeName.
Re-running my dataset from scratch yielded slightly different results than resuming a previous run using the same dataset!

This is the result of the non-deterministic nature of MMSeqs2 clustering. Everything "post"-clustering (i.e. after the profiles are generated) will result in identical results. Of course, though the absolute "value" of the taxonomy will be different, meaning the name and/or number of the novel group, the comparison will be symmetrical. So, genomes A, B and C might be "novel_X" in one run, but "novel_Y" in another.

Can I run vConTACT3 with only protein sequences?

Yes. Provide a protein FASTA via --proteins and a gene-to-genome mapping via --gene2genome. A genome lengths file (--len-nucleotide) is optional — if omitted, Size (Kb) will be NaN in the output and the ANI export will be automatically disabled.

vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --output results_dir

# With genome lengths (enables Size (Kb) in output and ANI export):
vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --len-nucleotide genome_lengths.tsv --output results_dir

My run fails with FileNotFoundError for my .faa file after the first identity level succeeds.

The temporary file cleanup uses a glob pattern that matches any filename containing mmseq. If your input file includes that string (e.g. my-mmseqs-results.fna), derived files such as the predicted proteins (.faa) will be deleted at the end of the first identity pass and subsequent identity levels will fail. Renaming the input to omit mmseq from the filename resolves the issue.
Why did the novel prediction naming change in 3.1.5?

vConTACT3 was not initially benchmarked against the upper ranks (kingdom --> class) and used the placeholder novel_<rank>_<#>_of_<parent_rank> to assign ranks where no official taxonomy existed. These assignments could lead to something like the following: "novel_genus_0_of_novel_subfamily_0_of_novel_family_0_of_novel_order_0_of_novel_class_0_of_novel_phylum_0_of_novel_kingdom_1233_of_default" This created two problems, 1) users unknowingly assumed that there were 1233 new kingdoms identified, and not simply that the sequence in 1233 might be in a different kingdom than a sequence in 1234, and 2) the text itself is lengthly and implies a hierarchical structure that might not be grounded in biology.

The prediction now uses "unplaced_<rank>_within_<parent_rank>" for all the upper ranks, and then starts numbering at order. This will allow users to more easily group predictions and prevent over-interpretation of results. Additionally, references existing in upper ranks will be grouped at least by order.

Troubleshooting

Missing import TreeStyle

There is a known issue when installing ete3 through conda. Unfortunately, the bug persists on some systems (none of which I have access to). So, if you want to export dendrogram figures, you will need to install PyQt5 manually as well. This does not affect Newick-formatted dendrograms exports.

You can try the following:

python -m pip install PyQt5

Verify the location with --db-path if using a custom directory.

'/usr/bin/clang' failed with exit code 1

This is sorta generic, and there could be a lot of reasons. For THIS software specifically, it can occur if you try to install the packages (via pip) on Apple Silicon (M1 and m2). It might be for scikit-bio or numba or pandas...

You can try the following:

pip3 install --upgrade pip
python3 -m pip install --upgrade setuptools

High Memory Usage on Large Datasets

For more details, see memory usage

For large genome sets, memory consumption may spike. To mitigate this, you can use --reduce-memory to cast arrays to float16:

vcontact3 run --reduce-memory --nucleotide genomes.fna --output results_dir

Keep in mind that this is a stop-gap measure. While it will reduce memory by 50%, if memory consumption spikes from 60 GB to 600 GB, 50% of that is still 300 GB. Please consider running on a high-memory node for very large datasets.

Slow Performance on D3js Exports

Exporting large networks for D3js visualization can be memory-consuming and challenging for even the best browsers to render.

Use --breaks to split networks into smaller chunks:

vcontact3 run --breaks 5 --nucleotide genomes.fna --output results_dir --exports d3js

Be aware that this does not split network components, but helps manage file size and rendering. Since it doesn't split components, that also means if a component has 100K members, you'll always get a 100K-member network.