FAQ & Troubleshooting ===================== Frequently Asked Questions (FAQ) -------------------------------- - **The** ``name`` **/** ``shared name`` **columns in my Cytoscape export show numeric IDs, not my original FASTA headers.** These columns are internal network node identifiers used by Cytoscape — they are not genome identifiers. The original FASTA sequence IDs are preserved in the ``GenomeName`` column for all user sequences (``Reference=False``). For reference genomes (``Reference=True``), ``GenomeName`` contains the NCBI genome name from the database. To cross-reference the Cytoscape export with ``final_assignments.csv``, join on ``GenomeName``. - **Re-running my dataset from scratch yielded slightly different results than resuming a previous run using the same dataset!** This is the result of the non-deterministic nature of MMSeqs2 clustering. Everything "post"-clustering (i.e. after the profiles are generated) will result in identical results. Of course, though the absolute "value" of the taxonomy will be different, meaning the name and/or number of the novel group, the comparison will be symmetrical. So, genomes A, B and C might be "novel_X" in one run, but "novel_Y" in another. - **Can I run vConTACT3 with only protein sequences?** Yes. Provide a protein FASTA via ``--proteins`` and a gene-to-genome mapping via ``--gene2genome``. A genome lengths file (``--len-nucleotide``) is optional — if omitted, ``Size (Kb)`` will be ``NaN`` in the output and the ``ANI`` export will be automatically disabled. .. code-block:: bash vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --output results_dir # With genome lengths (enables Size (Kb) in output and ANI export): vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --len-nucleotide genome_lengths.tsv --output results_dir - **My run fails with** ``FileNotFoundError`` **for my** ``.faa`` **file after the first identity level succeeds.** The temporary file cleanup uses a glob pattern that matches any filename containing ``mmseq``. If your input file includes that string (e.g. ``my-mmseqs-results.fna``), derived files such as the predicted proteins (``.faa``) will be deleted at the end of the first identity pass and subsequent identity levels will fail. Renaming the input to omit ``mmseq`` from the filename resolves the issue. - **Why did the novel prediction naming change in 3.1.5?** vConTACT3 was not initially benchmarked against the upper ranks (kingdom --> class) and used the placeholder novel__<#>_of_ to assign ranks where no official taxonomy existed. These assignments could lead to something like the following: "novel_genus_0_of_novel_subfamily_0_of_novel_family_0_of_novel_order_0_of_novel_class_0_of_novel_phylum_0_of_novel_kingdom_1233_of_default" This created two problems, 1) users unknowingly assumed that there were *1233* new kingdoms identified, and not simply that the sequence in *1233* might be in a different kingdom than a sequence in *1234*, and 2) the text itself is lengthly and implies a hierarchical structure that might not be grounded in biology. The prediction now uses "unplaced__within_" for all the upper ranks, and then starts numbering at order. This will allow users to more easily group predictions and prevent over-interpretation of results. Additionally, references existing in upper ranks will be grouped *at least* by order. Troubleshooting --------------- Missing import TreeStyle ~~~~~~~~~~~~~~~~~~~~~~~~ There is a known `issue `_ when installing ete3 through conda. Unfortunately, the bug persists on some systems (none of which I have access to). So, if you want to export dendrogram figures, you will need to install PyQt5 manually as well. This *does not* affect Newick-formatted dendrograms exports. You can try the following: .. code-block:: bash python -m pip install PyQt5 Verify the location with `--db-path` if using a custom directory. '/usr/bin/clang' failed with exit code 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is sorta generic, and there could be a lot of reasons. For THIS software specifically, it can occur if you try to install the packages (via pip) on Apple Silicon (M1 and m2). It might be for scikit-bio or numba or pandas... You can try the following: .. code-block:: bash pip3 install --upgrade pip python3 -m pip install --upgrade setuptools High Memory Usage on Large Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For more details, see :doc:`memory usage ` For large genome sets, memory consumption may spike. To mitigate this, you can use `--reduce-memory` to cast arrays to `float16`: .. code-block:: bash vcontact3 run --reduce-memory --nucleotide genomes.fna --output results_dir Keep in mind that this is a stop-gap measure. While it will reduce memory by 50%, if memory consumption spikes from 60 GB to 600 GB, 50% of that is still 300 GB. Please consider running on a high-memory node for very large datasets. Slow Performance on D3js Exports ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exporting large networks for D3js visualization can be memory-consuming and challenging for even the best browsers to render. - Use `--breaks` to split networks into smaller chunks: .. code-block:: bash vcontact3 run --breaks 5 --nucleotide genomes.fna --output results_dir --exports d3js Be aware that this does not split network components, but helps manage file size and rendering. Since it doesn't split components, that also means if a component has 100K members, you'll *always* get a 100K-member network.