FAQ & Troubleshooting
=====================

Frequently Asked Questions (FAQ)
--------------------------------

- **The** ``name`` **/** ``shared name`` **columns in my Cytoscape export show numeric IDs, not my original FASTA headers.**

  These columns are internal network node identifiers used by Cytoscape — they are not genome identifiers. The original
  FASTA sequence IDs are preserved in the ``GenomeName`` column for all user sequences (``Reference=False``). For reference
  genomes (``Reference=True``), ``GenomeName`` contains the NCBI genome name from the database. To cross-reference the
  Cytoscape export with ``final_assignments.csv``, join on ``GenomeName``.

- **Re-running my dataset from scratch yielded slightly different results than resuming a previous run using the same dataset!**

  This is the result of the non-deterministic nature of MMSeqs2 clustering. Everything "post"-clustering (i.e. after
  the profiles are generated) will result in identical results. Of course, though the absolute "value" of the taxonomy
  will be different, meaning the name and/or number of the novel group, the comparison will be symmetrical. So,
  genomes A, B and C might be "novel_X" in one run, but "novel_Y" in another.

- **Can I run vConTACT3 with only protein sequences?**

  Yes. Provide a protein FASTA via ``--proteins`` and a gene-to-genome mapping via ``--gene2genome``. A genome lengths
  file (``--len-nucleotide``) is optional — if omitted, ``Size (Kb)`` will be ``NaN`` in the output and the ``ANI``
  export will be automatically disabled.

  .. code-block:: bash

     vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --output results_dir

     # With genome lengths (enables Size (Kb) in output and ANI export):
     vcontact3 run --proteins proteins.faa --gene2genome gene2genome.tsv --len-nucleotide genome_lengths.tsv --output results_dir

- **My run fails with** ``FileNotFoundError`` **for my** ``.faa`` **file after the first identity level succeeds.**

  The temporary file cleanup uses a glob pattern that matches any filename containing ``mmseq``. If your input file
  includes that string (e.g. ``my-mmseqs-results.fna``), derived files such as the predicted proteins (``.faa``) will
  be deleted at the end of the first identity pass and subsequent identity levels will fail. Renaming the input to
  omit ``mmseq`` from the filename resolves the issue.

- **Why did the novel prediction naming change in 3.1.5?**

  vConTACT3 was not initially benchmarked against the upper ranks (kingdom --> class) and used the placeholder
  novel_<rank>_<#>_of_<parent_rank> to assign ranks where no official taxonomy existed. These assignments could lead to
  something like the following: "novel_genus_0_of_novel_subfamily_0_of_novel_family_0_of_novel_order_0_of_novel_class_0_of_novel_phylum_0_of_novel_kingdom_1233_of_default"
  This created two problems, 1) users unknowingly assumed that there were *1233* new kingdoms identified, and not simply
  that the sequence in *1233* might be in a different kingdom than a sequence in *1234*, and 2) the text itself is lengthly
  and implies a hierarchical structure that might not be grounded in biology.

  The prediction now uses "unplaced_<rank>_within_<parent_rank>" for all the upper ranks, and then starts numbering at
  order. This will allow users to more easily group predictions and prevent over-interpretation of results. Additionally,
  references existing in upper ranks will be grouped *at least* by order.


Troubleshooting
---------------

Missing import TreeStyle
~~~~~~~~~~~~~~~~~~~~~~~~

There is a known `issue <https://github.com/etetoolkit/ete/issues/354>`_ when installing ete3 through conda.
Unfortunately, the bug persists on some systems (none of which I have access to). So, if you want to export
dendrogram figures, you will need to install PyQt5 manually as well. This *does not* affect Newick-formatted
dendrograms exports.

You can try the following:

.. code-block:: bash

   python -m pip install PyQt5

Verify the location with `--db-path` if using a custom directory.

'/usr/bin/clang' failed with exit code 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is sorta generic, and there could be a lot of reasons. For THIS software specifically, it can occur if you try
to install the packages (via pip) on Apple Silicon (M1 and m2). It might be for scikit-bio or numba or pandas...

You can try the following:

.. code-block:: bash

   pip3 install --upgrade pip
   python3 -m pip install --upgrade setuptools


High Memory Usage on Large Datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more details, see :doc:`memory usage <memory_usage>`

For large genome sets, memory consumption may spike. To mitigate this, you can use `--reduce-memory` to cast arrays
to `float16`:
  
.. code-block:: bash

   vcontact3 run --reduce-memory --nucleotide genomes.fna --output results_dir

Keep in mind that this is a stop-gap measure. While it will reduce memory by 50%, if memory consumption spikes from
60 GB to 600 GB, 50% of that is still 300 GB. Please consider running on a high-memory node for very large datasets.

Slow Performance on D3js Exports
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Exporting large networks for D3js visualization can be memory-consuming and challenging for even the best browsers to
render.

- Use `--breaks` to split networks into smaller chunks:
  
  .. code-block:: bash

     vcontact3 run --breaks 5 --nucleotide genomes.fna --output results_dir --exports d3js

Be aware that this does not split network components, but helps manage file size and rendering. Since it doesn't split
components, that also means if a component has 100K members, you'll *always* get a 100K-member network.