Memory Usage & Performance Considerations

vConTACT3 processes large-scale viral genome datasets, which can be resource-intensive. While it is nearly impossible to estimate how much memory and time it will take to process a dataset, this page outlines some expectations.

Typical Resource Usage

The following are rough estimates based on dataset size and complexity (this is based on 40-cores and 1.5% dataset sparsity):

  • 20,000 genomes: ~5-7 GB RAM, 15 min

  • 50,000 genomes: ~9-12 GB RAM, 25 min

  • 150,000 genomes: 50-60 GB RAM, 1 hr 30 min

  • 400,000 genomes: 175-225 GB, 4-5 hr

  • 750,000 genomes: 700-800 GB, 16 hr

Runtime scales with numerous factors: dataset size, dataset sparsity, number of user genomes co-localized with references, and most importantly, how large the largest connected component in the network is.

Corrent and future work by the vConTACT3 team is primarily focused on reducing the memory required, so check back for updates.

Performance Optimization Strategies

  • Reduce Memory Consumption

    Use the --reduce-memory flag to downcast clustering arrays to float16, reducing RAM usage by ~50%.

    vcontact3 run --reduce-memory --nucleotide genomes.fna --output results_dir
    

    This is only a stop-gap measure. Depending on dataset complexity, 50% memory savings may not be sufficient for the increase in memory requirements.

  • Split Large Exports

    Use --breaks to chunk large network exports into smaller parts, improving file handling and visualization speed.

    vcontact3 run --breaks 5 --nucleotide genomes.fna --output results_dir
    

    Keep in mind that this will not break network components, meaning a component with 100K nodes will always have 100K nodes.

  • Adjust CPU Count

    Several portions of vConTACT3 are multithreaded. Select the maximum number of CPUs to use with --threads.

    vcontact3 run --threads 64 --nucleotide genomes.fna --output results_dir