Virus strains:
HSV-1 strain 17 (UP000009294)
HSV-2 strain HG52 (UP000001874)
VZV strain Dumas (UP000002602)
HCMV strain Merlin (UP000000938)
HHV-6A isolate U1102 (NC_001664)
HHV-6B strain Z29 (NC_000898)
HHV-7 strain RK (NC_001716)
EBV strain B95-8 (UP000153037)
KSHV strain GK18 (UP000000942).
Predictions:
All proteins were initially predicted using Colabfold with 3x recycles. If this model failed the quality scores (see below for details), then the prediction was rerun with Alphafold2 and Colabfold with 20x recycles. The model that passed the quality scores (or the model with the higher pTM if both passed) is shown.
Proteins were also analyzed with DeepTMHMM to identify signal peptides. If a signal peptide was found, the prediction was rerun without the signal peptide. If a transmembrane domain was also identified, this domain is specified.
Structural similarity searches were performed with DaliLite.v5 using local install versions of DALI [8578593] and Foldseek [37156916]. Networks were visualized with Gephi v0.10.1. Data analysis was done in Python, and all the scripts used can be found at https://github.com/QuantitativeVirology/Herpesfolds/
Quality scoring:
All proteins were initially predicted using ColabFold as it is more time-efficient. To assess the quality of the resulting models, 3 thresholds were used. Failing any threshold led to the model being flagged as “fail”. We interrogated whether the model optimization had converged by testing the consistency of the predicted local distance difference test (pLDDT) scores associated with every model. To do so, we first calculated the mean and standard deviation of the pLDDT values per model and then the standard deviation of these 5 values, resulting in ‘StDev of mean’ and ‘StDev of StDev’. To set thresholds, we fit a Gaussian curve to their respective histograms and set the cutoff at the mean of the Gaussian plus 2 times its standard deviation. Therefore, a ‘StDev of mean’>3.2 or a ‘StDev of StDev’>1.9 was deemed a low-quality model. We also evaluated the predicted template modeling score (pTM) to validate model confidence further. To set a global threshold at which a prediction likely constitutes a folded protein, we randomly chose 106 herpesvirus proteins representing the protein length distribution of the herpesvirus proteomes and scrambled their amino acid sequence. These scrambled sequences were used for model prediction and treated as a negative dataset, assuming that random sequences should not fold. Receiver Operating Characteristic (ROC) analysis was used to set a threshold using the Youden index with an area under the curve of 91.3%, Figure S1E. Following this analysis, a pTM<0.3150 was deemed a low-quality model. We reran all models that failed the initial thresholding in AlphaFold and rescored them. The corresponding code can be found in our Herpesfolds Github repository in the file ‘score_colabfold.py’.