### Classification tree trained with the training set

FigureĀ 2 shows the result of classification tree trained with the training set. From the top (root) to the bottom (leaves) of the tree, we show the features selected to predict the outcome; features closer to the root are considered to be more important. The branches of each node, visualized by the arrows, describe the features and the two possible conditions used for binary partitioning of the data according to which condition is satisfied. The color of each node indicates the predicted outcome for records partitioned into the category corresponding to the node: red indicates a positive predicted outcome of at least one infection in associated buildings, and blue, a negative predicted outcome. The value in the circle of each node indicates the percentage of the partitioned data records in the whole data.

The model in Fig.Ā 2 indicates the most important feature in predicting the outcome is whether fewer than (<) 3Ā days in the last 7 had positive wastewater test results. The outcome is predicted to be positive if wastewater results are positive in at least 3 out of the past 7Ā days, and negative otherwise. Given positive wastewater results on fewer than 3 out of the past 7Ā days, the second most important feature is whether none of the past 5Ā days have positive wastewater results. If yes then the outcome is predicted to be negative, otherwise to be positive.

The classification/decision tree in Fig.Ā 2 is fitted with weights of positive outcomes equal to (2/# positive classes) and weights of negative outcomes equal to (1/number # of negative classes).

Note the weights are standardized by the total number of positive and negative outcomes, respectively, and then multiplied by scalers based on the importance placed on correctly predicting the positive and negative outcomes. Our choice of weights reflects the priority of sensitivity (true positive rate) over specificity (true negative rate) in predicting positive individual infections. A sensitivity analysis using weights equal to the reciprocal of class sizes for both classes is performed in Appendix. The value of the penalty parameter on model complexity *cp*ā=ā0.02 is chosen to balance optimal performance in the training set as suggested by cross-validation while maintaining a small number of nodes in the tree for model interpretability. A sensitivity analysis using *cp*ā=ā0.001 to train the model is available in the Appendix to further investigate the influence of model complexity on the prediction performance and the trade-off between model complexity and interpretability.

Table 1 shows the confusion matrix of the predictions when applying the model to the training set. The sensitivity (True Positive Rate, TPRā=āTP/(TPā+āFN)) is 83.7% and the specificity (True Negative Rate, TNRā=āTN/(TNā+āFP)) is 58.5%. Note that the calculations of sensitivity and specificity are unaffected by the weights allocated to positive and negative outcome classes as the weights appear in both numerators and denominators and cancel out. The overall weighted prediction accuracy is 75.3%, which is calculated by

$$\frac{{\mathop \sum \nolimits_{i = 1}^{n} w_{i} \left[ {I\left( {predict{ }\;positive{|}positive} \right) + { }I\left( {predict\;{ }negative{|}negative} \right)} \right]{ }}}{{\mathop \sum \nolimits_{i = 1}^{n} w_{i} }}$$

where \(w_{i}\) denotes the weight of sample \(i\), \(I\left( {predict{ }\;positive{|}positive} \right)\) denotes the indicator function that sample \(i\) has a positive outcome that is predicted to be positive, and \(I\left( {predict{ }\;negative{|}negative} \right)\) denotes the indicator function of sample \(i\) has a negative outcome that is predicted to be negative. It is expected to observe a higher estimated sensitivity than specificity as we are over-sampling the positive outcome class compared to the negative class.

To evaluate the prediction performance of the classification tree, we then apply the model to the set-apart testing set in the period of 06/30/21ā11/13/21. The confusion matrix is provided in Table 2. For the testing set, the sensitivity decreased from 83.7 to 77.1% while the specificity increased from 58.5 to 62.8%. The overall weighted prediction accuracy is 72.3%. The testing set contains the period in which most of the student residents had already received vaccination and the wave of the highly infectious SARS-CoV-2 Omicron variant had not yet arrived^{49}. Therefore, fewer infected cases were observed and thus underrepresented the total population. Despite the evolving nature of the pandemic, the model performed well and was able to predict individual infections with satisfactory accuracy and high sensitivity. We also trained a model on the testing set alone and compared it with the model trained with the training set; the comparison of results is available in the Appendix.

### Influence of weights

In this section we investigate the role of relative weights of positive and negative outcomes in the prediction. For simplicity of notation, we denote a relative weight of (*a*/#positive classes): (*b*/#negative classes) for positive vs. negative outcomes as *a:b*. For example, the model in Fig.Ā 2 is trained with weights 2:1; this weighting places a double amount of emphasis on records with positive outcomes compared to those with negative outcomes after standardizing by the total numbers of positive and negative outcomes. The trained classification tree model for relative weights 1:1 is available in the Appendix as a sensitivity analysis.

FigureĀ 3 displays the receiver operating characteristics (ROC) curve^{48,51}, which demonstrates a trade-off between sensitivity and specificity; the *x*-axis indicates one minus the specificity, and the *y*-axis indicates the sensitivity. This curve permits a comparison of the performance of models trained with varying weights. Detailed results are provided in Table 3. With relative weights on the positive class as small as 0.2:1, all the outcomes are predicted to be negative; hence, the sensitivity is 0 and the specificity is 1. As the weight for positive class increases, the sensitivity also increases, and the specificity decreases. With relative weights of 4:1 or greater, all outcomes are predicted to be positive, yielding sensitivity of 1 and specificity of 0.

Table 4 summarizes the importance of features in models trained with different weights given by orders of nodes appearing in the classification trees. For results to be comparable, *cp* value of 0.02 is used in training all models with different weights; this approach leads to different numbers of nodes under different weight settings. For all models, the root nodes are defined by whether or not fewer than 3 out of the past 7Ā days have positive wastewater signals; this is consistently the most predictive wastewater feature for predicting individual COVID-19 infections. In all models with a lower level node/leaf, the next most important feature is whether or not none of the previous 5Ā days have positive wastewater signals. Combined with the result of the root node, a predictive model that is robust to the choice of weights consistently includes the dichotomous features: 3 or more out of 7Ā days wastewater positive (yes/no) and 1 to 5 of the previous days wastewater positive (vs 0Ā days). This model leverages features characterizing wastewater results both in a long-term trend of 7Ā days and in shorter periods of 5Ā days.

### Prediction with random forest model as a benchmark

To further evaluate the prediction performance of the proposed classification tree model, we apply a weighted random forest model^{52} consisting of an ensemble of 1000 individual weighted classification trees. As in the classification tree model, weights are applied for oversampling the positive individual cases. The random forest is known for its high prediction accuracy but lacks the interpretability of the classification trees. Comparing the performance of the proposed model to that of the random forest enables us to assess the proposed model with a reliable benchmark and to understand the trade-off between the interpretability and prediction accuracy of models.

Detailed results are provided in Tables 5 and 6. The proposed classification tree models generally outperform the random forest models in the same weight settings, especially when the relative weights of positive vs. negative outcomes are high. For the random forest approach, a choice of weight ratio that leads to high sensitivity and relatively high specificity, is 3:1. In this case, sensitivity equals to 72.9% and specificity equals to 68.5%, leading to a 71.7% prediction accuracy, while the proposed classification tree model has a prediction accuracy of 73.5% (at the same 3:1 weight ratio). One possible reason for the random forest to under-perform compared to the proposed classification tree is that the random forest is based on bootstrap (or subsampling) of the data, which breaks the chronological structure of the time series in the data and thereby potentially affects the prediction performance. Another possible reason is that given the relatively small feature space and the limited number of positive COVID infections in the data, the increased complexity of the random forest model introduces more risk of overfitting, which likely contributed to its decreased accuracy when applied to unseen test data. Furthermore, the random forest model is not the preferred choice in our study due to its reduced interpretability and transparency, particularly for the purpose of guiding campus-wide policies.

### Comparisons to other statistical and machine learning models

Besides the random forest models, we also assess the proposed classification tree model against various commonly used statistical and machine learning models, thoroughly evaluating their predictive performance and interpretability. All of these models are fitted using identical features extracted from the wastewater signals, same training and testing data partitioning, and the same weight ratios of positive vs. negative outcomes as for the classification tree models, ensuring a fair comparison. Results listed in this section focus on the model performance under the weight ratio of 2:1 as in the proposed classification tree model. Complete results under a variety of weight ratios can be found in Tables 5 and 6.

First, we apply both the logistic regression model and the logistic regression with LASSO regularization^{53} for variable selection to our preprocessed data. The ten-fold cross-validation is used to determine the value of the penalty parameter lambda for LASSO. The threshold of 0.5 for the predicted probability of positive individual infection is used to determine the binary predicted outcome. Logistic regression without variable selection produce a relatively low prediction accuracy of 67.7% with sensitivity of 66.7% and specificity of 69.8% when applying the model fitted using the training set to the set-aside test set. The observed under-performance of prediction in the test set may be due to its higher model complexity, which can lead to overfitting. Logistic regression with LASSO yields improved accuracy of 72.3% with sensitivity of 77.1% and specificity of 62.8% in the test set. Variables selected using LASSO include indicators of positive wastewater signals in at least 3Ā days out of past 7Ā days, at least 1Ā day out of past 5Ā days and at least 1Ā day out of past 3Ā days, largely overlapping with the important features selected by the classification tree model and hence the similar results. Although LASSO selects one additional variable, compared to the decision tree method, it has a very similar prediction performance (exactly the same in 3 decimal digits). This is because the variable whether wastewater signals are positive in at least 1Ā day out of past 3Ā days has a regression coefficient very close to 0 (despite not exactly equal to 0). As a popular variable selection method, LASSO is considered a comparable approach to the decision tree in our study, but it is less intuitive in terms of ranking the variable importance in prediction, which is a critical factor we consider in our policy making process.

We also apply several machine learning models including the Support Vector Machine (SVM) with linear kernel and Feedforward Neural Network (FNN) with single hidden layer^{54,55}. Both traditional SVM and FNN do not support variable selection and have limited interpretability. Furthermore, the prediction performance of these two methods falls short of the classification tree method in the test set (SVM: accuracy: 71.1%, sensitivity: 72.9%, specificity: 67.4%; FNN: accuracy: 69.0%, sensitivity: 70.8%, specificity: 67.5%). Notably, FNN exhibits impressive performance in the training set, achieving an accuracy of 78.1%. This underscores that complex machine learning methods can excel at fitting the training data but may encounter overfitting issues when applied to unseen testing dataset. Furthermore, when the weight ratio of positive vs. negative increases to 3:1, the SVM loses its effectiveness, resulting in a specificity of 0 and predicting all outcomes as positive.

Tables 5 and 6 include detailed results of sensitivity vs. (1-specificity) when applying training-set-fitted models with different weight ratios to the test data, using each of the models in comparison. The classification tree methods and logistic regression with LASSO are the two approaches that strike a good balance between interpretability and high sensitivity, particularly when using weight ratios of 2:1 and/or 3:1. Overall, the proposed classification tree model still possesses the best prediction accuracy. Given its good prediction performance and interpretability of results, the logistic regression with LASSO can serve as a viable alternative to the classification tree model. Nevertheless, from the perspective of policy makers, the classification tree may still hold an advantage due to its intuitive feature importance ranking. Further details on prediction accuracy, sensitivity and specificity for training models can be found in the Appendix.

### Positive predictive value (PPV) and negative predictive value (NPV)

We further examine the positive predictive value (PPV) and negative predictive valueĀ (NPV) of the predictions of individual infections as defined below:

$$\begin{aligned} {\text{Positive}} & \, \;{\text{predictive}}\;{\text{ value }}\left( {{\text{PPV}}} \right) \, \;{\text{of}}\;{\text{ wastewater}}\; \, \left( {{\text{WW}}} \right)\;{\text{ test}} \; \\ \, & \; = \frac{{{\text{Sensitivity}}\;{\text{ of }}\;{\text{WW}}\;{\text{ test }}*{\text{ prevalence}}}}{{\left\{ {\left( {{\text{sensitivity }}*{\text{ prevalence}}} \right) \, + \, \left( {{1} – {\text{specificity}}} \right) \, \left( {{1 }{-}{\text{ prevalence}}} \right)} \right\}}} \\ & \; = \, {\text{ TP}}/\left( {{\text{TP}} + {\text{FP}}} \right) \\ \end{aligned}$$

$$\begin{aligned} {\text{Negative}} & \;{\text{ predictive}}\;{\text{ value }}\;\left( {{\text{NPV}}} \right) \, \;{\text{of}}\;{\text{ WW}} \;{\text{ test}} \\ & \;\; \, = \frac{{ {\text{Specificity }}\;{\text{of }}\;{\text{WW}}\;{\text{ test }}* \, \left( {{1} – {\text{prevalence}}} \right) }}{{\left\{ {{\text{specificity }}*\left( {{1} – {\text{ prevalence}}} \right) \, + \, \left( {{1} – {\text{ sensitivity}}} \right) \, \left( {{\text{prevalence}}} \right)} \right\}}} \\ & \;\; = {\text{ TN}}/\left( {{\text{TN }} + {\text{ FN}}} \right) \\ \end{aligned}$$

where TP and FP are numbers of true and false positives and TN and FN are numbers of true and false negatives in the prediction, and the prevalence is the proportion of true positives among all tested units of observation (which could be, for example, at a building or individual level).

These quantities can be particularly useful in developing policies regarding control of the COVID-19 epidemic. In the case of pooled tests, results can help in using testing resources more efficientlyāby focusing intensive testing where cases are most likely to reside. In addition, the tests can provide an early warning about the potential for at least one resident of a building unit to be infected. To make best use of the wastewater tests, we estimate the probability that there is at least one infected person in a residence given a positive wastewater test. This estimate will aid in evaluating the costābenefit of different strategies for testing the residents. In addition, knowledge of the relationship between the timing of positive wastewater tests and positive individual-level tests can inform us about whenāor at what scheduleāit is best to offer the latter to residents.

Our testing setting is a little more complex than usual, because the wastewater test is a pooled test that aggregates results of buildings associated with the same manholes; hence, the number who contribute to the pool varies across testsāwhich are done at the residence level. Furthermore, the prevalence of interest is at the residence level; as noted above, we define a residence to be a true positive if there is at least 1 infected resident in the residence. Like the wastewater itself, this definition is at the residence building level.

The prevalence at the residence building level *p*_{c} can be estimated from the prevalence *p* at the individual level given the number of residents (*n*), under the assumption of independence across infection events across them: *p*_{c}ā=ā*prob of (*>ā=ā*1 infected resident)*ā=ā*1ā(1āp)*^{n} where *p* is individual-level prevalence. Because most detected infection events we observed are only in a single person, we believe that violation of this assumption has little effect on our estimates. As the prevalence of COVID-19 and the number of residents vary with date, the estimates of PPV and NPV will vary with date as well. There are also possible dilution effects that could affect the estimations. For example, the detectability of SARS CoV-2 genetic material may depend on the total number of residents living in the upstream of the manholes.

Here we provide approximate building-level estimates of the PPV and NPV and demonstrate how they are affected by the number of residents in buildings associated with manholes. We focus on the period of the week before Fall 2021 quarter begins, as most student residents are in the process of moving back onto campus during that week, and are required to take individual-level tests as soon as they move into their residences. The curves of PPV and NPV as a function of the number of people in residence buildings are shown in Fig.Ā 4. We note that the PPV and NPV are quite sensitive to the number of residents; the usefulness of wastewater tests must be considered in this context. Negative tests are less reassuring as the number climbs near 1000; whereas PPV only approaches 50% when the number of residents is near 250.

### Sensitivity analysis

Furthermore, we conduct a comprehensive sensitivity analysis designed to assess the model’s performance. We systematically vary sampling settings and model parameters and compare the proposed approach to other models and methods and evaluate the results. Specifically, the sensitivity analysis includes: (1) altering the time-window length in defining the outcome of individual COVID-19 infection, (2) applying different weight ratios of positive vs. negative outcomes in fitting the models, (3) varying model complexity including number of predictive features selected, (4) fitting a separate classification tree model using only the test set, (5) examining data from May and June 2021, (6) training proposed model on data including only Fall 2020, when the vaccines are still not publicly available, (7) varying the sampling frequency of wastewater signals, and (8) conducting a comparative analysis of the proposed classification tree model against other statistical and machine learning models. Details of the sensitivity analysis and results are available in the Appendix. Based on results from the above sensitivity analyses, we conclude that the proposed model and method stand as the overall best choice in the context of for our study. When applying the model, we recommend that researchers leverage our model for their own studies and carry out a similar sensitivity analysis to refine the parameter settings tailored to the specifics of their individual models.