Tuesday, May 30, 2023

Proof of concept of the potential of a machine learning algorithm to extract new information from conventional SARS-CoV-2 rRT-PCR results – Scientific Reports

Brief description of the rRT-PCR technique employed, primers, and software

We performed nucleic acid extraction in a MicrolabStarlet IVD platform using the STARMag 96 × 4 Universal Cartridge Kit (Seegene Inc, Seoul, South Korea). To detect SARS-CoV-2, we applied the Allplex™ SARS-CoV-2 Assay (Seegene Inc, Seoul, South Korea), a multiplex one-step rRT-PCR able to simultaneously detect four viral targets, including the structural protein envelope (E) gene, the RNA-dependent RNA polymerase (RdRP) gene, the spike (S) gene, the nucleocapsid (N) gene, and an exogenous RNA-based internal control (IC). This rRT-PCR step was run on a CFX96™ system (Bio-Rad Laboratories, Hercules, CA, USA), and the analysis was performed using Seegene Viewer-specific SARS-CoV-2 software (Seegene Inc, Seoul, South Korea), resulting in separate cycle threshold (Ct) values for the E and N genes and one combined Ct value for the RdRp and S genes (RdRp/S) in the FAM, Cal Red 610 and Quasar 670 channels, respectively. The HEX channel is used for internal control. Regarding interpretation of the results, according to the manufacturer’s instructions, Cts values ≤ 40 are considered detected, and Cts value > 40 or not applicable (N/A) are considered not detected.


The study protocol (2021/295) was approved by the Galician network of committees of research ethics conformed to the principles outlined in the declaration of Helsinki. All methodologies were performed according to the relevant guidelines and regulations, and patient data were anonymized. The dataset used and the waiver of informed consent was approved by Galician network of committees of research ethics.

Description of the dataset employed in this work

Positive samples from two different sources were used for this study. First, 3274 positive samples were obtained from the 688,763 samples processed by the pooling techniques between August 2020 and July 2021. Second, we also included 17,144 positive samples obtained from the 313,939 samples processed in the microbiology laboratory between February 2020 and March 2021.

The samples processed in the ‘pooling lab’ are screenings to detect SAR-CoV-2 in a non-symptomatic population. The participants were asked to collect saliva (self-sampling) in TRANSPORT MEDIUM-2 (Vircell® Ref: TM013) immediately after waking up, following the manufacturer’s instructions. Although each result pertains to an individual rRT-PCR for each sample, these samples were first flagged as possibly positive by group testing. Then, the original samples from the positive pool are individually analyzed. These are the results that are used here. Individual samples and pools were analyzed following the same standard rRT-PCR protocol described in “Brief description of the rRT-PCR technique employed, primers, and software” section.

The other samples were nasopharyngeal swabs processed individually in the laboratory of the CHUVI Microbiology Department as part of the assistance routine for SARS-CoV-2 diagnosis. It is important to note that the supply of this source of positive samples ended prematurely in March 2021 due to the need to change the reagent used in this laboratory (Allplex™ SARS-CoV-2 Assay to Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay, both from Segene Inc.) because of the high demand. In this way, we were able to keep the Allplex™ SARS-CoV-2 assay to group testing, since in this case, an assay change requires a full re-evaluation of the system, and the increase in the Cts for the N gene previously described by Wollschäger et al.14 may have greater significance in group testing. As explained in “Classification results” section, the data from 12,313 positive samples obtained by the Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay between February and August 2021 could not be included in the present study.

Characterization of the wave concept

Since the pandemic began in March 2020, the concurrent increases and decreases in cases have been linked to the concept of ‘wave’, which are determined using subjective, unofficial criteria. To the best of the authors’ knowledge, this is an abstract nomenclature whose rigorous definition has not yet been clearly established to date. To characterize the pandemic dynamics in our area, we tracked the curve of active cases at the level of Galicia and, more specifically, Vigo and determined the boundaries between the so-called ‘waves’ in a data-driven way.

The database of active cases in the entire Galician region during the SARS-CoV-2 pandemic was obtained from data provided by the public health service of the Autonomous Spanish Community. In order to determine the time limits of each wave, the contagion curve is fit to a smoothed spline (R2 = 0.99), and the waves are defined by the local minima and maxima of the curve, as shown in Fig. 1. Therefore, it can be concluded that although vaguely defined, waves are quite distinguishable, and the number of samples is inherently higher near the peak of each wave and much lower in their frontiers. Additionally, each new wave could also be associated with a higher proportion of samples with lower Ct values at the beginning29.

Figure 1

Spline approximation of the infection curve in Vigo during the pandemic. The concept of wave is based on the local minima and maxima of the curve of SARS-CoV-2 active cases, which mark the borderline dates between the five waves. In addition, the peak of infection in each of the five waves has also been pointed out.

Focusing now on the data available for this study, Fig. 2 shows how the existence of the waves is reduced to four clearly differentiated peaks. First, the slight increase in the number of cases experienced in autumn of 2020 is not seen in the data collected. The peak of cases in the second wave is concentrated in the last months of the year.

Figure 2
figure 2

Number of positive SARS-CoV-2 tests in Vigo, averaged by week, detected by pooling in CHUVI with the old PCR reagents (blue) and with the new PCR reagents (orange). (Left) Infection curve as seen from CHUVI. (Right) Zoom on the effect of the reagent change in March 2021. Considering that the algorithm presented in this article has been trained with samples obtained with the old reagent (blue), data contamination with the results of the new reagent (orange) has been avoided in order not to include a temporal indicator that could misleadingly help the algorithm in decision making.

As will be explained in this work, the key aspect is the capacity of the machine learning algorithm to correctly predict the wave to which each sample belongs based on the numerical results of rRT-PCR for each gene. Therefore, a change in the target genes of the PCR performed during the fourth wave is too strong to indicate the temporal position of those tests and therefore had to be removed from our database to avoid giving an unfair advantage to the algorithm. Unluckily, the dire circumstances under which laboratories had to work during the pandemic led to this type of disturbance. Fortunately, in our case, it only significantly affected the fourth wave.

Descriptive analysis of the database used in the work

Even after extracting the samples that could lead to unfair results, the resulting database used in this study corresponds to a set of 20,418 PCR samples collected by the Microbiology Department of the CHUVI from March 2020 to July 2021. For each sample that tested positive for SARS-CoV-2 the database included an anonymized identification number, the date when the sample was taken, the threshold value for each target gene (E, N and RdRP/S) and the threshold value for the internal control (IC). The RdRP and S genes share the same channel; therefore, we obtained a single Ct value for both genes.

Figure 3 shows the distribution of the number of cycles from the analyzed gene profiles, where the average Ct value is approximately 26 for genes E and N and close to 28 for the combination of genes RDRP/S.

Figure 3
figure 3

Gene E (blue), N (green) and RDRP/S (yellow) distributions from the pooling dataset. Histograms of the Ct value around which the sample are concentrated for each of the target genes. A slight deviation of the mean number of cycles of the RDRP/S gene can be visually appreciated with respect to the E and N gene.

Some visual features arise from the simple analysis of the data csollected. As seen in Fig. 4, the RdRP/S gene distribution seems to be slightly offset toward higher Ct values; and in fact, a more abrupt end is shown. However, a strong linear relationship between the number of cycles of the three genes can be observed from the database (R2E-N = 0.96, R2R-N = 0.95, and R2E-R = 0.97). This is anticipated since the presence of the genes is expected to be similar and each number of cycles is allegedly related to the viral load of the individual; thus, the values of the numbers of cycles detected in any sample are usually quite close.

Figure 4
figure 4

Relationship between the number of cycles of genes E, N and RDRP/S. A linear trend is shown between them, being more clustered in the case of the gene E versus RDRP/S and more dispersed in the rest.

Figure 5 shows the temporal evolution of the number of cycles of each gene during the pandemic. The figure clearly shows that, at least at first glance, there is no trend or time evolution that points to Ct differentiation over time.

Figure 5
figure 5

Evolution of the number of cycles of each gene against the number of positive cases during the pandemic. Each graph corresponds to the individual results of each target gene. The vertical solid lines identify the boundaries between different waves. Inherently, there is a higher concentration of points around the peaks of each of the wave.

Classification techniques employed

To cluster the samples, a supervised learning technique would allow predicting the membership of a sample to a wave based simply on the number of cycles presented as a result of the PCR. Supervised learning algorithms are based on using labeled input data, i.e., with a correct answer with reference to its classification. Thus, as the algorithm is trained, it compares its predicted output with the correct input response until the error in its decision is minimized. In this work, we have labeled each sample with the wave that was dominant at the time the sample was taken from the individual. Since the definition of each wave, described in “Characterization of the wave concept” section, is unique, there is no possible ambiguity on the wave that we assigned to each sample. However, when a new wave becomes dominant, that means that, for the reasons discussed later, there is are new mechanisms in the pandemic progression that become dominant over the receding conditions of the previous wave. Hence, there is an intrinsic overlap that, by the methodology employed, cannot be resolved. Instead, the wave assigned to each sample is simply the dominant wave at that specific date.

Considering that a supervised algorithm was chosen to classify the waves, it was decided to confront two classic approaches within the machine learning field: a support vector machine (SVM) and a neural network (NN), using MATLAB R2021a as the main tool for the model’s development and post-processing the results. The fundamentals of each of the classification techniques tested are completely different although their final performance, as will be shown, is similar. In the NN approach, the model learns according to the training strategy and adjusts the weights of each of the neurons towards the optimum, whereas in the SVM approach a maximum margin hyperplane is created by means of kernel functions that allow the increase of dimension and thus facilitates the classification task. A detailed mathematical description of both models utilized in this work, SVM and NN, can be found in the Supplementary material.

Structure of the model

Figure 6 includes the information detailing the several steps that constitute the entire ML pipeline. First, the number of cycles of each of the genes for a single sample and an additional number of cycles corresponding to the internal control are taken as input parameters. Then, the training process starts after the machine learning algorithm is chosen.

Figure 6
figure 6

Predictive scenario of an ML model (From left to right description of the methodology employed). Using as input data the number of cycles of each gene, and the number of cycles of the internal control (Ct IC), the Machine Learning models were trained to obtain a score with reference to the probability of each sample to be a member of every wave. In the example depicted, the score of the first wave is the highest and, therefore, this wave is chosen as the result of the prediction; this prediction is then compared with the wave covering the date when the sample was taken, and such comparison produces a true (coincident) or false (difference) result of the classification.

The output of the algorithm corresponds to a confidence score that represents the probability that a sample belongs to a particular group. Considering that the main objective is to predict the membership of a sample to a wave, the output of this algorithm will correspond to a confidence level associated with the probability that a sample belongs to a wave. Thus, the wave with the highest confidence level assigned to it will be the one chosen as the predicted wave.

Subsequently, the prediction will be compared with the real wave value assigned to the sample. If the prediction coincides with the real value, it represents a good estimating; conversely, it represents an incorrect classification. The actual wave value of the sample is determined from the date of the sample taken and the estimated cutoffs with the approximation of the wave of active cases to a spline.

Source link

Related Articles

Leave a Reply

Stay Connected

- Advertisement -spot_img

Latest Articles

%d bloggers like this: