Estimating the number of probable new SARS-CoV-2 infections among tested subjects from the number of confirmed cases – BMC Medical Research Methodology

The study population

The data were extracted from the SARS-CoV-2 infection surveillance system in South Kivu (DRC) during the 2020 pandemic. The population of South Kivu is approximately 4,800,000 people who live in a 64,492 km² area. The study considered the first thousand alert cases recorded between March 29 and November 29, 2020. An alert case was defined as a person with signs suggestive of COVID-19 (fever, headache, breathing difficulties, asthenia… with or without loss of taste and smell) or a person who has been in contact with a person who tested positive for SARS-Cov-2 infection.

From each alert case, two samples were collected: a nasopharyngeal sample for RT-PCR test and a blood sample for IgM and IgG serology. In general, the recording of an alert case and sample collection took place the same day.

The serological test used ‘SARS-CoV-2 IgG/IgM Rapid Test Kit’ (Abbexa Ltd, Cambridge, UK). This test detects separately but on the same ‘cassette’ IgM and IgG antibodies against the virus. The tests (RT-PCR and serological test) were carried out in two centers: Kinshasa (March 29 to June 16, 2020) and Bukavu (June 17 to November 29, 2020).

A confirmed case was defined as an alert case with a positive RT-PCR test.

Statistical analyses

Data presentation

Data presentation used 2 by 2 contingency tables for cross-tabulation of RT-PCR versus IgM test results (numbers and percentages), then for cross-tabulation of RT-PCR versus IgM and IgG test results (positive when IgM + or IgG+, negative when IgM– and IgG–). The information given by each cell of these tables depends on the proportion of infection cases, the Se, and the Sp of each test. For example, the number of alert cases positive on test A and test B is the sum of two numbers: (i) the number of true positive results on both tests (i.e., the number of alert cases multiplied by the proportion of infection cases and the Se of each test); and, (ii) the number of false positive results on both tests (i.e., the number of alert cases multiplied by the complement to 100% of the percentage of infected cases and the complement to 100% of the Sp of each test). This information is needed to estimate respectively the incidence proportion and the prevalence of SARS-CoV-2 infection using a latent class model and a Bayesian inference method.

The latent class model

In the latent class model, the infection status is considered unknown and the results of the diagnostic tests are used to estimate the proportion of infected cases and the performance (Se and Sp) of the tests. The model was built with the assumption that the RT-PCR and the antibody test results are independent conditionally on the infection status. In fact, this assumption is plausible because the two types of diagnostic tests (RT-PCR and IgM or IgG serology) have different biological mechanisms.

Two separate latent class models were used; one to estimate the incidence proportion (using RT-PCR and IgM serology) and the other to estimate the prevalence (using RT-PCR and IgM/IgG serology).

The bayesian inference method

With two tests, the information provided by the observed data is not sufficient to estimate the proportion of infection cases and the performance of the tests in terms of Se and Sp. A Bayesian inference method was used to add prior knowledge on the Se of the RT-PCR and the Se and Sp of the serological tests to the observed data [14]. The Sp of RT-PCR was set to 100%. This implies the use of two latent classes (instead of four without this assumption).

Prior knowledge was extracted from the literature and summarized using prior distributions. Prior information on the performance of the tests was obtained from a search on PubMed with various combinations of keywords “COVID-19”, “diagnosis”, “performance”, “accuracy”, “test”, and “serological”. The retained articles were those that reported the performance of at least one of the tests (RT-PCR, IgG, and IgM). The excluded articles were those where the ‘gold standard’ was an imperfect diagnostic test, those that reported on pre-pandemic sera (to determine serologic test specificities), and those that used clinical or biological criteria to select the population. From the articles selected [7,8,9, 13], we extracted the smallest lower bound and the largest upper bound of the 95% confidence intervals (CoIs) of each test Se and Sp to derive prior intervals. When no confidence intervals were available, point estimates were used to derive the prior intervals (Table 1). Beta distributions were used as prior distributions with means equal to the centers of the corresponding prior intervals and standard deviations equal to the fourths of their ranges (Table 1). For the proportion of infection cases, a beta distribution with both parameters equal to one was used; this corresponds to a uniform distribution between 0 and 1.

Table 1 Prior knowledge on the sensitivities and specificities of the tests used for the diagnosis of SAR-Cov-2 infection
Full size table

Gibbs sampling was used to obtain a sample of the posterior distribution of each parameter from which were derived a point estimate (median of the posterior distribution) and a 95% credibility interval (CrI, between quantiles 2.5% and 97.5% of the posterior distribution) [14].

Three sets of 60,000 values were sampled from the conditional posterior distribution of each of the three parameters of the model using three different sets of starting values for the parameters. These starting sets were chosen using the centres and the upper and lower bounds of the intervals of the literature data on test performance (Table 1). An interval was formed by the proportions of positive RT-PCR and IgG/IgM serology using the cross table of these two test results. The bounds and centre of this interval were used as a starting set for the proportion of infected people in the data. The convergence of the three Markov chains was evaluated by the Gelman index. The first 10,000 iterations of the three chains allowing to reach convergence were removed. The remaining 50,000 iterations of each of the three chains were put together to give point estimates (medians of the posterior distributions) and 95% credibility intervals (quantiles 2.5% and 97.5% of the posterior distributions) of the parameters.

Estimating the multiplying factor

A sample of the posterior distribution of the multiplying factor was obtained by dividing each value of the posterior distribution sample of the incidence proportion of SARS-Cov-2 infection by the observed proportion of alert cases with positive RT-PCR test result. A point estimate of the factor and a 95% CrI were extracted from that sample (For more details, please see Additional files 1, 2 and 3).

In this work, qualitative variables were summarized by numbers and percentages in various modalities and quantitative variables by the mean, the standard deviation, the median, the first and third quartile.

All statistical analyses were performed with R software version 3.6.3 (2020-02-29, R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/).

Leave a Reply

Your email address will not be published. Required fields are marked *