Based on the report from the 2013–2017 United States Cancer Statistics (USCS) database, we identified the top ten malignant incident cancer types for females and males, after excluding non-melanoma skin cancer31. First, we surveyed the NHGRI-EBI Catalog of Published Genome-Wide Association Studies (GWAS Catalog)39 and the Polygenic Risk Score (PGS) Catalog40 to select the largest European ancestry-based GWAS as of May 2020 for each cancer type. We additionally browsed PubMed41 for large cancer-specific GWASs that were not included in the GWAS Catalog or PGS Catalog. For breast and colorectal cancer, we searched for prior European sample-based large-scale polygenic risk score (PRS) studies as of July 2020 and selected studies reporting the best-performing PRS (Supplementary Data). We did not consider pleiotropic GWAS. We filtered to cancer types with at least ten independent genome-wide significant SNPs after LD clumping at a genome-wide significant (GWS) p-value, 5E-8, threshold. Ultimately, eleven cancer types (bladder, breast, colorectum, endometrium, kidney, lung, melanoma, Non-Hodgkin’s lymphoma, ovary, pancreas, and prostate) were included in our analysis. For the full list of source literature and GWAS summary statistics included in our analysis, see Supplementary Data.
UK Biobank (UKBB) is a prospective epidemiological cohort study with over 500,000 participants42,43,44. Individuals aged 40–69 at baseline were recruited across the United Kingdom (UK) from 2006–201042,43,44. A wide range of genotypic and phenotypic information, including personal medical and family history and lifestyle data, were collected at enrollment42,43,44. UKBB data is regularly updated by completing follow-up questionnaires, linkage to national cancer and mortality registries, and hospital inpatient electronic medical records systems42,43,44. With linkage to the national cancer registry data, cancer diagnosis date and type (coded based on International Classification of Disease 10 (ICD-10)) were available for participants diagnosed with cancer42,43,44. For our analysis, we used ICD-10 codes for cancer classification (see Supplementary Table 4).
We then filtered to unrelated UKBB participants of White British ancestry with imputed genotype data. We excluded individuals who were lost to follow-up, with genetic sex and self-reported sex mismatch, those with any cancer diagnosis prior to baseline assessment (prevalent cancers), and participants with missing data in any one of the classical risk factors (BMI, smoking status, pack years of smoking, and family history of cancer in non-adoptive first-degree relatives). In UKBB, family history of all cancers is not available. UK Biobank only reports family history of the top three cancer incident types for females (breast, bowel, and lung) and males (breast, bowel, and prostate). These quality control procedures resulted in a study population involving 133,830 females and 115,207 males.
After determining the source literature (Supplementary Data) for each cancer type, we reviewed the manuscript and any relevant additional resources. We extracted all autosomal SNPs from each cancer GWAS along with their summary statistics such as RSIDs, observed effect size estimates (OR or beta), effective (or risk) allele, risk allele frequency (RAF), and p-value. We excluded variants with minor allele frequency (MAF) < 0.01 and ambiguous SNPs (A/T or G/C allele) with MAF > 0.40. We filtered to variants with a MAF difference of less than 0.10 relative to the UK Biobank data. We removed variants with allele mismatches that could not be resolved by strand or dosage flips and/or SNPs with complete information mismatch, based on RSID, chromosome number, and position, to the European 1000 Genome reference panel45 or the UK Biobank data. We filtered to variants with an information score ≥0.90 based on the UK Biobank imputed genotype data. Finally, we used the fixed threshold approach to calculate PRS for each cancer. Using Plink46, we performed LD clumping at a p-value threshold of 5E-8, r2 of 0.1, and 1000 kb window with the European 1000 Genome reference panel45 as the reference panel to remove SNPs in linkage disequilibrium within each cancer type.
Then, PRS for UK Biobank participants was computed using PRSice247.
The formula used for PRS calculation in PRSice2:
\(PRS_j = \mathop {\sum }\limits_i^{} \beta _iSNP_{ij}\) where \(PRS_j\) is the PRS for the jth individual, βi is the observed effect size estimate for the ith SNP, and \(SNP_{ij}\) is the dosage information for the effective allele of the ith SNP for the jth individual. We standardized each PRS to have unit variance and zero mean.
We developed a sex-specific pan-cancer risk prediction model to estimate the risk of developing at least one cancer over the course of follow-up. The multicancer model included eleven cancer types (bladder, breast [Female only], colorectum, endometrium [Female only], kidney, lung, melanoma, Non-Hodgkin’s lymphoma, ovary [Female only], pancreas, and prostate [Male only]). Data were split into 2/3 training set and 1/3 of test set—independent validation datasets used for model performance evaluation and subsequent analysis.
Cox proportional hazard regression (Cox) model32 was fitted to the training set with the outcome as an incidence of any first cancer included in the analysis. The models specified a baseline hazard as a function of age and assumed multiplicative effects of the risk factors32:
$$\lambda \left( {t|{{{\boldsymbol{z}}}}} \right) = \lambda _0(t)\exp \left( {\beta _1z_1 + \beta _2z_2 + \ldots + \beta _nz_n} \right)$$
t: time-to-event; time to any first cancer incidence, censoring age, or death age
\(\lambda _0(t)\): baseline hazard function
z = (z1, z2, …, zn): set of covariates (risk factors) included in the Cox model
β = (β1, β2, …, βn): set of coefficients (log hazard ratios) for the predictors
Polygenic risk scores for each cancer (Supplementary Figs. 1 and 2), family history of cancer (breast, colorectum, lung, and prostate) in any first-degree relatives (nonadopted), body mass index, and pack-years of smoking were included as predictors in the model. We also adjusted for the first ten principal components. Also, as UKBB is a left-truncated and right-censored cohort, we used age as the timescale for the Cox model—that is, participants enter the model at recruitment age and exit at cancer incidence age, censoring age, or death age–whichever occurs first. We used the censoring date for the cancer registry data provided by UKBB48. In the underlying analysis of the UK Biobank data using the Cox proportional hazard model, the “event” is defined as the occurrence of any of these cancers, and the “time-to-event” is the time to first onset of any of these cancers. Thus, if an individual has multiple cancers, e.g., lung cancer first and then prostate, the individual is censored at the onset of the lung cancer. Further, if an individual first develops cancer of a type other than the ones included in our list, then they are censored at the first onset of those cancer types. Further, deaths from non-cancer causes were also treated as censoring events. Thus, the underlying hazard ratio parameters of the model can be interpreted as the instantaneous risk of developing at least one among the set of selected cancers, given a person was free of all cancers up to that time point.
Additionally, recognizing the concerns with the imputation of clinical/epidemiologic data, we conducted a complete-case analysis for the paper. A total of ~19% of subjects were removed who have missing data in any of the risk factors. Pack-years of smoking had the highest amount of missing data (~16%) missing, but all other individual variables had a small missing rate (<5%). For demonstrating the risk-stratification ability of models, a complete-case analysis is more desirable as imputation and model averaging will cause a diminishing of risk-stratification compared to the full potential of the model. In other words, our goal is to demonstrate the risk-stratification ability of the models for a population in which the underlying risk factors could be fully observed. From that point of view, a complete-case analysis is more desirable.
We computed pan-cancer risk scores (PCRS) or cancer-specific risk scores for all UKBB participants as the weighted sum of the predictors, with weights for each predictor as the estimated log hazard ratio (HRs) from the fitted Cox model. Then, in the test set, we assessed the discriminatory accuracy of the pan-cancer risk score (PCRS) or the cancer-specific risk score (for individual cancer models) using Harrel’s concordance index (C-statistic) and area under the curve (AUC) at five years of follow-up.
We used iCARE (Individualized Coherent Absolute Risk Estimation)49 to estimate absolute risk. Detailed methodology for absolute risk model building is described in Choudhury et al. 202049. Briefly, risk estimates for each individual in the test set were obtained by feeding age-specific cancer incidence rates by 1-year strata, log HR parameters from the Cox model, and the reference dataset into the model. We used 2016 cancer incidence rates in white individuals of the SEER*Stat database50. Site-specific cancer incidence rates were obtained and then added to get the overall incidence rates for any cancer included in our study. Cancer incidence rates for a given age and sex were determined by the following year’s cancer incidence rates. For instance, in our study, cancer incidence rates for females aged 50–51 will correspond to SEER*Stat’s cancer incidence rates for females aged 51–52. This is to account for the fact that the DETECT-A test was performed at study enrollment, and the female participants were followed up over the course of 12 months. DETECT-A and Galleri will both be used to detect cancers early, prior to conventional diagnosis. The reference dataset was obtained by simulating 10,000 samples representative of the underlying UKBB population using the normal distribution with PCRS or cancer-specific risk score mean and standard deviation.
DETECT-A study reported an overall sensitivity of 27.1% at 98.9% specificity and an empirical PPV value of 19.4% (95% CI: 13.1–27.1%)24. We wanted to select a time window for absolute risk estimation so that the PPV for females aged 65–75 is equal to the point estimate of 19.4% reported in the DETECT-A study24. We varied the time window by one month around one year and calculated the weighted average PPV for females aged 65–75 based on the UKBB PCRS distribution and age distribution as reported by the US Census Bureau51. We found that a time window of 11 months provided the best match for the overall PPV for the 65–75 group to the empirically determined PPV value of 19.4%. Thus, subsequently, we calculated PPV and NPV for different age and PCRS risk groups based on underlying 11-month absolute risk.
Galleri reported an overall sensitivity of 51.5% at 99.5% specificity. For Galleri, we used a time window of 1-year21,26. For DETECT-A, we omit the calculation of projected PPVs and NPVs for males as it does not include prostate cancer (highest incident cancer for males) as one of the detectable cancer types50.
Given the absolute risk estimate, x, the positive predictive value and negative predictive value of the multicancer liquid biopsy test can be calculated using the formula below:
$$Se = sensitivity;Sp = specificity$$
$$PPV(x) = \frac{{Se \times p(x)}}{{Se \times p(x) + \left( {1 – Sp} \right) \times \left( {1 – p(x)} \right)}}$$
$$NPV(x) = \frac{{Sp \times (1 – p(x))}}{{\left( {1 – Se} \right) \times p(x) + Sp \times \left( {1 – p(x)} \right)}}$$
The absolute risk estimate can be written as a function of age and risk factors. We assumed that the sensitivity and specificity of the multicancer liquid biopsy test do not depend on the underlying risk factors, and we used the value of these as reported from the DETECT-A and Galleri study (Supplementary Table 1)24,26.
This study was conducted under UK Biobank Application Number 17712 (PI: Dr. Nilanjan Chatterjee). The study analyzes existing UK Biobank data and does not involve new human research participants. UK Biobank was approved by the North West Multi-center Research Ethics Committee (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/about-us/ethics).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.