Study population
Our primary analyses were performed within three prospective US cohort studies: NHS, NHSII, and HPFS. The NHS started in 1976 and enrolled 121,700 female nurses aged 30–55 years42; the NHSII began in 1989 and recruited 116,429 female nurses aged 25–42 years42; and the HPFS was established in 1986 and enrolled 51,529 male health professionals aged 40–75 years43. Participants have been followed biennially and completed questionnaires obtaining lifestyle and health-related information. Blood samples were collected from 32,826 participants in the NHS during 1989–199044, from 29,611 participants in the NHSII during 1996–199944, and from 18,225 participants in the HPFS during 1993–199545 using a similar protocol. After collection, samples were shipped with an icepack via overnight courier to the laboratory, where they were processed and separated into plasma, buffy coat, and red blood cells and stored in liquid nitrogen freezers.
In the present study, we included participants from13 prior nested case-control sub-studies on metabolomics (Supplementary Table 1). For the analysis of mortality, we excluded participants reporting a history of cardiovascular disease (CVD) and cancer at blood collection or lost to follow-up after blood draw. For the analysis of longevity, we excluded participants who were alive but did not achieve longevity (living to age 85 years) at the end of follow-up. In all, 11,634 participants were included in the prospective analysis for mortality and 5297 for longevity (Fig. 1). The study protocol for each cohort was approved by the institutional review boards (IRBs) of the Brigham and Women’s Hospital, Harvard T.H. Chan School of Public Health, and participating registries. The present study was considered as non-human research by IRBs because it utilized de-identified samples that were originally collected for other studies.
The external replication for mortality was conducted in the PREDIMED study, a multicenter randomized controlled trial among individuals at high cardiovascular risk carried out in Spain from 2003 to 2009, which examined the effects of the traditional Mediterranean diet in the primary prevention of CVD, with type 2 diabetes as a secondary outcome46,47. For the present analyses, follow-up for death was available until 2020 through linkage with the National Death Index. We included participants from two nested case–cohort (CVD and diabetes outcomes) studies that included metabolomics profiling31,48. After excluding participants with >25% missing values in metabolites, 1878 were included in the replication analysis. The IRB of Hospital Clinic (Barcelona, Spain) approved the study protocol, and all participants provided written informed consent.
Metabolomics measurement
The plasma metabolomics profiling in all four studies (NHS, NHSII, HPFS, and PREDIMED) was performed using high-throughput liquid chromatography-mass spectrometry (LC–MS) techniques at the Broad Institute of MIT and Harvard (Cambridge, MA) during 2015–202137,49,50. Pooled plasma reference samples (prepared by combining small aliquots from the study samples), were analyzed every 20 participant samples to enable standardizing temporal drift in instrument response over time and between batches. In addition, quality control (QC) samples, to which the laboratory was blinded, were randomly distributed among the participants’ samples and were also profiled.
Hydrophilic interaction liquid chromatography (HILIC) analyses of water-soluble metabolites in the positive ionization mode (HILIC-positive) were conducted using an LC–MS system comprised of a Shimadzu Nexera X2 U-HPLC (Shimadzu Corp.; Marlborough, MA) coupled to a Q Exactive mass spectrometer (Thermo Fisher Scientific; Waltham, MA). Metabolites were extracted from plasma (10 μL) using 90 μL of acetonitrile/methanol/formic acid (74.9:24.9:0.2 v/v/v) containing stable isotope-labeled internal standards (valine-d8, Sigma-Aldrich; St. Louis, MO; and phenylalanine-d8, Cambridge Isotope Laboratories; Andover, MA). The samples were centrifuged (10 min, 9000 × g, 4 °C), and the supernatants were injected directly onto a 150 × 2 mm, 3 μm Atlantis HILIC column (Waters; Milford, MA). The column was eluted isocratically at a flow rate of 250 μL/min with 5% mobile phase A (10 mM ammonium formate and 0.1% formic acid in water) for 0.5 min followed by a linear gradient to 40% mobile phase B (acetonitrile with 0.1% formic acid) over 10 min. MS analyses were carried out using electrospray ionization in the positive ion mode using full scan analysis over 70–800 m/z at 70,000 resolution and 3 Hz data acquisition rate. Other MS settings were: sheath gas 40, sweep gas 2, spray voltage 3.5 kV, capillary temperature 350 °C, S-lens RF 40, heater temperature 300 °C, microscans 1, automatic gain control target 1e6, and maximum ion time 250 ms.
Plasma lipids were profiled using a Shimadzu Nexera X2 U-HPLC (Shimadzu Corp.; Marlborough, MA) (C8-positive). Lipids were extracted from plasma (10 μL) using 190 μL of isopropanol containing 1,2- didodecanoyl-sn-glycero-3-phosphocholine as an internal standard (Avanti Polar Lipids; Alabaster, AL). After centrifugation, supernatants were injected directly onto a 100 × 2.1 mm, 1.7 μm ACQUITY BEH C8 column (Waters; Milford, MA). The column was eluted isocratically with 80% mobile phase A (95:5:0.1 vol/vol/vol 10 mM ammonium acetate/methanol/formic acid) for 1 min followed by a linear gradient to 80% mobile-phase B (99.9:0.1 vol/vol methanol/formic acid) over 2 min, a linear gradient to 100% mobile phase B over 7 min, then 3 min at 100% mobile-phase B. MS analyses were carried out using electrospray ionization in the positive ion mode using full scan analysis over 200–1100 m/z at 70,000 resolution and 3 Hz data acquisition rate. Other MS settings were: sheath gas 50, in source CID 5 eV, sweep gas 5, spray voltage 3 kV, capillary temperature 300 °C, S-lens RF 60, heater temperature 300 °C, microscans 1, automatic gain control target 1e6, and maximum ion time 100 ms.
Only named metabolites (a total of 396 measured in NHS/NHSII/HPFS) were considered in the present analysis. After excluding metabolites whose intraclass correlation coefficient across blinded quality control replicates (10% of study samples) <0.3 (n = 8) and metabolites with a missing rate ≥25% (n = 145), 243 metabolites were included in the current analysis. They were primarily lipids (n = 171, including 70 glycerolipids, 31 glycerophospholipids, 24 plasmalogens, 20 carnitines, 10 cholesterol esters [CEs], 8 lysophospholipids, 4 ceramides, and 4 sphingolipids), but also included amino acids related metabolites (n = 31), nucleosides and derivatives (n = 12), and other metabolites (n = 29). Most of the named metabolites (208 out of 243) were available for the replication analysis in PREDIMED. Metabolite data were log-transformed if highly skewed (absolute skewness > 2)51, and all metabolites were then converted to z-scores within each sub-study. Missing data for each metabolite were imputed using the random forest imputation approach as it has been previously recommended for metabolomics analysis52.
Ascertainment of mortality and longevity
Deaths in the cohorts were identified from state vital statistics records and the National Death Index or by reports from next of kin or the postal authorities. The follow-up for mortality in these cohorts is over 98% complete using these methods. The cause of death was determined by physician’s review of medical records, autopsy reports, or death certificates. We used the International Classification of Diseases, Eighth Revision (ICD-8) in NHS and ICD-9 in HPFS, which were the ICD systems used at the time the cohorts began. Longevity was defined as living to the age of 85 years. This threshold has also been adopted by other studies6. In the PREDIMED study, four sources of information to identify endpoints were used: repeated contacts with participants, contacts with family physicians, a yearly review of medical records, and consultation with the National Death Index. All medical records that were related to endpoints were examined by medical doctors and ascertained by the end-point adjudication committee, whose members were unaware of the intervention-group assignments.
Covariables
In NHS/NHSII/HPFS, we collected information on body weight, smoking status, physical activity, multivitamin use, race, diabetes, hypertension, hypercholesterolemia, antihypertensive medication use, and lipid-lowering medication use via self-reported questionnaires preceding blood collection. Body mass index (BMI) was calculated using height reported at cohort baseline and body weight reported before the blood draw. Age and fasting status were obtained through questionnaires completed at blood collection. Total calorie, alcohol intake, and the Alternate Healthy Eating Index (AHEI, a measure of overall diet quality ranging from 0–100, not including alcohol), were derived from the last semiquantitative food-frequency questionnaire (FFQ) before blood draw. The validity and reproducibility of FFQs have been reported elsewhere53,54. In PREDIMED, medical conditions, medication use, family history of diseases, and lifestyle factors were collected through a questionnaire during the first screening visit. At the baseline visit, trained personnel measured participants’ body weight and height according to the study protocol.
Statistical analysis
In NHS/NHSII/HPFS, the associations of individual metabolites (per 1-SD increment) with all-cause, cardiovascular, and cancer mortality were assessed by multivariable Cox regression models. The Cox regressions were stratified by cohort, original sub-study, and case/control status in the original sub-study and adjusted for age in months, BMI (continuous), race (white or non-white), fasting status (yes or no), multivitamin use (yes or no), smoking status (current, past, or never), physical activity (continuous), diabetes (yes or no), hypertension (yes or no), hypercholesterolemia (yes or no), antihypertensive medication use (yes or no), lipid-lowering medication use (yes or no), total energy intake (continuous), alcohol intake (continuous), and AHEI (continuous). The person-time for each participant was calculated from the blood collection date until the date of death or end of follow-up (June 2018 for the NHS/HPFS and June 2019 for the NHSII), whichever came first. We also conducted subgroup analyses for all-cause mortality by age at baseline (<55 or ≥55 years), sex, and case/control status and sensitivity analyses by limiting to the first 10- or 20-year follow-up and by excluding participants diagnosed with CVD or cancer within the first 4 years after blood collection. In addition, we conducted a mediation analysis to calculate the percentage of association with mortality that can be attributed to incident CVD.
We created knowledge-based metabolite groups with molecular or biological similarity and performed metabolite set enrichment analysis (MSEA)55 to identify groups associated with mortality. MSEA ranks the metabolites by the multivariable-adjusted β coefficient of the association with mortality. This metric is used to identify enriched metabolite groups at the two extremes of the distribution of β estimates (positive/inverse associations)16. We derived metabolite modules using weighted gene co-expression network analysis (WGCNA), which constructs a scale-free network based on hierarchical clustering16,56. Each module was summarized by a score, calculated as the sum of measured metabolite values weighted by their corresponding loadings on the first principal component of all metabolites in that module. The module score was subsequently used in the Cox regression models to assess its association with mortality risk.
To account for the high correlations among metabolites and develop a multi-metabolite profile score for mortality, we used elastic net penalized Cox regression with a training-testing approach57. Participants were randomized to either the training set or the testing set in a 7 to 3 fashion. The elastic net model was first built in the training set within a 10-fold cross-validation framework and was then applied to the testing set to calculate the multi-metabolite profile score. The metabolite profile score was calculated as the weighted sum of the selected metabolites with weights equal to the elastic net regression coefficients. The score in the training set was obtained using a leave-one-out approach to avoid overfitting. We then examined the association between the metabolite score and mortality by multivariable Cox regression using combined data from the training set and testing set.
The associations between individual metabolites and longevity were evaluated by logistic regression with the same covariable adjustment as mortality. We also conducted a subgroup analysis by sex and sensitivity analysis for men by using 80 years instead of 85 years as the threshold due to the shorter life expectancy in men. The associations of knowledge-based metabolite groups, data-derived metabolite modules, and the multi-metabolite profile score with longevity were assessed as well. In the external replication dataset (PREDIMED Study) for all-cause and cardiovascular mortality, we conducted Cox regression analysis for available individual metabolites and the multi-metabolite profile score, stratified by recruitment center and intervention group and adjusted for age (continuous), sex (male or female), BMI (continuous), smoking status (current, past, or never), physical activity (continuous), baseline diabetes (yes or no), hypertension (yes or no), antihypertensive medication use (yes or no), hypercholesterolemia (yes or no), lipid-lowering medication use (yes or no), total energy intake (continuous), alcohol intake (continuous), and Mediterranean Diet Adherence Screener (MEDAS, a measure of overall diet quality in PREDIMED ranging from 0–14) (continuous). All statistical tests were two-sided. To account for multiple comparisons, we used the Benjamini–Hochberg procedure58 and controlled the false discovery rate (FDR) < 5%. All analyses were performed in R version 4.1.0. The main R packages used were ‘missRanger 2.1.5’ for random forest imputation, ‘survival 3.4-0’ for Cox regression, ‘fgsea 1.23.2’ for MSEA, ‘WGCNA 1.71’ for metabolite module derivation, and ‘glmnet 4.1-4’ for elastic net regression.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.