Sunday, May 28, 2023
BestWooCommerceThemeBuilttoBoostSales-728x90

Synthetic lethality prediction in DNA damage repair, chromatin remodeling and the cell cycle using multi-omics data from cell lines and patients. – Scientific Reports


Definitions

LoF alteration

For both cell line and patient data, we considered loss of function (LoF) alterations that result in producing a non-functional protein. LoF alterations were defined based on the Ensembl56 classification of the variants, and only variants with a predicted high impact effect were utilized. This included the following alterations: splice site, nonsense and damaging missense mutations, frameshift insertion/deletion, start codon insertion/deletion/single nucleotide polymorphism, stop codon insertion/deletion. To assess the functional effect of the missense variants and to define damaging ones, we used ANNOVAR57. Five variant effect prediction methods included in ANNOVAR were used to increase the accuracy of the prediction and decrease the false-positive rate: PolyPhen-258, SIFT59, PROVEAN60, CADD61 and DANN62. A missense mutation was classified as damaging if at least three out of these five methods predicted a damaging effect on the protein function according to the damaging score cut-off established by each method.

Focus genes

Focus genes were defined as genes from DDR, chromatin remodeling and cell cycle pathways. Sets of genes involved in these pathways were collected based on published literature43,63,64,65,66,67 and curated databases (total 1241 genes; Supplementary Table 1; Fig. 1A).

Druggable genes

Druggable genes were defined as genes targeted by drugs in an advanced stage of development. Specifically, 953 druggable genes were selected from the Open Targets Platform database (68, release 19.06), based on the stage of development of the drug that targets the selected gene, including genes in the approved, advanced clinical and phase 1 clinical development stage (Bucket 1–3, Supplementary Table 2).

Data source and pre-processing

Cell line data

The dataset of genome-scale CRISPR knockout screens from the project Achilles (https://depmap.org/portal/achilles/) for 789 cell lines was obtained from the Cancer Dependency Portal (DepMap 20Q3 release25, https://depmap.org/portal/download). 787 of these cell lines had genomic information and thus were utilized in the analysis. Additionally, large-scale RNAi screening datasets including the Broad Institute Project Achilles, Novartis Project DRIVE13 and the Marcotte et al.69 breast cell lines, with the genetic dependencies estimated using the DEMETER2 model70, were downloaded from the same source, the DepMap portal (712 cell lines). The genomic characterization data of the cell lines was generated by the Cancer Cell Line Encyclopedia (CCLE) and obtained from DepMap 20Q3 (787 and 670 cell lines for the Achilles and DEMETER2 projects, respectively). The cell line alteration data was then organized into a cell line per gene binary matrix with the value 1 if the cell line acquired at least one LoF alteration in the gene, and 0 otherwise.

Patient data

The somatic SNV (single nucleotide variants) alterations, mRNA gene expression and clinical data for cancer patients were downloaded for The Cancer Genome Atlas (TCGA) Pan-Cancer via the University of California Santa Cruz Xena data portal71. Raw counts from TCGA expression data were log2-scaled and quantile normalized to mitigate batch effects between different datasets (e.g. sequencing lab, time, etc.). Quantile normalization was performed using the “preprocessCore” R package72. The patient data was organized into a patient per gene binary matrix with the value 1 if the patient acquired at least one LoF alteration in the gene, and 0 otherwise.

Drug screen data

The drug screening data for the drug sensitivity analysis was collected from the Genomics of Drug Sensitivity in Cancer database (GDSC; https://www.cancerrxgene.org/downloads)26,27,28, and the Profiling Relative Inhibition Simultaneously in Mixtures database (PRISM25) from the DepMap data portal. Specifically, we exploited both the two available GDSC experimental setups (GDSC1 and GDSC2) and the secondary PRISM Repurposing 19Q4 screen. GDSC datasets were curated using the Open Targets database68 to unify the gene and protein information and assign gene targets to each drug. The screens included 518 unique drugs (367 in GDSC1 and 198 in GDSC2, with some overlap) targeting genes B in 988 unique cell lines in the GDSC database (987 cell lines in GDSC1 and 809 cell lines in GDSC2, with some overlap) and 1210 drugs targeting genes B in 481 cell lines in PRISM database.

Pathway data

The gene pathways were collected from the curated gene sets: Kyoto Encyclopedia of Genes and Genomes (KEGG73,74,75), Pathway Interaction Database (PID76) and Reactome77, available as part of the c2 collection in MSigDB29. Together, we evaluated 1951 pathways: 186 in KEGG, 196 in PID and 1569 in REACTOME.

Synthetic lethality tests

We used eight tests on four data sources (cell line, patient, drug and pathway data) to collect evidence of synthetic lethality: two types of statistical tests on cell line data, four tests on patient data, two tests on drug screens and one pathway enrichment test. Although several of those tests were previously used in SL discovery by many different frameworks15,16,17,18,19, here we introduced a new patient-based test (iSurvLRT) and a novel modification of the GSEA, called SPEA, and further combined all tests in a comprehensive framework for identifying clinically relevant SL pairs.

Synthetic partner inactivation dependency (SPID)

SPID test is an application of a one-sided Wilcoxon rank-sum test to determine whether cell lines with LoF alteration in gene A have significantly lower gene B dependency scores than the cell lines without gene A LoF alteration, i.e. whether they are more sensitive to gene B inactivation. In addition, we calculate the positive dependency percentage which is the fraction of such cell lines out of the total number of cell lines, which have a positive estimated dependency score as a result of gene B deactivation. A positive dependency score implies that the cells grow and divide more efficiently when gene B is deactivated. Thus, a high positive dependency percentage implies that a given pair should not be considered SL.

Test application criteria To perform SPID test for a given gene pair, gene A must have LoF alterations in at least 20 cell lines and gene B must be a knock-out target in a given cell line experiment. The criterion of minimum of 20 LoF mutations in cell lines (SPID, SPEA and SPDD tests) or patients (SoF, SurvLRT and iSurvLRT) is set arbitrarily and consequently across all tests to ensure enough statistical power.

SL evidence criteria Gene pairs with the combined and adjusted p-value less than 0.05, where the combination is across two datasets (Achilles and DEMETER2) and two tests (SPID and SPEA).

Synthetic partner enrichment analysis (SPEA)

SPEA test is a novel modification of a widely known and used Gene Set Enrichment Analysis (GSEA). GSEA identifies sets of genes that are over- or under-represented in lists ranked by gene expression29. Specifically, given a list of genes ranked by correlation of their expression with a certain phenotype of interest and a certain set of genes, GSEA uses a permutation-based test and a Kolmogorow-Smirnoff statistic to compute the significance of enrichment of the gene set at the top of the ranked gene list.

Instead of gene expression, SPEA ranks cell lines according to their gene dependency score of gene B (instead of differences in expression between two samples typically used in GSEA) so that the cell lines most sensitive to gene B silencing are situated at the top of the list. We are then interested whether the subset of such cell lines that carry gene A LoF alteration, is enriched at the top of that ranked list of cell lines. We thus calculate an enrichment score (ES) by walking down the list, increasing a running-sum statistic when we encounter a cell line from the subset, and decreasing it when we encounter cell lines without gene A LoF alteration. The ES is the maximum deviation from zero encountered in the random walk and corresponds to a weighted Kolmogorov-Smirnov-like statistic29. The ES indicates the direction of enrichment – a positive score means that cell lines with gene A LoF alteration are concentrated at the beginning of the ranking i.e. are more sensitive to gene B knockdown. See Supplementary Text for a formal description of the ES calculation.

We estimate the statistical significance (nominal p-value) of the ES by comparing it with the set of scores \(\text {ES}_{\text {rand}}\), computed with randomly assigned gene A LoF alteration status and for a reordered cell line list. We perform this permutation step 200 times, recompute \(\text {ES}_{\text {rand}}\) of the gene set for the permuted data and compile all results to generate a null distribution for the ES. The empirical, nominal p-value of the observed ES is then calculated relative to this null distribution.

Test application criteria To perform SPEA test for a given gene pair, gene A must have LoF alterations in at least 20 cell lines and gene B must be a knock-out target in a given cell line experiment.

SL evidence criteria Gene pairs with the combined and adjusted p-value less than 0.05, where the combination is across two datasets (Achilles and DEMETER2) and two tests (SPID and SPEA).

Gene co-expression (ExprSL)

ExprSL test is computed as the pairwise gene expression correlation and its significance is evaluated using a two-sided t-test for Spearman correlation. ExprSL is based on two assumptions. First is that SL gene pairs are involved in closely related biological processes and thus are more likely to be significantly positively correlated. Second is that the SL partner genes may compensate for each other and thus can be significantly negatively correlated78.

Test application criteria To perform ExprSL test for a given gene pair, both gene A and gene B’s expression must be measured.

SL evidence criteria Gene pairs with the absolute Spearman correlation coefficient higher than 0.4 and adjusted p-value less than 0.05 are considered significantly correlated.

Survival of the fittest (SoF)

SoF test assumes that cells with joint LoF alteration of gene A and reduced expression of gene B in a given SL pair will not survive in a tumor cell population78. Thus, intuitively, SoF test assesses whether tumors with LoF alteration of gene A compensate for this loss by an increase of gene B expression. Specifically, SoF uses a one-sided Wilcoxon rank-sum test to examine whether gene B has a significantly higher expression in samples with LoF alteration in gene A compared to the rest of the samples.

Test application criteria. To perform SoF test for a given gene pair, gene A must have LoF alterations in at least 20 patients and gene B’s expression must be measured.

SL evidence criteria Gene pairs with SoF test adjusted p-value less than 0.05 are considered significant according to this test.

SurvLRT

We use the survival likelihood ratio test (SurvLRT)14 to estimate the tumor fitness with a given genotype g from survival data of patients. Here, the genotype \(g = (g_A, g_B)\) is defined by alterations in gene A and gene B. Specifically, for a given patient, \(g_A = 0\) if gene A is not altered in that patient’s tumor, and \(g_A = 1\) otherwise, and similarly for \(g_B\). In the original approach14, the alteration could be of any type. Here, it is strictly confined to the definition of LoF alteration. SurvLRT assumes a survival model of tumor fitness, stating that a decrease in tumor fitness due to LoF alteration in gene A and gene B is exhibited by a proportional increase of survival of the patients. Thus, the survival of patients with LoF alterations in both SL genes should be longer than expected from the survival of patients without LoF alteration in those genes or with only one gene altered.

Consider a reference survival function S(t), estimated based on a cohort of patients who did not die of cancer as by14. Denote the fitness of a tumor with genotype g by \(\Delta _g\) and denote the log fitness as \(\delta _g = \log (\Delta _g)\). We assume that the survival of patients whose tumor carries genotype g is given by \(S(t)^{\Delta _g}\). In the case when there is no epistatic relation between genes A and B, we expect

$$\begin{aligned} \delta _{00} + \delta _{11} = \delta _{01} + \delta _{10}. \end{aligned}$$

(1)

In the case when gene A and gene B are in any epistatic relation (positive or negative), however, we expect that

$$\begin{aligned} \delta _{00} + \delta _{11} \ne \delta _{01} + \delta _{10}. \end{aligned}$$

(2)

Finally, for A and B being synthetic lethal partners, we expect that

$$\begin{aligned} \delta _{00} + \delta _{11} < \delta _{01} + \delta _{10}. \end{aligned}$$

(3)

SurvLRT is a likelihood ratio test, which is based on analytical estimates of the parameters \({\bar{\Delta }}_{00}\), \({\bar{\Delta }}_{11}\), \({\bar{\Delta }}_{01}\), and \({\bar{\Delta }}_{10}\), (and, correspondingly \({\bar{\delta }}_{00}\), \({\bar{\delta }}_{11}\), \({\bar{\delta }}_{01}\), and \({\bar{\delta }}_{10}\),) and verifies the null hypothesis given by Eq. (1) against an alternative hypothesis defined by inequality Eq. (2). To decide if the detected interaction is synthetic lethal, we compute the effect size \(\delta = {\bar{\delta }}_{00} + {\bar{\delta }}_{11} – {\bar{\delta }}_{01} – {\bar{\delta }}_{10}\) and check if the constraint \(\delta < 0\) holds. If it holds, we set an “SL flag” to 1 and report the pair as synthetic lethal with the associated p-value from the likelihood ratio test. Otherwise, we set the “SL flag” to \(-1\). For each investigated pair, we in addition report the log fitness of the double LoF alteration genotype that would be expected in the case of no epistatic interaction, i.e., in the case when Eq. (1) would hold, denoted \(\delta _{11}^{\text {Expected}}\).

Importantly, not all such SL interactions are clinically relevant. Inequality Eq. (3) can also hold when \(\delta _{00} < \delta _{11}\). In such a case, the fitness of the genotype with double LoF alterations \(\Delta _{11}\) is still unexpectedly high, given the single LoF alterations. In particular, it is smaller than the fitness of the double genotype expected in the case of no epistatic interaction, \(\delta _{11}^{\text {Expected}}\). Such an interaction, however, is not clinically relevant: turning the synthetic lethal partner off in addition to the first gene in the pair using treatment would cause the patients to survive worse than patients without the inactivation of either of the genes. Thus, we in addition set a clinically relevant (“CL”) flag, which is 1 if \(\delta _{00} > \delta _{11}\) and \(-1\) otherwise.

Test application criteria To perform SurvLRT for a given gene pair, gene A must have LoF alterations in at least 20 patients. Additionally, at least 5 patients must have LoF alteration and be classified as deceased for each genotype.

SL evidence criteria Gene pairs with SurvLRT test adjusted p-value less than 0.05, both SL flag and CL flag equal to one are considered significant evidence for clinically relevant SL according to this test.

iSurvLRT

Here, we propose a novel test, called iterative SurvLRT (iSurvLRT), which is an extension to SurvLRT. SurvLRT test14 has limited applicability, as it can only be used to test such gene pairs where both genes carry LoF alterations in a sufficient number of tumor samples. Instead, for a given gene A, which is often altered in cancer, it is desirable to find a partner B that itself does not necessarily acquire alterations in tumors.

The solution is iSurvLRT that bases on LoF alterations in gene A and expression of gene B. Specifically, LoF alteration status of gene A in a given patient is defined as \(g_A = 0\) if A is not altered in the patient and \(g_A = 1\) if gene A is altered. The expression status of gene B in the same patient is defined based on the fact whether the expression of gene B, denoted \(e_B\), is low in that patient or not, i.e. \(g_B(t) = 0\) if \(e_B >= t\) and \(g_B(t) = 1\) if \(e_B < t\) for a given threshold t. To define the threshold t, we consider a grid of possible thresholds given by the quantiles of the empirical distribution of expression of gene B. Specifically, the grid of thresholds is defined by \(t \in (q(0.05), q(0.1), \dots , q(0.5))\), where \(q(\alpha )\) is the \(\alpha\)-th quantile of the distribution of \(e_B\). We next iterate over the grid of thresholds and define the genotypes \(g (t)= (g_A, g_B(t))\) for each patient based on the current threshold t in the iteration. For each iteration, we compute the p-value in SurvLRT test for the genotype g(t) obtained for the current threshold. Finally, we return the lowest p-value across all iterations, the threshold used in that iteration and the obtained SL and CL flags.

Test application criteria To perform iSurvLRT for a given gene pair, gene A must have LoF alterations in at least 20 patients and gene B’s expression must be measured. At least 5 patients must have LoF alteration and be classified as deceased for each genotype.

SL evidence criteria Gene pairs with iSurvLRT test adjusted p-value less than 0.05, both SL flag and CL flag equal to one are considered significant evidence for clinically relevant SL according to this test.

Synthetic partner drug dependency (SPDD)

SPDD test assesses the susceptibility of cancer cell lines with gene A LoF alteration to drugs targeting gene B. To this end, cell lines are grouped to either wild type (WT) or gene A LoF. The potential drug sensitivity is assessed using the drug targets from GDSC and PRISM data and a one-sided Wilcoxon test on the ln(IC50) values (natural logarithm of the fitted half maximal inhibitory concentration for GDSC datasets) and median-collapsed log fold change profile corresponding to the WT and LoF group (cell lines without or with LoF alteration in gene A, respectively). This test is performed separately for each drug and the drug with the best result (the lowest p-value) is reported.

Test application criteria To perform SPDD test for a given gene pair, gene A must have LoF alterations in at least 20 cell lines and gene B must be a known drug target.

SL evidence criteria Gene pairs with SPDD test p-value less than 0.05 are considered significant evidence for SL according to this test.

Synthetic partners shared pathways (SPSP)

SPSP assesses enrichment in common pathways using a part of c2 collection from MSigDB, consisting of pathways from KEGG, PID and Reactome databases. Specifically, a hypergeometric test (calculating the probability of co-existence of gene A and gene B in pathways) is used to detect pairs that share a significantly large number of pathways, given the number of pathways each of the genes is involved in.

Test application criteria To perform SPSP test for a given gene pair, both gene A and gene B must be present in at least one of the considered pathways.

SL evidence criteria Gene pairs with SPSP test adjusted p-value less than 0.05 are considered significant evidence for SL according to this test.

slideCell and slidePat: cell line and patient tests R implementation

SL tests on cell line data were implemented in an R package called slideCell. Given dependency scores and alteration information for cancer cell lines, as well as a list of gene pairs, slideCell can be used to generate statistics, p-values and plots for SPID and SPEA tests. The source code for slideCell package is freely available at https://github.com/szczurek-lab/slideCell. SoF, ExprSL, SurvLRT and iSurvLRT tests on patient data were implemented in an R package slidePat, which takes as input gene alteration and expression, as well as patient survival data. The source code for slidePat package is freely available at https://github.com/szczurek-lab/slidePat.

SLIDE-VIP WebApp

The SLIDE-VIP WebApp is an online application for the visualization of test results for 224,169 potential SL gene pairs from this publication. The application has been developed in RStudio79, version 1.2.5033, using the Shiny package, version 1.5.0 and R80 version 4.0.5. The application is freely available online at slidevip.app.



Source link

Related Articles

Leave a Reply

Stay Connected

9FansLike
4FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles

%d bloggers like this: