Mathematical Risk Models — where Independent Risks Become Dependent

2004 risk working group

In the early 2000s, breast cancer risk assessment began to ease into the clinic, specifically for guiding SERM risk reduction strategies.  Our Komen-sponsored Working Group (above), under the leadership of Victor Vogel, MD, cobbled some guidelines together over the course of one year before publishing in 2004.  But risk assessment didn’t hit full force until the 2007 American Cancer Society guidelines were announced for screening high risk women with MRI.  (After 11 years, these high-risk guidelines are currently in the process of revision.)

WARNING: The following discussion of quasi-precision medicine is not for those faint of heart when it comes to numbers and mathematical models of risk.

In the pre-Gail days, the only predictive models were esoteric family history tables from the epidemiologic literature that were virtually unknown to clinicians. No attempt was made to incorporate other risk factors, and maybe that was a good thing.

Having started one of the earliest risk assessment programs in the U.S., pre-Gail, I leaned heavily on the Dupont tables that converted relative risks to absolute risks over defined time periods, preferably 20 years max. The problem was determining a reasonable overall relative risk (RR) when more than one risk factor was present. Anticipating this problem, investigators had begun coupling risk factors and calculating single RRs. For instance, “nulliparity plus a first degree relative with BC” carried an RR of 2.7 in one study. Plug that into the 20-year risk table from Dupont, and for a 50 year-old, you’ll get a 12% risk of breast cancer over the next 20 years (thru age 70). This calculation is a little higher than what the Gail predicts today, and about the same as what the Tyrer-Cuzick (TC) model predicts.

The best-known coupling of risk factors was “atypical hyperplasia and a first degree relative with breast cancer,” with the Page/Dupont team assigning a synergistic RR=9.0 (later nullified by the Mayo Clinic data). Although Page and Dupont introduced their seminal work in the 1970s, it was a landmark paper published in 1985 by the New England Journal of Medicine that captured the attention of what was then precious few breast specialists. Nevertheless, multi-factorial risk assessment was born.

This dual-risk approach was sufficient for risk factors in couplets, but what about 3 or more risk factors? And what about those continuums that are applicable to all, such as age?

Enter Gail (1989 publication, although common use lagged until the P-01 trial), where risks were not directly studied as couplets, triplets, or beyond. The Gail approach didn’t care about biologic relationships. Rather, relative risks (RRs) were merged into absolute risk mathematically (with multiplication of RRs being at the heart of the statistical formulas). The user never sees the RRs, only the final tally in the form of an absolute risk over a defined period of time.

Validation of the Gail came through the Texas Breast Screening Project, then from the 6,700 women in the placebo arm of the NSABP P-01 trial (then CASH validation, then NHS validation), where it was accurate in predicting the number of cancers that would occur in a large cohort (called “calibration” – #predicted vs. #observed). But prediction at the individual level (called “discrimination”)? Well, not so good. With a concordance-statistic (c-stat, which is comparable to statistical “accuracy” expressed as AUC) of 0.58, the original Gail was not much better than flipping a coin for an individual. This should not be surprising given that the majority of patients who develop breast cancer do not have the risk factors included in the Gail model, if they have any known risks at all.

After the Tyrer-Cuzick (IBIS) model emerged and bypassed the Gail in popularity (for reasons I won’t bother with here), one could easily fiddle with many more variables to see the impact that individual risks had on the final calculated risk. C-stats crept above 0.60, but realize that .70 is a minimum standard for “fairly good performance,” and it would be far better to reach 0.80 where one will find the predictive models used for diabetes and cardiovascular disease.

In order to be included in the mathematical models, risk factors must be “independent.” That is, they exert their effect with or without other risk factors. Yet, after they are merged mathematically, strange things happen biologically.

Let’s look at a patient who undergoes risk assessment with the Tyrer-Cuzick model at age 50, having no other risk factors other than her plan to take Prempro™ HRT for the next 20 years (yes, it’s an extreme example to make a point).

Tyrer-Cuzick generates a baseline risk of 11.4% lifetime that will increase to 16.3% with 20 years of intended E+P. So, the risk of E+P is an additional 4.9% absolute risk over a 35-year lifespan.

Now do the same calculations of E+P for a woman who has 2 first-degree relatives with breast cancer in their 30s, is nulliparous, and has had a benign biopsy with specific results unknown. Baseline risk is 40%, and if she adds 20 years of intended HRT use with E+P, her absolute risk is now 63%. This is a 23% absolute increase in risk over a 35-year period with E+P in this patient.

So, the T-C model is telling us that this high-risk patient is far more susceptible to E+P than the average-risk patient.

Does this strike you as odd? Is this precision medicine? Are we to believe that E+P is nearly 5 times more powerful in risk potential if taken by someone who is already at high risk? If so, the independent risk of HRT with E+P is, in fact, dependent on other risk factors with the T-C model.

This is because mathematics drive these models, not biology. Mitchell Gail did not invent from scratch the mathematics he used in 1989. There were statistical rules already established for the merger of risk factors independent of medical science. They can be applied to industry or commerce or any other discipline that is analyzing risk. Biology has a way of fouling things up, however.

Does the Women’s Health Initiative help with this issue? Not really. Theoretically, if the Hazard Ratio is the same HR = 1.24 for all levels of risk, then yes, there will be a disproportionate increase in risk for E+P in the higher risk patients. In that study, however,  higher risk women had lower HRs (1.13), which would support my contention that absolute risk is fixed across the different levels of underlying risk.  That said, none of the calculations reached statistical significance, so we don’t know.

Let’s move on to the newly added breast density feature of the Tyrer-Cuzick model version 8.0. What’s the absolute risk of having Level D mammograms, a so-called independent risk factor? There is no single absolute risk. Like we just saw with E+P, the independent density risk factor is dependent on the other risks.

Using version 8.0 of the T-C model, start with a 40 y/o, but add no other risk factors, leaving breast density “unknown.” (Tricky point – if you try to use v.8.0 on patients younger than 40, the density feature will not work at all.)

Lifetime risk will = 11.3%, but now add Level D density and you’ll get 17.5%. So, Level D density carries with it an absolute risk 6.2% in this patient. Fair enough. Sounds about right.

Now enter our high risk parameters above – two 1st degree relatives diagnosed at 35, prior biopsy and nulliparity, generating a lifetime risk of 40%. Level D density takes on a whole new level of absolute power, raising absolute risk to 54%, or an absolute increase of 14%, compared to the 6.2% in the baseline risk patient (an even bigger disparity will be calculated if you use the no-risk patient rather than the patient with general population risk).

As an aside, if you fiddle with various densities, you might be surprised to learn that the referent (RR=1.0) in this new addition to the T-C model is Level C density. So, in our example above where the baseline risk was 11.3%, if you then add Level C, the calculated risk barely changes to 11.6%. This is in sharp contrast to what the high-density grassroots movement is doing when women are told (by legislation in most states) that they have Level C or D density and that this raises risk 2-fold. When it comes to the T-C model, only Level D counts, and even then, it’s dependent on other risks. Incidentally, both Level A & B will generate lifetime risks well below what you’ll get using version 7.0 of the T-C model.

Another feature to point out – even though Level D density might be considered a 2-fold risk factor (RR) when compared to the average patient (which should be the patient at the border between B & C with 50% density, not Level C), the model does not multiply baseline risk X2. Again, there are complex statistical models used here, and even though multiplication is at the heart of those models, it’s the accepted statistical formulas used for merging risk factors that keep that from happening.

Moving now to SNPs where we add virtually every known SNP studied in breast cancer to the T-C model, as currently being done by Myriad and Ambry. I already addressed SNPs in the January 2018 blog, where the heart of merging risks is multiplication. If there are 86 SNPs, all with RRs under 1.26, then it’s this: 1.10 X 1.02 X 1.18 X 0.9 X 1.10 X 0.88 and on and on until all SNPs are included, with a final RR usually close to 1.0 (no effect). That said, it is possible to generate cumulative risks (mathematically) that reach RR=2.5, more powerful than some of the predisposition genes being tested at the same time.

And not only are the SNPs impacting our final results more than I anticipated, they have more power on the protective side as well, so with a SNP tally having an RR below 1.0, I’ve seen 10 absolute %-age points shaved off of one of my patient’s calculated risk, dragging her below the threshold for MRI testing, whereas without SNPs, she easily qualified (now I ask for SNP panels only in my patients who do NOT currently qualify for MRI).

But as for the power of the SNPs as related to the calculations without SNPs, we have the same phenomenon as with E+P and breast density above. That is, for the baseline risk patient, the SNPs have little impact. But take the same SNP results and apply them to someone at 40% lifetime risk, and suddenly the SNPs are far more powerful. Same SNPs. Different power.

My question that I’m tossing out to experts around the country is this – If RRs are DERIVED from patients without the risk in question (that is, baseline risk individuals that serve as the denominator), then why are they APPLIED to patients after other risks are included? It works fine for the baseline risk patient. But when it comes to the patient already at 40% risk, the newly added risk factor was not measured in women already at 40% risk. Its power was calculated without other risks.

In contrast, witness what has happened to the combination of atypical hyperplasia and a first-degree relative with breast cancer. Originally, Page and Dupont measured a 9-fold risk, and this synergism (termed interactive risks) was considered solid. Recently, the Mayo Clinic data (including the assistance of Dupont with his Nashville data) has confirmed that the family history adds next to nothing once atypical hyperplasia is diagnosed. It’s the same 4-fold risk with or without FH that Page and Dupont had calculated years ago without family history.  So, the key point in question — i.e., independent risks being dependent on other risks — no longer applies to the combination of ADH and FH (using either the T-C model, or the “benign breast disease” model (BBDAH-BC) generated by Mayo).

The T-C model has therefore made adjustments to this new information (the Gail has not), and if you enter a patient who has ADH and no other risk factors, the ADH prevails no matter what else you add, with regard to family history, that is.  Jump over to breast density or SNPs or other risk factors, and in sharp contrast to what happened with the addition to a positive FH, the calculated risk will be augmented, sometimes dramatically.  Thus, the patching of the T-C model causes some remarkable incongruities, with ADH fixing the risk at a single point in spite of FH, but totally vulnerable to exaggeration with other co-existing risk factors.

Furthermore, try the same exercise using plain old hyperplasia, rather than ADH.  You’ll find that we’re back to synergism with FH, and the calculations begin to zoom once again.  In other words, only ADH fixes the risk at a value that is impermeable to family history.  This is what happens when you try to fix the model based on direct observation of two risks working (or not working) together.  Although a more accurate calculation s derived, it creates an inconsistency because the remainder of the model is still using statistical mergers that don’t care about biology.  Thus, hyperplasia and a first degree relative will generate the same risk as if that same patient has ADH and a first degree relative (wherein the FH won’t count in the latter instance)!

The point is that direct measurement of risk factors as couplets or triplets is going to be more accurate than our mathematical models that take independent risk factors and turn them mathematically into factors dependent on each other, without any prospective validation.

My intuitive approach would be blasphemy to the makers of these models. But just to give the statisticians a good laugh, here is how I would handle the addition of SNPs:

A 40 year-old woman is already at 40% risk of breast cancer due to other factors, and with her SNPs having been found to impart a 2-fold risk, her final calculation is a 63% risk for breast cancer – according the T-C model, enlisting the SNP feature (under Tools) Note: Even the overestimated risk, using approved statistics, don’t multiple 40% X 2 to get 80%, or else it wouldn’t take much to exceed 100%.

But the SNPs were not measured in women already at 40% risk. Could we be counting the same risks twice? What if the SNP risk is the same risk imparted by FH, or proliferative changes on a biopsy? (After all, it turned out that positive family history added nothing to the diagnosis of ADH. In the past, we were in effect counting the same risk twice.) Instead of deriving SNPs from women at 40% risk, the SNPs were derived from women without other risk factors (with the exception of some work in patients with positive family history). So it seems (to me) that if SNPs are operating independently, they should be converted to an absolute risk increase based on the general population risk from which they were derived, then added to the absolute risk already calculated. Same risk factor = same absolute imparted risk, regardless of other factors (true independence).

That gibberish is difficult to follow, so here’s how my (antithetical) methodology based on apparently faulty intuition would work: The 2-fold SNP risk in the baseline 40 y/o patient generates a risk of 22% (11% X 2), so this relatively powerful SNP combination has a power of 11 additional percentage points of absolute risk. If our patient is already at 40% risk, then we add 11% to reach 51%, not the 63% generated by the T-C model, and not the 80% generated by those who simply multiply SNP scores by calculated risk.

It seems as though it would be even more accurate to start out – not with the 11-12% baseline risk – but with a 7% lifetime risk (assuming full life expectancy in a young woman) since this is the true “no-risk” baseline upon which RRs are usually calculated. Thus, a 2-fold SNP result would convert 7% to 14%, or an absolute 7% increase (as opposed to the 11% increase above).

Would it be so weird to convert the well-documented risk factors to absolute risk elevations based on a 7% baseline, then add them together? (yes, it would, if we include all 100+ alleged risk factors that have been described over the years).

The statisticians are screaming foul right now, if they didn’t stop reading already. Nonetheless, I’m trying my best to understand what is happening here, as we pile risks on top of risks on top of risks, multiplying RRs all the way. I have ongoing e-mail discussions with statisticians and experts and “model designers,” and I’m not progressing very well as a student, still trying to explain mathematics at the biologic/philosophic/logical level wherein independent risk factors become so dependent on each other.

Help is welcome. If you can reconcile this conundrum, please contact me. This is a black and white issue only for the statisticians. And if you want to defend these models to the death, remember this: excellent calibration (# of predicted cases vs. # of observed cases) can co-exist with bad discrimination (predicting who is going to get cancer at the individual level). Even with density and SNPs added in, the models are still in the 0.60s on c-stats. That’s not good, and that is not precision medicine.

To be frank, I don’t think the science of breast cancer risk assessment has caught up with our technologies. We have the tools needed to find virtually all cancers early, but due to cost and inconvenience, we are limited to risk stratification based on marginal models. As a confirmed cynic and skeptic (as commonly occurs with age), I think the strategy outlined by our Risk Assessment Working Group sponsored by Komen many years ago (2004 – photo at top) was on the right track by simply assigning women to one of 3 groups – baseline, high risk and very high risk. Although that article was intended to refine risk assessment by identifying patients for ductal lavage and the search for atypia, we mentioned the possibility of MRI screening (prior to most results from the international MRI screening trials).

But no one today is interested in grouping patients into risk levels based on categorical risks. The trend is to convert anything that moves into a number wherein empiricism can take over from there on. After all, how do you add digits to the right of the decimal point if you’re using such archaic terms as “very high risk”? We are much more comfortable with the illusion of certainty provided by 29.7%, or 11.6%, or an improbable 91.3%.

In debates about the propriety of “risk-based screening,” both for routine mammography (doing less) and for multimodality imaging (doing more), remember we are dealing with mathematical modeling applicable to industry and commerce that might not reflect true underlying biologic risk at all, not to mention c-stats that are sub-optimal.

So what’s the big deal if we’re off a bit, as long as we ballpark it? Coupla things. First, insurers are not playing in the ballpark; they have canonized the precarious (and age discriminatory) 20% threshold, so there’s a huge difference between 19% and 21% now that we’ve created this quagmire for ourselves. But at the higher end of risk – take a patient at modest risk elevation, and now add E+P HRT, Level D density and bad SNPs, and in the quest to beat 20%, you end up with an inflated risk value that drives her to preventive mastectomies. It seems odd that we are to counsel breast cancer patients out of contralateral prevention because the risks of cancer are considered to be so low (sometimes quoting 10-year risks for a lifetime potential), while at the same time, we have no hesitancy to jack up risks in asymptomatic patients (quoting lifetime risks) and, as a result, drive them beyond MRI screening or SERM risk reduction to bilateral preventive mastectomies.

Alas, I’m like everyone else who battles insurers to pay for screening MRI – “Get that 20% even if it means reverting to version 7.0 for density level A, B & C patients.” I’m just trying to avoid that unwieldy trap of fooling myself. As Richard Feynman said in his 1974 Caltech commencement address: The first principle is that you must not fool yourself — and you are the easiest person to fool.