


























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This technical report proposes the LASSO-Patternsearch algorithm as a two-stage procedure to identify clusters of multiple risk factors for outcomes of interest in large demographic studies, when the predictor variables are dichotomous or take on values in a small finite set. The method is applied to data from the Beaver Dam Eye Study and is shown to expose physiologically interesting interacting risk factors. The document also discusses related research and computational procedures.
Typology: Study notes
1 / 34
This page cannot be seen from the preview
Don't miss anything!



























DEPARTMENT OF STATISTICS University of Wisconsin 1300 University Ave. Madison, WI 53706
LASSO-Patternsearch Algorithm with Application to Ophthalmology Data
Weiliang Shi^1 Department of Statistics, University of Wisconsin, Madison WI
Grace Wahba^1 Department of Statistics, Department of Computer Science and Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison WI
Stephen Wright^2 Department of Computer Science, University of Wisconsin, Madison WI
Kristine Lee^3 , Ronald Klein^4 , Barbara Klein^4 Department of Ophthalmology and Visual Science, University of Wisconsin, Madison WI
(^1) Research supported in part by NIH Grant EY09946, NSF Grants DMS-0505636, DMS-0604572 and ONR Grant N0014-06-0095 2 Research supported in part by NSF Grants SCI-0330538, DMS-0427689, CCF-0430504, CTS-0456694, CNS-0540147 and DOE Grant DE-FG02-04ER (^3) Research supported in part by NIH grants EY06594 and EY (^4) Research support in part by NIH grants EY06594, EY015286 and Research to Prevent Blindness Senior Investigator Awards
Abstract The LASSO-Patternsearch is proposed, as a two-stage procedure to identify clusters of multiple risk factors for outcomes of interest in large demographic studies, when the predictor variables are dichotomous or take on values in a small finite set. Many dis- eases are suspected of having multiple interacting risk factors acting in concert, and it is of much interest to uncover higher order interactions when they exist. The method is related to Zhang et al(2004) except that variable flexibility is sacrificed to allow enter- taining models with high as well as low order interactions among multiple predictors. A LASSO is used to select important patterns, being applied conservatively to have a high rate of retention of true patterns, while allowing some noise. Then the patterns selected by the LASSO are tested in the framework of (parametric) generalized linear models to reduce the noise. Notably, the patterns are those that arise naturally from the log linear expansion of the multivariate Bernoulli density. Separate tuning pro- cedures are proposed for the LASSO step and then the parametric step and a novel computational algorithm for the LASSO step is developed to handle the large number of unknowns in the problem. The method is applied to data from the Beaver Dam Eye Study and is shown to expose physiologically interesting interacting risk factors. In a study of progression of myopia in an older cohort, it is found in this group that the risk for smokers is reduced by taking vitamins, while the risk for non-smokers is independent of the “taking vitamins” variable, which is in agreement with the general result that smoking reduces the absorption of vitamins, and certain vitamins have been
B The GACV Score 24
C Minimizing the Penalized Log Likelihood Function 27
D Effect of Coding Flips 30
1 Introduction
We consider the problem which occurs in large demographic studies when there is some dichotomous outcome with a large number of risk factors that interact in complicated ways to induce elevated risk. Our goal is to search for important patterns of multiple risk factors among a very large number of possible patterns, with results that are easily interpretable. In the examples motivating this study both dichotomous, polychotomous and continuous variables were available, but for the purposes of this work we assume that all variables are binary or that they have been dichotomized before the analyses considered here. As we will see, this allows the study of higher order interactions than would be possible with risk factors with more than two possible values. There are many approaches that can model data with binary covariates and binary re- sponses, see, for example [3] [4] [25] [26]. The method that we are proposing here takes ac- count of correlations between clusters of predictors in a particularly transparent way, which is different than the above methods. The main core of the method is global, in that it deals with a very large number of patterns simultaneously, as opposed to tree-structured methods or greedy (sequential) algorithms which constitute much of the literature in this area. We will be using LASSO ideas on problems with many potential binary predictors and binary outcomes with many possible patterns of predictor variables to consider. For Gaussian data the LASSO was proposed in [27] as a variant of linear least squares ridge regression with many predictor variables. As proposed there, the LASSO minimized the residual sum of squares subject to a constraint that the sum of absolute values of the coefficients be less than some constant, say t. This is equivalent to minimizing the residual sum of squares plus a penalty which is some multiple λ (depending on t) of the sum of absolute values (l 1 penalty). It was demonstrated there that this approach tended to set many of the coefficients to zero, result- ing in a sparse model, a property not generally obtaining with quadratic penalties. A similar idea was exploited in [5] to select a good subset of an overcomplete set of nonorthogonal
wavelet basis functions. The asymptotic behavior of LASSO type estimators was studied in [15], and [23] discussed computational procedures in the Gaussian context. More recently [7] discussed variants of the LASSO and methods for computing the LASSO for a continuous range of values of λ, in the Gaussian case. Variable selection properties of the LASSO were examined in [18] in some special cases, and many applications can be found on the web. In the context of nonparametric ANOVA decompositions [33] used an overcomplete set of basis functions obtained from a Spline ANOVA model, and used ℓ 1 penalties on the coefficients of main effects and low order interaction terms, in the spirit of [5]. The present paper uses some ideas from [33]. Other work in the context of nonparametric ANOVA decompositions have implemented ℓ 1 penalties along with quadratic (reproducing kernel square norm) penalties to take advantage of their sparsity-introducing properties, see for example [11] [17] [32]. In this paper, we present a two-stage method for logistic regression, beginning with a complete (in a sense to be defined) or otherwise overly large set of basis functions, embody- ing a large number of potential patterns. In the first stage the number of potential patterns is reduced, in the spirit of [5] and [33], by penalized likelihood with an l 1 penalty on the coefficients, the penalty parameter being chosen here in a conservative way for model selec- tion. This is followed by a tuned parametric logistic regression in the second stage, to select out significant patterns. Finally, pseudo-significance tests can be carried out by scrambling the response data in various ways to see to what extent, if any the two stage method pro- posed will generate spurious patterns like the ones found. In constructing the patterns, for ease of interpretation we assume that it is known a priori in which direction each variable promotes risk, if it does. In practice this condition may be violated by a very small number of variables without losing interpretability. With this assumption the method is turning out to be extremely useful in teasing out and interpreting high order interactions in complex demographic data sets that are not readily found by more classical methods. The rest of the article is organized as follows. In Section 2 we describe the first (LASSO) step of the LASSO-Patternsearch algorithm, including choosing the smoothing parameter by Generalized Approximate Cross Validation (GACV) and a discussion of an efficient algo- rithm for the LASSO step. Section 3 describes the second step of the LASSO-Patternsearch algorithm, along with a modification of the GACV for use in a sequential pattern deletion procedure. Section 4 presents two simulation examples, and Section 5 presents two real examples from Beaver Dam Eye Study. Section 6 notes some generalizations, and, finally, Section 7 gives a summary and conclusions. Appendix A relates the patterns to the log
p(x) = P rob{y = 1|x} and the logit f (x) = log[p(x)/(1−p(x))], we estimate f by minimizing
Iλ(y, f ) = L(y, f ) + λJ(f ) (1)
where L(y, f ) is (^) n^1 times the negative log likelihood:
L(y, f ) = n^1
∑^ n i=
[−yif (x(i)) + log(1 + ef^ (x(i)))] (2)
with
f (x) = μ +
N∑B − 1 ℓ=
cℓBℓ(x), (3)
where we are relabeling the NB − 1 (non-constant) patterns from 1 to NB − 1, and
J(f ) =
N∑B − 1 ℓ=
|cℓ|. (4)
In Step 1 of the LASSO-Patternsearch we minimize (1) using the GACV criteria to choose λ. The next section describes the GACV and the kinds of results it can be expected to produce.
The tuning parameter λ in (1) balances the trade-off between data fitting and the sparsity of the model. The bigger λ is, the sparser the model is, and vice versa. The choice of λ is generally a crucial part of penalized likelihood methods and machine learning techniques like the Support Vector Machine. For the smoothing spline models with Gaussian data, [29] proposed the ordinary cross validation (OCV). The generalized cross validation (GCV), derived from OCV was suggested by [6][10], and theoretical properties were obtained in [19]. For smoothing spline models with Bernoulli data and quadratic penalty functionals, [31] derived Generalized Approximate Cross Validation (GACV) from an OCV estimate following the method used to obtain GCV. In [33] GACV was extended to the case of Bernoulli data with l 1 penalties. We will use the GACV of [33], however, in the present context the expression for it reduces to a much simpler form, so we shall provide a simplified derivation. The GACV score derivation begins with a leaving out one likelihood to minimize an estimate of the comparative Kullback-Leibler distance (CKL) between the true and estimated
model distributions. The ordinary leave out one cross validation score for CKL is
CV (λ) =^1 n
∑^ n i=
[−yif (^) λ− i + log(1 + efλ(x(i)))], (5)
where fλ is the minimizer of the objective function (1), and f (^) λ[− i]is the minimizer of (1) with the ith data point left out. In Appendix B, a series of approximations and an averaging step are described, which results in the much easier to compute GACV (λ) score:
GACV (λ) = n^1
∑^ n i=
[−yifλi + log(1 + efλi^ )] +^1 n (^1 ntrH)
∑n i=1 yi(yi^ −^ pλi) (1 − (^1) n NB 0 ) ,^ (6)
here H = B∗(B∗′W B∗)−^1 B∗′, where W is the n × n diagonal matrix with iith element the estimated variance at x(i) and B∗^ is the n × NB 0 design matrix for the NB 0 non-zero cℓ in the model, see Appendix B.
The GACV criteria, being a predictive criteria rather than a direct model selection criteria (e. g. say, like BIC) is conservative in selecting basis functions (patterns) in that it has a high probability of including true patterns but at the expense of including noise patterns. We consider this a desirable feature, since in exploratory analysis, the cost of losing a true pattern is greater than that of including a noise pattern, since confirmatory analysis can take place later. We illustrate this with a small simulation. In this example, we generate 7 independent variables from Bernoulli(0.5). The sample size is 1000 and the true logit function contains two main effects, one size-two pattern and one size-three pattern:
f (x) = −2 + 1. 5 B 1 (x) + 1. 5 B 7 (x) + 1. 5 B 23 (x) + 2B 456 (x). The experiment was repeated 100 times. Table 1 shows the appearance frequency and estimate of the patterns in the model selected by Step 1 of the LASSO-Patternsearch. The LASSO never misses any of the patterns in the 100 runs. However, the average model size (the number of patterns) is 13.9 and the standard deviation is 4.32. This model size is much bigger than the true model size, 4. Figure 1 is the histogram of model sizes. It can be seen that the smallest model in the 100 runs still has 8 patterns. Table 1 also presents the mean estimates of the parameters (numbers in the parenthesis are standard deviations). Clearly they are much smaller than the truth. The l 1 penalty is
exploit the structure of the problem are needed. We use an algorithm that formulates the problem as nonlinear optimization with nonnegativity constraints and makes use of the fact that first and second partial derivatives of the function in (1) with respect to the coefficients cℓ are easy to compute analytically once the function has been evaluated at these values of cℓ. It is not economical to compute the full Hessian (the matrix of second partial deriva- tives), so the algorithm computes only the second derivatives of the log likelihood function L with respect to those coefficients cℓ that appear to be nonzero at the solution. (Generally, just a small fraction of these NB − 1 coefficients are nonzero at the solution.) It applies a gradient projection approach, using the partial Hessian to compute a Newton-like step in the apparently-nonzero coefficient. Rapid convergence is obtained once the correct nonzero coefficients are identified correctly. The algorithm is particularly well suited to solving the problem (1) for a number of different values of λ in succession; the solution for one value of λ provides an excellent starting point for the minimization with a nearby value of λ. Further details of this approach can be found in Appendix C.
3 The LASSO-Patternsearch Algorithm - Step 2
In Step 2 of LASSO-Patternsearch algorithm, the NB 0 patterns surviving Step 1 are entered into a linear logistic regression model, (glmfit in MATLAB is used here). Pattern coefficients and p-values for the significance level of each coefficient are obtained. With n ∼ 1000 and the pattern sizes obtained in Figure 1, there was no difficulty executing the code. The simplest procedure is just to choose a cutoff p-value arbitrarily or consistent with subjective prior information concerning the scientific problem at hand, and to delete all patterns with a p-value over the cutoff. Here the p-value cutoff is “tuned” as follows: The patterns are rank ordered by their p-values, from high to low, as 1, 2 , · · · , NB 0. A tuning score is computed for the model with all NB 0 patterns. Then the rank 1 pattern is deleted and the model refitted with the remaining NB 0 − 1 patterns and the tuning score computed. Then the rank 2 pattern is deleted, and so forth, until there are no patterns left. A final model is chosen from the pattern set with the best tuning score. It is possible for the rank orders to change as patterns are removed, however, for simplicity the initial rank ordering is maintained, and it is believed that the results would not, in general be different. Note that all subset selection is not being done, since this will introduce an overly large number of degrees of freedom into
the process. If a tuning set is available, then it can be used to create the tuning score. It is easy to derive a GACV score which could be used as the tuning score for each model. The Leave-Out-One Lemma still holds, the proof is simple, and the derivation of the GACV score follows that in Appendix B. The only difference is that the likelihood function is now smooth with respect to the parameters so the robust assumption that appears in Appendix B is not needed. All columns of the design matrix will be in the GACV score. That’s not a problem because the matrix is much smaller than that in Step 1. Let Bs be the design matrix here, then the GACV score for logistic regression is the same as (6) with H = Bs(B s′W Bs)−^1 B s′. However, it was found that using the GACV score was still conservative in problems at the scale (e.g. n, NB 0 , signal strength) of the data sets of interest here, retaining too many noise patterns, both in simulations and in retrospective examination of analyses (to be described) of observational data. Thus, a weight was inserted in front of the second term, the so called ”optimism” part of the GACV score. In the spirit of the variable selection criteria BIC the weight logn/2 was chosen, to obtain the BIC-type GACV (BGACV):
BGACV (λ) =^1 n
∑^ n i=
[−yifλi + log(1 + efλi^ )] + log 2 n^1 n (^1 ntrH)
∑n i=1 yi(yi^ −^ pλi) (1 − (^1) n NB 0 ).^ (7) The BGACV scores are compared for each model in the ordered sequence of models obtained above, and the model with the smallest BGACV score is taken as the final model. The following is a summary of the LASSO-Patternsearch algorithm:
For simulated data, the results can be compared with the simulated model. For observational data, a post-analysis step involving scrambling the data will be used to validate the results.
4 Simulation Studies
In this section we study the empirical performances of the LASSO-Patternsearch algorithm through two simulated examples. The first example continues with simulated data of Section
(^04 6 8 10 12 14 16 )
10
20
30
40
50
60
70
model size
frequency
(^03 4 5 6 7 8 9 )
10
20
30
40
50
60
70
model size
frequency
Figure 2: Histogram of sizes of the final models. Left panel: Simulation Example 1. Right panel: Simulation Example 2.
92 out of 100 times, B 2 shows up with an estimate close to 1.5 and B 23 shows up with an estimate close to -1.5. Therefore, we can tell that the direction of x 3 is wrong. This is a general phenomenon, coding in the wrong direction can only increase the number of patterns and include negative coefficients, and too many coding “errors” will make the results difficult to interpret. See Appendix D.
All the covariates in the last example are independent. In the real world however, risk factors within a pattern are frequently correlated to various degrees. In this example, we intentionally introduce some correlation among the risk variables and increase the number of risk factors to 8 and the sample size to 2000. Only the first five variables are correlated and the first seven are related to the risk. x 1 is generated from a Bernoulli distribution with a success rate of 0.5. x 2 and x 3 are conditionally independent given x 1. When x 1 is 1, both of them follow Bernoulli(0.7). When x 1 is 0, both of them follow Bernoulli(0.5). x 4 and x 5 are conditionally independent given x 2 and x 3. When both x 2 and x 3 are 1, x 4 and x 5 follow Bernoulli(0.7). When only one of x 2 and x 3 is 1, x 4 and x 5 follow Bernoulli(0.5). When none of x 2 and x 3 is 1, x 4 and x 5 follow Bernoulli(0.3). x 6 , x 7 and x 8 are independently generated from Bernoulli(0.5). They are also independent of the first five variables. The true logit
function f is
f (x) = −2 + 1. 5 B 5 (x) + 1. 5 B 23 (x) + 1. 5 B 34 (x) + 2B 123 (x) + 2B 4567 (x)
The simulation is run 100 times and the result is shown in Table 3.
constant B 5 B 23 B 34 B 123 B 4567 frequency 100 99 98 100 99 87 estimate -2.01(0.12) 1.50(0.15) 1.54(0.21) 1.49(0.18) 2.02(0.36) 2.16(0.44) true coef -2 1.5 1.5 1.5 2 2
Table 3: The appearance frequency and estimate of patterns in simulation example 2
All terms showed up almost every time except the size-four pattern I 4567. It appeared 87 times. It is very difficult to get high appearance frequency for high order patterns because often times, very few observations have these patterns. For example, suppose the four covariates are independent (they are not in the above example) and their occurrence rates are p 1 , p 2 , p 3 and p 4. We can only expect to see np 1 p 2 p 3 p 4 subjects with this pattern, where n is the sample size. This feature, on the other hand, tells us that the variables in high order patterns are probably correlated. Figure 2 Right panel shows the histogram of model sizes in the 100 runs. Again, the correct model size was found around 70% of the time. Unlike simulation example 1, there were a few cases where the model size is smaller than the true model size. This is caused by the low appearance frequency of B 4567.
5 The Beaver Dam Eye Study
The Beaver Dam Eye Study (BDES) is an ongoing population-based study of age-related ocular disorders. The purpose of the study is to collect information on the prevalence and incidence of age-related cataract, macular degeneration, and diabetic retinopathy. Between 1987 and 1988, a private census identified 5924 people aged 43 through 84 years in Beaver Dam, WI. 4926 of these people participated the baseline exam (bd1) between 1988 and 1900. Five (bd2), ten (bd3) and fifteen (bd4) year follow-up data have been collected and there have been several hundred publications on this data. A detailed description of the study can
substantial smoking history. catct has five levels of severity and we cut it at the third level. Aspirin (asa) and vitamin supplements (vtm) are commonly taken to maintain good health so we treat not taking them as risk factors. y is assigned 1 if a person’s refraction has changed more than -0.75 diopters from the baseline exam to the five-year follow-up and 0 otherwise. There are 1374 participants in this age group at the baseline examination, of which 952 have measurements of refraction at the baseline and the five-year follow-up. Among the 952 people, 76 have missing values in the covariates. We assume that the missing values are missing at random for both response and covariates, although this assumption is not necessarily valid. However the examination of the missingness and possible imputation are beyond the scope of this study. Our final data consists of 876 subjects without any missing values in the seven risk factors. Table 5 lists all the patterns selected by Step 1 of the LASSO-Patternsearch. There are 12 patterns plus the constant. We can easily handle 13 variables in a linear logistic regression, with the sample size here. Figure 3 shows the results of Step 2 of the LASSO- Patternsearch. Each point represents a pattern, in the order as they appeared in Table
Pattern Estimate Pattern Estimate 1 constant -3.4435 2 catct 2. 3 asa 0.8336 4 vtm 0. 5 inc pky -0.3034 6 catct asa 0. 7 catct vtm -0.7967 8 pky vtm 1. 9 sex asa vtm -0.6961 10 inc catct pky 0. 11 inc pky vtm 0.4848 12 sex inc jomyop asa 2. 13 sex inc catct asa 1.
Table 5: Patterns selected at Step 1 in the progression of myopia data.
0 2 4 6 8 10 12 14
−
−
−
−
0
1
2
3
4
Estimate
Pattern
Figure 3: The patterns that survived Step 1 of the Lasso-Patternsearch and the confidence intervals at Step 2 of the Lasso-Patternsearch in the progression of myopia data
The (refitted) model is f (catct, pky, vtm, sex, inc, jomyop, asa) = − 3 .29 + 2. 42 ∗ cact +
significance level of this pattern in the final model was .031 (= 1-.969), (the last pattern to be included). Comparing (11+1)/600) = .02 we get a rough agreement between the false pattern rate for this pattern from the simulation and the final model, although it could be argued that a higher noise rate could be expected to appear in the scrambling due to the two step “mining”.
−
−
0
−
− 0 5 10
10
−6 0 10 20 30 40
−
−
−
−
−
0
−6 0 10 20 30
−
−
−
−
−
0
10
−6 0 2 4 6 8 10
−
−
−
−
−
0
Figure 4: Scatter plots of log 10 p-values of noise patterns surviving the LASSO-Patternsearch in the completely reassigned data, 600 runs. Left to right, top to bottom: patterns of size 1, 2, 3 and 4.
Pigmentary abnormality is an early sign of age-related macular degeneration and an impor- tant cause of visual loss in people 65 years of age or older. It is defined by the presence of
retinal depigmentation or increased retinal pigmentation in association with retinal drusen. Based on previous work, age is a strong risk factor for pigmentary abnormality and age- related macular degeneration. In [13] the association between cardiovascular disease and its risk factors and the incidence of age-related macular degeneration was examined. It was shown in [12] that hormone replacement therapy has a weak protective effect while [24] and [22] said a history of heavy alcohol consumption has a deleterious effect for pigmentary ab- normality. The association of pigmentary abnormality with six attributes was studied by the “univariate” penalized logistic regression in [21]. The same data was studied by [9] in the setting of penalized multivariate logistic regression. The six variables used there were: current usage of hormone replacement therapy, history of heavy drinking, body mass index, age, systolic blood pressure and serum total cholesterol. Similar to the progression of myopia data, we only chose people aged 60 through 69 years as age is a special risk factor. The seven variables we use and their binary cut points are listed in Table 7. One interesting variable here is chol. We all know that high serum total cholesterol is associated with higher risk of cardiovascular disease morbidity and mortality, but both [21] and [9] found a protecting effect of serum total cholesterol for pigmentary abnormality. Therefore, we define lower serum total cholesterol as higher risk accordingly. High systolic blood pressure and high body mass index increased the risk and current usage of hormone replacement therapy was seen to reduce the risk in [9], where serum total cholesterol, systolic blood pressure and body mass index were treated as continuous variables. All the risk factors are the measurements at the baseline examination and the response is the status of pigmentary abnormality at the five-year follow-up. There are 2762 women at the baseline exam, of whom 749 are between the age of 60 and 69 years. After deleting the missing values, there remains 561 people. The incidence rate of pigmentary abnormality is 82/561 = 14.6%. The model fitted by LASSO-Patternsearch is:
f (x) = − 2 .1793 + 0. 6238 ∗ sys + 0. 9640 ∗ bmi ∗ hormone ∗ chol. (8)
Two terms showed up in the final model: the main effect of sys and a size-three pattern (bmi hormone chol). Remembering that the risk direction of chol is coded as lower serum total cholesterol and the risk direction of hormone is coded as no current usage of hormone this result agrees with [21] and [9], although less information is being used here since the data has been dichotomized. hist doesn’t show up partly because there are very few (30) women with the history of heavy drinking in this age group.