Heritability Estimates in Twin Studies: Comparing NACE and Falconer's Formula, Assignments of Mathematical Modeling and Simulation

This research paper investigates the accuracy of heritability estimates in twin studies using two models: nace and falconer's formula. The study explores the impact of violating model assumptions, such as non-normality of data and unequal variances, on the accuracy of heritability estimates. The paper also examines the influence of varying sample sizes and variance components on the results. The findings highlight the importance of considering model assumptions and data characteristics when interpreting heritability estimates in twin studies.

Typology: Assignments

2024/2025

Uploaded on 02/14/2025

wanglihui04
wanglihui04 🇺🇸

15 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ISYE 6644: SIMULATION: EVALUATING CONVERGENCE OF HERITABILTIY ESTIMATES
ACROSS TRAITS OF INTEREST VIA GENETIC MODELS AND SIMULATION
Georgia Institute of Technology, USA
ABSTRACT
Assessing the genetic impact of an individual is of crucial im-
portance across a variety of domains in human health, disease,
and function. Inferring this impact necessitates the utilization
of twin data and genetic modelling techniques. Such genetic
models make inherent assumptions about the data, and it is
of interest to investigate the model performance across dif-
ferent conditions using simulation. For two models, Normal
ACE model and Falconer’s method, input data simulated from
bivariate Lagrangian Poisson (blgp), multivariate t, and bi-
variate normal and relevant modulations. It was found that
an increase in sample size resulted in a decrease in the av-
erage and true standard error of genetic estimates. When the
data was normally distributed, the standard error (average and
true) converged. The greater the normality of the simulated
twin data, the greater performance both models had across
coverage rates and average and true standard errors. When the
model assumptions for ACE were violated (non-normality of
data, unequal variances across MZ and DZ groups) the model
resulted in biased estimates of heritability and associated cov-
erage rates. Lastly, the coverage rates for Falconer’s model
appears to contradict that found in [1], but otherwise the re-
sults strongly converge to that of [1].
Index TermsTwin study, heritability, simulation
1. BACKGROUND & DESCRIPTION OF PROBLEM
Assessing the enviornmental and genetic impact of an indi-
vidual is of crucial importance across a variety of domains in
human health, disease, and function. Inferring these factors
necessitates either strict genetic data (one’s genome) or, more
often, the utilization of twin data [1]. Twins can be separated
into two types, monozygotic (MZ) or dizygotic (DZ) [1].
What defines a twin in general is that they are born at the
same time but in which two distinct processes had occurred
even so. Specifically, MZ twins share 100% of their genomic
information with one another, and DZ twins share 50% on
average [1]. The similarity in MZ twins, and the fact that
they were born at the same time point (essentially), allows for
any differences in the twins to be more indicative of environ-
mental influences than that of genetic influences (since, by
definition, their genomes are identical). However, this attri-
bution of genetic influences can also be decomposed further
into additive genetics and dominance (the second of which is
not an interest in this study) [1]. Specifically, it is possible,
using both MZ and DZ twins to evaluate heritability which
is a function of the genetic components (defined differently
per model) of the trait of interest. Two models widely used
in the twin study literature (and genetics literature) are Fal-
coner’s Formula and the Normal ACE (NACE) model, both
of which make implict assumptions about the input data from
MZ and DZ twins, and both of which have their own benefits
and disadvantages. Hence, it is important to talk about the
mathematical properties of both models.
The NACE model utilizes MZ and DZ twin data in order
to model the heritability of a given trait of interest. In doing
so, it effectively breaks down the trait covariance of each
MZ or DZ twin into additive genetics (A), common shared
environment (C), and nonshared environment (E) variance
components [1]. This approach is done via structural equa-
tion modelling (SEM). For the NACE model, notably, there
are key assumptions which allow it to achieve the results it
does. Firstly, any given trait of interest (which is what would
be investigated using the data of the MZ and DZ twin pairs
that one has access to) is normally distributed [1]. Secondly,
the NACE model assumes that the ACE variance parameters
are equal for both MZ twin pairs and DZ twin pairs [1].
In contrast, another model, Falconer’s Formula, which is
a distribution free method of moment estimators [1] makes no
assumptions regarding the variance across MZ twin pairs and
DZ twin pairs (i.e. they are allowed to be unequal), with the
additional assumption that the proportion of the total variance
(for genetics or environmental effects) are the same [1].
Therefore there are key differences in assumptions be-
tween the NACE model and Falconer’s formula. Further-
more, given the wide use of these models across the literature
in looking at differently distributed traits (which are of wide
interest in a variety of fields as they relate to genetics, human
behavior, disease, and biomarkers), there is a great need to
evaluate the effect of these assumptions that these models
hold on the resultant parameter estimates in these scenarios.
In doing so, this work can be considered for future traits of
interest when utilizing models, the approach can be consid-
ered for other models of interest, and modulations therein can
be expanded upon based on the implementation herein [1].
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Heritability Estimates in Twin Studies: Comparing NACE and Falconer's Formula and more Assignments Mathematical Modeling and Simulation in PDF only on Docsity!

ISYE 6644: SIMULATION: EVALUATING CONVERGENCE OF HERITABILTIY ESTIMATES

ACROSS TRAITS OF INTEREST VIA GENETIC MODELS AND SIMULATION

Georgia Institute of Technology, USA

ABSTRACT

Assessing the genetic impact of an individual is of crucial im- portance across a variety of domains in human health, disease, and function. Inferring this impact necessitates the utilization of twin data and genetic modelling techniques. Such genetic models make inherent assumptions about the data, and it is of interest to investigate the model performance across dif- ferent conditions using simulation. For two models, Normal ACE model and Falconer’s method, input data simulated from bivariate Lagrangian Poisson (blgp), multivariate t, and bi- variate normal and relevant modulations. It was found that an increase in sample size resulted in a decrease in the av- erage and true standard error of genetic estimates. When the data was normally distributed, the standard error (average and true) converged. The greater the normality of the simulated twin data, the greater performance both models had across coverage rates and average and true standard errors. When the model assumptions for ACE were violated (non-normality of data, unequal variances across MZ and DZ groups) the model resulted in biased estimates of heritability and associated cov- erage rates. Lastly, the coverage rates for Falconer’s model appears to contradict that found in [1], but otherwise the re- sults strongly converge to that of [1].

Index Terms— Twin study, heritability, simulation

1. BACKGROUND & DESCRIPTION OF PROBLEM

Assessing the enviornmental and genetic impact of an indi- vidual is of crucial importance across a variety of domains in human health, disease, and function. Inferring these factors necessitates either strict genetic data (one’s genome) or, more often, the utilization of twin data [1]. Twins can be separated into two types, monozygotic (MZ) or dizygotic (DZ) [1]. What defines a twin in general is that they are born at the same time but in which two distinct processes had occurred even so. Specifically, MZ twins share 100% of their genomic information with one another, and DZ twins share 50% on average [1]. The similarity in MZ twins, and the fact that they were born at the same time point (essentially), allows for any differences in the twins to be more indicative of environ- mental influences than that of genetic influences (since, by definition, their genomes are identical). However, this attri-

bution of genetic influences can also be decomposed further into additive genetics and dominance (the second of which is not an interest in this study) [1]. Specifically, it is possible, using both MZ and DZ twins to evaluate heritability which is a function of the genetic components (defined differently per model) of the trait of interest. Two models widely used in the twin study literature (and genetics literature) are Fal- coner’s Formula and the Normal ACE (NACE) model, both of which make implict assumptions about the input data from MZ and DZ twins, and both of which have their own benefits and disadvantages. Hence, it is important to talk about the mathematical properties of both models. The NACE model utilizes MZ and DZ twin data in order to model the heritability of a given trait of interest. In doing so, it effectively breaks down the trait covariance of each MZ or DZ twin into additive genetics (A), common shared environment (C), and nonshared environment (E) variance components [1]. This approach is done via structural equa- tion modelling (SEM). For the NACE model, notably, there are key assumptions which allow it to achieve the results it does. Firstly, any given trait of interest (which is what would be investigated using the data of the MZ and DZ twin pairs that one has access to) is normally distributed [1]. Secondly, the NACE model assumes that the ACE variance parameters are equal for both MZ twin pairs and DZ twin pairs [1]. In contrast, another model, Falconer’s Formula, which is a distribution free method of moment estimators [1] makes no assumptions regarding the variance across MZ twin pairs and DZ twin pairs (i.e. they are allowed to be unequal), with the additional assumption that the proportion of the total variance (for genetics or environmental effects) are the same [1]. Therefore there are key differences in assumptions be- tween the NACE model and Falconer’s formula. Further- more, given the wide use of these models across the literature in looking at differently distributed traits (which are of wide interest in a variety of fields as they relate to genetics, human behavior, disease, and biomarkers), there is a great need to evaluate the effect of these assumptions that these models hold on the resultant parameter estimates in these scenarios. In doing so, this work can be considered for future traits of interest when utilizing models, the approach can be consid- ered for other models of interest, and modulations therein can be expanded upon based on the implementation herein [1].

This paper subequently seeks to use the methods in [1] in the same manner to see how simulation applies in the con- text of evaluating these models’ heritability estimates, across NACE model and Falconer’s formula. By validating the find- ings in the paper, this will be an important step forward in looking at additional models of interest, and understanding more in-depth the process in which models can be assessed (whether it be for heritability or other metrics as well). The upcoming sections of this paper are outlined as fol- lows: Section 2) the mathematical theory of each model (NACE and Falconer), relevant assumptions and mathemat- ical setup of how the analysis will be performed; section

  1. discussion of the application, or general framework of how to do the simulation analysis and associated simulation concepts. Afterwards, in section 4), resulting visual demon- strations of the simulation will be shown for the various steps outlined in 3), and the results specific to the model compar- isons will be demonstrated. Finally in 4), there will be a discussion of the results, and additionally whether it validates the paper of interest, and 5) the conclusion will state the over- all takeaways along with planned future work relevant to the results. In brief, the techniques that will be used to investigate this question are a) input analysis (determining optimal dis- tributions of interest for the traits selected based on literature review), b) simulation of data based on the input data analysis distribution, c) using both NACE and Falconer’s methods on the simulated data, and d) output analysis using confidence intervals on the resultant model parameter estimates.

2. THEORY

2.1. Models

Following mostly from [1] (unless stated otherwise) the fol- lowing equations relevant to models of interest (NACE and Falconer’s) can be derived. From here on, the explanation will be from understanding or from the paper [1], and any deriva- tions by hand will be noted. First, define the total amount of twin pairs for MZ twins as NM Z and that of DZ twins as NDZ , for a total number of twin pairs as N = NM Z + NDZ. For any given zygosity (MZ or DZ), z, define yz = (yz 1 , yz 2 ), where yz is the trait or phenotype of interest of the respective twin pair for a given zygoosity. Notably we can decompose this yz then as yz = xTz β + Az + Cz + Ez , in which xTz is a 2 × P matrtix of covariances for both twins (note we will expect that the expectation E(yz ) = 0 for all problems hence- forth, i.e. that there will be no covariates). It then follows that cov(yz ) = σz = cov(Az ) + cov(Cz ) + cov(Ez ) and to cre- ate the model to have randomness, we define the parameters to be distributed as: Az ∼ ( 0 , σ^2 Az Kz ), Cz ∼ ( 0 , σ^2 Cz J), Ez ∼

( 0 , σ^2 Ez I), where we define the matrices:

I =

[

]

, J =

[

]

, Kz =

[

1 wz wz 1

]

Here is where wz is defined as the ’genome relationship ma- trix’ where wz = 1 for MZ twins as they share 100% of their genes, and wz = 0. 5 for DZ twins since they share 50% of their genes. Based on these equations, then the overall vari- ation of a phenotype can be decomposed into σ^2 Az , σ^2 Cz , σ^2 Ez , which represents the additive genetic, shared environment, and unshared environment variances for twin of type zygos- ity z. Then, using these parameters as a baseline, we can estimate the heritability, an indication of the level of genetic importance of a trait as (the proportion of total trait variance due to additive genetics):

h^2 =

σ A^2 M Z σ^2 AM Z + σ C^2 M Z + σ^2 EM Z

σ^2 ADZ σ A^2 DZ + σ^2 CDZ + σ E^2 DZ

Relatedly, the shared environment estimate (that is, the pro- portion of trait variance due to shared environmental effects) can be derived as:

c^2 =

σ^2 CM Z σ^2 AM Z + σ^2 CM Z + σ E^2 M Z

σ^2 CDZ σ^2 ADZ + σ^2 CDZ + σ E^2 DZ

These parameters (along with e^2 = (1 − h^2 − c^2 ) can be derived using NACE and Falconer’s formula based on their inherent assumptions. Then the multivariate normal distribution is given as (not from paper):

yz = (yz 1 , yz 2 ) ∼ N 2 (μ = 0 , Σz)

= f (yz) =

( 2 π)d/^2

det Σz

exp

yzT Σ−z 1 yz

Where d is the number of dimensions (in this case d = 2, so the exponent for the 2 π term goes to 1 .). We can then find the log likelihood (derived by hand) such that:

log(f (yz |α)) = log(1) − log(2π) −

log(det Σz ) −

yTz Σ−z 1 yz

log(f (yz |α)) = −

log(det Σz ) −

yTz Σ− z 1 yz − log( 2 π)

log(f (yz |α)) = −

log(|Σz |) −

yTz Σ− z 1 yz − log( 2 π)

log(f (yz |α)) = − 0 .5(log(|Σz | + yzT Σ− z 1 yz + 2 log(2π))

where Σz =

[

σ A^2 + σ^2 C + σ E^2 wz σ^2 A + σ^2 C wz σ A^2 + σ^2 C σ^2 A + σ^2 C + σ E^2

]

, and α = (σ^2 A, σ C^2 , σ E^2 ). Further, the NACE model then assumes that:

c^2 space.

∂c^2 ∂

σ C^2

(σ^2 A + σ E^2 )(

σ E^2 ) (σ^2 A + σ C^2 + σ^2 E )^2 ∂c^2 ∂

σ A^2

σ^2 C (

σ A^2 ) (σ^2 A + σ C^2 + σ^2 E )^2 ∂c^2 ∂

σ E^2

σ^2 C (

σ E^2 ) (σ^2 A + σ C^2 + σ^2 E )^2

∇hc( ˆ

α) =

( (^) ∂c 2

σ A^2

∂c^2 ∂

σ^2 C

∂c^2 ∂

σ^2 E

From there, we can then compute the standard errors directly using the delta method as aforementioned. Once that is done, we can compute 95% confidence intervals using a zα 2 score equal to 1.96 (α = 0. 05 ), and then construct the confidence intervals such that:

h^2 ± zα/ 2 (SE) c^2 ± zα/ 2 (SE)

where the standard error for either h^2 or c^2 is derived again by the delta method (once we have the variance var(h(

αˆ) as shown in the delta method we just have to take the square root and we would get our transformed standard error for both functions). Then, similar to the NACE model, Falconer’s formula is serived as follows, with the heritability h^2 and shared env- iornment estimates c^2 as:

ˆh^2 F alc = 2(rM Z − rDZ ), ˆc^2 F alc = 2rDZ − rM Z

where rM Z is the pearson correlation within MZ twin pairs, and rDZ is the pearson correlation within DZ twin pairs. Fal- coner’s estimators are then derived [1] as:

ρM Z = Corr(yM Z 1 , yM Z 2 ) =

covM Z var(yM Z )

=

σ A^2 M Z + σ C^2 M Z σ^2 AM Z + σ^2 CM Z + σ E^2 M Z

= h^2 + c^2

ρDZ = Corr(yDZ 1 , yDZ 2 ) =

covDZ var(yDZ )

=

σ 02. 5 ADZ + σ^2 CDZ σ^2 ADZ + σ C^2 DZ + σ E^2 DZ

= 0. 5 h^2 + c^2

By definition from the previous defined equations Falconer’s heritability and shared environment are then:

⇒ 2(ρM Z − ρDZ ) = h^2 2 ρDZ − ρM Z = c^2

where ρM Z and ρDZ are the population correlation coeffi- cients within MZ twins and within DZ twins respectively. One possible means of computing the standard error for Fal- coner’s formulas is by considering the asymptotic variance of the pearson correlation coefficient which ultimately yields the following standard error for the two estimates of interest:

SE^ ˆ(hˆ^2 F alc) ≈

4( ˆvar(rM Z ) + varˆ(rDZ )

( (^) (1 − r 2 M Z ) 2 NM Z

(1 − r DZ^2 )^2 NDZ

SE^ ˆ(ˆc^2 F alc) ≈

4( ˆvar(rDZ ) + varˆ (rM Z )

( (^) (1 − r 2 DZ ) 2 NDZ

(1 − r^2 M Z )^2 NM Z

Once these are obtained in the study of interest, the confi- dence intervals for h^2 and c^2 will be constructed using a 95% confidence interval in the same way as NACE (except no need for the delta method). Therefore, it would be calculated again such that:

h^2 ± zα/ 2 (SE) c^2 ± zα/ 2 (SE)

2.2. Distributions

Three distributions will be used in this study and they are de- rived as follows (no particular order). First, the heavy weighted multivariate t-distribution is given as [1]:

yz = (yz 1 , yz 2 ) ∼ f (yz )

=

Γ( ν+2 2 ) Γ( ν 2 )νπ|Σz |^1 /^2

[

ν

yTz Σ− z 1 yz

] −(ν 2 +2)

yz = (yz 1 , yz 2 ) ∼ bLGP (σ^2 A + σ C^2 + σ^2 E , λ)

Next, we can also define and sample discrete count variables from the bivariate Lagrangian Poisson (BLGP) [1] distribu- tion eas follows for MZ twins:

Q 0 ∼ LGP (σ A^2 + σ^2 C , λ) Q 1 , Q 2 ∼ LGP (σ^2 E , λ) Y 1 = Q 0 + Q 1 Y 2 = Q 0 + Q 2 ⇒ Y 1 , Y 2 ∼ bLGP (σ^2 A + σ C^2 + σ^2 E , λ)

and for DZ twins:

Q 0 ∼ LGP (0. 5 σ^2 A + σ C^2 , λ) Q 1 , Q 2 ∼ LGP (0. 5 σ A^2 + σ^2 E , λ) Y 1 = Q 0 + Q 1 Y 2 = Q 0 + Q 2 ⇒ Y 1 , Y 2 ∼ bLGP (σ^2 A + σ C^2 + σ^2 E , λ)

Again, we can sample from the multivariate normal dis- tribution (bivariate in this case, with d = 2) based on the pre- defined distribution above:

yz = (yz 1 , yz 2 ) ∼ N 2 (μ = 0 , Σz)

= f (yz) =

( 2 π)d/^2

det Σz

exp

yTz Σ−z 1 yz

2.3. Other

  1. APPLICATION

First, for the distributions in theory mentioned, a single dataset distribution for 700 MZ and 700 DZ pairs was gener- ated from each (normal, BLGP, t) to visualize what the sam- ples look like (if they were expected, etc.). Specifically, for all distributions, the parameter estimates for the covariance matrix or as direct inputs were given as: σ^2 A = 0. 5 , σ C^2 = 0. 3 and σ^2 E = 0. 2. For the t-distribution specificially, the degrees of freedom ν = 4. 5 was used, and for BLGP, the value of λ = 0. 35 was used. The sample normal distribution results was also tested using the Anderson-Darling test in order to determine normality (to assure that the distribution was multivariate normal). Beyond this initial data visualizaton, three simulations were carried out based on the predefined theory. Using the same aforementioned distribution parameters, these samples were again sampled from for each simulation performed with the same parameters.

3.1. Simulation 1: Effect of distributions on parameter estimates and model

Using 1000 datasets (repetitions) of 700 MZ and DZ twin pairs each across the bivariate normal, t, and BLGP distribu- tions, the input data to both the NACE and Falconer’s methods were implemented and simulated. The average mean value h^ ¯^2 was calculated from the 1000 h^2 estimates per model, per distribution. The average standard error (derived from the according theory on delta method for NACE, and other- wise square root of the variance found in Falconer’s), SE¯ was found by averaging the resultant standard errors from each 1000 simulated estimates, per model, per distribution. The true standard error, denoted as SE in the resulting ta- bles, was found as the square root of the variance across all

1000 simulated h^2 estimates), and the coverage rate (Cov) across all 95% confidence intervals generated (again through the previous delta method, z-values and SEs as in the the- ory section), was calculated. Specifically, for coverage, the true parameter estimate was determined as those given above σ^2 A = 0. 5 , σ^2 C = 0. 3 since the resultant heritability and shared environment parameters are such that c^2 = σ C^2 /(σ^2 A + σ C^2 + σ^2 E ) = 0. 3 , and h^2 = σ^2 A/(σ^2 A + σ^2 C + σ E^2 )0. 5 by the same logic. Therefore, if the confidence intervals for a given model contained these true parameter values within the, these were included in the coverage rate (the amount of times that these true parameter values fell within the confidence interval upper and lower bounds). This was computed across each specific 1000 simulations (from the respective distributions), for a sin- gle coverage rate value per model of the estimate. This exact same procedure was done for c^2. That is, for 1000 repetitions from a given distribution (normal, t, BLGP) of 700 MZ and DZ twins each simulation, for both models (NACE and Falconer’s) respectively, the c¯^2 , SE, SE¯ and Cov all assocaited with the estimate were determined and reported. The resuls for this simulation can be found in Table 1 and Ta- ble 2. This procedure was an extension of that from [1]. This specific simulation sought to see the effect of the distributions on the different parameter estimate outputs as well as any dif- ference in the models with respect to these.

3.2. Simulation 2: Effect of sample size on heritability pa- rameter estimate

Just as in the previous simulation, 1000 datasets (repetitions) were simulated, except in this case, across N = NM Z = NDZ = 50, 100 , 200 , 400 and 700. This was an extension of that from [1]. This simulation procedure was done only on the heritability estimates compared to the previous simu- lation. The same values as in simulation 1 were reported for each sample size: h¯^2 , SE, SE¯ and Cov, all associated to the parameter estimate hˆ^2. The results from this simulation can be found in Tables 3,4 and 5 in the results section.

3.3. simulation 3: Effect of varying variance on heritabil- ity estimate for models Similar to the previous simulations, again 1000 repetitions were used, but this time the focus was on varying the related variance of the MZ input data versus that of the DZ input data to see its impact on the assumptions of the models. That is,

var(yDZ ) = var(yDZ 2 ) = σ^2 ADZ + σ^2 CDZ + σ^2 EDZ = .5 + .3 + .2 = 1 var(yM Z ) = τ ∗ var(yDZ ), τ = [0. 5 , 0. 6 , 0. 7 , 0. 8 , 0. 9 , 1]

So for every τ , there was 700 MZ and 700 DZ twins simulated with 1000 repetitions. This was then plot as in Figure 6. The bias in the average heritability estimate bias for each 1000 bias for each iteration was calculated (see formula in Fig 6).

(a) (b)

(c) (d)

Fig. 3: Heatmap of the resulting trait observations drawn from a/c) the multivariate t-distribution for MZ twins, and b/d) the multviariate t-distribution for DZ twins. Essentially, the dis- tribution represents the actual multivariate t-distribution with the parameters input into the model as σ^2 a = 0. 5 , σ^2 c =

  1. 3 , σ e^2 = 0. 2 , ν = 4. 5 for the MZ covariance matrix and for the DZ covariance matrix. As expected, for a/c) the t-distribution for MZ twin pairs sam- pled demonstrates a sharper correlation along one dimension than the other (i.e. less variance in one dimension). This re- sult would be anticipated, because MZ twins should vary less on a given trait than DZ twins by definition. Conversely, the trait for DZ twins have a wider variance across both dimen- sions, which is what would be expected given their greater dissimilarity compared to MZ twins. Note the tails of the dis- tribution are quite spread apart (as will be shown in contrast to the normal distribution in Figure 4). Note that the greatest density (or where the distribution is centered at) is toward the middle of the distribution, as would be expected. Further, the distributions are roughly symmetric.

so it does not run into this issue as shown. This is a critical demonstration and successful replication of the paper that must not be understated. For Table 1, it can be observed that regardless of the model or distribution, the ¯h^2 estimate was as expected, at a value 0.5 which is what the modelled σ A^2 was initialized to (again our actual expected heritability parameter would be this additive variance divided by the total variances which already sum up to 1, and this means that the expected heri- tability estimate would be 0.5; so for the estimate to converge

(a) (b)

(c) (d)

Fig. 4: Heatmap of the resulting trait observations drawn from a/c) the bivariate normal distribution for MZ twins, and b/d) the bivariate normal distribution for DZ twins. Essentially, the distribution represents the actual bivariate normal distribution with the parameters input into the model as σ^2 a = 0. 5 , σ^2 c = 0. 3 , σ^2 e = 0. 2 for the MZ covariance ma- trix and for the DZ covariance matrix. As expected, for a/c) the normal distribution for MZ twin pairs sampled demon- strates a sharper correlation along one dimension than the other (i.e. less variance in one dimension than the other). This result would be anticipated, because MZ twins should vary less on a given trait than DZ twins by definition. Con- versely, the trait for DZ twins have a wider variance across both dimensions, which is what would be expected given their greater dissimilarity compared to MZ twins. Note the tails of the distribution are not spread out considerably (as in contrast to the heavy tailed t-distribution in Figure 3). Note that the greatest density (or where the distribution is centered at) is toward the middle of the distribution, as would be expected. Further, the distributions are roughly symmetric.

to this value demonstrates its success). Notably, the average standard error of the heritability estimate in general varied from the ’true’ standard error estimate until the normal dis- tribution was considered. It was notable that the performance (average SE vs true SE) became slightly better from the BLGP distribution to the t-distribution for both models, and then converged during the normal distribution. This logically follows from the NACE model, because its most inherent assumption is that the data are normally distributed. This fact is further exemplified by looking at the coverage across the

(a) (b)

(c) (d)

Fig. 5: Heatmap of the resulting trait observations drawn from a/c) the bivariate lagrangian poisson (blgp) distribution for MZ twins, and b/d) the blgp for DZ twins. Essentially, the distribution represents the actual blgp dis- tribution with the parameters input into the model as σ^2 a =

  1. 5 , σ c^2 = 0. 3 , σ e^2 = 0. 2 , λ = 0. 35 for the MZ and DZ bivari- ate lagrangian poisson variance input. As expected, for a/c) the normal distribution for MZ twin pairs sampled demon- strates a sharper correlation along one dimension than the other (i.e. less variance in one dimension than the other). This result would be anticipated, because MZ twins should vary less on a given trait than DZ twins by definition. Con- versely, the trait for DZ twins have a wider variance across both dimensions, which is what would be expected given their greater dissimilarity compared to MZ twins. Note that these are discrete outcomes as that is what the Poisson dis- tribution is concerned with (as opposed to the normal and t- distributions earlier). Furthermore, note that the greatest den- sity is at around (0,0), which is what would be expected based on the Poisson distribution (for both MZ and DZ twins).

BLGP, t and normal distributions in that order. The cover- age increases as the distribution becomes more normal, in essence, and the coverage is almost 100% when a normal distribution is being simulated. Therefore, it is clear that the traits being investigated, when it comes to MZ or DZ twins, must be as normal as possible, otherwise there is great risk in lowering the coverage of the estimate, for the NACE model, but also based on the results, Falconer’s formula as well. Notably Table 2 demonstrates the same kind of results except in the context of c^2 , the shared environment parameter

Fig. 6: Effect of varying the MZ twin pairs’ variance as a function of DZ twin pairs’ variance on the heritability esti- mate ˆh^2 across NACE and Falconer’s model. 1000 datasets of 700 MZ and 700 DZ pairs were generated from a bivaraite normal distribution as prior, and input into each model, with the modulation that the variances differed between MZ and DZ twins. Specifically, Tau = τ = V ar(ymz )/V ar(ydz ), and tau varied from 0.5 to 1 with 0.1 increments for this dataset. As is seen, given that a core assumption of the NACE model is that σM Z = σDZ , for all respective variance components (genetic, shared environment, unshared environment), when this assumption is violated, there is not consistent conver- gence of the heritability estimate ˆh^2 until τ = 1, as com- pared to Falconer’s Formula which is consistently low bi- ased (or unbiased). The heritability bias was estimated as (ˆh^2 − h^2 )/h^2 ∗ 100 % Falconer’s formula makes no such as- sumption about the equality of variances, and hence does not have this issue.

estimate. As was the case for the average h^2 in Table 1, the average shared environment estimate matched with the expected 0.3, in roughly all the cases. The coverage, yet again, regardless of the model, was highest for the normally distributed data. In terms of the findings for Table 3, by observing the table, it was found that as the number of pairs increased for the input into either model when performing the same type of simulation as in the previous contexts (except only for the heritability estimate), the average standard error and the ’true’ standard error decrease. That result is expected because the standard error is a function of the variance, and if the sample size is increasing, then the variance in the result should seek to decrease by the central limit theorem. By definition, the standard error is a function of √σn , and with higher sample size, this should inevitably decrease (and ideally converge to a stable estimate). However, in this case, the average stan- dard error and the ’true’ standard error did not converge, for this BLGP distribution, which could speak more to the fact

BLGP-Dist ACE h^2 Falconer h^2 MZ/DZ Pairs (¯h^2 , SE, SE, Cov.¯ ) (¯h^2 , SE, SE, Cov.¯ ) N = 50 (0.48,0.14,0.29,0.61) (0.51,0.22,0.39,0.73) N = 100 (0.50,0.12,0.24,0.64) (0.51,0.16,0.29,0.70) N = 200 (0.49,0.09,0.20,0.63) (0.50,0.11,0.22,0.67) N = 400 (0.50,0.07,0.14,0.63) (0.50,0.08,0.15,0.70) N = 700 (0.50,0.05,0.12,0.63) (0.51,0.06,0.12,0.67)

Table 3: Varying twin pair sample size across BLGP dis- tributed data in 1000 simulations across MZ/DZ pairs = 50, 100, 200, 400, and 700. (ACE = NACE, Falc = Falconer’s Formula, blgp = bivariate lagrangian poisson distributed data, t indicates multivariate t- distributed data, ¯h^2 = average estimate of h^2 after 1000 simu- lations, SE¯ =the average of the h^2 -associated standard errors for all 1000 simulations, SE the true standard error of all the 1000 estimates of h^2 , and Cov = coverage of the true param- eter σ a^2 within the 1000 95% constructed confidence intervals across the datasets.)

new space. 7) The Anderson-Darling test is a suitable means to test for normality, while upon searching the literature, there appears to be only very recent developments in looking for equivalent tests for multivariate t-distributed data (ignoring the BLGP distribution). Overall, simulation allowed investi- gation into all these factors, so its utility as a tool for future research cannot be discounted. Future work should focus on using simulation to evaluate models based on their assumptions and differing distributions as in this instance. Additionally, it should be determined why there was a notable difference in Falconer’s formula coverage results as opposed to that found in [1], given all other results were consistent. It may be interesting to look into other mul- tivariate distributions as well, in the context of these models, and see how other data are typically distributed to see how well that trait data would fare when analyzing the data with these models through future simulation.

7. REFERENCES

[1] Arbet, J., McGue, M., & Basu, S. (2020). A robust and unified framework for estimating heritability in twin studies using generalized estimating equations. Statistics in Medicine. [2] Hofert, M. (2013). On sampling from the multivariate t distribution. The R Journal, 5(2), 129-136. [3] Fleiss, J. L., Levin, B., & Paik, M. C. (2013). Statistical methods for rates and proportions. John Wiley & Sons. [4] Keener, R. W. (2011). Theoretical statistics: Topics for a core course. Springer.

t-Dist ACE h^2 Falconer h^2 MZ/DZ Pairs (¯h^2 , SE, SE, Cov.¯ ) (h¯^2 , SE, SE, Cov.¯ ) N = 50 (0.47,0.16,0.24,0.72) (0.51,0.22,0.34,0.82) N = 100 (0.48,0.12,0.20,0.76) (0.49,0.16,0.24,0.80) N = 200 (0.50,0.09,0.16,0.74) (0.50,0.11,0.18,0.78) N = 400 (0.50,0.07,0.12,0.73) (0.50,0.08,0.13,0.79) N = 700 (0.50,0.05,0.10,0.73) (0.50,0.06,0.11,0.74)

Table 4: Varying twin pair sample size across multivariate t-distributed data in 1000 simulations across MZ/DZ pairs = 50, 100, 200, 400, and 700. (ACE = NACE, Falc = Falconer’s Formula, blgp = bivariate lagrangian poisson distributed data, t indicates multivariate t- distributed data, ¯h^2 = average estimate of h^2 after 1000 simu- lations, SE¯ =the average of the h^2 -associated standard errors for all 1000 simulations, SE the true standard error of all the 1000 estimates of h^2 , and Cov = coverage of the true param- eter σ a^2 within the 1000 95% constructed confidence intervals across the datasets.)

Normal-Dist ACE h^2 Falconer h^2 MZ/DZ Pairs (¯h^2 , SE, SE, Cov.¯ ) (h¯^2 , SE, SE, Cov.¯ ) N = 50 (0.50,0.18,0.18,0.88) (0.51,0.22,0.23,0.95) N = 100 (0.50,0.13,0.13,0.95) (0.50,0.16,0.15,0.95) N = 200 (0.50,0.10,0.10,0.94) (0.50,0.11,0.11,0.94) N = 400 (0.50,0.07,0.07,0.96) (0.50,0.08,0.08,0.95) N = 700 (0.50,0.05,0.05,0.96) (0.50,0.06,0.06,0.95)

Table 5: Varying twin pair sample size across bivariate nor- mal distributed data in 1000 simulations across MZ/DZ pairs = 50, 100, 200, 400, and 700. (ACE = NACE, Falc = Falconer’s Formula, blgp = bivariate lagrangian poisson distributed data, t indicates multivariate t- distributed data, ¯h^2 = average estimate of h^2 after 1000 simu- lations, SE¯ =the average of the h^2 -associated standard errors for all 1000 simulations, SE the true standard error of all the 1000 estimates of h^2 , and Cov = coverage of the true param- eter σ a^2 within the 1000 95% constructed confidence intervals across the datasets.)