















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A transcript of a classroom discussion between students and a professor about effect sizes in research studies. They discuss various methods for calculating effect sizes, such as t tests and correlations, and the importance of reporting means and standard deviations. The students also share their experiences with coding studies and making assumptions when data is missing.
Typology: Exams
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















Week 5 Psych education 7670 Professor: We're going to talk today about coding articles. You've all coded the ones for this program. Before that take one of these. This is one of the studies from our orange juice meta analysis. So I want you to tell me what the effect size in this study. We were doing it with points gained for the other studies. I skipped over this but I don't know if you included this or not. I shouldn't have skipped it. Female Student: You have to think about the formula. Professor: Which? Female Student: I don't know yet. Female Student: I know if you use sums of squares, you can use ADA [sp?] square for ... Professor: Can you convert ADA squared to a standardized effect size? But it would be a little difficult to convert ADA square. Would ADA be reasonable to use if you could compute it for all studies? Yeah. It's on a different metric. Many people have trouble conceptualizing a little gain when you put it in terms of ADA square because it's not something we deal with a lot. But it would be perfectly good effect size. But all other studies we did with the orange juice study was points on an IQ test. So if you could avoid doing it, you wouldn't want to use a different effect size. The notion behind doing an integrative review was you have some kind of point size which was common across all studies. You can't always achieve that. Any other ideas? Melynda's given you a hint and is on the right track. Male Student: To calculate gains scores? Professor: You have to somehow compute it backward. Female Student: There is a formula you can use if you have some value in the P value of a T score to kind of ...
Professor: Here's a formula. We don't have a regular T test so we have to corelate a T. I put together this handout borrowing from a Glass book. This isn't an original derivation. Page two, if you have a T test design and know the T test value and you know number of people in each group then effect size equals T times the square root of one over the number in the experimental group divide by the number in the control group. You're assuming that the variants in the groups is equal which is a standard assumption when you do a T test. It may not be true. A T test in terms of calculating a variability is robust to violations of that assumption. If you have fairly equal end sizes. It could mess up your effect size calculation a little. So the one below is the computation for the correlated pairs or dependent T test. In this case, I left my notes. Here they are. You say effect size equals T divided by the square root of two over N number of pairs times one over RXY where R is the correlation between pre and post test. Twenty four pairs? It's forty eight because the pairs are pre to post test. So you have forty eight people over whom you have pre to post test, half in the experimental, half in the control group. That's not the issue because that's what T tells you. You have seven point seven five times the square root of two over forty eight times one minus. What's the correlation? It's not given. This is like an example we talked about a while ago. If you want to code age and know they're in first grade, then you know they're not fourteen or two. How far off will you be? Your choices are to leave the study out or estimate what the correlation is and as long as the consequences of miss estimating are not too serious, it's probably worth doing. So we could make our best guess then do another estimation where it's lower and another where it's higher and see how much difference it makes then decide if you're willing to go ahead. So what do you think the correlation between IQ scores in five year olds over a four month period? Male Student: Point seven. Point eight.
There will be many studies you're looking at that may be looking at something different but provides data that's ideal for your question. So when I looked at socioeconomic status and academic studies, there were many that looked at if a particular curricular innovation lead to better outcomes on the content area like reading. If when they did the study they did an analysis of variance where one variable was socioeconomic status, and they split the group into high, medium and low socioeconomic status, then I could look at the variable and get information and answer my question even though it wasn't the question they designed the study to answer. You aren't interested in their conclusions but their data. Sometimes you code both what their conclusions are and what their data say. It can be different and that can be useful when you're writing the study. You're writing the study to encode information that we just laid out. So given that question this study had three groups. Which groups did you look at to code an effect size? Melynda doesn't get to answer because we talked about it. Doug's looking puzzled. Male Student: I always look that way. Professor: There's nothing wrong with that. It's a statement of fact not a value judgment. Which groups were involved in the study. Female Student: Group one had the dramatized version of the literature. They were the treatment group. Group two was a treatment group two but they didn't have a dramatized version of the program so then group three was the control group. Professor: Don't use their terms. Give me a phrase that defines one, two, and three. Female Student: Group one is treatment group. Professor: Don't use treatment etc. Group one was taught ... Female Student: Group one had the drama component. Group two had a modified instruction. The teacher used question techniques and that the teacher ... I can't remember what else. They didn't use
drama but taught the same content. Professor: Group one was taught content with drama. Group two was taught the same content with out drama. Group three? Female Student: Group three was the regular remedial reading program set up through whatever program was in the school. Different content. Professor: It sounds like different content to me. So if you really want to answer the question does teaching reading with drama result in more reading gain than teaching reading with out drama, what would you want? Female Student: One and two. Professor: Right. So for our purposes, group three is irrelevant. And we shouldn't even try to include them. For her purposes too. I don't know why she included group three. But that's not relevant to our purpose today. So the point is when you read these articles, have in your mind the issue you want to collect evidence for. So you have a problem statement where you made a series of assertions then you go to the literature to support or refute the assertions. If you don't have the questions you want to answer firmly in mind then you end up doing a lot with studies that are on the same general topic. That have those key words in the title. That's not what you want to do. So we want to look at groups one and two. How many effect sizes can we get from looking at this article? Female Student: One. Professor: You sure? What is it? Female Student: I am not sure, but I think it's one. I had a problem computing effect sizes for the table for group one and two. I wanted to use the table for the criteria but it was a weekly test. It wasn't like a pre and post test so I didn't want to use those scores. So I had
is 10.59. Their post would be 10.71. But in the middle they have higher scores for two weeks. I didn't think that was fair to use those as a pre and post measures. Professor: I have no response. You're right. It makes me queasy about using the data. How do you interpret that? Why would that happen? Who's giving these tests. Female Student: I imagine the classroom teacher is giving group two's test. The researcher I assume is giving group one's test. Professor: I think we're pretty sure the researcher is giving them in group one. We're not sure who's giving them in group two. Female Student: Group one's look perfect. The standard deviation decreases as the weeks go on. I know the researcher probably wrote the questions to group one's test and didn't write the questions to group two's. Professor: And you don't know for sure. But I wouldn't be surprised if the researcher gave them to group two. If I was a researcher hoping for a certain result from group two, if scores started going up in group two, I wouldn't fudge the data, but I'd subconsciously do some things that you know wouldn't keep the scores going up. Maybe that explains the peak. Maybe random fluctuation explains the peak. If you've got scores with standard deviations fluctuating 1.5 points, you have about almost a full standard deviation variance from the low to the high score in one group and two in the other group. That's not a hole lot. It could be sample fluctuation. Female Student: That's what I chalked it up to. Professor: Whatever it is, it isn't enough to make me throw out the study. I'll rate the study in terms of methodological practice. Now it is enough to add up another data point for this study isn't all that well done. This study won't get a high rating. But it's still a data point. So now there's nothing wrong if you decide in a systematic way that
I'll only include studies that meet this criteria. You may throw this study out. Say I don't think this helps me answer the question at all. As long as you throw all studies out that meet the same criteria whether they meet your bias or not. About whether integrating art helps to teach reading. That's the foundational point of a good integrated review is that you make your rules and have some basis for those rules. You can defend them to other people and you stick with them. But let's for discussion's sake we'll include the study and use table three to compute some sort of effect size. So someone tell me how you did it. Or would like to do it now. Give me numbers. Female Student: I used the controled standard deviations. Professor: Give me some numbers. Male Student: Eleven point seven one minus fourteen. Female Student: Use treatment minus the control. So it's fourteen minus eleven point seven one. Male Student: One point seven five times point seven one divided by two. I don't know what this is. Professor: I have a calculator. Conceptually you want the best estimate of growth. Is that the best? I think we can do better. Growth. You have pre and post test score. We want the difference between one and two. How much growth did group one make? They made fourteen point zero minus eleven point seven one. How much growth did group two make? Female Student: They made ten point seven one minus nine point [can't hear/can't understand].
there, we could have done something with it. And even better, had we had the means and standard deviations, we could have done a lot more with it. One reason I think everyone in grad school should do a systematic integrative review is so you become so frustrated with what other people report, you will always report the right information. And reporting means and standard deviations should be the foundational statistic that you always report when you do group experiments. Doesn't mean you could not improve on that by extracting variants, by using co variants, or progression or something else. But you should always report means and standard deviations. It takes little room. So we'll get an effect size measure. One point five two. That's our best estimate. That's the only effect size we'd get out of this. Let's code the rest of the article. I'll have to do it from memory. Study ID number is zero one one. You'll use column. Effect sizes will be one. That's nominal. Year of publication. Female Student: So you don't write one point five two because we got one effect size. Professor: Right. You could just as soon call it Billy or Suzie. It's a way to keep track of what effect size you're talking about. So I could have put effect size ID. Year of publication. 1992. Middle grade level. Five. It will make a difference when you punch this into a computer if you put fifth or five. It's okay if you write fifth as long as you do that with all the others. You have to do it consistently. And you have to code it in. It depends on how you analyze it. If you'll just look at it, that's fine. But if you'll use a computer, then ... when we did the meta analysis on early intervention, then we were going to code IQ pre test IQ in three digits with no decimals. Some people insisted on coding decimals. We stuck it into the computer and had kids with eight hundred and forty six IQ. Because it was eighty four point six. As long as everybody's doing it the same way.
Grade? Female Student: All the same grade. Professor: Zero. Maybe. You'll have conventions that go with the coding sheet, the instructional book. The way I do instructions for a code book. I let them evolve as I code. I had some basic code ideas but I don't try to write everything because you end up writing code for situations you never encounter. And not writing rules for things you do encounter. So in the early intervention rules we did, we wanted to code parental involvement. We tried to write rules about whether a parent was involved to a large or small degree. We had pages and pages and multiple iterations. We had a team of six people coding. The more complicated the code got the worse our integrator agreement was. Finally we threw them away and said I think you know about what a lot of parental involvement is and what a little is. Just code it. When we threw the rules away, almost perfect agreement. Then after the fact, we took examples of studies that were coded high and low and said this is how we coded it. If you're coding by your self you can't get away with that as easy. Sometimes it's worth writing rules. Sometimes it's not. As you encounter situations, then you code it at that time so you can remember and use the guideline in the future. If you coded age, you say they didn't record age but they're all in first grade. My rule is if they're in first grade, I'll say they're six or whatever. Percent male. You all say you don't know. SES? They say they're remedial. You want to say all remedial? Students: No. Professor: There are a lot of subjective decisions. Some might say oh no. Kids that are in remedial programs are low SES. I wouldn't say that's absolutely wrong. I've seen worse decisions. I'd agree with you.
Professor: Thirty four or seventeen? So you could say students in the experimental group and students in the control group. It doesn't help. It just makes it harder to code. So pick one. Either combined or experimental group only. If you pick combined and it's a pre post test, then you have to do pre and post test number of students. Classrooms? Female Student: Four. Or two if it's just experimental. Professor: So you have rules. Schools? Female Student: One or two. Professor: So are classrooms and schools important? I don't know. My experience is in most integrative reviews, there are only four or five variables that really make a difference. The problem is you don't know which they are. So if you leave out the most important one, you end up with what looks like a pile of rocks instead of a house. You don't know why. I ere on the side of including more rather than less things. If you take that too far, you create a system that collapses around your ears because you just can't do it all. When we coded the early intervention meta analysis because we had lots of money, and we had a whole team doing it, we had a hundred and fifty variables for every study that we coded. Every study took eight hours. But there were only four or five variables that made a difference. We could have predicted then what they were. But sometimes you can't. School type? Female Student: I assumed public. Professor: One says public. One says I'm not going there. We have no evidence. Most schools are public. Maybe your convention is if they don't say private, assume it's public. That is what I did. It's a little dangerous. It depends on how critical to your analysis public and private is.
School location. Female Student: It said both were suburban schools. Mixed. Professor: Average years of experience in teachers? Don't know. Duration? Female Student: Six weeks. Professor: Or is it five? Five. Hours per week. Female Student: They didn't specify. Professor: Anyone willing to guess? You guys are very consistent. Female Student: I'd say an hour a day but it didn't specify. Professor: I think most school periods last about an hour. It's a good assumption. But there's a lot of variation. The issue becomes when all is said and done do you want to differentiate between studies that last two hours vs those that last a hundred hours or do you want to differentiate between those that last twenty seven hours and twenty eight hours? If it's the latter, you better not code this because you don't have that fine a graduation. But if it's the former, you have a pretty good idea this lasted five weeks. No more than an hour a day. This is probably no more than twenty five hours worth of treatment. Might be fifteen. Might be forty. But it's not three hundred. If when going into this based on other reviews you have a sense of what distinctions you want to make when you're all done, that'll guide you in terms of what kinds of assumptions you're willing to make now. Male Student: It says there are six intact classes of fifth grade remedial reading students. And the class sizes are small. So there are six classes, and they all have from eight to nine students in them. Those classes are labeled as having kids with a remedial reading level. So I wondered if it wasn't the whole day.
Female Student: Check it. Professor: I'd code it like a dummy variable in a multiple regression. So I'd code it zero zero zero one. Then in the next study when they do the next study, you'd code it zero one zero one. So when you get all done, you say how many studies did just one. How many did this combination and this combination. Dependent variable. Male Student: Four. Professor: Are you okay? Male Student: I wondered if you wanted to differentiate reading vs reading comprehension? Professor: You could do that. Again, it depends a little bit. The point I'm trying to make is how you prepare your coding sheet and how you code your articles depends on what you've learned about the area from previous reviews. You can't just go out and start doing this. To do it well, you have to know a lot about the area. If in this area, it's important to distinguish between reading comprehension and decode and whatever the other reading things are, you would want to do that. I probably wouldn't want to use other. I'd probably want to do this earlier. Type of measure? What do you say, Doug? Male Student: I gave it a one but probably just because of validity. Given those criteria measures. It's gotta be objective. Professor: Given those options, it's gotta be one ... it may be a poor objective. Time of testing after treatment in months. Zero. Reliability of measure. I wouldn't guess on that one. We don't know. General validity of outcome scores. This gets maybe closer to what you were talking about. We still don't know because they don't give us much information. I wouldn't give it a good. Fair or poor. I'd need
some rules. Given that the program administrator administered it, it was a criterion reference test. Some danger of teaching to the test. I'd probably lean toward poor. Design characteristics? Method of assigning units to treatments? Three? Observation schedule? Female Student: Two. Professor: Or multiple pre post. I don't know what that means but it's probably more than two. Maybe three. Experimental unit? Female Student: One. Male Student: I put three because it was two classes put together. Professor: Three says diad or triad. It's smaller than a normal class but bigger than a diad or triad. Female Student: What do those mean? Professor: Two or three kids. I think. Threat to internal validity. One was a minor threat. Three was a major threat. Three all by it self, it could have explained anything. Any threes? Female Student: Perhaps regression. What they selected because they were different to start with. Professor: They were different to start with but there's a control group, so they were both different to start with. So regression will occur but that will happen equally for both equally. Regression is a zero. Male Student: I think history would be a three because of the examiner. Bias there. Professor: I would call examiner bias instrumentation. But if you explained it well. That's what I'm looking for is whether you identify ... as long as when you code the other articles, you code them the same
If you have seven ones, you don't have to give it a five. You can have seven ones and have a pretty good study because they might have canceled each other out. But if you give something a three conceptually, then you're saying this threat by it self could have canceled everything out so it's a bad study. I'd say this is a three. Standardized tech size is one point five two. Okay? The important thing is that you learn things that will help you code your articles. What have you learned? Male Student: One second. I'm curious about the inappropriateness of the statistical procedure. I think I would have used an end code differently. To code that. Because when you look at pre tests, there was a difference in that with the group. Why not use an ancova [sp?]? Professor: It would have been better. But when you calculate the statistics this way, you're doing a poor man's ancova [sp?]. You're adjusting for pretest scores. But not as well as with an ancova. But had you made an ancova, maybe it would have turned out to be one point five seven. I don't think it would have made much difference. Again I didn't develop this coding sheet. They were looking for more gross kinds of statistical inappropriate tests. If you're given means and standard deviations, then coding for funny T codes doesn't worry you too much because you have the real data. I'd only code that because I only had that to rely on, and they'd done it poorly. What did you learn from this for yours? Male Student: To create my own sheet rather than using someone
else's. Professor: I'm hoping you learned the sheet is important and it's going to take a few iterations. You have to come up with conventions, rules, definitions. Those can evolve as you go but it's important to do. What else? Male Student: We should have included more information. It's too bad we had to throw the mat six out. For effect size. Professor: So when you do your own study, include more information. But when you do other's studies not much you can do about it. What else? Importance of looking at reviews to develop coding sheet. Importance of content expertise in the area. As you start reading in the area where you want to do this integrative review, and you better be well into it, there will be a lot you read that doesn't yield effect size or included in the review. That doesn't mean it's wasted. It's also about getting background and context and a feeling for the issues. And understanding what others think are important. Female Student: Much of it is reading as quality. Professor: I don't believe in that. I shouldn't have said that. Every study looks at qualitative as much as what you call quantitive. You're still trying to answer questions. Give me an example of qualitative. Female Student: Someone asks a bunch of high school principals their thoughts about student attendance. Professor: Perfect. From that qualitative study, that person will draw conclusions about what principals think about high attendance low attendance. That's a quantitive outcome. Students who have parents who care about school a lot tend to have less truancy than students who have parents that don't care. The fact that it was based on interviews is a great thing. I'm not saying what you're calling qualitative data isn't good. I'm saying I don't see the value in calling it