Ensuring High Reliability in Cognitive Psychology Experimental Design

Understanding Reliability in Experimental Measurements

Reliable measurements are essential in cognitive psychology to ensure that experimental data accurately reflect the conceptual variables under study. Reliability refers to the extent measurements are free from random errors, enabling trust in the consistency of results.

Sources of Measurement Error

Random Errors: Caused by chance fluctuations like participant distraction, misreading questions, environmental variations (e.g., room temperature), or recording mistakes. These errors are self-canceling across multiple measurements.
Systematic Errors: Result from confounding variables such as participants' self-esteem, optimism, or social desirability bias, which can consistently skew results and threaten construct validity.

Types of Reliability Assessment Methods

1. Test-Retest Reliability

Measures stability over time by correlating scores from the same test administered at two different points.
High correlation indicates high reliability; low correlation suggests measurement error or change in the trait.

Limitations

Reactivity: Participants may alter responses if they recognize the test's purpose.
Retesting effect: Participants might remember or intentionally change answers.
Changes in state variables (e.g., mood, anxiety) over time reduce reliability estimates.

2. Equivalent Forms Reliability

Uses two parallel versions of a test containing similar but not identical items measuring the same construct.
Helps avoid retesting effects and memorization.
Common in standardized testing (e.g., GRE, TOEFL).

3. Internal Consistency Reliability

Evaluated within a single test administration by analyzing the correlation among multiple items intended to measure the same construct.
Higher average inter-item correlations indicate that items consistently reflect the true score.

Measures of Internal Consistency

Split-Half Reliability: Divides the test items into two halves and correlates the scores.
Cronbach's Coefficient Alpha: Provides an overall estimate of average inter-item correlation and is widely used due to its accuracy.
Item-Total Correlation: Correlates each item's score with the total score (excluding that item) to identify items contributing less to reliability.

Enhancing Measurement Reliability

Increase the number of measurements or items to average out random errors.
Remove or revise items showing low item-total correlation to improve scale consistency.
Use multiple raters in behavioral assessments and calculate interrater reliability to account for observer errors.

Interrater Reliability

Applicable when judgments are made by multiple observers.
Quantitative ratings can use coefficient alpha; nominal ratings use the kappa statistic, both ranging from 0 (random error) to 1 (perfect agreement).

Distinguishing Traits Versus States

Trait Variables: Relatively stable characteristics (e.g., optimism, intelligence) that should yield consistent test-retest scores over short periods.
State Variables: Fluctuating conditions (e.g., mood, anxiety) that may change rapidly, making test-retest reliability less applicable.

Practical Considerations and Conclusion

Avoid short retest intervals to reduce memory or practice effects.
Provide clear instructions to reduce participant misunderstanding.
Recognize that perfect reliability is rare; strive for maximizing the proportion of true score variance.

In the subsequent lecture, the focus will shift to addressing systematic errors and enhancing construct validity to ensure that experiments accurately measure intended psychological constructs. For a broader context on designing experiments in this field, see Fundamentals of Experimental Design in Cognitive Psychology. To understand challenges related to specificity and generality in study design, refer to Balancing Specificity and Generality in Cognitive Psychology Experimental Design. For an overview of foundational concepts and the scientific method in this domain, explore Foundations of Experimental Design in Cognitive Psychology: Scientific Method and Challenges.

Hello and welcome to the course basics of experimental uh design for cognitive psychology. I am Dr. Arkwarma from the

department of cognitive science at IIT Kpur. We are in week five of the course and we are talk we basically talked

about uh different kinds of experimental designs the oneway experiment design and the factorial design in the last three

lectures. Now moving on uh something that is extremely important for uh you know experimental uh researchers is to

be able to ensure that the experiments that we are carrying out the measurements that we are carrying out

are both reliable. we can sort of trust on them in the sense that the process has been followed and the measures are

actually yielding what they're supposed to measure and they are valid. So both the reliability and validity of an

instrument is extremely necessary and it is from that perspective that I'll be discussing about how to ensure higher

reliability and validity of our experimental findings. In today's lecture I'll mainly talk about

reliability. So uh an important concern among researchers is to ensure that the

measurements arising out of their experiments are reliable. Uh basically meaning that there are no errors or no

spoious measurements uh and that the results can be uh you know uh conceptually uh uh valid and the

research the results are conceptually valid. Now to ensure that these errors are minimal and that statistically uh

and conceptually sound measures are used to interpret the results of the experiment steps are taken to ensure

reliability and validity. So let's let's talk about reliability first. Now one of the basic difficulties in determining

the effectiveness of a measured variable is that this measure will always be influenced by some kind of error.

Remember we were talking about error and how do we uh you know minimize the possibility of error by increasing the

number of measurements and the idea is the more measurements we take the amount of this random error reduces. So that's

basically the concept we're going to talk about today. What are the sources of these errors?

These errors typically are uh you know clubbed under what is called random error. And that random error can arise

out of many sources. Let's look at that. So it is possible that the measured variable uh contains some chance

fluctuations in the measurement which we call random error which could arise from various sources such as for example

misreading or misunderstanding of the question measurement of individuals on different places. Say for example you

take a test in in a particular classroom and then you take a test at a different classroom where temperature etc are very

different the conditions are different and the performance of the participant sort of varies. It is also possible say

for example the experimentter has misprinted the question or has in a behavioral experiment has misrecorded

the responses of an experiment. You sometimes uh what students do is that they uh you know miscode a line is

missing the responses are not getting recorded or the responses are sort of adding some uh you know number maybe uh

an element of uh you know maybe 5 seconds 10 seconds are getting added because of uh you know uh the response

measurement devices and so on. So there can be any number of sources that can bring random error into our

measurements. Okay. Now these random errors do influence the measured variable sometimes but they do so in a

way which is self-cancelling. If there are any number of random variables that can have an effect on your uh you know

uh dependent measure a lot of times it will happen in a way that is self-cancelling. What does that really

imply? Uh basically although the experimentter uh can make some recording errors or the individuals can give some

incorrect responses typically what you will find is that these errors will increase the scores of some people and

decrease the scores of some other people. So technically that's typically uh you know one of the reasons why

people work with uh you know means and averages is that uh this will all balance each other out and then uh you

know the probability of the amount of error in your uh critical measurement will be reduced.

In contrast however there's another kind of error that can be slightly problematic is you know this error which

is called the systematic error. Okay. So the measured variable can also be influenced by other conceptual variables

that are not part of the conceptual variable of interest. You know these are called extraneous variables or

confounding variables. And sometimes we find that when we are conducting our experiments these confounding variables

have an influence in a very systematic way on the dependent measure that we are taking. Remember we are thinking or at

least the expectation is that whatever uh measurement we are getting on the dependent measures is solely due to the

manipulation in the independent variable. But if you have this kind of systematic error which is arising out of

some other uh you know variables that are playing a part then it decreases the uh you know reliability of the

measurements that we are eventually getting. So for example, let's take this example. Individuals with high

self-esteem may score systematically lower on an anxiety measure than those with low self-esteem. And more

optimistic individuals are can be expected to uh score consistently higher. So that is something that you

know you can actually see but uh you know what you're trying to do is you're basically trying to measure anxiety and

uh you know your anxiety is the dependent measure. If you have not taken care of self-esteem, if you're not

already measured or controlled for it, then it can actually produce a systematic uh you know variance or a

systematic error in the way your participants are responding to your scale.

Also, sometimes what happens is that the participants uh that you call to your experiments will have a tendency to

self-promote. You know, participants are almost always trying to guess what the experiment is for. the there is a

certain degree of reactivity that you can expect from all participants across all kinds of studies. Okay. So sometimes

the participants have this tendency to self-promote and it can lead some respondents to answer the items in ways

that make them appear less anxious. Something that is you know that's a bit of an issue with most scales and most

surveys uh and especially when we uh you know do them with psychology undergraduates or post-graduate students

is that the students sort of have a sense of what the survey is for and what they do is they will try and in some

sense project what they want in the survey. So for example, if you're doing uh you know an anxiety test and the

person does not want to be uh uh you know portrayed as a highly anxious individual, they will themselves respond

to the items in the questionnaire in a way uh such that uh you know they don't appear anxious. All right. So uh a lot

of times respondents will start answering uh your survey in a way that they that makes them appear less anxious

than they actually are. you know they they and they they might be doing this for whatever reason. Sometimes in order

to please the experimentter or just to feel better about themselves. In such cases what you will see is that the

measured variable will basically be assessing self-esteem optimism or the tendency to self-promote in addition to

the conceptual variable of interest that was anxiety. So this is a very good example of how systematic error may

sometimes creep in in our measurements in experiments also. For example, we were talking about that violent uh you

know watching violent cartoon, prior state, frustration and gender. A lot of times you know there could be other

things that creep in. Remember we are talking about the uh you know confounding variable of parents

disciplinary style. So sometimes that may be playing an effect as well. So this is basically uh you know the

difference between random error and systematic error. You can see random error can arise out of various reasons.

coding errors, participants in attention, uh you know their misperception. Sometimes there are uh

you know other things maybe the condition it's too hot and you you know asking your participant to perform the

experiment they are tired very quickly maybe in middle of a block maybe in the beginning of a block and so on. So all

of that can basically uh you know uh be as a threat to the reliability of the measurement that is happening but it

will eventually cancel each other out. Some participants will be very fatigued and tired when they come to your

experiment. Some will be extremely fresh. So somewhere the score will go up, somewhere the score will go down and

eventually it will all cancel each other out. But in case of systematic uh error, you know, when you have uh other

conceptual variables in play, for example, self-esteem, mood, self-promotion. In the previous example,

we can talk about with reference to the previous example, we can talk about things like parents disciplinary style.

We can talk about other things as well. you know personality factors may come in. Okay, these basically are threats to

the construct validity. They are sort of uh you know they confound uh the actual measure the that you are taking and that

is referred to as a threat to the construct validity of the experiment. What is it that you are trying to

measure? If your item is not measuring that but it is basically uh confounding with other conceptual variable that is

called a threat to conceptual valid and threat to construct validity. We'll talk about the construct validity uh in in

more detail as we go forward. Now we can see that the impact of random and systematic error on a measured

variable are are slightly different from each other. And even though there is no proper way to determine whether the

measured variables are free from random and systematic error, there are obviously techniques statistical and uh

methodological that can allow us to get an idea about how well our measured variables actually capture the

conceptual variable that they are designed to assess. In this case, uh I we are interested in anxiety.

Okay. So let's let's uh see how we go about it. Now the reliability is a very interesting concept. It basically refers

to the extent to which uh the measurements are free from random error. Typically when you want random error to

be taken care of, you're looking at enhancing the reliability. You're looking at ways that basically allow you

to first assess then enhance the reliability of your measurements. Okay. A possible uh method to determine the

reliability of a measured variable is uh very simply to measure it uh you know a larger number of times. Measure

it more than once. For example, one can test the reliability of a weighing scale simply by weighing the same object again

and again and again and again and say for example first time it gives you let's say uh there is an object of 10 kg

weight and you're putting it on the weighing scale. Suppose you're sort of calibrating the weighing scale or it is

a new one you want to determine how this is working. Uh first measurement it gives you some u measurement with some

error. Second measurement it gives you some measurement with some other error. You do it three four times and then you

average it. you basically what you've done is you've minimized the error component. It averages or cancels itself

out and then you get closer to the actual measurement. Okay. uh so if it is if your scale is giving the same

measurement uh on several trials then you can assume that it is a correct one but it's in in some measures it is

telling 10 some measures it is telling 12 other measures it is telling seven then you will also uh you can conclude

that the weighing scale is not measuring the weight reliably okay so let's let's go forward and let us assess uh the

different approaches to measuring reliability in an experiment The first one uh and something that is

uh very commonly used is called test retest reliability. The one that we were just talking about. So test retest

reliability is the extent to which scores on the same measured variable correlate on with each other on two

different measurements. So you take some score measurement on day one and then on day three and then day five and so on.

And when you compare the correlation between these, if the correlation is uh high, then you can assume that the

measurements were reliable. If the correlation is low, then you can assume that the measurement was not as

reliable. Okay? So if the test is perfectly reliable and if the scores on the conceptual variable do not change

considerably over the time period, the individuals typically as experimental, you should receive typically the exact

same score each time and the correlation between the scores will be tending to one. See one is the perfect correlation.

Zero is no correlation at all. So the idea is that whatever measure of correlation you'll get it will be

approaching one. It'll be in the high let's say 70 and above. Now if your measured variable contains

random error you know the two scores will obviously not be highly correlated. So higher positive correlations between

the scores at the two times would indicate higher test lead test reliability and low correlations will

basically indicate low test lead reliability. Now although this test uh retest procedure is a direct way to

measure reliability it also has limits. Remember we were talking about uh uh you know a bunch of things uh earlier

carryover practice and fatigue things like that. Now for instance when the procedure is used to assess also it can

uh you know we'll just talk about this and I'll come to it. For instance when the procedure is used to assess the

reliability of a self-report measure it can produce reactivity. So this is what I was talking about a lot of times when

you are reme-measuring the participant again and again and it happens more with qualitative than with quantitative and

experimental methods but it happens uh that the participant starts responding slightly unnaturally to the procedure

because they are trying to guess or they have already guessed uh the purpose of the experiment. Okay. So uh when when

you're doing uh let's say an attitude scale or when you're doing a personality scale and things like that and you're

doing it again and again what happens a lot of times is that the participant is trying to double guess the experimental

the participant will basically say uh that okay why is it maybe the experimental is expecting a certain kind

of answer maybe I did not provide it in the first instance so let us change let me change my answers in the next time

and you know be more uh uh you helpful let us say helpful to the experimental. So this is by the way what is reactivity

and reactivity can actually be a potential problem because when the same or the similar measures are given twice

responses typically on the second administration may have been influenced by the measure uh already you know that

was already taken the first time. These effects are typically known as retesting effect. So if you do this

again and again and if you discover that the participants are changing their responses or unnaturally responding them

you know that you know there is a degree of carryover practice reactive uh you know reactivity based retest effects. So

retesting problems can actually occur in cases typically where people remember how they had answered or responded the

uh you know uh first time to the same questionnaire. Some people believe that you know some

people as I was saying some people may believe that the experimenttor wants them to express different uh opinions on

the second occasion and that is basically why they would just change their responses and ideally it reduces

the reliability of your measure. This pattern obviously would reduce the test retest correlation because if you're

responding let's say towards uh one end uh in the first administration towards the other end in the second

administration obviously the correlations between these two values will be lower or these two set of values

across items will be lower. It could also be the case that the respondent might try to just duplicate

the answer. Say for example they want to come out as extremely consistent. So what they do is they remember what they

answered in the first uh administration. Oh, I I answered one here on a on a liquid scale kind of a thing. Uh between

one to five you have to answer. So the participant remembers on each item what they opted for. So and item one I said

one uh item two I said four. Item three I said two. Uh if the participant remembers it and tries to exactly

replicate the same in the second administration then also it would unnaturally increase the reliability

estimate. it will just be if it if it is the same number across the two measurements your uh you know

reliability or correlation will actually uh increase to 0.9 or even higher and that basically is also an unnatural

inflation. You know you're basically not getting what you are looking for. Also it can happen that participants get

bored with answering the same question again and again and that might also sort of uh you know be a source of

differences. So while some of the problems can actually be avoided using a long uh you

know testing interval. So you do your first administration on day one and the third ad second administration on day 30

or maybe uh you know day 15 and so on. Sometimes people will genuinely forget what they did in the first

administration of the test and they'll still be natural on the second or the third administration of the same test

and in that sense it will uh you know not affect it will not be a reactive response. it will not affect your

reliability uh uh computations. Other occasions you can just use instructions. You can explain things very clearly and

then hope that the participants are naturally and sincerely responding. Retesting typically becomes a general

problem in the computation of or retesting effects actually uh you know they pose a general problem for the

computation of what is called the test retest reliability. So to get around this what people have done is a lot of

researchers often employ a more sophisticated type of test retest reliability known as equivalent forms

reliability. So what they do is the items are not identical but they are basically very similar and they are

measuring the same thing. Okay. So under this approach two different but equivalent. Say for example you want to

conduct a test about somebody's let's say language proficiency English or something. Okay. Now you have one test

and you on the next time you want to give another test. The items in these two tests are broadly measuring the same

thing. They probably are organized in the same dimensional manner but they are not the exact same items. All right. So

you can use two different but equivalent forms of uh the same measure uh and you do that different times. So then when

you assess the correlations you might be closer to getting the you know low lower low uh random error and higher

reliability estimates. So this approach is uh you know this equivalent forms of equivalent forms reliability approach is

particularly useful when say for example certain items have correct answers to the test and which basically can be

learned by the uh by the individuals when they take the first test or you know they find out during the time

period between the test that oh this is the correct answer of that item. So ideally when you don't repeat the items

you have equivalent versions it makes it much easier. So and this is done in a lot of practical uh cases as well. Say

for example when student as students might remember the exact questions and learn the answers to aptitude tests such

as the GREs and the sets and the TOEFLs our uh you know uh uh you know tests like the uh the exams that are done in

CAT math and so on. Uh typically these tests would employ equivalent forms. They're all test trying to test the same

thing uh maybe on the second or the third administration. Uh but they uh items are not identical there.

Now in addition to uh the problems that can occur when people complete the same measure more than once there is another

problem that can happen with test retest reliability is that some conceptual variables actually will change over time

within an individual. So for example uh it is also possible that it is not the you know the the difference in responses

is not because of the items but it is actually because the conceptual variable that you are assessing or trying to

assess that has changed within an individual. Say for example a lot of personality tests might be like that

that when you take this take a given personality test at the age of 15 and then 25 and then 35 and then 45 and 55 a

lot of times the things that the personality test is trying to assess uh have actually changed the conceptual

variables say for example things like uh I don't know optimism has changed extraversion introversion has changed

things like that okay so uh let's let's take that example uh again something that sang taken uh if optimism has a uh

meaning as conceptual variable then people who are optimists on a Tuesday again remember we don't compare tests

across a large period of time typically you're trying to do first second and third administrations within a short

period so if you ask somebody to take a test on optimism or an optimism scale on Tuesday you would expect them to be

optimist also on Friday of the week you know this is a a characteristic that is assumed to be relatively stable in an

individual at least for short periods of time. Okay, these characteristics such as uh you know intelligence,

friendliness, assertiveness, uh optimism for example just as I was just saying are known as traits which are

personality traits and they are not expected to vary across a very short period of time. large period of times.

Yes, maybe. Okay, so these are called traits. These are basically trait variables and uh you would expect that

they will not change very quickly over time. But then there are also state variables. For example, anxiety uh for

example uh you know uh some kind of an emotional reaction which can obviously change over a period of hours and over a

period of days. So other conceptual variables such as levels of stress, moods or even say for example preference

for classical over rock music as I said these are typically known as states and these are personality variables which

are expected to change even within the same person over short periods of time. As a result what can happen is that a

person's score on a mood measure which was administered on Tuesday is not necessarily going to be the same. it's

not necessarily going to give the same information when the same measure is administered let's say on the same

Friday or the next Friday in these cases the test retest approach will not really work to and it will not really give us a

good estimate of reliability we'll have to look for something else we'll have to look for a different approach

the other approach that people can take in order to uh assess reliability is known as the internal consistency

approach. Now what we are doing is we are not uh you know hinging our bet on reliability on the individual but we are

basically trying to work on the internal consistency of the test or the measure. Okay. So uh given these problems

associated with the test retest and equivalent forms reliability, a different measure of reliability known

as the internal consistency uh has been the preferred choice and the most accurate way of assessing reliability

for both trait and state measures. Internal consistency can be assessed using the scores on a single

administration of the measure. So you don't have to take the measurements again and again. And we've seen that

most self-report measures typically contain a larger number of items. You'll not have typically you'll not have a

survey or you'll not have a test which has just one item. You'll typically have them with many items. Okay. So that is

one of the ways that probably helps us to determine the internal consistency and we'll talk about that going forward.

Now when we are measuring something when you're trying to take these measurements there is something referred to as the

true score and the random score. So one of the basic principles in reliability is that the more measured variables are

combined you know when you average a lot of measurements over time the more reliable the test will actually become.

This is basically because although each measured variable will in some in sense be influenced in part by a random error

some part of each item will also measure what is called the true error or let's say the true true score that you are

measuring or part of the scale score that is not random error. Say for example whenever you are collecting

reaction times for something and you are you know have this you have this B of X in mind which is uh you know the mental

process that you want to sort of appropriate uh obviously there will be some error but when you do this over a

period of time when you uh you know have several measurements what will happen is that these random errors will start

canceling each other out and you'll get closer to the B of X you'll get closer to the true score which is basically

what you are after also And just repeating myself I guess as random error is self-cancelling the

random error components of each measured variable will not be correlated with each other. So anyways you know they'll

they'll sort of uh you know uh not figure in the correlation uh the parts of the measured variables that represent

the true score they will actually be correlated. So when you carry out correlations over a large number of

measurements the only thing that will start figuring out in correlation is basically the true score. So

consequently when these several measures are combined together by you know taking an average or taking a sum of each of

them the use of many measured variables will produce a more reliable estimate of the conceptual variable then will any of

the individual measured variables themselves. So let's let's talk about how this is done. Now typically the role

of the true score and random score can be uh random score which is basically the error component can be expressed in

the in the form of two equations that we're just discussing and that forms the basis of reliability. So just look at

this. An individual score on any measure which you are sort of doing M of X will consist of both the true score and the

random score. And the reliability that you want to really uh you know arrive at is basically the proportion of the

actual score that reflects the true score. So basically what you're trying to do is reliability will be true score

divided by true score plus uh random score or actual score and that is basically uh you know uh uh you know

well equipped to give you a reliability estimate of how these uh you know measures are. Uh take an example

Rosenberg self-esteem scale has around 10 items each of which are designed to assess the conceptual variable of

self-esteem in each you know in in slightly different ways. Now although each of these items will have some part

random error or will have some part of the random score each will also measure some aspect of the true score some

aspect of uh you know self-esteem that you wanted to measure. Now uh if you average or if one averages all 10 of

these items together to form a single score single measure ideally what should be expected is that the overall scale

score will be more uh will be a more reliable measure than will be any of the individual items. So typically that's

why you will see that a combined score or a average score is taken on these scales which gives you a combination and

which tells you more about the uh you know conceptual variable that you are interested in as compared to a single

item. Now internal consistency in this case refers to the extent to which the scores

on the items correlate with each other and thus are all measuring the same true score rather than being affected much by

the random score or the random error. Now regarding this scale, a person who answers let's say above average on

question number one. Let's say that is measuring one aspect of self-esteem indicating that he or she has a high

self-esteem should also uh you know answer above average on all the other questions because they are also

measuring self-esteem. In that sense these responses among these different items can be or must be expected to be

correlated. So this pattern obviously will it will not be perfect. It will not be always 1.00 00 all the time. But

because each item will have some error as well, but to the extent that each of the items are measuring the same

conceptual variable, they are assessing the same true score rather than being affected too much by the random error,

the average correlation among the items will be approaching R is equals to 1.00 the perfect correlation. And to the

extent that the correlation among items is less, let's say uh you know it's closer to zero, it tells us that there

is too much random error or that the items are not really measuring the same thing. Okay, there's a typo here. Now,

one of the ways to do this is the coefficient alpha approach as well. So one of the ways to calculate the

internal consistency of the scale is to correlate a person's score on one half of the items with the person's score on

the other half of the item. Say for example you have a scale having 100 items. What you could do is you compare

the correlation uh of uh you know the first half 1 to 50 with the correlation on and you compare uh correlate this

with the score on the 51 to 100 items. So you basically do that. You can also randomly pick let's say 1 3 5 7 9 and

you have on the other half 2 4 6 8 10 and you basically correlate their scores on each of these two halves.

This procedure is known as split half reliability. Okay. If the scale is reliable and if it is not too much

affected by random error then the correlation between the two halves will approach R is equals to 1 indicating

that both halves of the scale are measuring one and the same thing and there are both halves all the items

across the both halves are appropriating the same true score. Now since split half reliability uses only some of the

available correlations among items, it is preferable to have a measure that indexes the average correlation among

all of the items on that scale. So it's there's a diff slightly different way of doing it. Now the most common and the

best index of internal consistency that is typically and widely used is called the Cronback's coefficient alpha. This

measure is an estimate of the average correlation among all of the items on the scale and is numerically equivalent

to the average of all possible split half reliability. So this is again something that is uh you know a

statistical measure to allow you the best estimate of split half reliability given your scale.

So coefficient alpha as it reflects the underlying correlational structure of the scale ranges from 0.00 00 indicating

that the measure is totally just error uh to plus 1.00 indicating that the measure has no error at all. Again, I'm

sure it is clear that perfect correlations will uh you know be very rare to be uh found. There is also

another way to sort of uh have this uh you know internal consistency in hand which is the item to total correlation.

So you basically take out a comparison between uh one item and the rest of the scale. Let's talk about this. When a new

scale is being developed, its initial reliability may obviously be low. And this would be because that uh even

though the researcher has selected the items that he or she believes uh will be reliable, some items will turn out to be

containing more random error for reasons that you know sometimes researchers will just miss while planning and creating

these items. So a strategy commonly used in the initial development of the scale is to calculate the correlations between

the score on each of the individual items and the total scale score excluding the item itself. Say for

example I have a uh scale of 10 items. I have the score of item number four separately and the other nine items

separately and I compare their correlation. I I calculate the correlation between them. This is known

as the item to total correlation which is also an interesting method which also useful method to assess the internal

consistency of your scale to give you for example the best reliability estimate possible.

So the items that do not correlate highly with the overall rest of the scale or overall total score can then be

deleted out from that scale. Now as this approach basically deletes the items that are not high in correlation it'll

lead you to reduction in items. So typically what happens is and you'll see that a lot uh people create the initial

scale with you know 20 50 100 items. Uh but after these calculations after you've run these tests a few times and

you've computed these correlations a lot of items get deleted. So what you have is a much shorter scale, smaller scale

but with a higher estimate of reliability and you can trust it more and you can use it more widely.

Another sort of approach that uh you know we've not talked about so far is that a lot of times individuals are also

not I mean individuals the raers are also afflicted with some kinds of errors. So reliability is not only

important in terms of these self-report scales, but they're also important in terms of behavioral measures that we

collect from these participants. And therefore, it is a common practice, let's say it's a it's a common practice

for a number of judges to rate the same observed behaviors and then we take an average or we combine their ratings to

create a single measured variable. Okay. So there are item level reli reliability estimates as well and you have to have

interrator reliability also so that you have a broad sense of how the raers are uh you know uh uh answering on a

particular item or whatever measure you're interested in. These calculations uh require the internal consistency

approach as well. So for example, just as any single item on the scale is expected to have error, the ratings of

one judge or an individual judge is also more like slightly more likely to contain error than the average rating

across a group of judges. Okay. The errors of the judges can be uh caused by several things including say for example

inattention, time of the day, misunderstanding of instruction or even personal preferences. When the internal

consistency of a group of judges is calculated, the resulting reliability estimate is known as interrator

reliability. Previously we were talking about items and how to uh enhance and magnify the reliability of the items on

a scale. We can also do that at the level of the raers as well. Now if the ratings of the judges that

are combined are quantitative variables then coefficient alpha can be also used to evaluate reliability. However in some

cases you the kind of measurement you are taking is nominal measurement. So it's basically uh say for example in a

scale where you are uh you know asking judges how did the children play? Uh did they play aggressively? Did they play

cooperatively? Did they play alone? That kind of thing uh you know will not give you easy numbers. So in such cases a

different statistic is known which is the kapa statistic and it is used as a measure of agreement among the judges.

Okay. So how did these judges sort of uh you know uh feel about this given behavioral measure. So like coefficient

alpha kapa also raises ranges from 0.00 which basically indicates that the judges ratings are entirely made of

random error to plus 1.00 00 indicating that the ratings are almost perfect. Okay. So this is broadly uh you know all

about reliability that I wanted to talk about. In the next lecture we'll talk about how to fight systematic error and

we'll talk about uh various ways of addressing construct validity. Thank you.

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Understanding Construct Validity and Reliability in Cognitive Psychology Experiments

This lecture by Dr. Aricwarma delves into the importance of construct validity and reliability in experimental design for cognitive psychology. It explains various types of validity—including face, content, convergent, discriminant, and criterion validity—and offers practical strategies like pilot testing and using multiple measures to enhance research accuracy.

Fundamentals of Experimental Design in Cognitive Psychology Explained

Discover the core principles of experimental design in cognitive psychology through Dr. Arkwarma's detailed lecture. Learn how measurements, mental processes, error terms, and variables interact to shape robust cognitive experiments, including practical examples like word recognition and pointing tasks.

Ensuring Reliability and Validity in Cognitive Psychology Experiments

This comprehensive summary explores how to enhance the reliability and validity of experiments in cognitive psychology. It covers key concepts such as construct validity, internal validity, manipulation strength, experimental realism, manipulation checks, and strategies to mitigate confounding variables and biases for robust experimental outcomes.

Understanding Reliability in Psychological Measurement

Explore the key concepts of reliability in psychological testing and its importance in research.

Balancing Specificity and Generality in Cognitive Psychology Experimental Design

Explore the fundamental principles of experimental design in cognitive psychology, focusing on the crucial balance between specificity and generality. This lecture delves into how precise measurements and controlled variations help researchers make valid, generalizable conclusions about cognitive behaviors, with an emphasis on the role of internal processes and error estimation.