Reliability in Research: Definitions, Measurement, & Examples

The term reliability in psychological research refers to the consistency of a quantitative research study or measuring test.

For example, if a person weighs themselves during the day, they would expect to see a similar reading. Scales that measured weight differently each time would be of little use.

The same analogy could be applied to a tape measure that measures inches differently each time it is used. It would not be considered reliable.

If findings from research are replicated consistently, they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable, it should show a high positive correlation.

Of course, it is unlikely the exact same results will be obtained each time as participants and situations vary. Still, a strong positive correlation between the same test results indicates reliability.

There are two types of reliability – internal and external reliability.
  • Internal reliability assesses the consistency of results across items within a test.
  • External reliability refers to the extent to which a measure varies from one use to another.

Assessing Reliability

table showing types of reliability

Split-half method

The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires. There, it measures the extent to which all parts of the test contribute equally to what is being measured.

This is done by comparing the results of one half of a test with the results of the other half. A test can be split in half in several ways, e.g., the first half and the second half or by odd and even numbers. If the two halves of the test provide similar results, this would suggest that the test has internal reliability.

The reliability of a test could be improved by using this method. For example, any items on separate halves of a test with a low correlation (e.g., r = .25) should either be removed or rewritten.

The split-half method is a quick and easy way to establish reliability. However, it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests that measure different constructs.

For example, the Minnesota Multiphasic Personality Inventory has sub scales measuring differently behaviors such as depression, schizophrenia, social introversion. Therefore the split-half method was not be an appropriate method to assess reliability for this personality test.


The test-retest method assesses the external consistency of a test. Examples of appropriate tests include questionnaires and psychometric tests. It measures the stability of a test over time.

A typical assessment would involve giving participants the same test on two separate occasions. If the same or similar results are obtained, then external reliability is established. The disadvantages of the test-retest method are that it takes a long time for results to be obtained.

Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions one week apart, they found a correlation of .93 therefore demonstrating high test-restest reliability of the depression inventory.

This is an example of why reliability in psychological research is necessary, if it wasn’t for the reliability of such tests some individuals may not be successfully diagnosed with disorders such as depression and consequently will not be given appropriate therapy.

The timing of the test is important; if the duration is too brief, then participants may recall information from the first test, which could bias the results.

Alternatively, if the duration is too long, it is feasible that the participants could have changed in some important way which could also bias the results.

Inter-rater reliability

The test-retest method assesses the external consistency of a test. This refers to the degree to which different raters give consistent estimates of the same behavior. Inter-rater reliability can be used for interviews.

Note it can also be called inter-observer reliability when referring to observational research. Here researchers observe the same behavior independently (to avoid bias) and compare their data. If the data is similar, then it is reliable.

Where observer scores do not significantly correlate, then reliability can be improved by:

  • Training observers in the observation techniques and ensuring everyone agrees with them.
  • Ensuring behavior categories have been operationalized. This means that they have been objectively defined.

For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they would both have their own subjective opinion regarding what aggression comprises.

In this scenario, it would be unlikely they would record aggressive behavior the same and the data would be unreliable.

However, if they were to operationalize the behavior category of aggression this would be more objective and make it easier to identify when a specific behavior occurs.

For example, while “aggressive behavior” is subjective and not operationalized, “pushing” is objective and operationalized. Thus researchers could simply count how many times children push each other over a certain duration of time.


Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the beck depression inventory The Psychological Corporation. San Antonio, TX.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory. New York: Psychological Corporation.

Olivia Guy-Evans

BSc (Hons), Psychology, MSc, Psychology of Education

Associate Editor for Simply Psychology

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

Saul Mcleod, PhD

Educator, Researcher

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, Ph.D., is a qualified psychology teacher with over 18 years experience of working in further and higher education.