January 28, 2016

Methodological Design

Evidence-based practice is a method of incorporating the best available evidence to support an intervention (physiotherapy, medical, nutritional, etc). This involves utilizing the scientific literature as well as clinical/field experience. However, this does not suggest that we should incorporate the literature blindly. We must critically appraise the literature by assessing its internal and external validity.

Internal validity reflects the quality of the study design, implementation, and data analysis in order to minimize the level of bias and determine a true ‘cause and effect’ relationship between an intervention and an outcome (1). External validity describes the circumstances under which the results of the research can be generalized (the population of interest).

Please note that the majority of the sections listed below are components of the PEDro Scale. There are a few additional sections described below that may add to the quality of a critical appraisal. In no way is the list complete, as there are many factors to consider when critically appraising the literature. 

Literature review
An adequate literature review provides a thorough background of the topic in the introduction. It should include previous research conducted on the topic and a reason why the current study is being conducted.

Participant selection
The participants included in a study are considered a sample of the population of interest. There are two forms of sampling that can be used to select participants, probability sampling and nonprobability sampling (3). The type of sampling can influence the characteristics of the sample, which in turn, influence the generalizability, or external validity (2).

Eligibility criteria
This is a list of criteria used to determine who was eligible to participate in the study (5). This affects external validity, but not internal or statistical validity (5). The more homogeneous the sample based on the inclusion/exclusion criteria, the lower the external validity (5).

Sample size
This is the number of subjects included in the study from a given population. Larger sample sizes tend to be more representative of the population.

Participants are randomly assigned to one of two or more interventions. Randomization minimizes the risk of confounding variables, reduces the risk of bias, and allows for examination of direct relationships. A major limitation of using an RCT is that it is impossible to obtain a true random sample of the population (2). This can make it difficult to generalize the results.

Concealed allocation
The person responsible for determining whether a subject was eligible for inclusion in the trial was unaware to which group the subject would be allocated (5). If allocation is not concealed, the decision about whether or not to include a person in a trial could be influenced by knowledge of whether the subject was to receive treatment or not. This could produce systematic biases in otherwise random allocation (5).

This is a method used to prevent study participants, as well as those collecting and analyzing data from knowing who is in the intervention group and who is in the control group. (1). When subjects are blinded, it is less likely that the results of treatment are due to a placebo effect (5). Blinding assessors prevents their personal bias from affecting the results (5).

Baseline comparability
Baseline comparability involves a comparison of the baseline values of the groups (intervention and control). There should be no statistically significant difference between groups. An appropriate randomization should ensure that groups are similar at baseline. This may provide an indication of potential bias arising by chance with random allocation (5). A significant difference between groups may indicate an issue with the randomization procedures (5).

The outcome measure is the method used to assess the effect of the intervention. The purpose of an outcome measure is to discriminate among subjects at one point in time, predict a subsequent event or outcome, and assess change over time.

Standardized outcome measures
1.      Have explicit instructions for administering, scoring, and interpreting results.
2.      They are supported to the extent that information concerning their measurement properties has been estimated, reported, and defended in the peer-reviewed literature
3.      The outcome measures is valid, reliable, sensitive and specific

Measures of validity
1.      The outcome measure assesses what it is intended to measure (face validity)
2.      The outcome measure is appropriate for the population of interest (content validity)
3.      The outcome measures provide results that are consistent with the gold standard (criterion validity)

Measures of reliability
1.      Multiple assessments of one individual will provide consistent results (test-retest/absolute reliability), among assessors (inter-rater reliability) and within the same assessor (intra-rater reliability)
2.      The outcome measure can determine the degree to which the condition exists (relative reliability)

Sensitivity is the ability of a test to reliably detect the presence of a condition (3). Therefore, if the test is negative, the subject will not have the condition (true positives/total positive results; SNOUT rules out).

Specificity is the ability of a test to reliably detect the absence of condition (3). Therefore, if the test is positive, the subject will have the condition (true negatives/total negative results; SPIN rules in).

The intervention should be described in enough detail for reproducibility. An inadequate description decreases internal validity as it is unclear of the exact mechanism that led to the change in outcomes.

Adequate follow-up
The number of subjects who completed the trial to provide follow-up data for statistical analysis must be sufficient. The PEDro group states that data collected from a minimum of 85% of subjects increases internal validity (5). It is important that measurements of outcome are made on all subjects who are randomized to groups. Subjects who are not followed up may differ systematically from those who are, and this potentially introduces bias. The magnitude of the potential bias increases with the proportion of subjects not followed up (5).

Intention-to-treat analysis
This is a strategy that ensures that all patients allocated to either the treatment or control groups are analysed together as representing that treatment arm whether or not they received the prescribed treatment or completed the study (1). When patients are excluded from the analysis, the main rationale for randomization is defeated, leading to potential bias (5).

Between-group comparisons
This comparison is a statistical comparison of one group with another. It is performed to determine if the difference between groups is greater than can plausibly be attributed to chance (5).

Point estimates (effect size) and variability

Point Estimates
A point estimate or effect size is a value that represents the most likely estimate of the true population (4). Some examples include the mean difference, regression coefficient, Cohen’s d, and correlation coefficient.

It is important to consider the variability of the effect size (point estimate). A few examples of variability include: the standard deviation, the standard error, and a range of values. The standard deviation is an estimate of the degree of scatter (variability) of individual sample data points about the mean sample (7). The standard error is the standard deviation of the test statistics (point estimate) obtained from all the samples randomly drawn from the population (3). It is used as a quantification of variability of mean values, which is calculated to help derive confidence intervals. Since the standard deviation is always greater than the standard error, authors sometimes present data as “Mean ± SEM” instead of a “Mean ± SD”. This is an under-estimate of the true variability and can misdirect the reader.

Confidence Intervals
It is common for variability to be displayed in a confidence interval (CI). A confidence interval is the range of values that encompasses the population, with a given probability (4). The width of the CI depends on the SEM and the degree of confidence we arbitrarily choose (usually 90%, 95%, or 99%). By using a 95% CI we are 95% confident that the true population mean will fall in the range of values within the interval. If a 95% CI includes a zero, it would indicate that there is a possibility that the mean change of an intervention in a given population is zero. Therefore, the result is not statistically significant.

Study Limitations
A description of the limitations of the study design and methodology allows for transparency as it should describe potential biases. This includes an explanation of possible errors in internal and external validity.


1.      Akobeng A. Understanding randomized controlled trials. Arch Dis Child 2005;90:840-844.

2.      Carter R, Lubinsky J, Domholdt E. Rehabilitation Research. 4th ed. 2010. Elsvier; St. Louis, Missouri.

3.      Gaddis G, Gaddis M. Introduction to biostatistics: part 3, sensitivity, specificity, predictive value, and hypothesis testing. Ann Emerg Med 1990;19:145-151.

4.      Nakagawa S, Cuthill I. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev 2007;82(4):591-605.

5.      Physiotherapy Evidence Database. PEDro Scale (1999). http://www.pedro.org.au/english/downloads/pedro-scale/. Accessed on January 27, 2015.