## Tools

## Glossary

Alternative hypothesis---the alternative hypothesis in a study is that there is a relationship between the predictor and response variables.

Anderson Darling test---a test for the fit of data to a particular distribution, often used to test the normality of data.

The Central Limit Theorem---a fundamental theorem in probability that says the sampling distribution of the mean of any distribution with a well-defined mean tends to a normal distribution as the sample size tends to infinity. The result of this theorem is that for large sample sizes, it is possible to use methods based on the normal distribution regardless of the underlying distribution of your response variable.

Centre of a distribution---the centre, or middle of a distribution can be studied either through the average of the values (sum divided by number of entries) in the distribution or the median value of the distribution (the value below which half of the values lie).

Confidence interval--A p% confidence interval around a parameter in a statistical model is an interval estimating this parameter that can be calculated from any sample from the population. There is a p% chance of choosing a random sample from the population that gives a confidence interval containing the true value of the parameter.

Correlation---a measure of the association between two quantitative variables.

Count---a variable whose possible values are counts, such as number of children a subject has.

Degrees of freedom---a measure of the amount of data available for building statistical models.

Effect size--the effect size in a study is a measure of the degree of influence of a predictor variable on a response variable. For instance, when comparing the means of two groups, the effect size would be the true difference in means on the two populations.Effect sizes are generally estimated statistically by confidence intervals calculated from samples.

Experiment---a study in which experimental units are randomised into one of several treatments or a control group and one or more interventions are carried out on the various groups.

Experimental unit---the units on which the measurements in a study have been made. These may be called participants or subjects if they are people, or may be called cases or simply units.

Exponential variable---a variable whose values tend to follow an exponential distribution, with most measurements expected near zero, and the expected number of measurements farther out getting smaller and smaller.

Fixed effect---a nominal variable is considered as a fixed effect if your research question is only about those levels of the variable you have sampled, or where you have sampled all possible levels. This is as opposed to a random effect, which is a nominal variable where you have not sampled all levels, but wish to draw conclusions about all levels.

General linear model---a linear equation relating a normally distributed quantitative response variable to one or more predictor variables and a normally distributed random error.

Generalised linear model---an equation relating a parameter in a family of distributions modeling a response variable to a linear combination of predictor variables.

Groups---the sets of experimental units at the various levels of a nominal predictor variable. For instance, in a study of the relationship of gender to height, the predictor variable is gender, which has levels male and female. Thus the data would be in two groups, the heights of men and the heights of women.

Histogram---a visual representation of a quantitative variable on a sample, with bars representing certain ranges of values of the variable. The height of the bar over a given range represents the number of measurements in that range from the sample.

[ Generating a Histogram in R ] [ Generating a Histogram in SPSS ]

Hypothesis test Many statistical tests involve calculating a p-value and deciding to accept or reject a null hypothesis based on this p value. This process is called hypothesis testing.

Levels of a variable---the possible values a nominal or ordinal variable may take. For instance, for the variable “gender” the levels are “male” and “female”. For a Likert-type question, the levels could be “strongly disagree”, “disagree”, “neutral”, “agree” and “strongly agree”.

Likert scale data---data, often measuring opinions, on a scale of the type from “strongly disagree” to “strongly agree”. Commonly, Likert scale data is given on a three to seven point scale, that is, with three to seven possible responses.

Median---the value below which half of the values of a variable lie.

Nominal variable---a nominal variable is a measurement on a population that takes values that are labelled by words, such as “male” and “female”. The values of a nominal variable cannot be put into any meaningful order.Nominal variables are also sometimes called categorical variables.

Normal variable---a variable whose values tend to follow a normal distribution, with most measurements near the average value, and the expected number of measurements decreasing as you move in either direction away from the average.

Null hypothesis---the null hypothesis in a study is that the predictor and response variables are not related to each other.

Observational study---a study in which one or more measurements are made on a sample, but no treatment is given.

Ordinal variable---a variable whose values may be put into a meaningful order, but where differences between levels are not meaningful, such as “Stage of cancer: I, II, III, IV”.

Parameter---a constant in a statistical model that is to be determined from the data collected.

Poisson variable---a variable whose values tend to follow a Poisson distribution.Count data is often modelled as a Poisson variable.

Population---the larger set of possible units from which your sample is drawn and about which you would like to draw conclusions based on the evidence from your sample.

Power---the power of a statistical test is the chance of obtaining a significant p-value when there is in a true effect in the population. Alternatively, 1-power is the probability of incorrectly accepting the null hypothesis.

Predictor variable---a variable that influences the values of another variable, called the response variable.Predictor variables are also sometimes called independent variables or covariates.

Proportion---for a nominal variable, the goal of statistical analysis may be an estimate of the proportion of the population at a particular level of that variable. For instance, a study of a sample of undergraduate engineering students may wish to determine the proportion of women in undergraduate engineering programs in the UK.

QQ plot---a “quantile-quantile plot”, which is a visual method for examining the fit of a particular distribution to data. Often used to examine the normality of data.

[ QQ Plot in R ] [ QQ plot in SPSS ]

Quantitative variable---a measurement whose possible values are numbers in some interval. Quantitative variables are also sometimes called scale or interval variables.

Random effect---a nominal variable is a random effect if you have not sampled all levels of the variable, but hope to draw conclusions about all levels. This is as compared to a fixed effect, which is a nominal variable for which all levels have been sampled, or for which you only hope to draw conclusions about those levels that have been sampled.

Regression line---the line of best fit to a scatterplot of a quantitative predictor variable against a quantitative response variable.

Repeated measure---a particular measurement done several times on each experimental unit under different conditions. For instance, researchers studying weight loss may weigh all participants at the beginning of an exercise regime, and then every week for five weeks during the regime

Residuals---the differences between the values of a response variable on the sample and the values predicted from the predictor variable values by the regression line or more general model.

Response variable---a variable whose values are influenced by other values, called the predictor variables. Response variables are also sometimes called dependent variables, primary variables or endpoints.

Sample---the particular set of experimental units or participants on which measurements were done to collect data.

Scatter plot---a visual representation of the relationship of a quantitative predictor variable to a response variable, given by plotting pairs of predictor and response values for the various units in the sample against axes representing all possible values of the predictor variable on the horizontal axis and all possible values of the response variable on the vertical axis.

Significance level---the significance level of a test is the probability of the results arising from random variation in the population if there is in fact no real effect, that is, if the null hypothesis is true. If the p-value calculated from a test on a sample is below the significance level, we reject the null hypothesis in favor of the alternative hypothesis and say the result is significant. A standard significance level is 0.05, which is also known as 95% significance.

Spread of a distribution---any measurement of the tendency of measurements to vary from the centre. Examples of statistics measuring spread are variance, standard deviation and interquartile range.

Statistical model---either a family of distributions that is assumed to represent the distribution of a particular variable or an equation relating variables.

Stem and leaf plot---a visual representation of quantitative data in which higher place values (the “stem” values) are listed vertically down the left side, and the lower place values of the datapoints (the “leaf” values) are listed to the right of the appropriate stem value.

[ Stem and Leaf Plot in R ] [ Stem and Leaf Plot in SPSS ]

Study---any form of research in which data is collected and analysed to answer research questions.

Tiered sample---a sample that is selected at various levels. For instance, a selection of UK schoolchildren may be obtained through a tiered sample through first selecting a sample of towns from the UK, then from each town selecting a sample of schools, then from each school selecting a sample of children. The levels in this sampling regime are then town, school and child. The levels above the bottom level in a tiered sample must be treated as random effect nominal variables in the analysis.

Time to event---data that records the time from some common starting point until some event of interest occurs. For instance, data may be collected on the time from diagnosis with lung cancer until death from lung cancer.

Transformation of a variable---the application of some formula to all measured values of a variable. For example, it is common to take the logarithm of certain values before carrying out inference. Often a variable is transformed to make its distribution normal in form, and thereby to permit the use of normal inference methods on data that did not originally follow a normal distribution.

Variable---any measurement made on experimental units in a sample.