The collection of scientifically evaluable data has to be planned accurately. All scientific data collections start with raising a question. Pilot studies, previous knowledge and literature data can help us to raise an adequate question. Our goal is to formulate a scientific hypothesis and make predictions from this hypothesis. These predictions are specific statements that can be tested statistically (Précsényi et al., 2000).
The measured variables that will be used to test the predictions have to be defined accurately before data collection. This definition has to be applied consequently during data collection (Martin and Bateson, 1993). Determining of the variables is not always a straightforward or easy process. It is easy to define the body mass and its measurement, but the situation is more difficult if we intend to measure a behaviour that includes complicated, variable components such as fight or courtship between individuals. In such cases it is not always obvious when a given behaviour starts and ends, and what is its intensity etc.
The behaviour of animals is characterized by natural variability. This variability is the result of several factors: genetic factors, biotic and abiotic environmental effects and their interactions shape the behaviour of individuals (Székely et al., 2010). Because of this variability, our measurements contain ‘noise’ that cannot be controlled for. Therefore, to collect statistically evaluable data, several measurements have to be taken. During data collection we have to pay outmost attention to random sampling (Zar, 2010): from the group of individuals to be investigated (statistical population, not necessarily identical with the biological population) any individuals should have the same chance to be measured (statistical sample). If random sampling (i.e. the temporal and spatial independence of the sample elements) is not assured during data collection, then the data will be pseudoreplicated, and the conclusions drawn from the analysis of data may be incorrect. It is easy to see that by measuring the height of the same person twice we do not obtain two independent data points, however, assuring spatial independence is not always so simple (e.g. within a group the more similar individuals may be more close together than more dissimilar individuals). Furthermore, measurements of relatives (e.g. siblings) are also not independent, because firstly the relatives share common genes, and secondly they may developed in the same social environment.
Variables describing behaviour can usually be divided in four categories (Fig. 20.1, Martin and Bateson, 1993). Latency variables measure the time from the beginning of sampling until the occurrence of the behaviour. The occurrence or frequency variables measure the occurrence or the number of occurrences of the behaviour during a unit time, e.g. a minute. Duration variables measure the length of the occurrence of the behaviour. If the behaviour occurs several times during the data recording, then total duration and average duration can be calculated for the full sample. If not only the occurrence but also the extent of the behaviour (volume of a call, speed of running) has to be described, then we use intensity variables.
Fig. 20.1. Latency, frequency, duration and intensity. The grey rectangles represent the occurrence of the behaviour over time t. The width of the rectangles is the length of each occurrence, whereas the height is the intensity of the behaviour. The frequency of the behaviour over time t is four. The total duration is a + b + c + d, and the average duration is (a + b + c + d)/4. Based on Martin and Bateson (1993).
Before data collection, we also have to decide on which scale will each variable be measured (Figure XX.2), because the scale of measurement largely influences which statistical procedures can be used to analyse the collected data.
Figure XX.2. Types of variables according to their scale of measurement.
Behaviour can be recorded continuously, or only at given time intervals (e.g. every ten seconds, instantaneous sampling). While continuous data recording can describe behaviour very precisely, it can be used only to record a few variables simultaneously. By increasing the number of recorded variables the accuracy of continuous data recording decreases, therefore in these cases better to use instantaneous sampling. During instantaneous sampling, by the help of a stop watch or rather a timer (a device giving a short beep at given time intervals) we record at given time intervals which behaviour occurs at the sampling points. The accuracy of instantaneous sampling is largely influenced by the sampling interval, i.e. the time elapsed between sampling points. In case of swiftly changing behaviours (e.g. fight between individuals) rather short, few second intervals have to be used, whereas it may be enough to record the behaviour of resting individuals only at every minute.
The simplest way to record behaviour is to use paper and pencil or pen. To make continuous data recording even an empty sheet of paper may be appropriate, whereas for instantaneous sampling usually a behavioural sheet prepared beforehand is used. The header of the behavioural sheet contains the name of the observer, the date, the start and end of data recording, the identification of the observed individual(s) (e.g. name, ring number), and further data (e.g. temperature). The behavioural sheet itself is a table which rows are the sampling points, and the columns are either different behavioural variables (feeding, preening etc.) or different individuals (male, female, offspring 1, offspring 2 etc.). If the columns are behavioural variables, then at each sampling point we can indicate which behaviour occurs by writing e.g. an X in the corresponding column. Whereas if the columns represent individuals, then we can indicate the behaviour of the different individuals using one or two letter abbreviations defined previously. The biggest advantage of recording behaviour using paper and pencil is that they can be used almost everywhere any time, and there is no chance for technical failure. In contrary, the disadvantage of this recording method is that before analyses the data have to be entered to spreadsheet or database that may be a time demanding process. Entering data into a database can be avoided by using event recorder. Any kind of portable computer (smartphone, tablet, laptop) can be used as event recorder. Running an appropriate application we can record which behaviour occurs by hitting predefined key combinations or by touching the appropriate part of the screen. With an event recorder we can effectively record behaviours that consists of well defined behavioural categories, however, it may be much more difficult to add comments to the sampling points than to write down a quick note on the margin of the behavioural sheet.
The behaviour may be recorded on video tapes, however, that method again needs later a time consuming coding of data into a database. Video recordings have the advantage that if later during the study new questions arise, then further, previously not planned variables can be recorded by re-watching the footages. The disadvantage of video recordings is that on footages one can see often less than in real time, thus some details of the behaviour may not be visible. This is especially true in case of time-lapse videos where only one or a few pictures are taken per second e.g. because of limited data storage.
Behavioural data can be also recorded by automatic devices. For example, electronic scale can be placed under the nest of birds to describe the parents feeding activity based on the body mass differences of the sexes (Szép et al., 1995). Another possibility is to glue small RFID (Radio Frequency IDentification) tags (transponders) to the birds, and record the unique identification codes of tags by a computer controlled reader connected to an antenna applied under the nest or to the entrance of the nestbox (Kosztolányi and Székely, 2002). The advantage of using automatic recording systems is that big amount of data (even data from several days) can be collected and the data is directly recorded into a logger, so there is no need for time consuming data entry. Their disadvantage is, however, that these systems are usually complicated, they are the results of long planning processes, and because of their complexity the probability of failures may be also high. Furthermore, before data recording we have to make sure that the automatic system estimates well the true behaviour, that is, the data collected by the system are in accordance with data collected by an observer.
Measurements are subject of two kinds of errors: systematic and random errors (Fig. 20.3). Systematic error represents the difference between the true value of the variable and its measured value, i.e. the validity of the measurement, whereas random error represents errors occurring during measurements, .i.e. the reliability of the measurement (Martin and Bateson, 1993). For example, systematic error is, if a thermometer always shows 3 degrees less than the actual temperature because it was miscalibrated (the zero line was drawn at +3 °C). Whereas random error is, if the scale on our thermometer is given only at every 5 °C, therefore our readings are not accurate, and repeated readings do not agree.
The observers can be regarded as instruments that measure a given parameter of the behaviour the same way based on the same principles. To return the thermometer example, as there can be systematic error between two thermometers because one of them is miscalibrated, there can be systematic differences between two observers, because, for example, they interpret and use the definitions consistently differently. Furthermore, as there can be random error in the value read from two thermometers with different scaling, there can be random error between two observers, because, for example, one of them is less experienced or less concentrated, and thus data collected by this observer contain more errors.
Therefore, if our data were collected by several observers, then before data analysis we have to ensure whether the agreement between the sets of data collected by different observers is adequate (inter-observer agreement or reliability, Martin and Bateson, 1993). To test this, two observers have to evaluate the same behaviour sequence in real time or from video footage, and we have to compare the resulting data.
The reliability of data collection has to be checked even when data were collected by only one observer. In this case, we examine the degree of agreement of the observer with himself/herself (intra-observer agreement or reliability): the observer evaluates the same behaviour sequence twice and we analyse the agreement between the two codings.
If all data were collected by one observer, even then it may be worth to test the inter-observer agreement by including an independent observer. This way it can be detected if the data collected by our single observer has systematic errors similarly to the case when we collect all data with a miscalibrated thermometer.
There are several methods to measure the reliability between observers (Martin and Bateson, 1993). Here we review the three most often applied methods.
The degree of agreement can be estimated often by correlation between the two sets of data. The degree of association between two sets of data is measured by the correlation coefficient (r) in which the value can vary between -1 and +1. If r = +1, then there is full agreement between the two datasets. With decreasing r, the degree of agreement decreases, and if r = 0, there is no linear association between the two datasets. If r < 0, then the two datasets describe the given behaviour in an opposite way.
If the variable follows normal distribution, then Pearson correlation coefficient (r) is used, otherwise Spearman rank correlation coefficient (rs) can be used in which the value can vary also between -1 and +1.
If we calculate the correlation coefficient by software, then usually the statistical significance (p) is also reported that refers to the divergence of the coefficient from zero. It is important to emphasize that the p value in itself does not give too much information about the degree of agreement, because the significance level of a given correlation coefficient decreases sharply with increasing sample size. Usually we consider the association between the datasets of observers reasonable, if r ≥ 0.7. If r is lower than this value, then we cannot combine the data collected by the two observers, and either we need to redefine the definitions used for coding or the coding experience of observers has to be improved.
It is important that for the calculation of correlation independent data pairs have to be used. That is, it is not correct to calculate the correlation from one sample, i.e. from data gained from sections of one behaviour sequence (e.g. one video footage), but data gained from separate samples have to be used. Furthermore, the sample sections used for the calculation of correlation should be random and representative regarding our sample, otherwise we can easily obtain misleadingly high agreement between two observers, if for example we choose sections where the behaviour in question does not occur at all or occurs continuously.
Index of concordance is usually used if the investigated variable was measured on nominal or ordinal scale (Table 1). To calculate the index of concordance we first count the cases where the coding of the two observers agrees (A), and the cases where it disagrees (D), then we calculate the index: KI = A/(A + D) that can have values between 0 and 1.
A shortcoming of the index of concordance is that it does not take into account that some amount of agreement can occur between the observers just by chance, and therefore it may overestimate the agreement between observers.
In contrast to the index of concordance, in the calculation of Cohen’s Kappa the number of agreements occurred by chance are taken into account: κ = (KI – V)/(1 – V), where KI is the index of concordance, that is, the proportion of agreement between the two observers, and V is the amount of agreement expected by chance.
Let’s assume that two observers (‘A’ and ‘B’) analysed a 10 minute long footage with sampling every 10 seconds (n = 60 sampling points in total). The observers recorded whether a dog on the footage barks or not. From the 60 sampling points in 30 cases both of them found that the dog barked, whereas in 25 cases both of them found that the dog did not bark. Thus the index of concordance is KI = (30 + 25)/60 = 0.917.
To determine the value of V, first we need to count how many times from the 60 cases observer ‘A’ coded barking (A+) and non-barking (A-), and similarly we need the values of B+ and B- for observer ‘B’ (Figure XX.4).
Figure XX.4. Calculation of Cohen’s Kappa in case of a sample consisting of n=60 sampling points.
The probability that two observers code the same assuming independence: V = A+/n × B+/n + A–/n × B–/n = 33/60 × 32/60 + 27/60 × 28/60 = 0.293 + 0.210 = 0.503. Therefore Cohen’s Kappa: κ = (0.917 – 0.503)/(1 – 0.503) = 0.833. That is much lower than the index of concordance, and shows that almost 10% of the agreement between the observers is due to chance.
Similarly to the correlation coefficient, also in case of Cohen’s Kappa there is no objective threshold above which the agreement is considered appropriate. Usually Kappa values above 0.6 may be already considered as acceptable, however, the closer it is to 1, the more reliable the analysis of the behavioural variable is.
After we checked the reliability of our data, we can start with data analyses in which the first step is the description of the data (Fig. 3). The localization of the collected sample, that is, where our sample sits on the axis, is most often characterized with the mean. Another often used descriptive statistic of the sample localization is the median that is the value at the half of the rank ordered sample.
The dispersion of our sample, that is, how wide the data are spread on the axis, is characterized most often with standard deviation (s) or with its square, the variance (), where xi are the individual data points, is the mean of the sample, and n is the sample size. The dispersion of the data can also be characterized by the interquartile range that contains 50% of the data and its calculation is IQ = Q3 -Q1. Q3 is the upper, whereas Q1 is the lower quartile, the medians of the two sub-samples split by the median of the full data (Fig. 20.5).
Figure XX.5. Localization and dispersion of the sample: the time spent on the nest by the male and the female in an imaginary bird species (n = 10 pairs). On the boxplot the middle line is the median (M), the bottom and top of the box are the lower (Q1) and upper (Q3) quartiles, the „whiskers” are the minimum and maximum values in the ranges Q1 – 1,5 × (Q3 – Q1) and Q3 + 1,5 × (Q3 – Q1). Values outside of these ranges are called outliers or extreme values and are depicted with a dot or asterisk.
The statistical hypothesis is different from the scientific one. The scientific hypothesis is a logical framework that is based on our previous knowledge, and from which predictions can be drawn. In turn the statistical hypothesis is a simple statement pair about a characteristic of the statistical population. The null hypothesis (H0) states the absence of difference, whereas the alternative hypothesis (HA or H1) states the presence of difference. The members of the hypothesis pair should exclude each other, i.e. if H0 is not true then HA should be true and vice versa.
During statistical testing first we calculate the value of the test statistic from our sample (this is most often done with a statistical software). From this value we can determine the probability (p) to get a value as high (or higher) for test statistic if H0 is true. If this is highly unlikely, that is, if p is small, then we reject H0 and accept HA. If the probability is high then we keep H0. The probability that is used as the threshold for rejection of H0 is the level of significance and denoted by α. In biology the widely accepted level of significance is α = 0.05, that is, p ≤ α values are considered significant.
It is usually typical for the distribution of variables measured on interval or ratio scale that the values are aggregated near the mean and further from the mean less and less values can be found (“bell shaped curve”). Not all distributions of this kind are normal, but many biological data converges to normal distribution if the sample size is large. According to the central limit theorem, if from a non-normally distributed population several random samples are drawn, then the mean of these samples converges to normal distribution. Biological variables are usually the result of several different factors; therefore they converge to normal distribution.
Before conducting parametric statistical tests (see 2.11) we have to ensure that the assumption of normality is fulfilled. The most often used test for checking normality is the Kolmogorov-Smirnov test that has, however, a very low power, therefore its application is not recommended. The other often used test for checking normality is the Shapiro-Wilk test that has a high power if the data do not contain equal values. However, equal values (ties) often occur in biological data, thus the applicability of this latter test is also limited. Therefore, normality is inspected often graphically by the quantile-quantile plot (Q-Q plot). If the distribution of the sample does not diverge largely from the normal distribution, then the theoretical and sample quantiles give a near straight line (Fig. 20.6.).
Statistical tests can be divided in two large groups based on the distribution of the variables to be investigated (Précsényi et al., 2000). Parametric tests, as it is in their name, estimate a parameter of the investigated population. They assume that the distribution of the investigated variable (or the error) is normal. The power of parametric tests is high (also small differences can be detected), but they have several assumptions and usually they can be used only on variables measured on ratio or interval scale (see Figure XX.2). In contrary, non-parametric tests do not estimate a parameter. They do not require normality, but in case of some of them it is assumed that the distribution has a particular shape (e.g. symmetric). Non-parametric tests have fewer assumptions and can be used also on variables measured on nominal or ordinal scale. They have usually lower power then their parametric counterpart, and many, especially the more complex parametric tests, do not have a non-parametric counterpart.
If we have one sample and we intend to compare one of its characteristics to a theoretical value, then we can use one-sample t-test (parametric) or Wilcoxon signed-rank test (non-parametric). If we have two independent samples, and we intend to compare one of their characteristics, then we can use two-sample t-test (parametric) or Mann-Whitney test (non-parametric). If we have two samples and the elements of the samples can be arranged into pairs (e.g. before and after treatment values measured from the same individual, males and females of pairs), then we can use paired t-test (parametric) or Wilcoxon paired signed-rank test (non-parametric). If we have more than two unrelated samples, then we can compare them by Analysis of Variance (ANOVA, parametric) or by Kruskal-Wallis test (non-parametric). If we have more than two measurements from the individuals, then we can apply repeated-measures ANOVA (parametric) or Friedman test (non-parametric).
The linear association between two normally distributed variables can be investigated by Pearson correlation, whereas in case of non-normally distributed data we can use Spearman rank correlation. Correlation, however, does not assume causality between two variables. If we are interested how an independent variable influences linearly a dependent variable, then we can apply regression analysis.
Association between variables measured on nominal scale (e.g. whether the distribution of hair colour depends on sex in humans) can be tested with test of independence. In the test of independence usually χ2 test is used.
When reporting the results of statistical tests, usually we have to give the name of the statistical test used, the value of the test statistic (e.g. t value, χ2 value), the degrees of freedom of the used sample (parametric tests) or the sample size (non-parametric tests) and the p value.
 See Chapter 21 InStat introduction