The collection of scientifically evaluable data has to be planned accurately.
All scientific data collections start with **raising a question**. Pilot studies, previous knowledge
and literature data can help us to raise an adequate question. Our goal is to
formulate a scientific hypothesis and make predictions from this hypothesis.
These **predictions** are specific statements that can be
tested statistically (Précsényi et al., 2000).

The **measured variables** that will be used to test the
predictions have to be defined accurately before data collection. This
definition has to be applied consequently during data collection (Martin and
Bateson, 1993). Determining of the variables is not always a straightforward or
easy process. It is easy to define the body mass and its measurement, but the
situation is more difficult if we intend to measure a behaviour that includes
complicated, variable components such as fight or courtship between individuals.
In such cases it is not always obvious when a given behaviour starts and ends,
and what is its intensity etc.

The behaviour of animals is characterized by natural variability. This
variability is the result of several factors: genetic factors, biotic and
abiotic environmental effects and their interactions shape the behaviour of
individuals (Székely et al., 2010). Because of this variability, our
measurements contain ‘noise’ that cannot be controlled for. Therefore, to
collect statistically evaluable data, several measurements have to be taken.
During data collection we have to pay outmost attention to random sampling (Zar,
2010): from the group of individuals to be investigated (**statistical population**, not necessarily identical
with the **biological population**) any individuals should have
the same chance to be measured (**statistical sample**). If random sampling (i.e. the
temporal and spatial independence of the sample elements) is not assured during
data collection, then the **data will be pseudoreplicated**, and the conclusions
drawn from the analysis of data may be incorrect. It is easy to see that by
measuring the height of the same person twice we do not obtain two independent
data points, however, assuring spatial independence is not always so simple
(e.g. within a group the more similar individuals may be more close together
than more dissimilar individuals). Furthermore, measurements of relatives (e.g.
siblings) are also not independent, because firstly the relatives share common
genes, and secondly they may developed in the same social environment.

Variables describing behaviour can usually be divided in **four categories** (Fig. 20.1, Martin and Bateson,
1993). **Latency variables** measure the time from the beginning
of sampling until the occurrence of the behaviour. The occurrence or **frequency variables** measure the occurrence or the
number of occurrences of the behaviour during a unit time, e.g. a minute. **Duration variables** measure the length of the
occurrence of the behaviour. If the behaviour occurs several times during the
data recording, then total duration and average duration can be calculated for
the full sample. If not only the occurrence but also the extent of the behaviour
(volume of a call, speed of running) has to be described, then we use **intensity variables**.

Fig. 20.1. Latency, frequency, duration and intensity. The grey rectangles represent the occurrence of the behaviour over time t. The width of the rectangles is the length of each occurrence, whereas the height is the intensity of the behaviour. The frequency of the behaviour over time t is four. The total duration is a + b + c + d, and the average duration is (a + b + c + d)/4. Based on Martin and Bateson (1993).

Before data collection, we also have to decide on which scale will each variable be measured (Figure XX.2), because the scale of measurement largely influences which statistical procedures can be used to analyse the collected data.

Figure XX.2. Types of variables according to their scale of measurement.

Behaviour can be recorded **continuously**, or only at given time intervals (e.g.
every ten seconds, **instantaneous sampling**). While continuous data
recording can describe behaviour very precisely, it can be used only to record a
few variables simultaneously. By increasing the number of recorded variables the
accuracy of continuous data recording decreases, therefore in these cases better
to use instantaneous sampling. During instantaneous sampling, by the help of a
stop watch or rather a timer (a device giving a short beep at given time
intervals) we record at given time intervals which behaviour occurs at the
sampling points. The accuracy of instantaneous sampling is largely influenced by
the sampling interval, i.e. the time elapsed between sampling points. In case of
swiftly changing behaviours (e.g. fight between individuals) rather short, few
second intervals have to be used, whereas it may be enough to record the
behaviour of resting individuals only at every minute.

The simplest way to record behaviour is to use **paper and pencil** or pen. To make continuous data
recording even an empty sheet of paper may be appropriate, whereas for
instantaneous sampling usually a behavioural sheet prepared beforehand is used.
The header of the **behavioural sheet** contains the name of the observer,
the date, the start and end of data recording, the identification of the
observed individual(s) (e.g. name, ring number), and further data (e.g.
temperature). The behavioural sheet itself is a table which rows are the
sampling points, and the columns are either different behavioural variables
(feeding, preening etc.) or different individuals (male, female, offspring 1,
offspring 2 etc.). If the columns are behavioural variables, then at each
sampling point we can indicate which behaviour occurs by writing e.g. an X in
the corresponding column. Whereas if the columns represent individuals, then we
can indicate the behaviour of the different individuals using one or two letter
abbreviations defined previously. The biggest advantage of recording behaviour
using paper and pencil is that they can be used almost everywhere any time, and
there is no chance for technical failure. In contrary, the disadvantage of this
recording method is that before analyses the data have to be entered to
spreadsheet or database that may be a time demanding process. Entering data into
a database can be avoided by using **event recorder**. Any kind of portable **computer** (smartphone, tablet, laptop) can be used as
event recorder. Running an appropriate application we can record which behaviour
occurs by hitting predefined key combinations or by touching the appropriate
part of the screen. With an event recorder we can effectively record behaviours
that consists of well defined behavioural categories, however, it may be much
more difficult to add comments to the sampling points than to write down a quick
note on the margin of the behavioural sheet.

The behaviour may be recorded on **video tapes**, however, that method again needs later a
time consuming coding of data into a database. Video recordings have the
advantage that if later during the study new questions arise, then further, **previously not planned variables** can be recorded by
re-watching the footages. The disadvantage of video recordings is that on
footages one can see often less than in real time, thus some details of the
behaviour may not be visible. This is especially true in case of time-lapse
videos where only one or a few pictures are taken per second e.g. because of
limited data storage.

Behavioural data can be also recorded by **automatic devices**. For example, **electronic scale** can be placed under the nest of
birds to describe the parents feeding activity based on the body mass
differences of the sexes (Szép et al., 1995). Another possibility is to glue
small RFID (Radio Frequency IDentification) tags (transponders) to the birds,
and record the unique identification codes of tags by a computer controlled
reader connected to an antenna applied under the nest or to the entrance of the
nestbox (Kosztolányi and Székely, 2002). The advantage of using automatic
recording systems is that big amount of data (even data from several days) can
be collected and the data is directly recorded into a logger, so there is no
need for time consuming data entry. Their disadvantage is, however, that these
systems are usually complicated, they are the results of long planning
processes, and because of their complexity the probability of failures may be
also high. Furthermore, before data recording we have to make sure that the
automatic system estimates well the true behaviour, that is, the data collected
by the system are in accordance with data collected by an observer.

Measurements are subject of two kinds of errors: **systematic and random errors** (Fig. 20.3). Systematic
error represents the difference between the true value of the variable and its
measured value, i.e. **the validity of the measurement**, whereas random error
represents errors occurring during measurements, .i.e. **the reliability of the measurement** (Martin and
Bateson, 1993). For example, systematic error is, if a thermometer always shows
3 degrees less than the actual temperature because it was miscalibrated (the
zero line was drawn at +3 °C). Whereas random error is, if the scale on our
thermometer is given only at every 5 °C, therefore our readings are not
accurate, and repeated readings do not agree.

The observers can be regarded as instruments that measure a given parameter of the behaviour the same way based on the same principles. To return the thermometer example, as there can be systematic error between two thermometers because one of them is miscalibrated, there can be systematic differences between two observers, because, for example, they interpret and use the definitions consistently differently. Furthermore, as there can be random error in the value read from two thermometers with different scaling, there can be random error between two observers, because, for example, one of them is less experienced or less concentrated, and thus data collected by this observer contain more errors.

Therefore, if our data were collected by several observers, then before data
analysis we have to ensure whether the agreement between the sets of data
collected by different observers is adequate (**inter-observer agreement or reliability**, Martin and
Bateson, 1993). To test this, two observers have to evaluate the same behaviour
sequence in real time or from video footage, and we have to compare the
resulting data.

The reliability of data collection has to be checked even when data were
collected by only one observer. In this case, we examine the degree of agreement
of the observer with himself/herself (**intra-observer agreement or reliability**): the
observer evaluates the same behaviour sequence twice and we analyse the
agreement between the two codings.

If all data were collected by one observer, even then it may be worth to test the inter-observer agreement by including an independent observer. This way it can be detected if the data collected by our single observer has systematic errors similarly to the case when we collect all data with a miscalibrated thermometer.

There are several methods to measure the reliability between observers (Martin and Bateson, 1993). Here we review the three most often applied methods.

The degree of agreement can be estimated often by **correlation** between the two sets of data. The
degree of association between two sets of data is measured by the **correlation coefficient** (*r*)
in which the value can vary between -1 and +1. If *r* =
+1, then there is full agreement between the two datasets. With decreasing
*r*, the degree of agreement decreases, and if
*r* = 0, there is no linear association between the
two datasets. If *r* < 0, then the two datasets
describe the given behaviour in an opposite way.

If the variable follows **normal distribution**, then **Pearson correlation coefficient**
(*r*) is used, otherwise **Spearman rank correlation coefficient**
(*r*_{s}) can be used in which
the value can vary also between -1 and +1.

If we calculate the correlation coefficient by software, then usually the
statistical significance (*p*) is also reported that
refers to the divergence of the coefficient from zero. It is important to
emphasize that the *p* value in itself does not give too
much information about the degree of agreement, because the significance
level of a given correlation coefficient decreases sharply with increasing
sample size. Usually we consider the association between the datasets of
observers reasonable, if *r* ≥ 0.7. If
*r* is lower than this value, then we cannot combine
the data collected by the two observers, and either we need to redefine the
definitions used for coding or the coding experience of observers has to be
improved.

It is important that for the calculation of correlation **independent data pairs** have to be used. That is,
it is not correct to calculate the correlation from one sample, i.e. from
data gained from sections of one behaviour sequence (e.g. one video
footage), but data gained from separate samples have to be used.
Furthermore, the sample sections used for the calculation of correlation
should be random and representative regarding our sample, otherwise we can
easily obtain misleadingly high agreement between two observers, if for
example we choose sections where the behaviour in question does not occur at
all or occurs continuously.

**Index of concordance** is usually used if the
investigated variable was measured on nominal or ordinal scale (Table 1). To
calculate the index of concordance we first count the cases where the coding
of the two observers agrees (A), and the cases where it disagrees (D), then
we calculate the index: *KI* =
*A*/(*A* + *D*)
that can have values between 0 and 1.

A shortcoming of the index of concordance is that it does not take into account that some amount of agreement can occur between the observers just by chance, and therefore it may overestimate the agreement between observers.

In contrast to the index of concordance, in the calculation of Cohen’s
Kappa the number of agreements occurred by chance are taken into account:
*κ* = (*KI* –
*V*)/(1 – *V*), where
*KI* is the index of concordance, that is, the
proportion of agreement between the two observers, and
*V* is the amount of agreement expected by
chance.

Let’s assume that two observers (‘A’ and ‘B’) analysed a 10 minute long
footage with sampling every 10 seconds (*n* = 60 sampling
points in total). The observers recorded whether a dog on the footage barks
or not. From the 60 sampling points in 30 cases both of them found that the
dog barked, whereas in 25 cases both of them found that the dog did not
bark. Thus the index of concordance is *KI* = (30 +
25)/60 = 0.917.

To determine the value of *V*, first we need to count
how many times from the 60 cases observer ‘A’ coded barking
(A^{+}) and non-barking
(A^{-}), and similarly we need the values of
B^{+} and B^{-} for
observer ‘B’ (Figure XX.4).

Figure XX.4. Calculation of Cohen’s Kappa in case of a sample consisting of n=60 sampling points.

The probability that two observers code the same assuming independence:
*V* =
*A*^{+}/*n*
×
*B*^{+}/*n*
+
*A*^{–}/*n*
×
*B*^{–}/*n*
= 33/60 × 32/60 + 27/60 × 28/60 = 0.293 + 0.210 = 0.503. Therefore Cohen’s
Kappa: *κ* = (0.917 – 0.503)/(1 – 0.503) = 0.833. That is
much lower than the index of concordance, and shows that almost 10% of the
agreement between the observers is due to chance.

Similarly to the correlation coefficient, also in case of Cohen’s Kappa there is no objective threshold above which the agreement is considered appropriate. Usually Kappa values above 0.6 may be already considered as acceptable, however, the closer it is to 1, the more reliable the analysis of the behavioural variable is.

After we checked the reliability of our data, we can start with data analyses
in which the first step is the description of the data (Fig. 3). The **localization of the collected sample**, that is, where
our sample sits on the axis, is most often characterized with the **mean**. Another often used descriptive statistic of the
sample localization is the **median** that is the value at the half of the rank
ordered sample.

The **dispersion of our sample**, that is, how wide the data
are spread on the axis, is characterized most often with **standard deviation** (*s*) or with** its square, the variance** (), where *x*_{i} are
the individual data points, is the mean of the sample, and *n* is
the sample size. The dispersion of the data can also be characterized by the
interquartile range that contains 50% of the data and its calculation is
*IQ* = *Q*_{3}
-*Q*_{1}.
*Q*_{3} is **the upper**, whereas
*Q*_{1} is **the lower quartile, **the medians of the two
sub-samples split by the median of the full data (Fig. 20.5).

Figure XX.5. Localization and dispersion of the sample: the time spent on the
nest by the male and the female in an imaginary bird species (n = 10 pairs). On
the boxplot the middle line is the median (M), the bottom and top of the box are
the lower (Q_{1}) and upper (Q_{3})
quartiles, the „whiskers” are the minimum and maximum values in the ranges
Q_{1} – 1,5 × (Q_{3} –
Q_{1}) and Q_{3} + 1,5 ×
(Q_{3} – Q_{1}). Values outside of
these ranges are called outliers or extreme values and are depicted with a dot
or asterisk.

The statistical hypothesis is different from the scientific one. The
scientific hypothesis is a logical framework that is based on our previous
knowledge, and from which predictions can be drawn. In turn the statistical
hypothesis is a simple statement pair about a characteristic of the statistical
population. The null hypothesis (*H*_{0})
states the absence of difference, whereas the alternative hypothesis
(*H*_{A} or
*H*_{1}) states the presence of
difference. The members of the hypothesis pair should exclude each other, i.e.
if *H*_{0} is not true then
*H*_{A} should be true and
*vice versa*.

During statistical testing first we calculate the value of the test statistic
from our sample (this is most often done with a statistical software^{[8]}). From this value we can determine the probability
(*p*) to get a value as high (or higher) for test
statistic if *H*_{0} is true. If this is
highly unlikely, that is, if *p* is small, then we reject
*H*_{0} and accept
*H*_{A}. If the probability is high
then we keep *H*_{0}. The probability
that is used as the threshold for rejection of
*H*_{0} is the **level of significance** and denoted by α. In biology
the widely accepted level of significance is α = 0.05, that is,
*p* ≤ α values are considered significant.

It is usually typical for the distribution of variables measured on interval
or ratio scale that the values are aggregated near the mean and further from the
mean less and less values can be found (“bell shaped curve”). Not all
distributions of this kind are normal, but many biological data converges to
normal distribution if the sample size is large. According to the **central limit theorem**, if from a non-normally
distributed population several random samples are drawn, then the mean of these
samples converges to normal distribution. Biological variables are usually the
result of several different factors; therefore they converge to normal
distribution.

Before conducting parametric statistical tests (see 2.11) we have to ensure
that the assumption of normality is fulfilled. The most often used test for
checking normality is the **Kolmogorov-Smirnov test** that has, however, a very low
power, therefore its application is not recommended. The other often used test
for checking normality is the **Shapiro-Wilk test** that has a high power if the data
do not contain equal values. However, equal values (ties) often occur in
biological data, thus the applicability of this latter test is also limited.
Therefore, normality is inspected often graphically by the **quantile-quantile plot (Q-Q plot)**. If the
distribution of the sample does not diverge largely from the normal
distribution, then the theoretical and sample quantiles give a near straight
line (Fig. 20.6.).

Statistical tests can be divided in two large groups based on the distribution
of the variables to be investigated (Précsényi et al., 2000). **Parametric tests**, as it is in their name, estimate a
parameter of the investigated population. They assume that the distribution of
the investigated variable (or the error) is normal. The power of parametric
tests is high (also small differences can be detected), but they have several
assumptions and usually they can be used only on variables measured on ratio or
interval scale (see Figure XX.2). In contrary, **non-parametric tests** do not estimate a parameter.
They do not require normality, but in case of some of them it is assumed that
the distribution has a particular shape (e.g. symmetric). Non-parametric tests
have fewer assumptions and can be used also on variables measured on nominal or
ordinal scale. They have usually lower power then their parametric counterpart,
and many, especially the more complex parametric tests, do not have a
non-parametric counterpart.

If we have one sample and we intend to compare one of its characteristics to a
theoretical value, then we can use **one-sample t-test** (parametric) or **Wilcoxon signed-rank test** (non-parametric). If we
have two independent samples, and we intend to compare one of their
characteristics, then we can use **two-sample t-test** (parametric) or **Mann-Whitney test** (non-parametric). If we have two
samples and the elements of the samples can be arranged into pairs (e.g. before
and after treatment values measured from the same individual, males and females
of pairs), then we can use **paired t-test **(parametric) or **Wilcoxon paired signed-rank test** (non-parametric). If
we have more than two unrelated samples, then we can compare them by **Analysis of Variance** (**ANOVA**, parametric) or by **Kruskal-Wallis test** (non-parametric). If we have more
than two measurements from the individuals, then we can apply **repeated-measures ANOVA** (parametric) or **Friedman test** (non-parametric).

The linear association between two normally distributed variables can be
investigated by **Pearson correlation**, whereas in case of non-normally
distributed data we can use **Spearman rank correlation**. Correlation, however, does
not assume causality between two variables. If we are interested how an
independent variable influences linearly a dependent variable, then we can apply **regression analysis**.

Association between variables measured on nominal scale (e.g. whether the
distribution of hair colour depends on sex in humans) can be tested with **test of independence**. In the test of independence
usually χ^{2} test is used.

When reporting the results of statistical tests, usually we have to give the
name of the statistical test used, the value of the test statistic (e.g.
*t* value, χ^{2} value), the
degrees of freedom of the used sample (parametric tests) or the sample size
(non-parametric tests) and the *p* value.