Chapter 3: Descriptive Stats
When we have collected (or otherwise acquired) a data set in order to investigate questions pertaining to human behavior and cognition, the first thing to do is often to get some descritpive summary statitics. In itself, data are just piles of meaningless numbers. Only when we start throughing questions at the data - in terms of statistical analyses and plots - can we begin to appreciate what is actually in our data. Often we can approach the first answers to our questions by looking at means and variance, and do some simple visualizations of the distributions. We will treat visualization of data in the next chapter. In this chapter we will look at how we gain some summary indices of central tendencies and variance (means, median, standard deviations and standard error) from a large data set.
However we will start learning how to install packages in RStudio, which enables us to extend the basic R repertoire with new functions. In this case we need a function from the Pastecs package.
Installing packages in RStudio
Whenever we need particular functions that are not part of the default R environment we install a new package. Programmers over the world are continuously contributing new functions to R extending and improving its functionality. When we want to install a package, RStudio goes online and fetches it for us. That is, you need to be online to install packages.
You need to install the package only once. Then it is added to your environment and you can use it thereafter if you activate it in your script using the command library('name of package')
.
There are two ways of installing a package i RStudio. Either we do it 'by command' or we can use the graphical interface.
Installing packages with commands
If you are certain about the name of the package you need, the easiest way to install a package to use the command install.packages()
in the console. To install the Pastecs package we would then write:
# Install the Pastecs package
install.packages(pastecs)
Installing packages with the GUI
If we are a bit uncertain about the name of the package or can't remember the install command, we can also use the graphical interface to browse and install packages.
In the top menu bar, choose "Tools > Install packages..."
Once you click it, the 'Install packages' window pops up. Now you can type the first couple of letters of the package name and it will appear in a dropdown list.
Now, we simply choose the package from the list and click 'Install' and the package will now be part of our environment once we include the command library(pastecs)
in the beginning of our scripts.
Mean, median, SD, and SE
Note that in the examples below we will still use the sample data set from the last chapter, so you might start you script file by setting you working directory and importing the data set.
If we just want to get mean, median and standard deviations from a full column in our data set we can use the commands mean(), median(), and sd()
and specify the column in the function.
# get the mean of the tongue twister data
mean(DATA$Tongue_twister_rt)
# get the median
median(DATA$Tongue_twister_rt)
# get the standard deviation
sd(DATA$Tongue_twister_rt)
There is no inbuilt function to calculate the standard error of the mean SE. However we know that it is simply...
That is the standard deviation divided by the square root of the sample size. If we need to report a standard error we can thus do something like:
# get the standard error
sd(DATA$Tongue_twister_rt)/sqrt(length(DATA$Tongue_twister_rt))
The stat.desc function
Rather than running a number of individual commands to retrieve our different summary scores, we can also use a single function to get it all. The stat.desc()
function is part of the Pastecs package that we installed above. We can use it to get a whole load of summary scores for a varable with one command. Let's try to run it with the Tongue twister data again
# get descriptive summary data from the tongue twister task
stat.desc(DATA$Tongue_twister_rt)
You will notice that the output gives you both the max, min, mean, sd and se of the data plus an addition number of summary stats. The output is not always easy to read due to the large number of decimals. You can use the round()
function to round up the output values. You simply put the whole stat.desc()
command inside the round()
command and specify the number of decimals you want after a comma.
# get descriptive summary data from the tongue twister task
round(stat.desc(DATA$Tongue_twister_rt), 2)
Using the by() function to get summary stats by group
Rather than getting the mean and variance from a full column of data, we often want means for different groups in the data set. For instance we could want to know the mean of the tongue twister data for females and males respectively. Or to compare means and variance for two experimental conditions. Here the by()
function comes handy.
The by function takes three arguments. First you specify the data than you want to run the function on, e.g. the tongue twister data. Then you specify the grouping variable, e.g. gender (or condition) by which you want to split the data. The last argument is the function you want to run on the groups of data, e.g. mean
or sd
. Lets have a look at an example and its output.
# get the mean of the tongue twister data by gender
by(DATA$Tongue_twister_rt, DATA$Gender, mean)
DATA$Gender: female
[1] 49.4375
--------------------------------------------------------------------------
DATA$Gender: male
[1] 46.71769
# get the standard deviation
by(DATA$Tongue_twister_rt, DATA$Gender, sd)
DATA$Gender: female
[1] 16.14505
--------------------------------------------------------------------------
DATA$Gender: male
[1] 11.88231
In principle you can use the by()
function with any other function. For instance you can use it with the stat.desc()
function as well.
# get all descriptive stats by gender
by(DATA$Tongue_twister_rt, DATA$Gender, stat.desc)