Chapter 2: Data Frames Basics

In this chapter we will look at how we can create or import data frames in order to inspect the data and run analyses. For this purpose we will need the script editor in RStudio.

Data Frames

Data frames are two-dimensional tables or matrices of values. They are absolutely fundamental to all kinds of data analysis in the cognitive sciences. Data frames consist of horizontal rows and vertical columns. Often each row contains data from one participant or one experimental trial, while each column contains individual variables such as, for instance, the participant's name, age and some measurements.

Name Age Condition Reaction_time
Kate 22 test 1904
John 24 test 2231
Mohammed 20 control 2089
Susan 25 test 1892
Anna 23 control 2277
Charlie 21 control 3002

Figure 1: example of data frame

Notice that the data frame most often have a header line with labels specifying the data in each column.

Writing data frames

We can easily create a data frame in RStudio. We do that by first writing a set of lists variables (of equal length) corresponding to each columns we want in the new data frame, and the we use the command data.frame() to gather them in a two-dimensional matrix.

  # write lists of values 
  Names = c(‘Bob’, ‘Lizzy’, ‘Knut’)
  Ages = c(23, 25, 47)
  RTs = c(2.45, 1.99, 6.22)
  # gather in data frame
  my_data_frame = data.frame(Name = Names, Age = Ages, RT = RTs)
  # now print the data frame
  my_data_frame

      Name   Age   RT
  1   Bob    23    2.45
  2   Lizzy  25    1.99
  3   Knut   47    6.22

This is, however, not the way we usually work with data frames. Most often, data are imported into RStudio from another source, for instance from an experiment run in PsychoPy or from the internet.

Using the RStudio script editor

Until now we have run pieces of code one-by-one in the console window. Once we start doing slightly more advanced data handling, it is practical to be able to write and execute multiple lines of code and to save then in a script file for later usage. We can start a new script file in RStudio by clicking the green '+' sign in the upper left corner and pick the first option 'R Script'.

Figure 2: click 'R Script' to open the script editor

This opens the script editor window, which look something like this:

Figure 3: the script editor

Commenting your code

It is highly recommended that you develop good habits of commenting your code. Comments in your code have at least three purposes:

  1. They help you navigate your own code while you are working on an analysis
  2. They help remind you want different parts of your code does when you revisit it later on
  3. They help other people to navigate your code if they borrow it

Your can write a comment in your code by first typing a # followed by your comment. Everything following a hashtag will not be recognized by RStudio as a command. Your can either put comments on their own line or immediately following a command.

Lets start our new script by making a 'header' comment with some meta information about the script.

  # This script does basic data mining of the X data set
  # Written by Kristian Tylén
  # September 2016

Defining a working directory

Whenever we want to work with a data file that come from an external source we need to tell RStudio about the location of that file. We do that by designating a folder on our computer as the "working directory" for a particular analysis or project. RStudio will then search in this folder when we ask it to import data files. This is also where it will save our analysis script and graphs if we make some.

There are two ways in which we can define a working directory in RStudio, i) by using the command setwd(), or ii) by using the graphical interface (GUI) browser. Since it can be hard to remember a long path to a folder we will use the GUI solution here. That is also platform neutral.

Click the top bar menu item 'Session', then 'Set Working Directory', and then 'Choose Directory'.

Figure 4: set working directory

Now a browser window appears and you can navigate to the folder to wish to use as working directory and where you keep your data file.

Once you have chosen a working directory and clicked 'open', RStudio will write the setwd(path-to-your-chosen-working-directory) command in the console. You can now copy-paste the command into your script. Then, next time you open RStudio you can just run your script and you do not need to browse to the directory again. Below is an example of how our script could look like. Notice that I have added a comment above the new command.

  # This script does basic data mining of the X data set
  # Written by Kristian Tylén
  # September 2016

  # set working directory
  setwd("~/Dropbox/Undervisning/Experimental Methods I/Fall_2016/Code")

Importing data sets into RStudio

Often we will be analyzing data collected in experiments, simulations or gathered from the internet. Such data are usually stored in tab-separated .txt files or comma-separated .csv files. We can import such files into RStudio using the read.delim() command.

For the examples and exercises, we will use the sample data set called 'SampleDataSet.txt', which can be downloaded using this link. Be sure to place it in your working directory. The data originate from an fun activity where students of BSc Cognitive Science responded to a number of small test and questions. The data have been anonymized.

Let's import the data set and write it to a variable called DATA (again, we could really call this anything).

  # Import data
  DATA = read.delim('SampleDataSet.txt')

In order to run the command, we need to mark it and press the 'run' button in the top bar of the script window. You can also use the short cut 'ctrl - enter' (Windows) or 'command-enter' (Mac).

What should happen now is that the new data frame variable appears in the 'Environment' window to the right of the script editor. We can see that the data file has 29 observations of 9 variables.

Figure 5: Notice the new data frame variable DATA in the Global Enviroment window

Now try to click on DATA in the Environment window (you can also run the command view(DATA) which is perfectly equivalent). This opens up a new tap in the main window showing the data.

Figure 6: Data visualization tap

This looks nice and tidy so we can proceed.

Indexing data frames using numbers

Once we have our data imported, we want to do some basic exploratory mining. For this we need 'indexing' in order to pull out specific values or subset of the data for further inspection.

You might remember from previous sections on list handling that we can ask for a particular value by indexing the position of the value in square brackets []. We can try to use the same procedure here by running the following command:

  DATA[1]

While for lists this procedure would give us the first element (a single value) here we get the full column of names. Since data frames are two-dimensional we need to specify both the row and column number. If we want the first value in the second column we would thus use the following indexing:

  # get value from first row, second column
  DATA[1,2]

Notice that we always specify the row followed by the column. Imagine that we wanted to pull out all values from the forth row. Then we can just leave the column unspecified:

  DATA[4,]

We could also be interested in a particular range of values from the data frame. Then we can use : to separate indices from one position to the next.

  # get values from row 1 to 10 in column 3
  DATA[1:10, 3]

  # get values from row 5 to 15 from column 2 to 4
  DATA[5:15, 2:4]

  # use embedded list indexing to get value from all rows in column 4 and 8
  DATA[,c(4, 8)]

Finally, we could want to get a range of values except a particular row or column. For this we can use negative indexing.

  # get all values from row 10 - 20, all columns except from column 3
  DATA[10:20, -3]

  # get values from column 2 - 4 except for row 11
  DATA[-11, 2:4]

Indexing using column header labels

If we have a large data set with many column variables it is not always practical to use number indexing since we would have to remember what the numbers signify. We can thus do more transparent indexing by using the column labels. We do this by writing the data frame variable name, a $ and then the column label, e.g. DATA$Name(this command is perfectly equivalent to DATA[1], but it is more transparent from the code what we are trying to get).

Let's look at a few examples of indexing using the column header labels. Notice that in these examples the data that we are interested in are from the column specified outside the brackets, while the particular rows are defined by a different column inside the brackets.

  # pull out the names of those participants that has a shoe size of 39
  DATA$Name[DATA$Shoe_size == 39]

  # pull out the smallest shoe size in the data set
  min(DATA$Shoe_size)

  # Write a new variable with all the data from those participants who has a shoe size larger than 42
  DATA_big_feet = DATA[DATA$Shoe_size > 42,]

Now we should be ready for an exercise!


Exercise 1

  1. How many left-handed and how many right handed individuals participated in the test?
  2. Do right or left-handers have the largest shoe sizes on average?
  3. Write a new variable called shark_data with all the data from students who prefer diving with white sharks (see the column variable 'Pick_danger'), except from the first column, which we want to leave out.
  4. Who are faster at the tongue twister on average – males or females?
  5. Are people who can hold their breath longer than 60 secs faster on average at the tongue_twister than people who can only hold their breath for a shorter period?

results matching ""

    No results matching ""