First of all, we need to open the dataset.
nobel <- read.csv('C:/Users/benjamcr/Rproj/GEOG3006/data/nobel.csv')
head(nobel)
id firstname surname year category
1 1 Wilhelm Conrad Röntgen 1901 Physics
2 2 Hendrik A. Lorentz 1902 Physics
3 3 Pieter Zeeman 1902 Physics
4 4 Henri Becquerel 1903 Physics
5 5 Pierre Curie 1903 Physics
6 6 Marie Curie 1903 Physics
affiliation
1 Munich University
2 Leiden University
3 Amsterdam University
4 École Polytechnique
5 École municipale de physique et de chimie industrielles (Municipal School of Industrial Physics and Chemistry)
6 <NA>
city country born_date died_date gender born_city
1 Munich Germany 1845-03-27 1923-02-10 male Remscheid
2 Leiden Netherlands 1853-07-18 1928-02-04 male Arnhem
3 Amsterdam Netherlands 1865-05-25 1943-10-09 male Zonnemaire
4 Paris France 1852-12-15 1908-08-25 male Paris
5 Paris France 1859-05-15 1906-04-19 male Paris
6 <NA> <NA> 1867-11-07 1934-07-04 female Warsaw
born_country born_country_code died_city died_country
1 Germany DE Munich Germany
2 Netherlands NL <NA> Netherlands
3 Netherlands NL Amsterdam Netherlands
4 France FR <NA> France
5 France FR Paris France
6 Poland PL Sallanches France
died_country_code overall_motivation share
1 DE <NA> 1
2 NL <NA> 2
3 NL <NA> 2
4 FR <NA> 2
5 FR <NA> 4
6 FR <NA> 4
motivation
1 "in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"
2 "in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena"
3 "in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena"
4 "in recognition of the extraordinary services he has rendered by his discovery of spontaneous radioactivity"
5 "in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel"
6 "in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel"
born_country_original born_city_original
1 Prussia (now Germany) Lennep (now Remscheid)
2 the Netherlands Arnhem
3 the Netherlands Zonnemaire
4 France Paris
5 France Paris
6 Russian Empire (now Poland) Warsaw
died_country_original died_city_original city_original
1 Germany Munich Munich
2 the Netherlands <NA> Leiden
3 the Netherlands Amsterdam Amsterdam
4 France <NA> Paris
5 France Paris Paris
6 France Sallanches <NA>
country_original
1 Germany
2 the Netherlands
3 the Netherlands
4 France
5 France
6 <NA>
Question 1: How many observations and how many variables are in the dataset?
We can use the functions nrow
and ncol
to get the number of observations and the number of variables in a dataset. As you may remember from the tutorials, the rows of a dataset are the observations (Tor, Gus or Lena in the kids_frame
dataset for instance). The columns are the variables (names, shirt_color or height in the kids_frame
dataset)
# Number of observations
nrow(nobel)
[1] 935
# Number of variables
ncol(nobel)
[1] 26
Question 2: How many woman won a nobel price? How many men?
Although there is multiple ways of answering this question in R, a fast and efficient way of doing so is to use the combination group_by
and summarise()
, functions included in the dplyr
package.
library(tidyverse)
nobel %>%
dplyr::group_by(gender) %>%
dplyr::summarise(n = n())
# A tibble: 3 x 2
gender n
<chr> <int>
1 female 52
2 male 856
3 org 27
In the line above I load the library
tidyverse
. Later in the document we will see what is this library and why it is so useful.
From the output we can see that only 52 women got a nobel price while 856 men got a nobel price.
We can calculate the percentage of women who got a nobel price:
[1] 0.05726872
Only 5%!
Question 3: Create a new data frame called nobel_living that filters for
drop_na()
, look on google what it does!)Here, we need to create a new object that we will call nobel_living
. To select the data we are supposed to place in this dataset we will use the function filter
, again from the dplyr
package.
nobel_living <- nobel %>%
drop_na(country) %>%
filter(gender == "female") %>%
filter(is.na(died_date))
The code above has three steps. Note that in this case there is no special order for the steps.
drop_na
female
NA
. Here I combine 2 functions: filter
and is.na()
.The function is.na()
will test whether a particular cell value is NA
or not. Here, if NA is true (the person is still living) then the function will filter for it. Roughly, you can translate this line of code by Filter the laureates whom died_date is NA.
Note that you can filter only the dead persons by adding a ! before the
is.na()
function.
nobel_living <- nobel %>%
drop_na(country) %>%
filter(gender == "female") %>%
filter(is.na(died_date))
We can translate this as Filter the laureates whom died_date is no NA.
Question 4: With this new dataset, summarize the number of females laureate who are still alive by country and make a histogram of the number of female laureates per country. Your histogram should include a title and title for the axis.
First, we will create a dataset which summarise the number of female laureates by country:
nobel_female_country <- nobel_living %>%
group_by(country) %>%
summarise(n = n())
This dataset should have 7 observations and only two variables which are n
the number of living female laureate per country and country
.
Then we can build our visualisation with the package ggplot2
. Here I do not build a histogram but rather a lollipop chart.
nobel_female_country %>%
ggplot(., aes(x = country, y = n)) +
geom_point(size = 3, color = "#a25079") +
geom_segment(aes(x = country, xend = country, y=0, yend = n), color = "grey") +
coord_flip() +
theme_minimal() +
labs(title = "Number of nobel prices per country",
xlab = "Number of nobel prices",
ylab = "Country")
Question 5: Which country has the most female laureates?
From the lollipop chart we draw we can clearly see that the US has the most female living laureate. Note that Norway has a woman who won a Nobel Price and who is still alive, in fact May-Britt Moser works at NTNU in St-Olav!
tidyverse
libraryPreviously, I have used the Tidyverse package which is a collection of R packages that share an underlying design philosophy, grammar and data structures.
We have already been using some of the packages included in the tidyverse
including dplyr
, magritrr
(package which has the pipe) and ggplot2
. There is a lot more to explore and a lot of very useful functions in it. Of course I do not know everything from the tidyverse
and later in this course you may find more efficient ways to wrangle the data than I do, let me know if that’s the case!
I highly recommend to install the tidyverse
package and to load it at the beginning of your script. If you do that you won’t have to load ggplot
, dplyr
… every time.
install.packages('tidyverse')
library(tidyverse)
If you want to improve your data science skills you can find some tips here.
You already have been introduced to some of the main functions from the dplyr
package:
filter
group_by
summarise
There is two other functions which will help you in your data analysis workflow, namely select()
and mutate()
select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select()
is not very useful on the kids_frame
dataset but you can still get the idea:
# We reconstruct the kids_frame dataset
kids_frame <- data.frame(
names = c("Tor", "Gus", "Bob", "Di", "Lena", "Tony", "Ingrid", "Maria", "Ed", "Raghnild"),
height = c(110, 130, 115, 140, 125, 135, 120, 130, 130, 115),
shirt_color = c("green", "green", "green", "blue", "blue", "green", "blue", "green", "green", "blue"),
shoe_color = c("blue", "red", "grey", "blue", "pink", "red", "grey", "pink", "pink", "blue"),
sex = c("m", "m", "m", "f", "f", "m", "f", "f", "m", "f"),
age = c(8,11,8,12,11,11,9,12,12,8))
Imagine you would like to create a new dataset with only the columns names and height, with the select()
function you would write:
kids_selected <- kids_frame %>%
dplyr::select(names, height)
head(kids_selected)
names height
1 Tor 110
2 Gus 130
3 Bob 115
4 Di 140
5 Lena 125
6 Tony 135
Now you reduced the initial dataset to only two columns.
mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
mutate()
always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View()
Here I will create a new column about the kids’ favorite food:
names height shirt_color shoe_color sex age fav_food
1 Tor 110 green blue m 8 strawberry
2 Gus 130 green red m 11 candy
3 Bob 115 green grey m 8 pasta
4 Di 140 blue blue f 12 chocolate
5 Lena 125 blue pink f 11 candy
6 Tony 135 green red m 11 beef
Now you can see that the new column has been created!
Now you know the main functions of the tidyverse package that will allow you to wrangle efficiently. The list is not exhaustive and as I said I do not know all the functions! However here a small list of the tidyverse
functions worth mentionning. You can look them up in R by typing ?NAME_FUNCTION
or on google.
transmute
rename
starts_with
ends_with
slice
During next lab we will finish the bloc on Data visualisation and data wrangling. We will mainly repeat what we have been doing, using the same functions.
First, download the dataset here:
As you may see, this is an .rda file. A .rda file is basically a compressed R file. I saved this file in the folder “data”. TO open it I should write:
load("data/ncbikecrash.rda")
You will use what you learned form the previous labs and answer these questions:
Question 1: Run View(ncbikecrash) in your Console to view the data in the data viewer. What does each row in the dataset represent?
Question 2: How many bike crashes were recorded in NC between 2007 and 2014? How many variables are recorded on these crashes?
Question 3: How many bike crashes occurred in residential development areas where the driver was between 0 and 19 years old?
Question 4: Create a frequency table of the estimated speed of the car (driver_est_speed) involved in the crash. What is the most common estimated speed range in the dataset?
Good luck!