1 Today's exercise

We will use the tidyverse to extract and transform sociodemographic data of local areas in Paris. At the end of the class, we will produce maps of the 1000 IRISes in Paris, coloured by sociodemographic variables. For example, we will able to see which areas of Paris have the highest number of qualified professionals, the highest number of immigrants, or the highest number of young people.

We aim to create two data frames, one by IRIS and the other by arrondissement, that look something like this:

We then aim to plot this data on maps of Paris.

2 Set-up

There are many different libraries in the R universe for a wide variety of tasks. In this first class, we will be covering methods to import and transform data in R using the tidyverse. I try to adhere to the tidyverse style guide.

Libraries in the tidyverse

2.1 R projects and structuring your code

R projects are good for managing your data and scripts in a particular folder on your computer. Using R-Studio, click File -> New project to create a new R project in a new or existing folder. A good name for a new folder is something like "Class1", which you can save somewhere logical on your computer, such as in a folder called "Introdution_to_R".

Within the folder "Class1", create 3 subfolders, "Data", "Scripts" and "Output". We will save all R code in the folder "Scripts". A key advantage of using R projects is that all paths leading to our input data and output files will be relative to the location of the R project (the folder "Class1").

Create a new R script by clicking File -> New file -> R Script. This should be saved in the folder "Scripts". You can call this script something like "cleaning_paris_data".

2.2 Elements of R-Studio

R-Studio interface

Top left: Script editor - this is where you type your code.

Bottom left: Console - this is where the code runs. You can also type and run code here, but generally we want to write and edit code in the script editor, so that it can be easily saved and organised.

Top right: Information on R objects that we have created (dataframes, vectors, variables...)

Bottom right: Files, plots, packages and versions, help...

Note on running scripts: To run the whole script, click run. To run a part of the script, highlight it and press ctrl+enter.

2.3 Commenting code

It is always a good idea to comment lines of code. Use # at the start of a line in order place a comment or in order to deactivate the line so that it does not run.

2.4 Installing packages

Installing packages in R is done with the command install.packages("package_name"). To install the tidyverse package, we can thus type install.packages("tidyverse"). Once the package has been installed, it needs to be loaded every session using the code library("package_name"). Thus, the first lines of our script will be:

### Installing and loading packages
# install.packages("tidyverse")
# install.packages("sf")
library("tidyverse")
library("sf")

A useful piece of code to install packages only if they are not already installed, then load them, is:

### installs if necessary and loads tidyverse and sf, another package which we will be using today
list.of.packages <- c("tidyverse", "sf")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")

invisible(lapply(list.of.packages, library, character.only = TRUE))

3 Reading data

All the data for today's exercise can be downloaded from here.

This folder contains the French population data at the IRIS level (50k units in metropolitan France) (source: Insee website), the shapefiles for the IRIS (source) and the shapefiles for the arrondissements in Paris (source).

3.1 Reading delimited data

This data is in the form of an .csv (comma-separated values). This can be read using the function read_delim the tidyverse package readr. To the left of the <- sign is the new R object we wish to define, to the right is how we wish to define it.

df <- read_delim(file = "Data/base-ic-evol-struct-pop-2013.csv", 
                 delim = ",", 
                 col_names = TRUE, 
                 skip = 5, 
                 locale = locale(encoding = "UTF-8"))

The options for the function read_delim can be found by typing ?? read_delim in the console. Here, we just present a few frequently used options.

Argument	Description
file (required)	path to file (relative to R project)
delim (required)	delimiter
col_names (TRUE by default)	TRUE if first line is column names, else FALSE or a vector of column names
skip (0 by default)	the number of lines to skip at the start
locale	control the regional options, importantly the encoding

We can check that our dataframe df is how we want it to be by typing View(df) in the console, or by clicking on the data frame in the "Environment" panel.

Encoding matters

3.2 Other types of data

There are other packages inside the tidyverse that can be used to read most other classic types of data, for example: read_csv, read_xls, read_dta, read_sas, read_sav. These functions work similarly.

3.3 The tibble in R

A tibble in R is a standard R object used to store databases. It is the more modern version of a data frame. A tibble consists of rows and columns, where the columns contain one of five basic classes of data.

logical (e.g., TRUE, FALSE)
integer (e.g., 213, as.integer(3))
numeric (real or decimal) (e.g, 2, 2.0, pi)
complex (e.g, 1 + 0i, 1 + 4i)
character (e.g, "hello", "AA231")

When our data set was imported from .csv, R recognised character and numeric columns. We will later learn how to change column types.

Each of the columns may be accessed by their name, e.g. df$IRIS, or by their number , e.g. df[,2].

4 The dplyr pipe function

The pipe function is part of the package dplyr in the tidyverse, and is used to simply transform a tibble. A cheat sheet for the dplyr package can by found by clicking Help -> Cheatsheets -> Data Transformation with dplyr, or at this link.

The pipe function

4.1 Selecting rows and columns: select and filter

We want to select only the columns IRIS, COM, TYP_IRIS, P13_POP and the age variables P13_POP0014 through to P13_POP75P. We want to select the rows that denote data from Paris only. To select columns, we use the function select. To select rows, we use the function filter.

iris <- df %>% # creates a copy of df called iris
  filter(DEP=="75") %>% # filters ROWS where departement == 75 (Paris)
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) # selects COLS

The order of these two lines matters, if we select the columns first, then we cannot use the variable DEP to filter the variables. It is also possible to deselect variables by putting a minus sign before the variable, e.g. select(-COM).

4.1.1 Renaming columns

Columns can be renamed by using rename(new_name=old_name), or by integrating the new names into the select function, e.g. select(new_name_1=old_name_1, new_name_2=old_name_2).

4.1.2 Note on logical statements in R (and most other languages)

In logical statements, e.g. "if A is equal to B, then apply function Y", we use the following notation. Specifically, we use a double equals sign for 'equal to'.

Statement	Meaning
`==`	equal to
`>=`, `<=`	greater than or equal to, less than or equal to
`>`, `<`	greater than, less than
`!=`	not equal to
`&`	and
`\|`	or

4.2 Mutating variables

We now wish to convert all the population variables to percentages. To modify one, many or all columns, we use the functions mutate, mutate_at or mutate_all.

To convert one variable, say P13_POP0014, to a percentage, we divide this variable by P13_POP. This can be done with the function mutate. We can continue changing each variable to a percentage like this, but it will become quite cumbersome!

iris <- df %>% 
  filter(DEP=="75") %>%
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  mutate(P13_POP0014 = P13_POP0014/P13_POP) %>%
  mutate(P13_POP1529 = P13_POP1529/P13_POP) %>%
  mutate(P13_POP3044 = P13_POP3044/P13_POP)

A better way to do this is to use mutate_at to mutate multiple variables at once.

iris <- df %>% 
  filter(DEP=="75") %>%
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP))

4.2.1 Conditional mutations

We notice that there are some IRIS for which the population is 0. In these cases, when we divide by 0, we obtain the result NaN (not a number). We wish to convert these values to 0. We can use mutate_if to only mutate columns satifying a particular condition, and we can use the ifelse function to replace NaN by 0. The three arguments of the ifelse function are:

Logical statement
Action to take if logical statement is true
Action to take if logical statement is false

iris <- df %>% 
  filter(DEP=="75") %>%
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP)) %>%
  mutate_if(is.numeric, ~ (ifelse(is.nan(.), 0, .)))

4.2.2 Some basic string operations

Here we learn two simple functions for string variables, substr and paste0.

Say we wish to convert the column COM into a more readable string, e.g. instead of "75114", we wish to write "Paris 14". We use the function substr to extract from the 4th to the 5th position of the string, and paste0 to concatenate strings.

iris <- df %>% 
  filter(DEP=="75") %>%
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP)) %>%
  mutate_if(is.numeric, ~ (ifelse(is.nan(.), 0, .))) %>%
  mutate(name_arrd = substr(COM, 4, 5)) %>%
  mutate(name_arrd = paste0("Paris ", name_arrd))

4.3 Grouping and aggregating variables

We now wish to group the IRISes by arrondissement, in order to obtain aggregated statistics of the population by arrondissement. Using the function group_by, we can group the variables by COM, which indicates the arrondissement.

To apply a function to a column by group, we use summarise. As for mutate, there is also summarise_at, summarise_all and summarise_if.

After this aggregation, we need to ungroup our data frame.

arrd <- df %>% 
  filter(DEP=="75") %>%
  select(COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  group_by(COM) %>%
  summarise_all(~ (sum(.))) %>%
  ungroup %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP)) %>%
  mutate_if(is.numeric, ~ (ifelse(is.nan(.), 0, .)))

The final two lines are the same as before.

4.4 Changing from wide to long, and long to wide

Our data is currently in wide format. To change it from wide to long format, we use the function gather, and to change it from long to wide format, we use the function spread.

long <- arrd %>%
  gather(key = population_variable, value = value, -COM)

wide <- long %>%
  spread(key = population_variable, value = value)

4.5 Writing data

We can write data in .csv format using write_csv. We can also use .rds format (r dataset) in order to preserve the tibble attributes, such as which variables are numeric and character variables.

iris <- df %>% 
  filter(DEP=="75") %>%
  select(IRIS, COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP)) %>%
  mutate_if(is.numeric, ~ (ifelse(is.nan(.), 0, .))) %>%
  mutate(name_arrd = substr(COM, 4, 5)) %>%
  mutate(name_arrd = paste0("Paris ", name_arrd)) %>%
  write_csv("Output/iris.csv") %>%
  write_rds("Output/iris.rds") 

arrd <- df %>% 
  filter(DEP=="75") %>%
  select(COM, P13_POP, P13_POP0014:P13_POP75P) %>%
  group_by(COM) %>%
  summarise_all(~ (sum(.))) %>%
  ungroup %>%
  mutate_at(vars(P13_POP0014:P13_POP75P), ~ (./P13_POP)) %>%
  mutate_if(is.numeric, ~ (ifelse(is.nan(.), 0, .))) %>%
  write_csv("Output/arrd.csv") %>%
  write_rds("Output/arrd.rds")

4.6 Joining two datasets

There are four key types of joins.

Function	Meaning
`a %>% left_join(b, by="x")`	Join matching rows from b to a
`a %>% right_join(b, by="x")`	Join matching rows from a to b
`a %>% inner_join(b, by="x")`	Join data retaining only rows that match in a and b
`a %>% full_join(b, by="x")`	Join data retaining all rows from a and b

We will apply a join with geographical data, in order to display our variables on a map.

4.6.1 Import geographical data

Shapefiles are a common format of geographical data. We can import them using the package sf, which is not part of the tidyverse, but follows the same syntax. We select only the variable corresponding to the IRIS code, and call this IRIS to match our other data set. We then apply a right_join to join the geographical data to the iris tibble that we have created.

irisshp <- read_sf(dsn = "Data/iris", layer = "CONTOURS-IRIS") %>%
  select(IRIS=CODE_IRIS) %>% # here we do 2 things, select the column and rename it to IRIS
  right_join(iris, by="IRIS") # join matching rows from irisshp to iris

4.6.2 Plot data

In the next class, we will plot data in a much nicer way using ggplot2. However, for now, we will simply use the plot function.

We wish to plot a demography variable, such as the percentage of people over 75 years old, on a map of Paris. We select only the variable of interest then use the function plot.

iristoplot <- irisshp %>%
  # mutate(P13_POP75P_pc=ifelse(TYP_IRIS=="H", P13_POP75P_pc, NA)) %>%  ### optional line to exclude IRISes with no or few inhabitants
  select(P13_POP75P) # selects variable we are interested in

plot(iristoplot)

In order to save the plots, use the following code.

### to save plot use these two lines
# dev.copy(pdf, 'Output/age.pdf')
# dev.off()

The same plot by arrondissement is given by the following code.

arrdshp <- read_sf(dsn = "Data/arrondissements", layer = "arrondissements") %>%
  select(COM=c_arinsee) %>% 
  mutate(COM=as.character(COM)) %>%
  left_join(arrd, by="COM") %>%
  select(P13_POP75P)

plot(arrdshp)

5 Exercises

Create a map of Paris by arrondissement with the percentage of immigrants
Choose a French city aside from Paris and create a map of the percentage of qualified professionals (cadre) by IRIS in that city (C13_POP15P_CS3)

Comment each line of your code using # to demonstrate that you understand what each line is doing.

Upload your code and two maps using the link on the home page.

Class 1: Introduction to data wrangling with the tidyverse