1 Today’s exercise

We will scrape data off Wikipedia on the leaders of all of the countries in the world, their political parties and the political orientation of their parties. This will allow us to plot a map of the world coloured by political orientation of the leader of many countries.

Map of the political orientations of heads of government

We will be using the page List of current heads of state and government in order to obtain a list of names of current heads of government in each country. We wil then be searching the Wikipedia links to each of these current heads, in order to scrape their political party off Wikipedia. Then, we will be searching the Wikipedia links to each of these parties in order to identify their political position.

2 Set-up

You will need a browser such as Chrome or Firefox and an internet connection. You will also need the packages tidyverse and rvest for the web scraping part, and the packages countrycode and maps for the to produce the map later on.

### installs if necessary and loads tidyverse and sf, another package which we will be using today
list.of.packages <- c("tidyverse", "rvest", "countrycode", "maps")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")
invisible(lapply(list.of.packages, library, character.only = TRUE))

3 Theory

To begin, we will briefly cover four aspects of essential theory: functions, if/else statement, loops, regular expressions and XPaths.

3.1 Functions

A function is a set of instructions that R executes. It takes arguments are inputs, and returns an output. It can be rely helpful to create a function in R when you have to repeat these instructions several times.


name_function <- function(argument1, argument2, ...) {
  instruction 
  return(output) #not necessarily
}

For example, we can define a function called square_root which computes the square root of an argument x.

square_root <- function(x) {
  sqrt(x)
}

square_root(16)

## [1] 4

square_root(144)

## [1] 12

Functions can have multiple arguments, or even no arguments.

sum_square_root_x_y <- function(x,y) {
  sqrt(x)+sqrt(y)
}

print_hello <- function() {
  print("Hello world!")
}

3.2 If/else statement

You can adapt your code to different circumstances. If a condition is TRUE, your program will execute a command. The else statement indicates what to do if the condition is not met. one thing. If you do not specify an else and the if condition is not met, R does nothing.


if (condition = TRUE) {
  some R code
} else {
  some other R code
}

3.3 For loops

For loops execute a command on each item of a vector. This is very useful.

for (x in c(1,4,9,16,25)) {
  print(square_root(x))
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

for (i in c("a", "b", "c")) {
  if (i %in% c("c", "o", "d", "e")) {
    print(paste("The letter", i, "is in the word 'code'"))
  } else {
    print(paste("The letter", i, "is not in the word 'code'"))
  }
}

## [1] "The letter a is not in the word 'code'"
## [1] "The letter b is not in the word 'code'"
## [1] "The letter c is in the word 'code'"

3.4 tryCatch

Often, we do not want our R script to stop due to one error, especially when we are running a loop over many elements. For example, if we are trying to scrape information off hundreds of websites, we do not want our script to stop every time one of the websites is down, or doesn’t have the structure we expect.

To deal with errors, we can use the function tryCatch. The syntax of this function is: tryCatch({do something}, error = function(e) {do somethingelse}).

for (x in c("a", "b")) {
  tryCatch({
    print(square_root(x))
  }, error = function(e) {
    print(paste("Error for", x))
  })
}

3.5 Regular expressions

Regular expressions are used to identify strings following a specified pattern.

Regex	Description
^	Start of string OR ‘’not’’
$	End of string
.	Wildcard (type `\\.` to mean full-stop), which denotes any character
[]	a range of characters
[A-Z]	Any capital letter from A to Z
[a-z]	Any lowercase letter from a to z
[0-9]	Any digit from 0 to 9
[:alnum:]	Any alpha-numeric character
[:punct:]	Any punctuation character
[:alpha:]	Any alphabetical character
[:upper:]	Any uppercase character
[:lower:]	Any lowercase character
[:digit:]	Any digit
[^abc]	Not a, b or c
*	0 or more occurences
+	1 or more occurences
{n}	n occurences
{n,}	n or more occurences
{n,m}	between n and m occurences

Some examples:

Regex	Meaning	Some examples
l[oe]t	Anything starting with a l - either e or o - t	`"let"`, `"lots"`, but not `"latitude"`
`".*\\.jpg$"`	Ending in .jpg	`"image1.jpg"`, `"Output/map.jpg"`
`"^[A-B][:lower:]+.*"`	Starting with A or B, followed by at least 1 lower case letter	`"Adam Smith"`
`".*@gmail\\.com$"`	Anything ending in @gmail.com	`"john.smith@gmail.com"`
`"^[0-9]{5}$"`	A string with 5 digits between 0 and 9	`"01234"`

3.6 XPaths

If you access the page List of current heads of state and government and right-click to view the source code, then you can see the html code behind the website.

Html code is structured by ‘tags’. For example, the document starts with the tag <html> and ends with the tag </html>. For example, a html table is contained between the tags <table> and the tag </table>, and a link to another website is contained within the tags <a> and the tag </a>.

XPaths are paths to html tags. For example, the XPath /html/body/div[3]/div/table[2] is the path to the second table tag in the div tag in the third div tag in the body tag in the html tag.

If we are interested in scraping a certain object, such as a table or a link from a html page, then we can use the XPath of that object to access it.

In order to access the XPath of an object, we can right click somewhere on the object, click ‘Examine element’, then find the html tag of the object we would like to access. We can then right click on the html tag and copy the XPath to paste into our R code. .

We can also use the text contents of an object to specify the XPath. For example, the XPath "//a[text()='Alain Berset']" gives us all <a> tags with the text ‘Alain Berset’. This is useful for finding links behind the text ‘Alain Berset’ on the Wikipedia page.

We can use the argument ‘contains’ in order to partially specify attributes of a html tag. For example, //table[contains(@class, "infobox")] selects all tables where the table class contains the word ‘infobox’.

Two dots in an XPath returns to the parent tag. For example, the XPath //th[text()="Political party"]/.. returns us to the parent tag of the ‘th’ tag with the text ‘Political party’.

4 Scraping government leaders and their Wikipedia pages

In a first step, we aim to scrape the table of government leaders on the page List of current heads of state and government, as well as the links to the Wikipedia pages of each of the leaders.

4.1 Reading html data and extracting tables

We use the function read_html in the rvest package to read html data from a website. We use the function html_nodes to select the portion of the html document under a particular XPath. We then use html_table to extract the table under the XPath.

# get webpage
html_heads <- read_html("https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government")
# extract table
df <- html_heads %>%
  html_nodes(xpath = '/html/body/div[3]/div[3]/div[5]/div[1]/table[2]') %>%
  html_table(fill=TRUE) %>%
  .[[1]]

4.2 The function `gsub`

The function gsub takes 3 main arguments. The first argument is a regular expression to replace, the second argument is a string to replace it with, and the final argument is the object in which to make this replacement. For example, we can remove all letters preceding -, or we can remove the string “Prince” at the start of a string. We use this to clean up the names of the world leaders, such that the text of the name is the text in the link to their Wikipedia page.

df <- html_heads %>%
  html_nodes(xpath = '/html/body/div[3]/div[3]/div[5]/div[1]/table[2]') %>%
  html_table(fill=TRUE) %>%
  .[[1]] %>%
  mutate(leader = gsub(".*– ", "", `Head of government`)) %>%
  mutate(leader = gsub(".*\u2013 ", "", leader)) %>% # this line is for the windows computers where the previous line does not work
  mutate(leader = gsub("^Sheikh |^Cardinal |^Prince ", "", leader)) %>%
  mutate(leader = gsub(" \\(.*$|\\[.*$", "", leader)) %>%
  mutate(leader = ifelse(leader=="Hasina", "Sheikh Hasina", leader)) %>%
  select(State, leader) %>%
  filter(!duplicated(State))

4.3 Extracting links

Links to other pages are contained within <a> tags. If we wish to extract the link behind the name “Alain Berset”, for example, we can search for the <a> tag with the text “Alain Berset”, then use the function html_attr to extract the link from the <a> tag.

### useful function to get link behind text on webpage
get_link_text <- function(text, html_page){
  tryCatch({
    html_page %>%
      html_nodes(xpath=paste0("//a[text()='", text, "']")) %>% 
      .[[1]] %>% 
      html_attr("href")
  }, error = function(e) {
    print(paste("Error for", text))
  })
}

We can now integrate this function into a lapply in order to extract the links for all the heads of government in a vector.

df <- html_heads %>%
  html_nodes(xpath = '/html/body/div[3]/div[3]/div[5]/div[1]/table[2]') %>%
  html_table(fill=TRUE) %>%
  .[[1]] %>%
  mutate(leader = gsub(".*– ", "", `Head of government`)) %>%
  mutate(leader = gsub(".*\u2013 ", "", leader)) %>% # this line is for the windows computers where the previous line does not work
  mutate(leader = gsub("^Sheikh |^Cardinal |^Prince ", "", leader)) %>%
  mutate(leader = gsub(" \\(.*$|\\[.*$", "", leader)) %>%
  mutate(leader = ifelse(leader=="Hasina", "Sheikh Hasina", leader)) %>%
  select(State, leader) %>%
  filter(!duplicated(State)) %>%
  mutate(link_leader = lapply(leader, get_link_text, html_page=html_heads)) 
  # lapply(LIST, FUNCTION, FIXED ARGUMENTS OF FUNCTION)

5 Extracting parties

In this step, we extract the political party and the Wikipedia link to the party from each of the pages of the world leaders from their respective Wikipedia pages.

## first do example
get_link_party_example <- "https://en.wikipedia.org/wiki/Jean_Castex" %>%
  read_html() %>%
  html_nodes(xpath='//table[contains(@class, "infobox")]/tbody/tr/th[text()="Political party"]/../td/a') %>%
  html_attr("href") %>%
  tail(., n=1)

5.1 Converting this example into a function

We convert this example into a function of the political leader’s Wikipedia page.

## then write function 
get_link_party <- function(leader_link){
  paste0("https://en.wikipedia.org", leader_link) %>%
    read_html() %>%
    html_nodes(xpath='//table[contains(@class, "infobox")]/tbody/tr/th[text()="Political party"]/../td/a') %>%
    html_attr("href") %>%
    tail(., n=1)
}

5.2 Integrating the function into a loop

We incorporate this function into a loop with a tryCatch.

## then write loop
df$link_party <- ""
for (i in 1:nrow(df)) {
  tryCatch({
    df$link_party[i] <- get_link_party(df$link_leader[i])
    print(paste("Got link for party of", df$leader[i]))
  }, error = function(e) {
    print(paste("Error for", df$leader[i]))
  })
}

5.2.1 Correcting by hand

We can correct any errors in the loop by hand.

df <- df %>%
  mutate(link_party=ifelse(leader=="Angela Merkel", "/wiki/Christian_Democratic_Union_of_Germany", link_party))

6 Extracting political positions

Now we wish to extract the political positions from the political party Wikipedia pages.

6.1 The function `grepl`

The function grepl has two main arguments, the first is a regular expression and the second is the string that we wish to search. The output is TRUE or FALSE depending on whether or not the regular expression matches the string. Here we use a regular expression to overcome various spellings and cases of “Political position”.

## first write example
get_political_position_example <- "https://en.wikipedia.org/wiki/La_R%C3%A9publique_En_Marche!" %>%
  read_html() %>%
  html_nodes(xpath='//table[contains(@class, "infobox")]') %>%
  html_table(fill=TRUE) %>%
  .[[1]] %>%
  set_tidy_names() %>%
  filter(grepl('*olitical.*osition', .[[1]])) %>%
  .[[2]]

## then write function 
get_political_position <- function(party_link){
  ### FILL HERE
}
## then write loop 
df$political_position <- ""
for (i in 1:nrow(df)) {
  ### FILL HERE
}

7 Creating a map

Here we use regular expressions and functions that you know already in order to plot the political orientation of leaders on a world map.

## Cleaning
political_classification = c("far left","left wing","centre left to left wing", "centre left", 
                             "centre to centre left", "centre/big tent", "centre to centre right", "centre right",
                             "centre right to right wing", "right wing", "unavailable")
df_clean <- df %>%
  select(State, political_position) %>%
  mutate(political_position_clean = political_position) %>%
  mutate(political_position_clean=gsub("^[^:]+:", " ", political_position_clean)) %>%
  mutate(political_position_clean=gsub(":.*$", " ", political_position_clean)) %>%
  mutate(political_position_clean=gsub("[0-9]{4}.*$", " ", political_position_clean)) %>%
  mutate(political_position_clean=gsub("\\([^()]*\\)", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub("\\[[^[]]*\\]", "", political_position_clean)) %>%
  mutate(political_position_clean=tolower(political_position_clean)) %>%
  mutate(political_position_clean=gsub("[^a-z ]", " ", political_position_clean)) %>%
  mutate(political_position_clean=str_trim(political_position_clean)) %>%
  mutate(political_position_clean=ifelse(political_position_clean=="", NA, political_position_clean)) %>%
  mutate(political_position_clean=gsub("center", "centre", political_position_clean)) %>%
  mutate(political_position_clean=gsub("tocentre", "to centre", political_position_clean)) %>%
  mutate(political_position_clean=gsub("toright", "to right", political_position_clean)) %>%
  mutate(political_position_clean=gsub(" formerly", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub("historical", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub(" correa era", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub(" citation needed", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub("economic", "", political_position_clean)) %>%
  mutate(political_position_clean=gsub("after", "", political_position_clean)) %>%
  mutate(political_position_clean=str_trim(political_position_clean)) %>%
  mutate(political_position_clean=gsub("centre left to centre right", "centre", political_position_clean)) %>%
  mutate(political_position_clean=gsub("big tent of the left", "big tent", political_position_clean)) %>%
  mutate(political_position_clean=gsub("big tent", "centre/big tent", political_position_clean)) %>%
  mutate(political_position_clean=gsub("^centre$", "centre/big tent", political_position_clean)) %>%
  mutate(political_position_clean = ifelse(political_position_clean %in% political_classification, political_position_clean, NA)) %>%
  mutate(iso3 = countrycode(State, "country.name", "iso3c")) %>%
  mutate(iso3 = ifelse(State=="Micronesia", "FSM", iso3))
map_world <- map_data("world") %>%
  mutate(iso3 = countrycode(region, "country.name", "iso3c")) %>%
  mutate(iso3 = ifelse(region=="Micronesia", "FSM", iso3)) %>%
  left_join(df_clean, by="iso3") %>%
  mutate(political_position_clean=ifelse(is.na(political_position_clean), "unavailable", political_position_clean)) %>%
  mutate(political_position_clean=factor(political_position_clean,
                                         levels = political_classification,
                                         labels = political_classification))
plot <- ggplot() +
  geom_polygon(data = map_world, aes(x = long, y = lat, group = group, fill = political_position_clean)) +
  scale_fill_manual(values = c("#FF0000","#E2001C","#C60038", "#AA0055", "#8D0071", 
                               "#71008D", "#5500AA", "#3800C6","#1C00E2", "#0000FF", "#bebebe")) + coord_fixed()
ggsave("Output/worldmappolitics.jpg", width = 20, height=12, device = "jpg")

8 Exercises

Fill in the code blanks
Correct a few errors for countries that were not scraped correctly and improve my map

Upload your R script as and your map at the link on the homepage.

Class 4: Working with string variables and functions through an introduction to web scraping with R