What will you learn in this course?

Session 1
- RStudio interface
- Installing and using R packages
- Reading data into RStudio from multiple sources

Session 2
- Exploring data using crosstabs and frequencies
- Manipulating data using R code

Session 3
- Becoming familiar with the widely used dplyr package

Session 4
- ggplot2

What is R?

  • R is a language and environment for statistical computing and graphics.

  • Widely used across a variety of sectors for data analysis and statistical inference.

  • R is open source and free to use, and available for all major operating systems.

  • It is part of the analytical strategy but R is not the only tool we can use.

Difference between R and RStudio?

  • RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. You cannot run RStudio without R.
  • RStudio is more flexible and powerful, and provides direct access to R code.
  • Both are good for viewing data, neither is good for manually editing data.
  • Use RStudio for any project that requires direct interaction with code and/or manipulation of complex data.
  • RStudio Pro is proprietary and comes with user support from RStudio and some extra functionality like the use of RStudio server.

RStudio

RStudio

R packages

  • R packages are collections of functions and, sometimes, datasets created by the R community.

  • R packages cover a wide range of needs and can improve the basic functionality of R or add additional functionality.

  • Packages come with extensive helpfiles called vignettes. These contain information on all the functions in the package, a description of what the package is designed to do

  • Vignette for the lubridate package

Installing packages

  • Packages are installed using:
install.packages("haven")

and are called using:

library("haven")
  • As anyone can create an R package, different packages can have functions of the same name.

  • We can specify which package a function should be from using ::

dplyr::summarise() 
# This calls the summarise function specifically from the dplyr package

Data Frames and Objects

Objects
- Objects are things we can use in our R code, for example, a string containing a file path, a list of hospital codes or data in the form of a data frame.

input <- "//conf/pub_incubator/"
codes <- read.csv("hospitalcodes.csv")
x <- 4

Data Frame
- A data frame is a table of information in which each column contains values of one variable and each row contains all the information on a single observation.

# Look at the iris dataset which is built into R
iris <- iris

How to import and export data

You can import data of any type including but not limited to:

  • csv
  • rds
  • SPSS
  • SMRA

csv

You can use the readr package to read in csv files. To do this, we first need to install and load the package.

install.packages("readr")
library(readr)

Once we have loaded the package, we can then use this simple command to read in our file

csv_data <- read_csv("path/file_name.csv")

The data is now stored as an object called csv_data. The <- operator is used to assign objects.

Let's try an example!

EXAMPLE

Load the readr package

library(readr)

Read in the Borders.csv data from

borders <- read_csv("data/Borders.csv")
View(borders)

SPSS

A package called haven provides a simple function to read in SPSS files (.sav or .zsav).

install.packages("haven")
library(haven)
spss_data <- read_sav("path/file_name.sav")

Let's try an example!

EXAMPLE

Load the haven package

library(haven)

Read in the Borders.sav SPSS data

borders_spss <- read_sav("data/Borders.sav")
View(borders_spss)

Reading in data from a website

  • Data in any file format can be read in from a website by replacing the filepath with the data URL.

  • For example to read in the Hospital Codes CSV from the open data website:

library(readr)

hospital_codes <- read_csv(
  "https://www.opendata.nhs.scot/dataset/cbd1802e-0e04-4282-88eb-d7bdcfb120f0/resource/c698f450-eeed-41a0-88f7-c1e40a568acc/download/current_nhs_hospitals_in_scotland_030320.csv")

SMRA (for information only)

Reading data from SMRA is slightly more tricky.

install.packages("odbc") # One time install
library(odbc)
# Set up a connection
smra_connection <- dbConnect(drv = odbc(), dsn = "SMRA",
  uid = .rs.askForPassword("SMRA Username:"),
  pwd = .rs.askForPassword("SMRA Password:")
)
# Do the extract using SQL
smr01 <- dbGetQuery(smra_connection,
  paste("select DISCHARGE_DATE, LOCATION, MAIN_OPERATION",
    "from SMR01_PI",
    "where ROWNUM <= 100")
)

Exporting Data

  • Exporting data is very similar to importing data.

  • To export as a CSV:

write_csv(borders, "data/borders.csv")

END OF BLOCK 1

START BLOCK 2

Data Exploration

  • You can easily produce summary statistics in R. For example, we can calculate the mean by using the mean() function.

  • To reference a specific column in a dataframe, use the $ operator e.g. df$column. You can use this to produce summary statistics of a specific column e.g. mean(df$column).

  • In addition to the mean, it is also possible to calculate other statistics such as the median value and the range of values (for a numeric column). Use the summary() function to see all summary statistics at once.

summary(df$column)  

Data Exploration - Frequencies/Crosstabs

The table() function can be used to create frequencies and crosstabs from your data. Using the following code will give a frequency table for column 1:

table(df$column1)
# Create a table of Species from the iris dataset
table(iris$Species)

Using the following code will give a crosstab for column1 and column2:

table(df$column1, df$column2)
# Create a cross tab from the iris dataset
table(iris$Sepal.Length , iris$Species)

Let's try an example!

EXAMPLE

Using our data:

table(borders$HospitalCode, borders$Sex)

 # Addmargins() will add extra row and column for totals
addmargins(table(borders$HospitalCode, borders$Sex))

Exercise 1:

  • Read in the Borders.csv data and check the frequency on column Sex.

  • Using the same dataset, what are the mean and median values for LengthOfStay?

END BLOCK 2?

START BLOCK 3?

Data Manipulation

  • As an organisation we use the tidyverse suite of packages - this is industry best practice.

  • The tidyverse is a collection of R packages for data exploration, manipulation and visualisation. Tidyverse packages use a consistent format and data structure making them simple to use.

  • In tidyverse functions, the first argument is the dataset e.g. function(data, task).

  • Many useful tidyverse functions come from the dplyr package:

install.packages("dplyr")
library(dplyr)

Comparison Operators

It is possible to select rows (using the function filter(), which we'll meet soon) in a dataframe, the table below shows a selection of comparison operators that can be used.

Operator Function
== Equal to
!= Not equal to
< Less than
<= Less than or equal to
> More than
>= More than or equal to

Logical Operators

In addition to comparison operators, sometimes it is necessary to use logical operators.

Operator Text Equivalent Definition
& AND Both relations must be true
| OR Either relation can be true
! NOT Logical negation operator. Reverses the outcome of an expression

Let's try using the operators

Using the borders data, select all records with Specialty E12. We can use the filter() function to use only specific rows from the data frame. (This works in a similar way to 'select if' in SPSS.)

borders_E12 <- filter(borders, Specialty == "E12")

Use filter() with the borders data to get patients who attended hospital B120H and had a length of stay greater than 10.

borders_filtered <- filter(borders, 
                           HospitalCode == "B120H" & 
                           LengthOfStay > 10)

Exercise 2:

  • How many patients had a length of stay between 2 and 6 days inclusive in Borders General Hospital (B120H)?

mutate()

Often there is a need to create a new column in a dataframe or modify an existing column. The simplest way to do this is to use the mutate() function.

mutate(df, newcolumn = expression)

For example if we want to create LengthOfStay divided by 2.

borders_2 <- mutate(borders, los_div_2 = LengthOfStay / 2)

arrange()

Data can also be sorted with dplyr using the arrange() function. Multiple columns can be selected and arrange() will sort the dataframe in the order the columns are selected.

arrange(df, column1, column2, desc(column3))

We can sort our data into ascending order by HospitalCode

arrange(borders, HospitalCode)

Pipe Operator

While all the dplyr functions shown are useful on their own, there are times when we need to use more of them to achieve desired outputs. For this, dplyr contains the pipe operator, %>%. This operator works by linking the dplyr functions together.

The pipe operator essentially means "and then". This makes reading your code much easier, as rather than forcing several functions into one line of code, the pipe allows you to refer to your data frame and then perform several functions on it.

A common short cut for the pipe operator is "Ctrl+Shift+M".

Pipe Operator

For example, if we wanted to get all records with HospitalCode "B102H" and sort by Dateofbirth, we would have had to run this code:

arrange(filter(borders, HospitalCode == "B102H"), Dateofbirth)

However, we can make this code much easier to read by introducing the pipe operator:

# Take the borders data frame
borders %>%
# AND THEN filter so only rows where HospitalCode is B102H
filter(HospitalCode == "B102H") %>% 
# AND THEN sort the result by date of birth
arrange(Dateofbirth) 

Pipe Operator Notes

  • Whatever precedes the pipe is automatically passed as the first argument to the next function.

  • Pipe can also be used with non dplyr functions.

select()

Sometimes you will only be interested in a few columns in your dataframe. The select() function allows you to specify certain columns from within your dataframe to keep.

select(df, column_name)

select() can also be used to delete a specific column, you can simply run the following code:

select(df, -column_name)

Let's try using the borders data.

select(borders, -Postcode)

Example

Select the URI, Specialty and LengthOfStay columns, filter for rows which contain C8 for Specialty and arrange by LengthOfStay.

borders %>%
  select(URI, Specialty, LengthOfStay) %>%
  filter(Specialty == "C8") %>%
  arrange(LengthOfStay)

group_by()

The group_by() function allows you to easily aggregate your data into groups. This function simply lists the desired columns you wish to group by within your dataframe.

group_by(df, column_name)

Using our data:

group_by(borders, HospitalCode)

As you can see, using the group_by function on its own doesn't affect the appearance of the data. However, the output in the console now shows the number of groups within the data.

Groups: HospitalCode [48]

summarise()

The summarise() function allows you to calculate summary statistics for each desired group. To use this function let's try an example using the borders data, using the code below we calculate the mean length of stay for each hospital.

borders %>%
  group_by(HospitalCode) %>%
  summarise(mean_los = mean(LengthOfStay))

If we run this code without group_by(), then the output will be the mean length of stay across all the data.

borders %>%
  summarise(mean_los = mean(LengthOfStay))

ungroup()

We need an example here of using ungroup using the example data - happy for suggestions.

count()

The count() function is very useful for running frequencies on a dataframe. This function automatically calls group_by() and produces counts for a specified column.

borders %>%
  count(Sex)

You can also sort frequencies in descending order using count()

borders %>%
  count(Specialty, sort = TRUE)

Exercise 3:

  • Work out the earliest admission date and latest discharge date for each specialty.

  • Count the number of admission (frequencies) per hospital per specialty, using count()

rename()

The dplyr package also makes it very straight forward to rename specific columns by using the rename() function. The rename() function works as follows:

df <- rename(df, new_name = existing_name)
borders <- rename(borders, date_of_birth = Dateofbirth)

If you look at your borders dataframe, you will see that the Dateofbirth column is now called date_of_birth.

recode()

While rename() changes the column name, recode() changes the values within the column. The recode() function works best when used within a mutate():

df <- df %>% 
  mutate(column_name = recode(column_name, 
                              "existing_code" = "new_code"))
borders <- borders %>% 
  mutate(HospitalCode = recode(HospitalCode, "B120V" = "B120H"))

Exercise 4:

  • Select the URI, Specialty and DateOfBirth columns from borders data and save this into a new data frame.
  • Arrange this new data frame in ascending order by Specialty and view it to see the results.
  • Use the filter function to extract records that have a missing DateOfBirth value.

Joining Data

The dplyr package also makes it very easy to merge data by matching files together using common variables. There are several join functions within dplyr that are designed to merge dataframes together to create one dataframe containing the relevant variables.

The general structure of a join function is to specify the two dataframes used for merging, x and y, and a common variable to match by. For example, for a left_join() you would have:

new_data <- left_join(df_1, df_2, by = "common_variable")

Examples of Joining Functions

dplyr joining functions

Example

Read in Baby5.csv and Baby6.csv

baby5 <- read_csv("data/Baby5.csv")
baby6 <- read_csv("data/Baby6.csv")
baby_joined <- baby5 %>%
  left_join(baby6, by = c("FAMILYID", "DOB"))

END BLOCK 3

START BLOCK 4

Data Visulisation - ggplot2

A basic line graph can be created using the following code:

library(ggplot2)
library(tibble)

#Define data
x <- c(1, 3, 2, 4)
y <- c(4, 8, 7, 11)

# Create a data frame
df <- tibble(x, y)
# Plot the data
line_plot <- ggplot(df, aes(x, y)) + 
  geom_line() +
  geom_point()

View plot

line_plot

Example - Barplot

  • Read in the Borders (inc Age).csv data
borders_age <- read_csv("data/BORDERS (inc Age).csv")
  • Calculate frequencies of admissionday
counts <- count(borders_age, admissionday)
colnames(counts) <- c("day", "count")
# plot
bar_plot <- ggplot(borders_age) +
  geom_bar(aes(x = admissionday))

# View plot
bar_plot

Example - Scatter plot

Use the borders_age data:

# plot
scatter_plot <- ggplot(borders_age, aes(x = ageonadmission, 
                                        y = lengthofstay1)) +
  geom_point()

# View plot
scatter_plot

Customising Plots

It is possible to customise plots created in R.

  • Change the colour and width of the bars
bar_plot <- ggplot(borders_age) +
  geom_bar(aes(x = admissionday), fill = "blue", width = 0.5)

bar_plot
  • Add titles
bar_plot <- ggplot(borders_age) +
  geom_bar(aes(x = admissionday)) +
  xlab("Admission Day") +
  ylab("Number of Admissions") +
  ggtitle("Distribution of Admission Day")

bar_plot

Exercise 5:

  • Filter your data to select records with HospitalCode "B109H". Create a histogram for patient's length of stay. Add a title and axis labels and manually save the plot to the plots folder as a PNG.

Hint: Use geom_histogram() to create the histogram

Resources

Contact

If you have questions or need help, you can contact

PHI R User Group Inbox: phs.rusergroupinbox@nhs.net
(email here to be added to the distribution list or Teams channel)

Thank You