Building Shiny Apps with Rvest

Building Shiny Apps with Rvest

What are Shiny and Rvest?

In this project I wanted to walk through how to make a beginner-to-intermediate Shiny application incorporating elements of the rvest package. The goal of this project will be to provide a framework for building a shiny app, scraping the IMDB database for the top 50 movies of all time by user rating and reporting on some simple visuals.

For those not familiar with Shiny applications or the shiny library, Shiny is a free-to-use package developed and maintained by RStudio that provides R users some really nice options for making interactive visuals and displays that typically might require levels of programming outside the typical R-user wheelhouse (think: Java, JavaScript, PHP, HTML, etc.). I wouldn’t say building a Shiny application or dashboard qualifies someone as a software developer, but I will say that it’s a wonderful, if not essential, tool for data scientists and analysts. The important metric of individuals in these roles is in the end result, and that’s where Shiny excels by enabling R users the ability to easily construct tools for colleagues and peers.

rvest is a part of the tidyverse library and is similar to beautifulsoup Python enthusiasts. The purpose of rvest is to enable HTML/XML parsing and “scraping” of web page elements. These elements all have specific identities that are accessible using various tools, but for this project I am going to use Select Gadget; a simple Google Chrome extension that allows for easy point-and-click retrieval of HTML elements. Check out examples of the Selector Gadget in action to understand how I retrieve some of my HTML elements.

Hang on this sounds more like a tutorial than a project…

I struggled a bit to figure out if I wanted to place this in the tutorials section of this site or in projects, but in the end I selected the projects section because this was largely a self-undertaking to understand Shiny and rvest to produce a product. But I highly encourage new users of these packages to lean on this as a tutorial for their own use!

Creating the Shiny App

For anyone who’s never created a Shiny application, RStudio makes it pretty friendly and intuitive (makes sense, the company did make it after all). Simply click the following series of commands from the toolbar in the RStudio IDE:

Shiny App Location

Then simply follow the GUI prompt with an application name and file location:

Shiny App Location

How to choose between “Single App” and “Multiple File?” In the end it depends on each project and, in my opinion, comes more down to individual style and need for tidy scripting than anything else. Both file extensions (“.app” or the combination “ui.r”/“server.r”) are readable by R as an indicator for Shiny deployment. “.app” is what I chose for this project since it is not overwhelmingly complex, and can be assumed “.app” supplies the contents of the multiple file in a single script.

When creating a Shiny app there are two main components that wrap the application: ui and server. ui controls all user interface functions while server supplies all of the functionality needed for them to work. The most alien part of working with Shiny apps for new users is likely to be the idea of “reactive” coding and treating the app like a dynamic program. Without going too into specifics, I think that parallels can be drawn between reactive coding and the use of “parameterization” which I go through in my RMarkdown tutorial and RMarkdown project. Static declarations of variables which could typically be called during the debugging or initial construction phases become inaccessible until dynamic rendering.

Necessary Libraries

Note for any viewers following along, the following libraries will be utilized in the creation of this application:

library(shiny)
library(tidyverse)
library(rvest)
library(knitr)
library(kableExtra)
library(highcharter)
library(viridisLite)

Grabbing Data with rvest

On the IMDB page there are a number of elements I want to grab but to start off I want to look into the top 50 movie ratings on the 1-10 scale by users. As shown below, when using the Selector Gadget, I can point and click on the HTML element I want and it will deliver to me two views displaying “div strong”. div strong is what I want to scrape from this page.

First steps will be to create a variable, url, storing the web page address. Then invoke read_html and rvest’s html_nodes to scrape the data of interest. dplyr and the %>% command is used to assist with tidying and code simplification. It can also be noted that once the ratings data is pulled, some cleaning is done at the end to make the data usable.

# First establish the URL we want to scape data from, in this IMDB
url <- "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc"
webpage <- read_html(url)

# Using rvest and html_nodes in combination with the Google Chrome Selector
# Gadget extension, tease out specific elements of the site for analysis.
# In this case ratings, titles, year of release, runtimes and certification.
ratings <- webpage %>% 
  html_nodes("div strong") %>% 
  html_text() %>% 
  as.numeric()

ratings <- ratings[!ratings > 10.0] # remove numeric values captures not in range
ratings <- ratings[!is.na(ratings)] # remove all NAs introduced

I then used similar commands and logic to store data for movie titles, year of release, runtimes, genres, and certificates (i.e. ratings as in “R” for adult restricted movies and “G” for general audience”). One command that proved very helpful was the use of stringr::strsquish() to remove any random spaces from the text in data scraped from genres.

It really is as simple as that (for the purposes of this project). Accessing HTML nodes in the webpage are easily scraped and added into usable data for the purposes of R analyses.

Getting Started with ui and server

For this application I had three main goals:

  • Tabulate all of the movies with their different attributes
  • Make a dynamically interactive table with filtering elements
  • Visually depict certificate rating and genre as elements of a whole

In the ui portion of the application, I declare the elements users will be greeted with on opening my application. Elements such as titlepanel() and sidebarLayout() control what their names indicate: the title at the top of the app and the elements of the sidebar panel which will allow users to filter from the table in my second goal.

Another thing to note is that it’s good practice to build with size in mind. The typical application is going to be a size width of 12, so when declaring column widths, bear in mind to keep the total sum of your columns less than or equal to 12. Additionally all ___Output()s will be based exclusively on output()s specifically declared in the server portion of the application.

ui <- fluidPage(
    
    # Application title
    titlePanel("Top 50 IMDB Movies by User Rating"),
    
    # Sidebar with a slider input for number of bins 
    sidebarLayout(
        sidebarPanel(width = 1,
                     checkboxGroupInput(inputId = "choicefilter", 
                                        label = "Select Movie Genre(s)", inline = FALSE,
                                        choices = sort(c("Drama", "Crime", "Action",
                                                         "Adventure", "Fantasy", 
                                                         "Biography", "History", "Sci-Fi",
                                                         "Romance", "Western", 
                                                         "Animation", "Family",
                                                         "War", "Comedy",
                                                         "Mystery", "Thriller",
                                                         "Music", "Horror")), 
                     )
        ),
        
        fluidRow(
            column(4,
                   tableOutput("movietablet")# Total movie table
            ), 
            column(4,
                   tableOutput("movietablef")  # Filtered movie table
            )
        )
    ),
    
    fluidRow(column(width = 5, highchartOutput("movieratingspie", height = "400px")),
             column(width = 4, highchartOutput("moviegenrespie", height = "400px"))),
    
    fluidRow(column(width = 2,textOutput("InfoBox")))
    
)

Since the server is home to most of the heavy lifting that the R-chitect (see what I did there?) executes, I chose to leave all of the statis rvest commands in a separate .r script to be read in via source(). The variable movie_table combines all of the rvest scraping into a single, succinct data frame which I then pass through kable to beautify and control for coloring and interactive tabular elements. By making the static table encompassing all of the movie elements from rvest I have accomplished the first goal. The first portion of the server is shown below:

server <- function(input, output) {
    
    source("top_movies.R") # Load all of the variables from the top_movies.r file
    
    # Create the movie_table from the variables declared in source()
    movie_table <- data.frame(
        Rank = seq(1, length(year), 1),
        Ratings = ratings,
        Titles = titles,
        Year = year,
        Runtime = runtimes,
        Genre = genre, 
        Certificate = Certificate)
    
    movie_table_t <- movie_table %>% 
        mutate(
            Ratings = cell_spec(x = Ratings, format = "html", bold = T, 
                                color = "white", 
                                background = ifelse(Ratings > mean(Ratings), "#66bf3f", "#0e5a9b"))
        ) %>%
        kable(escape = F) %>% # NOTE must have "escape = F" for HTML to render
        kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

To accomplish my second goal, most of the work has already been completed in the first task above. Now all I need to do is add in my reactive components which will lean on input$choicefilter for user-based directives. The input was declared in the sidebarPanel in the ui part of the application.

# Create a table that is reactive based on OR conditions for the 3 filter genre inputs
    movie_table_f <- reactive({
        
        validate(
            need(input$choicefilter != "", "Please select a movie genre")
        )
        
        # Apply filters here
        movie_table_f <- movie_table %>% filter(grepl(paste(input$choicefilter, collapse = "|"), Genre))
        
        # Feed the movie table through kableExtra to produce pretty output
        movie_table_f <- movie_table_f %>% 
            mutate(
                Ratings = cell_spec(x = Ratings, format = "html", bold = T, 
                                    color = "white", 
                                    background = ifelse(Ratings > mean(Ratings), "#66bf3f", "#0e5a9b"))
            ) %>%
            kable(escape = F) %>% # NOTE must have "escape = F" for HTML to render
            kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
    })

I also want to call attention to the use of validate and need in the code above which are excellent for removing annoying red text errors when there is no user input. I highly recommend incorporating the combination of these function calls whenever designing an app that requires user inputs.

Before moving on to the final goal, I want to call attention to the formatting for declaring outputs that will be read by the ui. The typical setup of the output will look something like this:

    output$output_name <- function(){
        declared_finalized_variable %>% 
            other_function_calls()
    }

For the two movie tables created above, movietablet and movietablef, the code looks like:

# Output for total movietable
    output$movietablet <- function(){
        movie_table_t %>% 
            scroll_box(width = "550px", height = "400px")
    }
    
    # Output for filtered movie_table_f
    output$movietablef <- function(){
        movie_table_f() %>% 
            scroll_box(width = "550px", height = "400px")
    }

Now finally for the third goal: visuals. To create beautiful visuals, most everyone who has touched R knows of ggplot2() and the magic of pretty graphing. In other posts on this site I have incorporated plotly to create interactive graphics. For this project I instead want to use another visual library, highcharter, from the highcharts development team. I also incorporated the viridis library to use the viridis color palette and trimws() to “trim white space” from elements of my dataframes. For these I bundled the execution of the visuals in the outputs functions since there was no reliance on reactive user input.

    output$movieratingspie <- renderHighchart({
        # Pie Chart Construction================================================
        
        clrs <- c("#E45402", "#E48D02", "#0E4B95", "#02966D", "#B70268")
        
        MovieRatings.pie <- cbind.data.frame(titles, Certificate) %>% 
            group_by(Certificate) %>% 
            tally() %>% 
            hchart(type = "pie", hcaes(x = Certificate, y = n)) %>% 
            hc_title(text = "<b>Movie Ratings</b>",
                     margin = 20, align = "center",
                     style = list(color = "##000000", useHTML = TRUE)) %>% 
            hc_colors(clrs)
      
        MovieRatings.pie  
    })
    
    output$moviegenrespie <- renderHighchart({
        # Pie Chart Construction================================================
        
        genrepie.df <- data.frame(unlist(strsplit(as.character(genre), ",")))
        colnames(genrepie.df) <- c("genres")
        genrepie.df$genres <- trimws(genrepie.df$genres, which = c("both", "left", "right"), whitespace = "[ \t\r\n]")
        
        clrs2 <- viridis(n = length(unique(genrepie.df$genres)))
        
        MovieGenres.pie <-  genrepie.df %>% 
            group_by(genres) %>% 
            tally() %>% 
            hchart(type = "pie", hcaes(x = genres, y = n)) %>% 
            hc_title(text = "<b>Movie Genres</b>",
                     margin = 20, align = "center",
                     style = list(color = "##000000", useHTML = TRUE)) %>% 
            hc_colors(clrs2)
        
        MovieGenres.pie 

    })

Now all that’s left is to punch the “run app” play button in the top right of the RStudio console and the Top Movies app will play out as expected. Due to the size of the application and spacing of elements, the application below may require scrolling to the right or zooming out in the web browser to get a better look at the full application. Otherwise visit the shinyapps.io site it is hosted on by clicking the button at the top of this web page.

The Final Product

Parting Remarks and Final Comments

As one of my first forays into Shiny and rvest I was very pleased with how this application turned out and how easy it was to deploy. I learned much about succinct scripting for Shiny apps and a lot about coding with reactive elements. In the future I hope to explore web scraping more for future analyses and develop more complex applications to support them.

This application is freely available for download for learning. Please feel free to clone and download the entire app from my GitHub page by visiting the link at the top of this webpage.

shinyapps.io: the shiny apps io site is a great way to host applications for free. Without the use of a live server or hosting platform, Shiny apps can only be run locally on a personal machine. So long as there is no concern over public access and personal data, the shinyapps.io space is a great choice for broadcasting and serving your application. The site as well as RStudio’s other products (like RStudio Connect) also offers paid plans for privacy and greater control.

Avatar
Richard Hanna
Biomedical Engineer and Data Scientist