There Was an Error While Loading apisapidrivegoogleapiscomoverviewproject Please Try Again

[This article was kickoff published on R by R(yo), and kindly contributed to R-bloggers]. (You can written report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a weblog, or here if you lot don't.

In analytics for any particular field, it's not enough to be able to create output (fancy charts, dashboards, reports, etc.) but besides be able to collect the information you want to employ in a easy, reproducible, and well-nigh importantly, consistent style. This is all the more important in a field like sports where throughout the course of a season, new information is beingness updated to a database or some kind of folder.

In this blog mail service I will get over how to create a data pipeline for Canadian Premier League data stored on a Google Drive binder (courtesy of Centre Circumvolve & StatsPerform) using R and Github Actions.

The unproblematic example I'll become over will guide you lot on how to fix a Google service account and create a Github Actions workflow that runs a few R scripts. The end-product is a very simple ggplot2 chart using the data you lot downloaded using this workflow.

For more analytical and visualization based blog posts have a wait at other web log posts on my website or check out my soccer_ggplots Github repository.

Let'southward get started!

Canadian Premier League data

The Canadian Premier League was started to amend the quality of soccer in Canada and alongside its inaugural launch in the 2019 season, a data initiative was started by CentreCircle in a partnership with StatsPerform to provide detailed data on all CPL teams and players.

The data is divided into .csv or .xlsx files for:

  • Player Total stats
  • Player per Game stats
  • Squad Total stats
  • Squad per Game stats

With around 147 metrics available that range from 'expected goals from ready pieces', '% of passes that go forwards', 'fouls committed in attacking 3rd', etc. this initiative provides a great source of information for both beginner and good analysts to hone their skills.

You can sign up to gain admission to the Google Drive containing the data here. You should do and then before we get started with all of this and then you tin familiarize yourself not just with the data but too the best practices and usage permissions as stated in the files.

So, it's prissy that we have all this data in a Google Drive folder and that it'due south being updated every few days by Centre Circumvolve. However, it'south a bit annoying to have to manually re-download the information files later every update, salvage information technology in the proper folder, and only then finally get down to analyzing the information. This is where automation tin can help, in this case Github Deportment and the {googledrive} R package can be utilized to create an ETL pipeline to automate the data loading/saving for yous.

Guide

In the following section I'll go over the steps you lot demand to create an EPL pipeline for CPL data. This tutorial assumes you know the basics of R programming, that you already have a Github account, and can navigate your manner around the platform.

It can be a daunting process to gear up all this up, particularly the Google service account stuff. It was the same for me, I had to go through the documentation provided by the {googledrive} R packet quite a fleck along with a lot of googling things separately.

  • {googledrive} R parcel documentation
  • Non-interactive authentication docs from the {gargle} R bundle

Here is the link to my own repository, CanPL_Analysis, which has all the files I'll be talking nigh afterwards.

Google Drive

    1. Make certain you lot're already signed into your Google business relationship and get to "Google Deject Platform/Console". On the left-side menu bar become to the "IAM & Admin" department.

    1. Coil down to find the Create a project button in the carte bar. Requite it a skillful proper noun that states the purpose of your project.

    1. Create service account. Fill in service account details.

    1. Select Part "Owner" or other equally is relevant for your project.

    1. Once you're washed, click on your newly created service account from the project page. The email yous encounter listed for your service account is something you'll need later so keep a copy of that address somewhere.

    1. Go to "Keys" tab and click on "ADD Cardinal" and and then "Create new cardinal".

    1. Make sure the primal type is "JSON" and create it.

    1. Store the file in a secure space. Make sure you store information technology somewhere so that it's Not being uploaded into a public repository on Github. You can do that past .gitignore-ing the file or making the credential an environs variable in R with usethis::edit_r_environ().

    1. Become to your Google Drive API folio and enable the API. URL link is: https://console.developers.google.com/apis/api/drive.googleapis.com/overview?project={YOUR-PROJECT-ID} (fill in {YOUR-PROJECT-ID} with your projection ID).

    1. Ask the owner of folder/file to share it with the service business relationship by right-clicking on the folder/file in Google Drive and click the 'Share' push (Steven Scott is the possessor of CanPL data). Employ the email address y'all see in the "client_email" section of your Google credential JSON file as the account to add.

NOTE: Since the service account doesn't take a concrete email inbox, you lot can't send an electronic mail to information technology with the share link and open the file/binder from the email message. You have to make sure to 'share'/'inquire to share' the binder/file from Google Bulldoze directly.

Note: Delight only practice this part if you are serious about using the Canadian Premier League data and that you lot signed up here on the Canadian PL website and have read all the terms, weather condition, and "best practices" sheet provided in the Google Bulldoze folder containing the data. Otherwise y'all tin create your own separate folder on Google Bulldoze, put some random data in it, and share that to your Google service business relationship and run the Github Deportment workflow on that instead.

Github Deportment (GHA)

Github Deportment (GHA) is a relatively new feature introduced in late 2019 that allows y'all to set workflows and take reward of Github's VMs (Windows, Mac, Linux, or your own setup). For using Github Actions with R, there's a perfectly named Github Actions with R volume to help you out, while I have gotten a lot of mileage from looking up other people'southward GHA yaml files online and run across how they tackled bug. R has considerable support for using Github Actions to ability your analyses. Take a wait at R-Lib's GHA repository which will give you templates and commands to set different workflows for your needs.

Some other examples of using R and GHA:

  • Automate Web Scraping in R with Github Actions
  • Automating web scraping with GitHub Deportment and R: an example from New Jersey
  • R-Parcel GitHub Deportment via {usethis} and r-lib
  • Up-to-date blog stats in your README
  • Launch an R script using github actions (R for SEO volume)
  • A Twitter bot with {rtweet} and GitHub Actions

To ready upwardly GHA in your own Github repository:

    1. Have a Github repository set upwards with the scripts and other materials you want to employ. Beneath is how I ready up mine (link), the folders are important as we're going to be referring to them to save our data and output. Name yours yet you wish, just remember to refer to them properly in the R scripts or YAML files you utilise.

    1. Open the repo upwards in RStudio and type in: usethis::use_github_actions(). This will do all the prepare for you to get GHA running in your repository.
    1. Your GHA workflows are stored in the .github/workflows folder as YAML files. If you used the office above information technology'll create one for R-CMD-check for you. Y'all don't need that for what we're doing since this repository isn't an R parcel. Either delete information technology or change it for what we want to do. We'll exist working on the YAML files in the next section.
    1. Annotation that for both private and public repositories you take a number of free credits to use per month only anything more is going to toll you. See hither for pricing details.

To let Github Deportment workflow employ your Google credentials, yous need to store it in a place where GHA can recollect it when its running.

    1. Become into "Settings" in your Github repository, and so "Secrets", and so "New repository secret".

    1. Phone call it GOOGLE_AUTHENTICATION_CREDENTIALS or whatever yous want (just make certain its consistent between what y'all call it hither and in the workflow YAML file or R script). And so re-create-paste the contents of the Google credential primal .JSON file (the one you downloaded earlier) into the "value" prompt. I believe you demand to include the {} brackets every bit well.

Workflow YAML file

At present we need to create a workflow YAML file within the .github/workflows/ directory. This is the file which gives GHA instructions on what to practise. Here is the link to the one I created.

First, you want to figure out how oft yous want this GHA to run. It really depends on what you lot want to exercise with GHA. For the purposes of the CPL information project, it appears that there are new data updates every few days so you may want to schedule it to run maybe every two days. It really depends. To schedule your GHA workflow yous can use keywords such as "push button", "pull-request", etc. or y'all tin can use cron.

The crontab.guru is a useful website to configure the specific syntax you lot need to schedule your GHA workflow with cron, whether that be 'once a 24-hour interval', 'every 3 hours', 'every 2 days at iii AM', etc. If you yet don't know, try googling "cron run every 6 hours" or whatever.

To accomplish what we want to do, the bones steps are as follows:

  1. It does some set upward with installing R itself (r-lib/actions/[email protected]) and git check out (actions/[e-mail protected]).

  2. If y'all're using Ubuntu-Linux as the VM running this workflow, then you need to install libcurl openssl to be able to install the {googledrive} package in later steps (mainly due to ringlet and httr dependency R packages, see this StackOverflow post for details). You lot don't need to do this if you're using MacOS every bit the VM.

  3. Installs R packages from CRAN. Be warned that due to some dependencies, the packages yous listing hither might not actually install even if GHA says that step was completed. This part tripped me up quite a fleck until I figured out the libcurl openssl affair. Besides note that there should be a manner to cache the R packages you're installing but so far I've only constitute solutions when the Github repo you lot're using is also an R bundle too. This part unremarkably takes quite long (and in this simplified example I'm just installing three packages!) so if yous really desire to do something with a lot of dependencies I propose y'all enquiry this a lot more or you will employ up your GHA minutes quite chop-chop!

  4. Runs the R script Get-GoogleDrive-Data.R.

  5. Runs the R script Plot-ggplot.R.

  6. Commits and pushes the data files downloaded into the Github repository.

Brand certain your indentations for each section are correct or information technology won't run properly. It can exist an annoyance to figure out but information technology is what information technology is. I usually just re-create-paste someone else's YAML file that has steps and setups close to what I want to exercise and and so but start editing from at that place.

More details:

  • on: "When" to run this activeness. Tin exist on git push, pull-request, etc. (employ [] when specifying multiple weather) or you lot tin can schedule it using cron equally we talked about earlier.

  • runs-on: Which Os practice you want to run this GHA on? Note that per the terms of GHA minutes and billing, using Ubuntu-Linux is the cheapest, so its Windows, and MacOS is the about expensive and then plan accordingly.

  • env: This is where you lot refer to the environment variables you have prepare upwardly in your Github repository. Stuff similar your Github Token and your Google service business relationship credentials.

  • steps: This is where you outline the specific steps your workflow should have.

An important question is: How tin can I refer to my Google credentials stored as a Github secret in the googledrive::drive_auth() function that volition be used in the R script for authentication?

Equally you specified the Github hush-hush in the previous section as "GOOGLE_AUTHENTICATION_CREDENTIALS", you can use that proper name as the reference for the Sys.getenv() office so that it tin can take hold of that as an environs variable. You'll see how that works in the side by side section.

Likewise remember that y'all need to "git pull" your repo whenever so that you lot have the latest data to piece of work with. This is because when the GHA workflow downloads and commits the data into your github repo, it is only updating it online on github and not on your local computer. So you demand to pull all the new stuff in get-go or you'll be working with un-updated information from the terminal time you pulled.

R Scripts

Now that we've washed a lot of the setup, we can actually start doing stuff in R. For both of these scripts I tried using the minimal corporeality of packages to reduce dependencies that GHA would take to download and install as part of the workflow run. You could easily use a for loop instead of purrr::map2() and base R plotting instead of {ggplot2} if you desire to go even further (I attempted this in scripts/Plot-BaseR.R but it was just quicker doing it with ggplot2).

Become-GoogleDrive-Data.R

Link

  1. Load R packages.

  2. Authenticate Google Drive by fetching the environment variable yous set up in the Github repository as a Github secret.

  3. Detect the Google Drive folder you desire to grab data from.

  4. Filter the folder for the .csv files.

  5. Create a download function that grabs the .csv files and saves them in the data/ folder. Add together some handy letters throughout the office and so that it will evidence upwards in the GHA log (this helps with debugging and just knowing what'due south going on every bit the workflow runs).

  6. Now use purrr::map2() to iterate the download function to each individual .csv file in the binder.

## Load packages ---- library(googledrive) library(purrr) #library(dplyr) #library(stringr) #library(fs) options(googledrive_quiet = TRUE)  ## Authenticate into googledrive service account ---- ## 'GOOGLE_APPLICATION_CREDENTIALS' is what we named the Github Secret that  ## contains the credential JSON file googledrive::drive_auth(path = Sys.getenv("GOOGLE_APPLICATION_CREDENTIALS"))  ## Find Google Drive folder 'Heart Circle Data & Info' data_folder <- drive_ls(path = "Eye Circumvolve Data & Info")  ## Filter for but the .csv files data_csv <- data_folder[grepl(".csv", data_folder$name), ] # dplyr: data_csv <- data_folder %>% filter(str_detect(name, ".csv")) %>% arrange(name)  data_path <- "information" dir.create(data_path)  ## dir.create will neglect if folder already exists so not great for scripts on local simply as GHA is  ## creating a new environs every time it runs nosotros won't accept that problem here # normally i prefer using fs::dir_create(data_path)  ## download function ---- get_drive_cpl_data <- function(g_id, data_name) {   cat("\n... Trying to download", data_name, "...\north")      # Wrap drive_download role in safely()   safe_drive_download <- purrr::safely(drive_download)      ## Run download office for all information files   dl_return <- safe_drive_download(file = as_id(g_id), path = paste0("data/", data_name), overwrite = TRUE)      ## Log letters for success or failure   if (is.goose egg(dl_return$result)) {     cat("\nSomething went incorrect!\n")     dl_error <- as.graphic symbol(dl_return$fault) ## errors come back as lists sometimes so coerce to graphic symbol     cat("\n", dl_error, "\due north")   } else {     res <- dl_return$result     cat("\nFile:", res$name, "download successful!", "\nPath:", res$local_path, "\due north")   } }  ## Download all files from Google Drive! ---- map2(data_csv$id, data_csv$name,      ~ get_drive_cpl_data(g_id = .x, data_name = .y))  cat("\nAll done!\due north")            

Plot-ggplot.R

Link

  1. Load R packages.

  2. Read information from the data/ folder.

  3. Exercise some data cleaning and create some non-penalization version of the variables.

  4. Create a very bones bar chart.

  5. Relieve it in the basic_plots/ folder.

## Load packages ---- library(ggplot2)  ## Read information ---- ## commonly you lot should use readr::read_csv() instead cpl_teamtotal_2021 <- read.csv("data/CPLTeamTotals2021.csv")  ## Data cleaning ---- ## normally would use {dplyr} but trying to reduce dependencies just for this minimal script instance...  ## Select a subset of variables select_vars <- c('ShotsTotal', 'xGPerShot', "GM", "Team", "NonPenxG", "PenTaken") cpl_teamtotal_2021 <- cpl_teamtotal_2021[select_vars]  ## Calculate not penalty variables cpl_teamtotal_2021$NonPenShotsTotal <- cpl_teamtotal_2021$ShotsTotal - cpl_teamtotal_2021$PenTaken cpl_teamtotal_2021$NonPenXGPerShot <- cpl_teamtotal_2021$NonPenxG / cpl_teamtotal_2021$NonPenShotsTotal cpl_teamtotal_2021$NonPenShotsP90 <- cpl_teamtotal_2021$NonPenShotsTotal / (cpl_teamtotal_2021$GM * 90) * xc cpl_teamtotal_2021$NonPenxGP90 <- cpl_teamtotal_2021$NonPenxG / (cpl_teamtotal_2021$GM * xc) * ninety  ## Plot ---- basic_plot <- ggplot(data = cpl_teamtotal_2021,        aes(x = NonPenxGP90, y = reorder(Squad, NonPenxGP90))) +   geom_col() +   annotate(geom = "text", x = 0.65, y = 4.v,             label = 'EXAMPLE', color = 'white', angle = 45, fontface = 'bold',            size = 30, alpha = 0.5) +   scale_x_continuous(     expand = c(0, 0.025),     limits = c(0, i.5)   ) +   labs(     title = "Pacific FC leads the league in expected goals for per 90...",     subtitle = paste0("Canada Premier League 2021 Season (As of ", format(Sys.Date(), '%B %d, %Y'), ")"),     x = "Non-Penalisation xG per 90",     y = NULL,     caption = "Data: Centre Circle & StatsPerform\nMedia: @CanPLdata #CCdata #CanPL"   ) +   theme_minimal() +   theme(axis.ticks = element_blank(),         panel.grid.major.y = element_blank())  ## Salve in 'basic_plots' folder ---- ggsave(filename = paste0("basic_plots/basic_plot_", Sys.Engagement(), ".PNG"), plot = basic_plot)            

Output

Once a workflow is successful, you should be able to see that some other git commit was made in your github repository that saved new data downloaded from the CanPL Google Drive folder into your data/ folder, while the simple plot of xG data was saved and committed in the basic_plots binder. When y'all're creating work from this information ready please remember to add in social media links to the Canadian Premier League also as the logos for Centre Circle Data and StatsPerform (beneath example plot is without the logos).

(Information technology'south very possible that as the information is updated throughout the flavor, Pacific FC won't exist the league leaders but whatever, you become the signal.)

Conclusion

Hopefully this was a helpful guide for grabbing Canadian Premier League soccer data automatically with Github Deportment. There is a lot to learn equally there are many different moving parts to make this work. I tried to put links to a lot of the documentation and blog posts that helped me through then those should be of apply to you as well.

In that location are many things you can practice by extending the very basic ETL prepare upward we created in this blog post, including just not limited to:

  • Create a Twitter bot to post your visualizations!
  • Create a parameterized RMarkdown report!
  • Create a Rmarkdown dashboard!
  • Use the updated data to power a Shiny app!
  • Create your ain separate database and upload new data into information technology afterwards some cleaning steps!
  • Etc.

Some of these are things which I promise to talk about in future web log posts, so stay tuned!

inmanherm1985.blogspot.com

Source: https://www.r-bloggers.com/2021/09/creating-a-data-pipeline-with-github-actions-the-googledrive-package-for-the-canadian-premier-league-soccer-data-initiative/

0 Response to "There Was an Error While Loading apisapidrivegoogleapiscomoverviewproject Please Try Again"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel