Reproducible computing in R. How to separate code and data?

Quite often there is a need for periodic calculations and the preparation of a consolidated report on self-sufficient data. Those. according to data that is stored as files. This can be data collected from open sources, various documents and excel tables, downloads from corporate systems. Raw data can take up several megabytes or several gigabytes. Data may be depersonalized or contain confidential information. In the case when the calculation code is placed in the repository, and the work is carried out by more than one person on more than one computer, the problem arises of maintaining the consistency of the code and data. At the same time, it is still necessary to ensure compliance with different access rights to the code and data. What to do?


It is a continuation of previous publications .


RStudio is now actively developing a package pinsto solve this problem. Unfortunately, the backend solutions used are somewhat unpopular and expensive to use in the vastness of our country. AWS, Azure, Google cloud ... for every one you have to pay, for storage and traffic. AWS4 pinsdoes not support authentication yet, so Yandex cloud is also on the sidelines, although it is not free.


On the other hand, teams of analysts working on specific tasks are usually small (no more than 5-10 people). Many use Google drive, One drive, etc., in a paid or free format. Why not take advantage of already acquired resources? Below is one of the possible workflows.


Overall plan


  1. The calculations should be carried out locally on the machine, which means that the machine must have an actual replica of all the data necessary for the calculation.
  2. The code must be under version control. Data should not fall into it in any way (potential volume and confidentiality). We will store the data replica either in a separate folder in the project (including it in .gitignore), or in an external directory relative to the project.
  3. The data master will be stored by google drive. We place the rights to access directories in it.

It remains the case for small. It is necessary to implement the synchronization functionality of the local data replica with the cloud. Authorization and authentication is provided by google.


Below is the code.


library(googledrive)
library(memoise)
#    google disk
drive_user()
updateGdCache(here::here("data/"), cloud_folder = "XXX___jZnIW3jdkbdxK0iazx7t63Dc")

Function for cache synchronization
updateGdCache <- function(local_folder, cloud_folder){
  #     
  cache_fname <- "gdrive_sig.Rds"
  # 0.  memoise     
  getGdriveFolder <- memoise(function(gdrive_folder){
    drive_ls(as_id(gdrive_folder), recursive = FALSE)
  })

  # 1.        
  cloud_gdrive_sig <- purrr::possibly(getGdriveFolder, NULL)(cloud_folder)
  #          ,     
  if(is.null(cloud_gdrive_sig)) {
    message("Some Google Drive issues happened. Can't update cache")
    return()
  }
  # 2.       
  fdir <- if(fs::is_dir(local_folder)) local_folder else fs::path_dir(local_folder)

  # 3.       
  local_files <- fs::dir_ls(fdir, recurse = FALSE) %>%
    fs::path_file()

  # 4.        
  local_gdrive_sig <- purrr::possibly(readRDS, NULL, quiet = TRUE)(fs::path(fdir, cache_fname))
  if(is.null(local_gdrive_sig)){
    #    ,   ,     
    #  ,   
    local_gdrive_sig <- cloud_gdrive_sig %>%
      dplyr::filter(row_number() == -1)
  }
  #         
  local_gdrive_sig <- local_gdrive_sig %>%
    dplyr::filter(name %in% local_files)

  # 5.      ,    ,   
  #  ,       
  reconcile_tbl <- cloud_gdrive_sig %>%
    dplyr::rename(drive_resource_cloud = drive_resource) %>%
    dplyr::left_join(local_gdrive_sig, by = c("name", "id")) %>%
    tidyr::hoist(drive_resource_cloud, cloud_modified_time = "modifiedTime") %>%
    tidyr::hoist(drive_resource, local_modified_time = "modifiedTime") %>%
    # TODO:   ,       
    #       = NA
    dplyr::mutate(not_in_sync = is.na(local_modified_time) | cloud_modified_time != local_modified_time)

  # 6.    
  syncFile <- function(fpath, id){
    res <- purrr::possibly(drive_download, otherwise = NULL)(as_id(id), path = fpath, overwrite = TRUE, verbose = TRUE)
    ifelse(is.null(res), FALSE, TRUE)
  }
  #  ,        ,   
  sync_gdrive_sig <- reconcile_tbl %>%
    dplyr::filter(not_in_sync == TRUE) %>%
    dplyr::mutate(fpath = fs::path(fdir, name)) %>%
    dplyr::mutate(sync_status = purrr::map2_lgl(fpath, id, syncFile)) %>%
    dplyr::select(name, id, sync_status)

  # 7.      ,   
  #   
  cloud_gdrive_sig %>%
    #   
    dplyr::anti_join(dplyr::filter(sync_gdrive_sig, sync_status == FALSE), by = c("name", "id")) %>%
    saveRDS(fs::path(fdir, cache_fname))
}

Specify the folder identifier in google drive as the path, you can take it from the address bar of the browser. The identifier will remain unchanged even if the folder is moved in the drive.



Simple, compact, convenient and free.


Couple of comments


  1. There are problems with the encoding for gargle0.4.0 version. It is necessary to load the dev version. More details here .
  2. There are problems with authorization on RStudio Server “Unable to authorize from RStudio Server # 79 {Closed}” , but ideas for a workaround can be found here .

Previous publication - “Programming and the Christmas tree, can they be combined?” .

Source: https://habr.com/ru/post/undefined/


All Articles