admin管理员组

文章数量:1202815

I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.

I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}
# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

I know there are several functions like grepl() and stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).

Is there a better/more efficient way to do something like my check_substrings() function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?

I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.

I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}
# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

I know there are several functions like grepl() and stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).

Is there a better/more efficient way to do something like my check_substrings() function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?

Share Improve this question edited Jan 22 at 7:03 TiredSquirrel asked Jan 22 at 5:10 TiredSquirrelTiredSquirrel 6514 silver badges14 bronze badges 3
  • Additional info: Solutions involving grepl(paste(substrings,collapse='|'),tested_string) will not work for this - I need to know which substrings were found, not just whether substrings were found. – TiredSquirrel Commented Jan 22 at 5:11
  • 1 See Gregor Thomas's answer in stackoverflow.com/questions/48837538/r-grepl-vector-over-vector – Edward Commented Jan 22 at 7:29
  • If you really want to make this fast... some specific details about the problem at hand will probably help to find the best algorithm. For example what is the alphabet in question? Is it only uppercase A-Z? – s_baldur Commented Jan 23 at 10:28
Add a comment  | 

2 Answers 2

Reset to default 5

/../ stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern)/../

Recycling in stringi works both ways:

# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1]  TRUE FALSE FALSE

# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1]  TRUE  TRUE FALSE

So you could rewrite that check with stri_extract_first_fixed() or stri_detect_fixed() + subsetting (tiny bit faster than omitting NA-values from stri_extract results) to look something like this:

library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]

Single tested_string and absolute time values:

bench::mark(
  check_substrings(teststrings[1], substrings),
  check_substrings_stringi(teststrings[1], substrings),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms      4.15    3.75MB     26.1
#> 2 check_substrings_stringi(tests…   4.36ms   4.44ms    210.    709.75KB      0

Full sweep over teststrings and relative execution times:

bench::mark(
  check_substrings         = lapply(teststrings, check_substrings, substrings = substrings),
  check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
  relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                 min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 check_substrings          40.8   40.8       1        5.46     5.51
#> 2 check_substrings_stringi   1      1        40.8      1        1

I wrote a cpp-function that calculates a substring matrix in 5.3 seconds for 20.000 values of "testedstrings" and another string-vector which you called "substrings".

Under the hood, stri_extract_first_fixed() executes a C++ function which makes it so fast. However, there is some overhead to this function which I guess comes from various checks. Also it checks only one string for a pattern vector which leads us to use sapply. We can write a much faster C++ function which does the same but without the need for sapply. We use OpenMP parallelization and calculate a logical matrix directly in C++.

How my function works

> strings <- c("apple", "banana", "cherry", "pineapple")
> patterns <- c("an", "apple", "berry")
> res <- string_detect_multiple(strings, patterns)
> rownames(res) <- patterns
> colnames(res) <- strings
> res
      apple banana cherry pineapple
an    FALSE   TRUE  FALSE     FALSE
apple  TRUE  FALSE  FALSE      TRUE
berry FALSE  FALSE  FALSE     FALSE

In your R-directory create a file "string_detect_multiple.cpp":

string_detect_multiple.cpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalMatrix string_detect_multiple(CharacterVector strings, CharacterVector patterns) {
  const int n_strings = strings.size();
  const int n_patterns = patterns.size();
  LogicalMatrix results(n_patterns, n_strings);
  
  // Pre-convert patterns to std::string to avoid repeated conversion
  std::vector<std::string> pattern_strings(n_patterns);
  for (int i = 0; i < n_patterns; ++i) {
    pattern_strings[i] = std::string(patterns[i]);
  }
  
  // Pre-convert strings to avoid repeated conversion
  std::vector<std::string> str_vec(n_strings);
  for (int j = 0; j < n_strings; ++j) {
    str_vec[j] = std::string(strings[j]);
  }
  
#pragma omp parallel for collapse(2)
  for (int i = 0; i < n_patterns; ++i) {
    for (int j = 0; j < n_strings; ++j) {
      results(i, j) = str_vec[j].find(pattern_strings[i]) != std::string::npos;
    }
  }
  
  return results;
}

Import faster function to R

We then use this function in R making use of the package Rccp. Create this R-Script in the same folder and load in the CPP-Script like shown:

library(stringi)
library(microbenchmark)
setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory

# Test data
set.seed(5)
pat2 <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),stri_rand_strings(30000,7,pattern='[A-Z]'), stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings 
test_string <- "FRXHCSNVYCHM" 
str1 <- c("FRXHCSNVYCHM",stri_rand_strings(1000,20,pattern='[A-Z]'))


# get matrix with stri_detect_fixed
res_sdf  <-  sapply(str1, stri_detect_fixed, pat2)
# rownames(res_sdf_m) <- pat2 # optional name rows
# import our own C++ function :)
#install.packages("Rcpp")
library(Rcpp)
Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
Sys.setenv("PKG_LIBS" = "-fopenmp")
sourceCpp("string_detect_multiple.cpp")
# use

res_cpp <- string_detect_multiple(str1, pat2)
colnames(res_cpp) <- str1 # we don't really count that into the function time!! :)
# rownames(res_cpp) <- pat2 # optional name rows

identical(res_cpp, res_sdf)
# benchmark

bm <- microbenchmark::microbenchmark(
  stringi = sapply(str1, stri_detect_fixed, pat2),
  cpp = string_detect_multiple(str1, pat2),
  times = 5
)
bm
boxplot(bm)

Benchmark Results

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
 stringi 6543.8222 6609.0601 6627.8566 6623.9730 6675.8684 6686.5595     5  a 
     cpp  244.3382  244.6394  266.1575  251.5274  270.3496  319.9331     5   b

As you can see, that makes it 25 times faster than sapply(str1, stri_detect_fixed, pat2).


Single Test-String

Of course, you can use my function to mimic the behavior of your check_substrings() but 112 times as fast and 2.677906 times faster than stringi:

check_subsstrings(test_string, pat2) # run your function
pat2[string_detect_multiple(test_string, pat2)] # run mine

本文标签: rA more efficient way to check which substrings are present in several test stringsStack Overflow