r - A more efficient way to check which substrings are present in several test strings? - Stack Overflow

IT技术

更新时间：2025-02-030

admin管理员组
文章数量:1202815

I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.

I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}

# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

I know there are several functions like grepl() and stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).

Is there a better/more efficient way to do something like my check_substrings() function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?

I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.

I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}

# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

I know there are several functions like grepl() and stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).

Is there a better/more efficient way to do something like my check_substrings() function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?

Share Improve this question edited Jan 22 at 7:03 asked Jan 22 at 5:10 TiredSquirrel 6514 silver badges14 bronze badges

Additional info: Solutions involving grepl(paste(substrings,collapse='|'),tested_string) will not work for this - I need to know which substrings were found, not just whether substrings were found. – TiredSquirrel Commented Jan 22 at 5:11
1 See Gregor Thomas's answer in stackoverflow.com/questions/48837538/r-grepl-vector-over-vector – Edward Commented Jan 22 at 7:29
If you really want to make this fast... some specific details about the problem at hand will probably help to find the best algorithm. For example what is the alphabet in question? Is it only uppercase A-Z? – s_baldur Commented Jan 23 at 10:28

Add a comment |

2 Answers 2

Sorted by: Reset to default 5

/../ stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern)/../

Recycling in stringi works both ways:

# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1]  TRUE FALSE FALSE

# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1]  TRUE  TRUE FALSE

So you could rewrite that check with stri_extract_first_fixed() or stri_detect_fixed() + subsetting (tiny bit faster than omitting NA-values from stri_extract results) to look something like this:

library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]

Single tested_string and absolute time values:

bench::mark(
  check_substrings(teststrings[1], substrings),
  check_substrings_stringi(teststrings[1], substrings),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms      4.15    3.75MB     26.1
#> 2 check_substrings_stringi(tests…   4.36ms   4.44ms    210.    709.75KB      0

Full sweep over teststrings and relative execution times:

bench::mark(
  check_substrings         = lapply(teststrings, check_substrings, substrings = substrings),
  check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
  relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                 min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 check_substrings          40.8   40.8       1        5.46     5.51
#> 2 check_substrings_stringi   1      1        40.8      1        1

I wrote a cpp-function that calculates a substring matrix in 5.3 seconds for 20.000 values of "testedstrings" and another string-vector which you called "substrings".

Under the hood, stri_extract_first_fixed() executes a C++ function which makes it so fast. However, there is some overhead to this function which I guess comes from various checks. Also it checks only one string for a pattern vector which leads us to use sapply. We can write a much faster C++ function which does the same but without the need for sapply. We use OpenMP parallelization and calculate a logical matrix directly in C++.

How my function works

> strings <- c("apple", "banana", "cherry", "pineapple")
> patterns <- c("an", "apple", "berry")
> res <- string_detect_multiple(strings, patterns)
> rownames(res) <- patterns
> colnames(res) <- strings
> res
      apple banana cherry pineapple
an    FALSE   TRUE  FALSE     FALSE
apple  TRUE  FALSE  FALSE      TRUE
berry FALSE  FALSE  FALSE     FALSE

In your R-directory create a file "string_detect_multiple.cpp":

string_detect_multiple.cpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalMatrix string_detect_multiple(CharacterVector strings, CharacterVector patterns) {
  const int n_strings = strings.size();
  const int n_patterns = patterns.size();
  LogicalMatrix results(n_patterns, n_strings);
  
  // Pre-convert patterns to std::string to avoid repeated conversion
  std::vector<std::string> pattern_strings(n_patterns);
  for (int i = 0; i < n_patterns; ++i) {
    pattern_strings[i] = std::string(patterns[i]);
  }
  
  // Pre-convert strings to avoid repeated conversion
  std::vector<std::string> str_vec(n_strings);
  for (int j = 0; j < n_strings; ++j) {
    str_vec[j] = std::string(strings[j]);
  }
  
#pragma omp parallel for collapse(2)
  for (int i = 0; i < n_patterns; ++i) {
    for (int j = 0; j < n_strings; ++j) {
      results(i, j) = str_vec[j].find(pattern_strings[i]) != std::string::npos;
    }
  }
  
  return results;
}

Import faster function to R

We then use this function in R making use of the package Rccp. Create this R-Script in the same folder and load in the CPP-Script like shown:

library(stringi)
library(microbenchmark)
setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory

# Test data
set.seed(5)
pat2 <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),stri_rand_strings(30000,7,pattern='[A-Z]'), stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings 
test_string <- "FRXHCSNVYCHM" 
str1 <- c("FRXHCSNVYCHM",stri_rand_strings(1000,20,pattern='[A-Z]'))


# get matrix with stri_detect_fixed
res_sdf  <-  sapply(str1, stri_detect_fixed, pat2)
# rownames(res_sdf_m) <- pat2 # optional name rows
# import our own C++ function :)
#install.packages("Rcpp")
library(Rcpp)
Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
Sys.setenv("PKG_LIBS" = "-fopenmp")
sourceCpp("string_detect_multiple.cpp")
# use

res_cpp <- string_detect_multiple(str1, pat2)
colnames(res_cpp) <- str1 # we don't really count that into the function time!! :)
# rownames(res_cpp) <- pat2 # optional name rows

identical(res_cpp, res_sdf)
# benchmark

bm <- microbenchmark::microbenchmark(
  stringi = sapply(str1, stri_detect_fixed, pat2),
  cpp = string_detect_multiple(str1, pat2),
  times = 5
)
bm
boxplot(bm)

Benchmark Results

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
 stringi 6543.8222 6609.0601 6627.8566 6623.9730 6675.8684 6686.5595     5  a 
     cpp  244.3382  244.6394  266.1575  251.5274  270.3496  319.9331     5   b

As you can see, that makes it 25 times faster than sapply(str1, stri_detect_fixed, pat2).

Single Test-String

Of course, you can use my function to mimic the behavior of your check_substrings() but 112 times as fast and 2.677906 times faster than stringi:

check_subsstrings(test_string, pat2) # run your function
pat2[string_detect_multiple(test_string, pat2)] # run mine

本文标签： rA more efficient way to check which substrings are present in several test stringsStack Overflow

版权声明：本文标题：r - A more efficient way to check which substrings are present in several test strings? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738585341a2101398.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - A more efficient way to check which substrings are present in several test strings? - Stack Overflow

2 Answers 2

How my function works

string_detect_multiple.cpp

Import faster function to R

Benchmark Results

Single Test-String

更多相关文章

r - A more efficient way to check which substrings are present in several test strings? - Stack Overflow

发表评论

推荐文章

javascript - because it violates the following Content Security Policy directive: &quot;style-src &#39;self&#39;&

javascript - Put text on an image and save as image - Stack Overflow

javascript - When cloning a draggable object in jQuery UI, how can you transfer the data and events to the new element? - Stack

database - how to sum cells struct in matlab - Stack Overflow

javascript - Node.js - callback parameters - Stack Overflow

热门文章

python - How to merge dataframes over multiple columns and split rows? - Stack Overflow

javascript - using jQuery to get absolute or relative position position of a div - Stack Overflow

call a specific javascript function when multiple same name function are in memory - Stack Overflow

powershell - Escaping issue with az ad app update in Azure DevOps Pipelines - Stack Overflow

How to delete resized (cropped) image uploads and prevent future resizing?

naming conventions - Why do I see JavaScript variables prefixed with $? - Stack Overflow

javascript - .setCapture and .releaseCapture in Chrome - Stack Overflow

Check If Taxonomy A and Taxonomy B has same Slug, 301 Auto Redirect Tax A to Tax B if True in Wordpress

javascript - addEventListener in a React app isn&#39;t working - Stack Overflow

shortcode doesnt work on excerpt

最新文章

电脑小白怎么重装系统_电脑小白u盘重装系统详细教程【小白必看】

忘记电脑密码如何修改win7

Windows7 SP1更新升级失败

Windows7BT种子大全

适合win7的python版本_Win7操作系统上安装 Python3.X环境

php - How to use instagram api to fetch image with certain hashtag? - Stack Overflow

javascript - How to solve TypeError: spyOn andReturn is not a function? - Stack Overflow

android - Flutter WebView and Back Button with PopScope showing a black screen - Stack Overflow

custom post type: own database table structure and custom &quot;new-post&quot; interface

javascript - Calculate point between two coordinates based on a percentage - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - because it violates the following Content Security Policy directive: "style-src 'self'&

javascript - addEventListener in a React app isn't working - Stack Overflow

custom post type: own database table structure and custom "new-post" interface