admin管理员组文章数量:1202815
I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.
I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":
# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
return(names(result)[result])
}
# Testing speed
library('stringi') # Used for generating random strings for testing
# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
stri_rand_strings(30000,7,pattern='[A-Z]'),
stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')
# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
for(tstring in teststrings) {
x <- check_substrings(tstring,substrings)
}
)
# user system elapsed
# 12.457 0.046 12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings
I know there are several functions like grepl()
and stri_detect()
that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:
system.time(
{
# Initial check for which substrings are in which test strings
res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
# Turn result into a list of which substrings are in which test strings
rownames(res_matrix) <- teststrings_1k
res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
}
)
# user system elapsed
# 3.641 0.227 3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings
I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).
Is there a better/more efficient way to do something like my check_substrings()
function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?
I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.
I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":
# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
return(names(result)[result])
}
# Testing speed
library('stringi') # Used for generating random strings for testing
# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
stri_rand_strings(30000,7,pattern='[A-Z]'),
stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')
# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
for(tstring in teststrings) {
x <- check_substrings(tstring,substrings)
}
)
# user system elapsed
# 12.457 0.046 12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings
I know there are several functions like grepl()
and stri_detect()
that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:
system.time(
{
# Initial check for which substrings are in which test strings
res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
# Turn result into a list of which substrings are in which test strings
rownames(res_matrix) <- teststrings_1k
res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
}
)
# user system elapsed
# 3.641 0.227 3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings
I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).
Is there a better/more efficient way to do something like my check_substrings()
function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?
2 Answers
Reset to default 5/../
stri_detect()
that can solve the opposite problem (check a vector of multiple strings for a single pattern)/../
Recycling in stringi
works both ways:
# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1] TRUE FALSE FALSE
# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1] TRUE TRUE FALSE
So you could rewrite that check with stri_extract_first_fixed()
or stri_detect_fixed()
+ subsetting (tiny bit faster than omitting NA-values from stri_extract
results) to look something like this:
library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]
Single tested_string
and absolute time values:
bench::mark(
check_substrings(teststrings[1], substrings),
check_substrings_stringi(teststrings[1], substrings),
iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms 4.15 3.75MB 26.1
#> 2 check_substrings_stringi(tests… 4.36ms 4.44ms 210. 709.75KB 0
Full sweep over teststrings
and relative execution times:
bench::mark(
check_substrings = lapply(teststrings, check_substrings, substrings = substrings),
check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 check_substrings 40.8 40.8 1 5.46 5.51
#> 2 check_substrings_stringi 1 1 40.8 1 1
I wrote a cpp-function that calculates a substring matrix in 5.3 seconds for 20.000 values of "testedstrings" and another string-vector which you called "substrings".
Under the hood, stri_extract_first_fixed()
executes a C++ function which makes it so fast. However, there is some overhead to this function which I guess comes from various checks. Also it checks only one string for a pattern vector which leads us to use sapply. We can write a much faster C++ function which does the same but without the need for sapply. We use OpenMP parallelization and calculate a logical matrix directly in C++.
How my function works
> strings <- c("apple", "banana", "cherry", "pineapple")
> patterns <- c("an", "apple", "berry")
> res <- string_detect_multiple(strings, patterns)
> rownames(res) <- patterns
> colnames(res) <- strings
> res
apple banana cherry pineapple
an FALSE TRUE FALSE FALSE
apple TRUE FALSE FALSE TRUE
berry FALSE FALSE FALSE FALSE
In your R-directory create a file "string_detect_multiple.cpp":
string_detect_multiple.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalMatrix string_detect_multiple(CharacterVector strings, CharacterVector patterns) {
const int n_strings = strings.size();
const int n_patterns = patterns.size();
LogicalMatrix results(n_patterns, n_strings);
// Pre-convert patterns to std::string to avoid repeated conversion
std::vector<std::string> pattern_strings(n_patterns);
for (int i = 0; i < n_patterns; ++i) {
pattern_strings[i] = std::string(patterns[i]);
}
// Pre-convert strings to avoid repeated conversion
std::vector<std::string> str_vec(n_strings);
for (int j = 0; j < n_strings; ++j) {
str_vec[j] = std::string(strings[j]);
}
#pragma omp parallel for collapse(2)
for (int i = 0; i < n_patterns; ++i) {
for (int j = 0; j < n_strings; ++j) {
results(i, j) = str_vec[j].find(pattern_strings[i]) != std::string::npos;
}
}
return results;
}
Import faster function to R
We then use this function in R making use of the package Rccp
. Create this R-Script in the same folder and load in the CPP-Script like shown:
library(stringi)
library(microbenchmark)
setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory
# Test data
set.seed(5)
pat2 <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),stri_rand_strings(30000,7,pattern='[A-Z]'), stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings
test_string <- "FRXHCSNVYCHM"
str1 <- c("FRXHCSNVYCHM",stri_rand_strings(1000,20,pattern='[A-Z]'))
# get matrix with stri_detect_fixed
res_sdf <- sapply(str1, stri_detect_fixed, pat2)
# rownames(res_sdf_m) <- pat2 # optional name rows
# import our own C++ function :)
#install.packages("Rcpp")
library(Rcpp)
Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
Sys.setenv("PKG_LIBS" = "-fopenmp")
sourceCpp("string_detect_multiple.cpp")
# use
res_cpp <- string_detect_multiple(str1, pat2)
colnames(res_cpp) <- str1 # we don't really count that into the function time!! :)
# rownames(res_cpp) <- pat2 # optional name rows
identical(res_cpp, res_sdf)
# benchmark
bm <- microbenchmark::microbenchmark(
stringi = sapply(str1, stri_detect_fixed, pat2),
cpp = string_detect_multiple(str1, pat2),
times = 5
)
bm
boxplot(bm)
Benchmark Results
Unit: milliseconds
expr min lq mean median uq max neval cld
stringi 6543.8222 6609.0601 6627.8566 6623.9730 6675.8684 6686.5595 5 a
cpp 244.3382 244.6394 266.1575 251.5274 270.3496 319.9331 5 b
As you can see, that makes it 25 times faster than sapply(str1, stri_detect_fixed, pat2)
.
Single Test-String
Of course, you can use my function to mimic the behavior of your check_substrings()
but 112 times as fast and 2.677906 times faster than stringi:
check_subsstrings(test_string, pat2) # run your function
pat2[string_detect_multiple(test_string, pat2)] # run mine
本文标签: rA more efficient way to check which substrings are present in several test stringsStack Overflow
版权声明:本文标题:r - A more efficient way to check which substrings are present in several test strings? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738585341a2101398.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
grepl(paste(substrings,collapse='|'),tested_string)
will not work for this - I need to know which substrings were found, not just whether substrings were found. – TiredSquirrel Commented Jan 22 at 5:11