admin管理员组文章数量:1278854
I have a list of all the words in the English dictionary (270,000+ words) stored in a variable called theList
.
I have a scrambled word word
that I want to unscramble by matching against the word list.
Initially, I thought that the following code would do the trick, but it doesn't work so well.
var theList; // Contains all the words in the English dictionary.
var word = "iexospensr"; // The word I want to unscramble.
var matches = word.match(new RegExp("^["+word+"]{"+word.length+"}$", "gim"));
I'd have expected "EXPRESSION" as the unscrambled result, but I instead get a lot more results (listed below).
EERINESSES,EXPRESSERS,EXPRESSION,IRONNESSES,ISOSPORIES,NONPERSONS,NONPROSSES,NOSINESSES,OPENNESSES,OPPRESSION,OPPRESSORS,ORNERINESS,PENSIEROSO,PENSIONEER,PENSIONERS,PEPPERONIS,PERSIENNES,PERSONISES,PIPINESSES,PIXINESSES,POORNESSES,PORINESSES,POSSESSION,POSSESSORS,PREEXPOSES,PREPOSSESS,PREPPINESS,PRESENSION,PRIORESSES,PRISSINESS,PROPENSION,PROPERNESS,REINSPIRES,REPRESSERS,REPRESSION,REPRESSORS,RESERPINES,RESPONSERS,RESPONSORS,RIPENESSES,ROPINESSES,ROSINESSES,SERENENESS,SEXINESSES,SIXPENNIES,SNIPPINESS,SORENESSES,SPINNERIES
Perhaps, if I could find a way to tell the regular expression to consider each letter in the string word
just once irrespective of the order of the letters. So the end result would be an array of bination of those letters, and not permutations (what I have now).
Any help would be appreciated.
EDIT: I think the way to go would be: 1. finding all the binations of the scrambled word 2. matching them against the word list to check for validity
If you have a better solution (performance-wise), it'd help.
The best solution to this problem seems to be reordering the anagram by alphabet, and the entire word list and matching the word against each item in the list.
Here's the code:
var textList; // the entire dictionary
var list = textList.match(/^.*$/gim);
var sortedList = [];
list.forEach(function(element, index, array) {
sortedList[index] = element.split("").sort().join("");
});
function unscramble(word)
{
word = word.toUpperCase().split("").sort().join("");
var matches = [];
for (var i = 0; i < list.length; i++) {
if (word.indexOf(sortedList[i]) >= 0) {
if (!matches[list[i].length])
matches[list[i].length] = [];
matches[list[i].length].push(list[i]);
}
}
return matches;
}
I have a list of all the words in the English dictionary (270,000+ words) stored in a variable called theList
.
I have a scrambled word word
that I want to unscramble by matching against the word list.
Initially, I thought that the following code would do the trick, but it doesn't work so well.
var theList; // Contains all the words in the English dictionary.
var word = "iexospensr"; // The word I want to unscramble.
var matches = word.match(new RegExp("^["+word+"]{"+word.length+"}$", "gim"));
I'd have expected "EXPRESSION" as the unscrambled result, but I instead get a lot more results (listed below).
EERINESSES,EXPRESSERS,EXPRESSION,IRONNESSES,ISOSPORIES,NONPERSONS,NONPROSSES,NOSINESSES,OPENNESSES,OPPRESSION,OPPRESSORS,ORNERINESS,PENSIEROSO,PENSIONEER,PENSIONERS,PEPPERONIS,PERSIENNES,PERSONISES,PIPINESSES,PIXINESSES,POORNESSES,PORINESSES,POSSESSION,POSSESSORS,PREEXPOSES,PREPOSSESS,PREPPINESS,PRESENSION,PRIORESSES,PRISSINESS,PROPENSION,PROPERNESS,REINSPIRES,REPRESSERS,REPRESSION,REPRESSORS,RESERPINES,RESPONSERS,RESPONSORS,RIPENESSES,ROPINESSES,ROSINESSES,SERENENESS,SEXINESSES,SIXPENNIES,SNIPPINESS,SORENESSES,SPINNERIES
Perhaps, if I could find a way to tell the regular expression to consider each letter in the string word
just once irrespective of the order of the letters. So the end result would be an array of bination of those letters, and not permutations (what I have now).
Any help would be appreciated.
EDIT: I think the way to go would be: 1. finding all the binations of the scrambled word 2. matching them against the word list to check for validity
If you have a better solution (performance-wise), it'd help.
The best solution to this problem seems to be reordering the anagram by alphabet, and the entire word list and matching the word against each item in the list.
Here's the code:
var textList; // the entire dictionary
var list = textList.match(/^.*$/gim);
var sortedList = [];
list.forEach(function(element, index, array) {
sortedList[index] = element.split("").sort().join("");
});
function unscramble(word)
{
word = word.toUpperCase().split("").sort().join("");
var matches = [];
for (var i = 0; i < list.length; i++) {
if (word.indexOf(sortedList[i]) >= 0) {
if (!matches[list[i].length])
matches[list[i].length] = [];
matches[list[i].length].push(list[i]);
}
}
return matches;
}
Share
Improve this question
edited May 8, 2011 at 16:04
Himanshu
asked May 7, 2011 at 22:00
HimanshuHimanshu
3993 silver badges14 bronze badges
3
- How big are the words you wanto to scramble? How big are the strings they will be matched against? Will your regex be used once or many times? – hugomg Commented May 7, 2011 at 22:14
- @missingno The words can be of variable length (2-15 characters). The string matched against contains all the words in the English dictionary, so that's around 270,000+ words. – Himanshu Commented May 7, 2011 at 22:21
-
The entire point of pre-sorting all the words is that you can have a hash that has "zero" lookup time instead of having to iterate through an array. I'm not sure what you're doing at the beginning there with the
.match(/^.*$/gim);
, but here's an example assuming you have your words in an array: jsfiddle/RYHAh – user578895 Commented May 8, 2011 at 16:24
8 Answers
Reset to default 2I think a better approach would not use regular expressions. Instead it would test each member of the list against your scrambled word, by walking the characters of the word, and looking if that character exists in the word in the list. Each time it finds a character, it can mark that character as "already used".
Here's something to mark a character position as "used":
function checkUsed(o, which) {
if (o[which] != null) {
o[which] = 1;
return false;
}
return true;
}
var usedMap = [];
if (checkUsed(usedMap, 5) == false) {
...
}
Don't use regular expressions for this, there are simpler ways, if you split your dictionary into words instead of doing a mega large string:
A scrambled word is defined by the frequency of letter occurence:
//WARNING, untested code alphabet = 'qwertyuiopasdfghjklzxcvbnm'; function empty_frequences(){ var freqs = {}; var i=; for(i=0; i<alphabet.length; i++){ freqs[alphabet[i]] = 0; } return freqs; } function frequences(str){ var freqs = empty_frequences(); var i; for(i=0; i<str.length; i++){ freqs[str[i]] += 1; } }
Use this fact to find all matches in your dictionary
function matcher(word){ //returns a function that matchs against this word var word_freqs = frequences(word); function do_the_match(word2){ var freqs2 = frequences(word2); var i, c; for(i=0; i<alphabet.length; i++){ c = alphabet[i] if(freqs[c] > freqs2[c]){return false;} //change > to != to allow only strict anagrams } return true; } return do_the_match; } function main(word, dict){ var mf = matcher(word); var i, matcheds = []; for(i=0; i<dict.length; i++){ if(mf(dict[i])){ matcheds.push(dict[i]); } } return matcheds; }
Here's an idea for you. Constructing the initial lookup data will be slow, but finding a match should be simple. However, you should only build the dictionary once and load it! Reputing every time is a waste of time.
I am assuming you're only using the Latin alphabet (i.e. what English is written in), everything is case insensitive and you use no numerals...etc. so you have only characters A-Z.
For each word in your dictionary, build a "hash" based on the counts of each letters occurrence. The hash-array will have 26 positions. Each position will the count of the times a specific character for that position was encountered. (e.g. A is in the first array position/index 0; Z in 26th/index 25)
To cheat a little you can store the results as a a pair of strings. Few, if any, words have 9 repetitions of a single letter, so a single "digit" per letter should work fine. For example: "the" bees "00001001000000000001000000"; "hat" bees "10000001000000000001000000"; "that" bees "10000001000000000002000000".Load the preputed dictionary. Use the hashed value as a key in a key-value pair, and have a collection as the value. Append each word with the same key to the end of the collection for that key.
Perform the same hash algorithm on the scrambled word, and look up the key. Output the collection referred to by the key.
EDIT 1: If building a dictionary up front is not viable, then use a variation on this where you create an associative array/dictionary with the letter as the key, and the count of the times it's found as the value. Before puting this, pare lengths, if the strings are of different lengths, then don't bother paring as you know they mismatch. After puting these arrays for the source (scrambled) and the target (a possible match) pare the keys and values in your associative array.
EDIT 2: Pretty much along the same lines as above, sort the characters within the string for both source and target strings.
Just for the fun of it:
> var words = 'exceptional extraordinary retinas retains retsina antsier nastier retrains starfish';
> words.match(/\b([aeinrst])(?!\1)([aeinrst])(?!\1|\2)([aeinrst])(?!\1|\2|\3)([aeinrst])(?!\1|\2|\3|\4)([aeinrst])(?!\1|\2|\3|\4|\5)([aeinrst])(?!\1|\2|\3|\4|\5|\6)([aeinrst])\b/ig)
[ 'retinas', 'retains', 'retsina', 'antsier', 'nastier' ]
Note that I can't quite figure out how to get the above method to work if you have two of the same letter, e.g. I can't match "boo" :)
If lookup has to be fast, and buildup at the start is no big problem then using a Trie is the most efficient solution I know of. I could explain it, but the WP article is actually very good and gives code samples.
The solution using histograms is probably the best if you're interested primarily in whether 2 given strings match.
I don't know if a regular expression is the best tool for this job. The regex you're building will end up being
"^[iexospensr]{10}$"
which matches any 10-letter word made up of any of the letters in the character class [iexospensr]
.
Perhaps, if I could find a way to tell the regular expression to consider each letter in the string word just once irrespective of the order of the letters.
You could do that with word.length
different regular expressions, but some of your letters repeat. You'd get closer if you sort the letters in the scrambled word, then search for words that have the right number of repetitions of each letter. For example, two e's, two s's, one x, etc.
Regular expressions although powerful aren't the solution for everything.
In some cases like this it's just better to build your own solution: Start by removing all words that don't match the required length, then start paring letters.
Depending on the length of your dictionary you can build in different optimizations.
I should have seen this Q&A long time ago. I have been doing research on this and I want to share my solution to the problem.
Solution: Step 1: Sort alphabetically the scrambled word (Note: or even the scrambled page of the book for that matter)
Step 2: Build your WORD or PAGE list with an additional column for the sorted word (Note: You can hash this column if you want)
Step 3: Do your matching process. This should find scrambled word from the lookup list.
I was doing some research on finding arbitrary scrambled no. of words in a page and creating a list containing those scrambled words given the scrambled letters.
版权声明:本文标题:regex - Regular Expression (Javascript) - Take a scrambled word and find an unscrambled match - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741270987a2369291.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论