admin管理员组文章数量:1326316
I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a
and b
, but b
counts for 3 characters.
Allowed are:
aa
(anything less than 12 is fine).aaaaaaaaaaaa
(exactly 12 is fine).aaabaaab
(6 + 2 * 3 = 12, which is fine).abaaaaab
(still 6 + 2 * 3 = 12).
Disallowed is:
aaaaaaaaaaaaa
(13a
's).bbbba
(1 + 4 * 3 = 13, which is too much).baaaaaaab
(7 + 2 * 3 = 13, which is too much).
I've made an attempt that gets fairly close:
^(a{0,3}|b){0,4}$
This matches on up to 4 clusters that may consist of 0-3 a
's or one b
.
However, it fails to match on my last positive example: abaaaaab
, because that forces the first cluster to be the single a
at the beginning, consumes a second cluster for the b
, then leaves only 2 more clusters for the rest, aaaaab
, which is too long.
Constraints
- Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
- Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.
Rationale
Why do I need to do this with a regular expression?
It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name bees too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).
16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.
Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.
I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a
and b
, but b
counts for 3 characters.
Allowed are:
aa
(anything less than 12 is fine).aaaaaaaaaaaa
(exactly 12 is fine).aaabaaab
(6 + 2 * 3 = 12, which is fine).abaaaaab
(still 6 + 2 * 3 = 12).
Disallowed is:
aaaaaaaaaaaaa
(13a
's).bbbba
(1 + 4 * 3 = 13, which is too much).baaaaaaab
(7 + 2 * 3 = 13, which is too much).
I've made an attempt that gets fairly close:
^(a{0,3}|b){0,4}$
This matches on up to 4 clusters that may consist of 0-3 a
's or one b
.
However, it fails to match on my last positive example: abaaaaab
, because that forces the first cluster to be the single a
at the beginning, consumes a second cluster for the b
, then leaves only 2 more clusters for the rest, aaaaab
, which is too long.
Constraints
- Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
- Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.
Rationale
Why do I need to do this with a regular expression?
It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name bees too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).
16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.
Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.
Share Improve this question asked Oct 28, 2016 at 1:32 GhostkeeperGhostkeeper 3,0501 gold badge19 silver badges30 bronze badges 4- I can make this regex without much trouble, but I feel dirty doing it. I've made several attempts, but can't make a good, generic solution that can be expanded for longer strings (length 13 for example) or higher values (b=4 for example) – Addison Commented Oct 28, 2016 at 4:29
-
Could you not
length
the submitted name (after url-encoding) then decide to accept or reject it? Seems the simplest solution. – A. L Commented Oct 28, 2016 at 5:00 - 2 I'm bookmarking this question as my point of reference on how to ask a good regex question. Too many regex questions out there are sloppily written, unspecific and unclear. This is perfect. – Tim Pietzcker Commented Oct 28, 2016 at 5:28
- @A.Lau That's impossible. I've tried a solution where I could write my own validator via PyQt, but that resulted in a segfault. We traced that to a bug in PyQt and submitted a chreq for Riverbank Solutions. I'm therefore limited to using one of their built-in validators. The only validator that applies to other stuff than numbers is the RegExpValidator. – Ghostkeeper Commented Oct 28, 2016 at 8:04
3 Answers
Reset to default 6There is no real straightforward way to do this, given the limitations of regexp. You're going to have to test for all binations, such as thirteen b
with up to one a
, twelve b
with up to four a
, and so on. We will build a little program to generate these for us. The basic format for testing for up to four a
will be
/^(?=([^a]*a){0,4}[^a]*$)/
We'll write a little routine to create these lookaheads for us, given some letter and a minimum and maximum number of occurrences:
function matchLetter(c, m, n) {
return `(?=([^${c}]*${c}){${m},${n}}[^${c}]*$)`;
}
> matchLetter('a', 0, 4)
< "(?=([^a]*a){0,4}[^a]*$)"
We can bine these to test for three b
with up to three a
:
/^(?=([^b]*b){3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)/
We will write a function to create such bined lookaheads which matches exactly m
occurrences of c1
and up to n
occurrences of c2
:
function matchTwoLetters(c1, m, c2, n) {
return matchLetter(c1, m, m) + matchLetter(c2, 0, n);
}
We can use this to match exactly twelve b
and up to four a
, for a total of forty or less:
> matchTwoLetters('b', 12, 'a', 1, 4)
< "(?=([^b]*b){12,12}[^b]*$)(?=([^a]*a){0,4}[^a]*$)"
It remains to simply create versions of this for each count of b
, and glom them together (for the case of a max count of 12):
function makeRegExp() {
const res = [];
for (let bs = 0; bs <= 4; bs++)
res.push(matchTwoLetters('b', bs, 'a', 12 - bs*3));
return new RegExp(`^(${res.join('|')})`);
}
> makeRegExp()
< "^((?=([^b]*b){0,0}[^b]*$)(?=([^a]*a){0,12}[^a]*$)|(?=([^b]*b){1,1}[^b]*$)(?=([^a]*a){0,9}[^a]*$)|(?=([^b]*b){2,2}[^b]*$)(?=([^a]*a){0,6}[^a]*$)|(?=([^b]*b){3,3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)|(?=([^b]*b){4,4}[^b]*$)(?=([^a]*a){0,0}[^a]*$))"
Now you can do the test with
makeRegExp().test("baabaaa");
For the case of length=40, the regxp is 679 characters long. A very rough benchmark shows that it executes in under a microsecond.
If you want to count bytes when multibyte encoding is present, you can use this function:
function bytesLength(str) {
var s = str.length;
for (var i = s-1; i > -1; i--) {
var code = str.charCodeAt(i);
if (code > 0x7f && code <= 0x7ff) {s++;}
else if (code > 0x7ff && code <= 0xffff) {s+=2;}
if (code >= 0xDC00 && code <= 0xDFFF) {i--;}
}
return s;
}
console.log(bytesLength('敗')); // length 3
Try using something like this:
^((a{1,3}|b){1,4}|(a{1,4}|a?b|ba){1,3}|((a{2,3}|b){2}|aaba|abaa){2})$
Example: https://regex101./r/yTTiEX/6
This breaks it up into the logical possibilities:
4 parts, each with a value up to 3.
3 parts, each with a value up to 4.
2 parts, each with a value up to 6.
本文标签: javascriptRegex character countbut some count for threeStack Overflow
版权声明:本文标题:javascript - Regex character count, but some count for three - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742204223a2432514.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论