python - Regex b - Devanagiri - Stack Overflow

IT技术

更新时间：2025-03-072

admin管理员组
文章数量:1287776

I have a regex in Python which uses \b to split words. When I use it on Devanagiri text, I notice that not all characters in the Unicode block are defined as word characters. Certain punctuation marks appear to be defined as non-word characters. This is fundamentally wrong as words in this script can end with these characters.

Is it possible to tell regex to treat the entire block from 0x900 to 0x97f as word characters?

See for example the following regex.

'(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b'

Here, the first four words abc, ade, zip and चाय are detected at proper word boundaries. The word पानी however, ends with a vowel ी and regex does not treat it as a valid word boundary when ideally it should be.

>>> import re
>>> re.findall(r"(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b", 'This is abc, ade, चाय, पानी  and abca')
['abc', 'ade', 'चाय']

Can I change this regex behavior and if yes, how?

I have a regex in Python which uses \b to split words. When I use it on Devanagiri text, I notice that not all characters in the Unicode block are defined as word characters. Certain punctuation marks appear to be defined as non-word characters. This is fundamentally wrong as words in this script can end with these characters.

Is it possible to tell regex to treat the entire block from 0x900 to 0x97f as word characters?

See for example the following regex.

'(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b'

Here, the first four words abc, ade, zip and चाय are detected at proper word boundaries. The word पानी however, ends with a vowel ी and regex does not treat it as a valid word boundary when ideally it should be.

>>> import re
>>> re.findall(r"(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b", 'This is abc, ade, चाय, पानी  and abca')
['abc', 'ade', 'चाय']

Can I change this regex behavior and if yes, how?

Share Improve this question edited Feb 22 at 21:36 asked Feb 22 at 20:37 ACBlue 1411 silver badge11 bronze badges

1 The correct answer here depends a lot on the RegExp implementation/flavor you’re using - please tag the language. A minimal reproducible example would also help. – esqew Commented Feb 22 at 20:42
I guess you can use re.findall(r'[\w\u0900-\u097f]+|[^\w\u0900-\u097f]+', text) – Wiktor Stribiżew Commented Feb 22 at 20:56
Please see the regex statement for Python that I have posted with an explanation of the problem that I am facing. – ACBlue Commented Feb 22 at 21:16
You mention Python in a comment, if that is correct then please edit the question to add the python tag and also show some Python code that demonstrates the problem. – AdrianHHH Commented Feb 22 at 21:26
Please see if the changes I have made are what you had in mind. – ACBlue Commented Feb 22 at 21:37

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

The problem with the pattern is that \b detects U+093E (DEVANAGARI VOWEL SIGN AA) and 0940 (DEVANAGARI VOWEL SIGN II) as non-word characaters, so the boundaries in the word पानी occur after each consonant and before the dependent vowels.

It is critical to understand when working with Python regular expressions, with text in Devanagari Script, that the definitions of the re modules \w and \b are fundamentally different from Unicode's definitions.

The easiest fix is to use the regex module instead. This regex engine has Unicode support unlike the re module.

import regex as re
re.findall(r"(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b", 'This is abc, ade, चाय, पानी  and abca')
# ['abc', 'ade', 'चाय', 'पानी']

本文标签： pythonRegex bDevanagiriStack Overflow

版权声明：本文标题：python - Regex b - Devanagiri - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741327739a2372585.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Regex b - Devanagiri - Stack Overflow

1 Answer 1

更多相关文章