admin管理员组文章数量:1414628
I'm using a Javascript regEx to parse a database field for image urls and format them for output - so far, I have been using
input = input.replace(/(https?:\/\/.*?\.(?:png|jpe?g|gif)(.*))(\w|$)/ig, "<br><img style='max-width:100%;overflow:hidden;' src='$1'>");
and its been serving me well. All png, jpe?g and gif references get replaced by IMG tags and images show in the output stream as intended.
However, I've been thrown a loop.
I've noticed some urls (notably those from Facebook CDN - though I supposed others could also be doing this as well) have appended a whole pile of "stuff" after the image type ... stuff that if not present results in the files not being available, and a missing image icon gets produced. For example, this is a valid picture url from fbcdn:
.0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4
Can someone suggest a change/improvement to the regEx that would pick up the extra trailing characters? Or is another method of attack necessary
(I personally like the global regEx as I can nail all of the instances in the stream at once... having to manually parse the stream is not something I would look forward to...)
Update: I understand there is some ambiguity in the request - hopefully this will clarify.
I need to pull out any image url - regardless of the "stuff" after the image extention. It could be the first item in the text string, or the last, or embedded somewhere in the middle.
The processing is done in Javascript. I am currently using this as my validity test. All images within it are valid urls pulled from Google image search.
.png?20150508104424447 This is arbitrary text .0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4 this is arbitrary text
.jpg?imgmax=800 this is arbitrary text .jpg?cb=1409089267
Hopefully this sheds sufficient light into the types of variations I may encounter (The only one I know for sure is the FBCDN - I'm basing the others on knowledge of what else I've seen out there... so a generalized solution is needed, not one specific to FBCDN).
Thank you to all that offer suggestions...
I'm using a Javascript regEx to parse a database field for image urls and format them for output - so far, I have been using
input = input.replace(/(https?:\/\/.*?\.(?:png|jpe?g|gif)(.*))(\w|$)/ig, "<br><img style='max-width:100%;overflow:hidden;' src='$1'>");
and its been serving me well. All png, jpe?g and gif references get replaced by IMG tags and images show in the output stream as intended.
However, I've been thrown a loop.
I've noticed some urls (notably those from Facebook CDN - though I supposed others could also be doing this as well) have appended a whole pile of "stuff" after the image type ... stuff that if not present results in the files not being available, and a missing image icon gets produced. For example, this is a valid picture url from fbcdn:
https://scontent-lga1-1.xx.fbcdn/hphotos-xtf1/v/t1.0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4
Can someone suggest a change/improvement to the regEx that would pick up the extra trailing characters? Or is another method of attack necessary
(I personally like the global regEx as I can nail all of the instances in the stream at once... having to manually parse the stream is not something I would look forward to...)
Update: I understand there is some ambiguity in the request - hopefully this will clarify.
I need to pull out any image url - regardless of the "stuff" after the image extention. It could be the first item in the text string, or the last, or embedded somewhere in the middle.
The processing is done in Javascript. I am currently using this as my validity test. All images within it are valid urls pulled from Google image search.
http://well-being.esdc.gc.ca/misme-iowb/auto/diagramme-chart/stg2/c_4_21_6_1_eng.png?20150508104424447 This is arbitrary text https://scontent-lga1-1.xx.fbcdn/hphotos-xtf1/v/t1.0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4 this is arbitrary text
http://lh6.ggpht./-1Rua79J-EDo/TwuyZkHwcmI/AAAAAAAADvA/ENfg1TeayvU/type_catalog_error_thumb%25255B1%25255D.jpg?imgmax=800 this is arbitrary text http://image.slidesharecdn./top5thingstodoafteranaccident-140826163850-phpapp02/95/top-five-things-to-do-after-any-type-of-accident-causing-injury-1-638.jpg?cb=1409089267
Hopefully this sheds sufficient light into the types of variations I may encounter (The only one I know for sure is the FBCDN - I'm basing the others on knowledge of what else I've seen out there... so a generalized solution is needed, not one specific to FBCDN).
Thank you to all that offer suggestions...
Share Improve this question edited Jun 2, 2015 at 14:47 Scott Brown asked Jun 2, 2015 at 4:59 Scott BrownScott Brown 3013 silver badges16 bronze badges 2-
To catch the optional question mark and the rest, you would use
(\?blabla)?
but typing this almost sounds too easy. Is there a problem? – Mr Lister Commented Jun 2, 2015 at 5:15 - @MrLister - yes, the problem was I was staring at it too long and getting nowhere with my testing on regexpal. .. all the variations I tried were either too greedy, or not greedy enough. The FB urls have some consistency - but I'm sure I should be limiting myself to this. I've also seen a few (example not available, sorry) where there is sizing info appended, and others who seem to append a timestamp (for a cache?) Who knows what evil concoctions others have put in place. – Scott Brown Commented Jun 2, 2015 at 12:15
1 Answer
Reset to default 6Updated after OP updated with more example input.
There are three issues with your attempt: boundaries of your matches, using '.*' and missing pattern for legal postfix.
The dot star notation is a bad idea in RegEx, which the article "Death to Dot Star!" illustrates quite well. Use negated character classes instead, and here I chose "\S*?" which is "any character that is not a whitespace". If you try replacing that with ".*?" instead on regex101, you can see it failing to match properly (it includes a link that is not an image).
Since it is all in the same string, boundries must be defined for the match, and since whitespace is sufficient "\b" does the trick nicely. This also removes the need for the "(.*)" and "(\w|$)" parts.
The last thing you missed was the legal endings to the url, and there are two solutions to this: Either define what you think is plausible to include most scenarios and have no false positives, or include anything but have a chance of getting too many results.
Wrap it all together, and you are left with these two different approaches:
Solution 1 - define what is correct
\b(https?:\/\/\S*?\.(?:png|jpe?g|gif)
# allowed postfixes to the filetype
(?:\?(?:
# alphnumeric key/value pairs
(?:(?:[\w_-]+=[\w_-]+)(?:&[\w_-]+=[\w_-]+)*)|
# alphnumeric postfix
(?:[\w_-]+)
))?
)\b
Try it out on regex101
Solution 2 - use whitespace as the only factor
\b(https?:\/\/\S+(?:png|jpe?g|gif)\S*)\b
Try it out on regex101
本文标签: javascript regex to find image urls in stringStack Overflow
版权声明:本文标题:javascript regex to find image urls in string - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745150870a2644915.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论