admin管理员组

文章数量:1391934

I want to use a white list of tags, attributes and values to sanitize a html string, before I place it in the dom. Can safely I construct a dom element, and traverse over that to implement the white list filter, assuming that no malicious javascript could execute until I append the dom element to the document? Are there pitfalls to this approach?

I want to use a white list of tags, attributes and values to sanitize a html string, before I place it in the dom. Can safely I construct a dom element, and traverse over that to implement the white list filter, assuming that no malicious javascript could execute until I append the dom element to the document? Are there pitfalls to this approach?

Share Improve this question asked Feb 13, 2014 at 0:37 PiwakawakaPiwakawaka 5196 silver badges16 bronze badges 5
  • I haven't used the library in the accepted answer myself, but you might check out stackoverflow./questions/5575559/… , with the help pages of perhaps most relevance: owasp/index.php/… and code.google./p/owasp-esapi-js/wiki/MitigatingDOMBasedXSS – Brett Zamir Commented Feb 13, 2014 at 0:52
  • The advantage of this over HTMLPurifier, etc. would be that it can run dynamically on the client-side without round-tripping to the server. – Brett Zamir Commented Feb 13, 2014 at 0:54
  • As far as the whitelist that you need, while owasp/index.php/… does make mention of one JS library, github./ecto/bleach and perhaps it could be adapted for client-side usage, it appears to rely on regular expressions which I would not trust to do the job very well (e.g., it doesn't currently match newlines within tags). – Brett Zamir Commented Feb 13, 2014 at 1:10
  • I also found: github./gbirke/Sanitize.js. I like both answers here - what is the protocol about choosing the correct answer? – Piwakawaka Commented Feb 13, 2014 at 17:45
  • Haven't examined it, but its approach definitely sounds like the way to go. As far as liking both answers, do you mean liking both libraries or liking both of our Stack Overflow answers? If the latter, no worries. Normally, it's whatever you liked the best (I like to pick the first poster if the answers were similar.). Once you have enough reputation, you can also up-vote other answers. – Brett Zamir Commented Feb 13, 2014 at 22:55
Add a ment  | 

3 Answers 3

Reset to default 2

It doesn't appear that anything will execute until you insert into the document, as per @rvighne's answer, but there are at least these (unusual) exceptions (tested in FF 27.0):

var userInput = '<a href="http://example." onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("click", function(e) {
    if (e.target.nodeName.toLowerCase() === 'a') {
        alert("I will also cause side effects; I shouldn't run on the wrong link!");
    }
});
el.getElementsByTagName('a')[0].click(); // Alerts "boo!" and "I will also cause side effects; I shouldn't run on the wrong link!"

...or...

var userInput = '<a href="http://example." onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("cat", function(e) { this.getElementsByTagName('a')[0].click(); });
var event = new CustomEvent("cat", {"detail":{}});
el.dispatchEvent(event); // Alerts "boo!"

...or... (though setUserData is deprecated, it is still working):

var userInput = '<a href="http://example." onclick="alert(\'boo!\')">Link<\/a>';
var span = document.createElement('span');
span.innerHTML = userInput;
span.setUserData('key', 10, {handle: function (n1, n2, n3, src) {
    src.getElementsByTagName('a')[0].click();
}});
var div = document.createElement('div');
div.appendChild(span);
span.cloneNode(); // Alerts "Boo!"    
var imprt = document.importNode(span, true); // Alerts "Boo!"
var adopt = document.adoptNode(span, true); // Alerts "Boo!"

...or during iteration...

var userInput = '<a href="http://example." onclick="alert(\'Boo!\');">Link</a>';
var span = document.createElement('span');
span.innerHTML = userInput;
var treeWalker = document.createTreeWalker(
  span,
  NodeFilter.SHOW_ELEMENT,
  { acceptNode: function(node) { node.click(); } },
  false
);
var nodeList = [];
while(treeWalker.nextNode()) nodeList.push(treeWalker.currentNode); // Alerts 'Boo!'

But without these kind of (unusual) event interactions, the fact of building into the DOM alone would not, as far as I have been able to detect, cause any side effects (and of course the examples above are contrived and one wouldn't expect to encounter them very often if at all!).

No script embedded in the HTML can execute until it is put in the document. Try running this code on any page:

var html = "<script>document.body.innerHTML = '';</script>";
var div = document.createElement('div');
div.innerHTML = html;

You will notice nothing change. If the "malicious" script in the HTML was run, then the document should have vanished. So, you can use the DOM to sanitize HTML without worrying about bad JS being in the HTML. As long as you snip out the script in your sanitizer of course.


By the way, your approach is pretty safe and smarter than what most people try (parse it with regex, the poor fools). However, it's best to rely on good, trusted HTML sanitizing libraries for this, like HTML Purifier. Or, if you want to do it client-side, you can use ESAPI-JS (remended by @Brett Zamir)

You can use a "sandboxed" iframe that won't execute anything.

var iframe = document.createElement('iframe');
iframe['sandbox'] = 'allow-same-origin';

From w3schools:

The sandbox attribute enables an extra set of restrictions for the content in the iframe. When the sandbox attribute is present, and it will:

  • block form submission
  • block script execution
  • disable APIs
  • ...

P.S. That's, by the way, exactly how we do it in our Html Sanitizer https://github./jitbit/HtmlSanitizer - we use the browser to interpret HTML and convert it to DOM. Feel free to check the code (or actually use the ponent)

(disclaimer: I'm the contributor to that OSS project)

本文标签: Sanitizing html string with javascript using browser to interpret htmlStack Overflow