admin管理员组

文章数量:1296491

I'm using HTML Tidy in PHP and it's producing unexpected results because of a <script> tag in a JavaScript string literal. Here's a sample input:

<html>
<script>
var t='<script><'+'/script>';
</script>
</html>

HTML Tidy's output:

<html>
<script>
//<![CDATA[
var t='<script><'+'/script>';
<\/script>
<\/html>
//]]>
</script>
</html>

It's interpreting </script></html> as part of the script. Then, it adds another </script></html> to close the open tags. I tried this on an online version of HTML Tidy (/) and it's producing the same error.

How do I prevent this error from occurring in PHP?

I'm using HTML Tidy in PHP and it's producing unexpected results because of a <script> tag in a JavaScript string literal. Here's a sample input:

<html>
<script>
var t='<script><'+'/script>';
</script>
</html>

HTML Tidy's output:

<html>
<script>
//<![CDATA[
var t='<script><'+'/script>';
<\/script>
<\/html>
//]]>
</script>
</html>

It's interpreting </script></html> as part of the script. Then, it adds another </script></html> to close the open tags. I tried this on an online version of HTML Tidy (http://www.dirtymarkup./) and it's producing the same error.

How do I prevent this error from occurring in PHP?

Share Improve this question edited Mar 7, 2014 at 9:52 user2428118 8,1144 gold badges46 silver badges73 bronze badges asked Feb 26, 2014 at 0:31 Leo JiangLeo Jiang 26.2k59 gold badges176 silver badges327 bronze badges 7
  • 3 I would say "open a bug ticket", but they do not have any means to do so on their web site... – akonsu Commented Feb 26, 2014 at 0:35
  • its an interesting bug but seems very specific to the close script </script> tag, I would just use your current solution.. also the use case for outputting the < and the /script> separately confuses me – clancer Commented Feb 26, 2014 at 0:40
  • can you specify why do you want to add script tag to a variable. – Viswanath Polaki Commented Mar 1, 2014 at 4:53
  • @ViswanathPolaki I'm parsing webpages and the authors of those webpages may want to do so. – Leo Jiang Commented Mar 1, 2014 at 5:01
  • 1 Bug reports go here: sourceforge/p/tidy/bugs But it does not seem like they want to solve any of them. Sad :( – func0der Commented Mar 7, 2014 at 15:52
 |  Show 2 more ments

6 Answers 6

Reset to default 6 +50

After playing around with it a bit I discovered that one can use ment //'<\/script>' to confuse the algorithm in a way to prevent this bug from occurring:

<html>
<script>
var t='<script><'+'/script>'; //'<\/script>'
</script>
</html>

After clean-up:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>

   <script>
var t='<script><'+'/script>'; //'<\/script>'
   </script>

   <title></title>
</head>

<body>
</body>
</html>

My guess is that as the clean-up algorithm looks through the codes and detects the string <script> twice, it looks for </script> immediately. And separting < with /script> makes the second </script> goes undetected, which is why it decided to add another </script> at the end of the codes and somehow also closed it with antoher </html>. (Poor design indeed!)

So I made a second assumption that there isn't an if-statement in the algorithm to determine if a </scirpt> is in a ment, and I was right! Having another string <\/script> as a javascript ment indeed makes the algorithm to think that there are two </script> in total.

There's no need for string concatenation to avoid the closing </script>. Simply escaping the / character is enough to "fool" the parsers in browsers and, it seems, HTML Tidy's parser as well:

<html>
<script>
var t='<script><\/script>';
</script>
</html>

Try to make the script tag not a full word but a string concatenation

<html>
<script>
var t='<scr'+'ipt><'+'/script>';
</script>
</html>

Resulting cleaned code

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>

    <script>
var t='<scr'+'ipt><'+'/script>';
    </script>

    <title></title>
</head>

<body>
</body>
</html>

This is probably a better practice to create a script tag like this: (this should also solve your tidy issues)

<script>
    script = document.createElement('script');
    script.type = 'text/javascript';
    script.src = 'http://myserver./file.js';
    document.getElementsByTagName('head')[0].appendChild(script);   
</script>

One way is to make it so tidy doesn't detect the script tag. The "cleanest" way I could e up with is to escape a character in the tag.

<html>
<script>
var t='<\script><'+'/script>';
</script>
</html>

so you could even do this, without having to break the string up as above:

var t='<\script></\script>';

That just works as expected

<html>
    <script>
        var t='<'+'script><'+'/script>';
    </script>
</html>

By the way, string concatenation is not best way to create dynamically HTML to insert in page, look for document.createElement or even templates engines (handlebars.js is my favourite)

本文标签: phpHTML Tidy fails on script tag in JavaScript string literalStack Overflow