admin管理员组

文章数量:1313801

I'm attempting to scrape a <script> tag from a set of webpages using Simple HTML Dom. At first, I was scraping it by providing the numerical order of the tag I needed:

$script = $html->find('script', 17); //The tag I need is typically the 18th <script> tag on the page

I've e to realize that the order differs depending on the page (and it's just not a scalable way of doing this since it could change at any time). How can I instead search for a keyword within the tag that I need and then pull back the full tag? For example, the tag I need always contains the string "PRODUCT_METADATA".

Thanks in advance for any ideas!

I'm attempting to scrape a <script> tag from a set of webpages using Simple HTML Dom. At first, I was scraping it by providing the numerical order of the tag I needed:

$script = $html->find('script', 17); //The tag I need is typically the 18th <script> tag on the page

I've e to realize that the order differs depending on the page (and it's just not a scalable way of doing this since it could change at any time). How can I instead search for a keyword within the tag that I need and then pull back the full tag? For example, the tag I need always contains the string "PRODUCT_METADATA".

Thanks in advance for any ideas!

Share Improve this question asked Aug 3, 2015 at 18:46 user994585user994585 6714 gold badges14 silver badges28 bronze badges 1
  • Use Xpath with simpleXML aor DomDocument – splash58 Commented Aug 3, 2015 at 18:49
Add a ment  | 

2 Answers 2

Reset to default 7

I ended up using the below code to search all script tags for my keyword:

$scripts = $html->find('script');
    foreach($scripts as $s) {
        if(strpos($s->innertext, 'PRODUCT_METADATA') !== false) {
            $script = $s;
        }
    }

It works, but for me I was trying to find a csrf token hidden in a script tag and at first couldn't get it to work, all a got out was NULL.

My solution was use explode() on the script s and very important remember ->innertext else you can't get a string.

I was lucky that the token was in doublequotes so it was easy to get it.

My final code looks like this:

$scripts = $html->find('script');
foreach($scripts as $s) {
    if (strpos($s->innertext, 'csrf_token') !== false) {
        $script_array = explode('"', $s->innertext);
        $token = $script_array[1];
        break;
    }
}

本文标签: javascriptScraping ltscriptgt tag with certain keyword using Simple HTML Dom ParserStack Overflow