javascript - How to get all links from a website with puppeteer - Stack Overflow

IT技术

更新时间：2025-03-060

admin管理员组
文章数量:1287526

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.

Supposing you want to get all the tags on this page for example:

view-source:/

How can I get all script tags and return to console? I put view-source: because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.

Supposing you want to get all the tags on this page for example:

view-source:https://www.nike./

How can I get all script tags and return to console? I put view-source:https://nike. because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

Share Improve this question edited Jun 4, 2021 at 0:40 asked Jun 1, 2021 at 0:22 user15594988

1 Bounties are a way of using reputation to advertise questions, but be forewarned: you lose the rep immediately, with few chances to get it back. – Heretic Monkey Commented Jun 1, 2021 at 0:24
Stack overflow isn't a code writing service. Show us your own research first please, and what works and what issues you run against. – Tschallacka Commented Jun 3, 2021 at 7:23
As site you mean 1 particual link (e.g. google.) or all sublinks (e.g. google. and google./something ect.) also? – ulou Commented Jun 3, 2021 at 8:05
@Tschallacka I don't have a code, I didn't find something explaining and I asked the stack overflow to get an answer, I didn't find what I was looking for – user15594988 Commented Jun 3, 2021 at 13:13
@ulou I want to get all links and sublinks and linking from css javascript file etc, I want to be able to get all links and sublinks that are visible in the source code – user15594988 Commented Jun 3, 2021 at 13:15

| Show 3 more ments

3 Answers 3

Sorted by: Reset to default 6 +150

It is possible to get all links from a URL using only node.js, without puppeteer:

There are two main steps:

Get the source code for the URL.
Parse the source code for links.

Simple implementation in node.js:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow./a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

Sample usage:

$ node get-links.js
[
    'http://www.w3/2000/svg',
    ...
    'https://s3.nikecdn./unite/scripts/unite.min.js',
    'https://www.nike./android-icon-192x192.png',
    ...
    'https://connect.facebook/',
... 658 more items
]

Notes:

I used the axios library for simplicity and to avoid "access denied" errors from nike.. It is possible to use any other method to get the HTML source, like:
- Native node.js http/https libraries
- Puppeteer (Get plete web page source html with puppeteer - but some part always missing)

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering bees irrelevant for extracting page data.

Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

yes you can get all the script tags and their links without opening view source. You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below

here is the code:

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

Here I have made CSS selector for <script> tags that have src attribute inside them.

You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.

you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

本文标签： javascriptHow to get all links from a website with puppeteerStack Overflow

版权声明：本文标题：javascript - How to get all links from a website with puppeteer - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741220254a2360834.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - How to get all links from a website with puppeteer - Stack Overflow

3 Answers 3

更多相关文章