admin管理员组文章数量:1399217
I am trying to open a public pany page on Linkedin using Puppeteer, but every time it is redirected to an authentication form. This does not happen when I manually paste the URL in Chromium or in Chrome.
This is the code:
const puppeteer = require("puppeteer");
(async () => {
const url = "/";
const browser = await puppeteer.launch({
headless: false,
args: [
"--lang=en-GB",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-gpu",
"--disable-dev-shm-usage",
],
defaultViewport: null,
pipe: true,
slowMo: 30,
});
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'networkidle0',
});
await page.waitForSelector(".top-card-layout__entity-info-container", { timeout: 10000 });
await page.close();
await browser.close();
})();
This is where the browser is redirected:
This does not happen if I manually paste the URL /
in Chromium or Chrome.
What I have tried so far:
- Use an
incognito
browser context:
// [...]
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// [...]
- Use puppeteer-extra-plugin-stealth to avoid being detected as bot:
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());
// [...]
- Random user agent
const randomUserAgent = require("random-useragent");
// [...]
await page.setUserAgent(randomUserAgent.getRandom());
// [...]
Nothing has worked. Is there anything else I can try?
I am trying to open a public pany page on Linkedin using Puppeteer, but every time it is redirected to an authentication form. This does not happen when I manually paste the URL in Chromium or in Chrome.
This is the code:
const puppeteer = require("puppeteer");
(async () => {
const url = "https://www.linkedin./pany/google/";
const browser = await puppeteer.launch({
headless: false,
args: [
"--lang=en-GB",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-gpu",
"--disable-dev-shm-usage",
],
defaultViewport: null,
pipe: true,
slowMo: 30,
});
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'networkidle0',
});
await page.waitForSelector(".top-card-layout__entity-info-container", { timeout: 10000 });
await page.close();
await browser.close();
})();
This is where the browser is redirected:
This does not happen if I manually paste the URL https://www.linkedin./pany/google/
in Chromium or Chrome.
What I have tried so far:
- Use an
incognito
browser context:
// [...]
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// [...]
- Use puppeteer-extra-plugin-stealth to avoid being detected as bot:
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());
// [...]
- Random user agent
const randomUserAgent = require("random-useragent");
// [...]
await page.setUserAgent(randomUserAgent.getRandom());
// [...]
Nothing has worked. Is there anything else I can try?
Share Improve this question edited Aug 30, 2020 at 18:35 revy asked Aug 30, 2020 at 17:49 revyrevy 4,72711 gold badges48 silver badges101 bronze badges 3- I copied the link you provided. I also got the authentication page the first. After refresh, I got to view once. In Incognito, I am always asked to login. – Rupjyoti Commented Aug 30, 2020 at 18:00
- Note that I tried in chrome in mobile phone. – Rupjyoti Commented Aug 30, 2020 at 18:05
- @Rupjyoti That's weird, it works for me on Chrome incognito and Safari incognito – revy Commented Aug 30, 2020 at 18:31
3 Answers
Reset to default 3The cause
It is due to Microsoft's extreme protection on the profiles. If you are able to visit the public profiles in incognito mode I think some shared cookies are responsible for this, but normally you cannot visit public pany profiles on LinkedIn without logging in due to AuthWall (which blocks you in this case). For me the login is required all the time, even from non-incognito window.
A bit background from data expert John Koala:
When Microsoft bought LinkedIn they invested billions into the purchase. They also started to act, quite soon they battled scraping. Companies like the now famous, due to it’s court battle, “HiQ Labs” use the LinkedIn data to make a huge profit.
Now LinkedIn had the problem that public scraping is not a legal offense, they failed (like all other websites) t[o] prevent well developed public scraping.
So LinkedIn added and strengthened a feature called “Authwall”, that is a very sensitive scraping detection. It allows rarely any public views from non authorized accounts making scraping without account impossible.
Scraping with accounts is a legal offense and it’s a lot more difficult as accounts need to be maintained. This is when HiQ Labs and all other scraping panies went out of business. HiQ saw millions of profit going down the sink, they battled LinkedIn at court.
The only pany left scraping them is “scraping.services“, it will stay interesting what is going to happen during the next years.
Source: John Koala, Why does LinkedIn no longer allow me to see public profiles without logging in? In: quora
I am sure the fact that the whole ex-puppeteer team works now at Microsoft will not make it easier to deceive the AuthWall neither (see: even with puppeteer-extra-plugin-stealth is prevented to visit the page).
Solution
The only way to visit stably LinkedIn pages is to login with the form (or to use a chrome profile which is logged in and already has valid session cookies).
Update: As scraping itself with an existing account violates LinkedIn's user agreement: it is not advised to do such thing. My above solution applies only for one-time visits (which is not a valid scenario anyway). So the final answer is: it is not possible to visit these profiles with puppeteer.
Try A different User agent. just pick one: https://developers.whatismybrowser./useragents/explore/software_type_specific/web-browser/
More about implementing user agents in puppeteer: https://dev.to/sonyarianto/user-agent-string-difference-in-puppeteer-headless-and-headful-4aoh
Edit: before you try the said above, maybe try the stealth add-on first: https://www.npmjs./package/puppeteer-extra-plugin-stealth
This is bcz pupetter opens incognito browser for scrapping, due to which the linkedin redirects you towards login. Here you just need to write script to enter signin option and provide credentials there through pupetter and you will get logged in then you can perform whatever you were about to perform whether scrapping or anything.
E.g:
var login = ".whatever the login button class is"
await page.click(login);
// Type into search box your email and password.
await page.type('.email', 'Email here');
await page.type('.password', 'Pass here');
await page.click(.login-button-class);
本文标签:
版权声明:本文标题:javascript - Public LinkedIn page requires authentication in Puppeteer but it doesn't when manually pasting the url in C 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744189402a2594445.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论