admin管理员组文章数量:1321051
I'm trying to scrape the gbarbosa page, but it only returns 8 nodes, when the total number of products is 16 in page one. Any suggestions?
library(rvest)
url <- ";map=ft,categoria&page=1"
sku <-
read_html(url) |>
html_nodes(".gbarbosa-gbarbosa-components-0-x-ProductName.false") |>
html_text(trim = TRUE)
gives
> sku
[1] "Ração p/ Cães Pedigree Adulto 7+ Carne Sachê 100g"
[2] "Ração p/ Cães Pedigree Adulto Raças Pequenas Carne 100g"
[3] "Ração p/ Cães Pedigree Adulto Raças Pequenas Frango 100g"
[4] "Alimento p/ Cães Balance Adultos Premium Especial Cordeiro e Vegetais ao Molho Sachê 85g"
[5] "Alimento p/ Cães Pedigree Nutrição Essencial Adultos 9+ Carne ao Leite Pacote 10.1kg Grátis 1.1kg"
[6] "Tapete Higiênico Jumbo s/ Fragrância 80x60cm c/ 7 Unid"
[7] "Ração Kynus p/ Cães Adultos Carne Pacote 15Kg"
[8] "Petisco Dguitos Bifinho de Carne Cães Adultos e Filhotes 65g"
I'm trying to scrape the gbarbosa page, but it only returns 8 nodes, when the total number of products is 16 in page one. Any suggestions?
library(rvest)
url <- "https://www.gbarbosa.br/caes/caes?_q=C%C3%A3es&map=ft,categoria&page=1"
sku <-
read_html(url) |>
html_nodes(".gbarbosa-gbarbosa-components-0-x-ProductName.false") |>
html_text(trim = TRUE)
gives
> sku
[1] "Ração p/ Cães Pedigree Adulto 7+ Carne Sachê 100g"
[2] "Ração p/ Cães Pedigree Adulto Raças Pequenas Carne 100g"
[3] "Ração p/ Cães Pedigree Adulto Raças Pequenas Frango 100g"
[4] "Alimento p/ Cães Balance Adultos Premium Especial Cordeiro e Vegetais ao Molho Sachê 85g"
[5] "Alimento p/ Cães Pedigree Nutrição Essencial Adultos 9+ Carne ao Leite Pacote 10.1kg Grátis 1.1kg"
[6] "Tapete Higiênico Jumbo s/ Fragrância 80x60cm c/ 7 Unid"
[7] "Ração Kynus p/ Cães Adultos Carne Pacote 15Kg"
[8] "Petisco Dguitos Bifinho de Carne Cães Adultos e Filhotes 65g"
Share
Improve this question
edited Jan 17 at 21:57
Friede
8,3812 gold badges9 silver badges29 bronze badges
asked Jan 17 at 21:18
Rafael DíazRafael Díaz
2,2992 gold badges21 silver badges36 bronze badges
2
|
2 Answers
Reset to default 3Item data for that page is embedded in script
elements as ldjson
, you can extract it with rvest
, parse with jsonlite
as a normal JSON and rectangle resulting list into a frame (tidyr
comes with a few handy tools for that):
library(rvest)
library(jsonlite)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
url_ <- "https://www.gbarbosa.br/caes/caes?_q=C%C3%A3es&map=ft,categoria&page=1"
ldjson <-
read_html(url_) |>
html_element(".render-container script[type='application/ld+json']") |>
html_text() |>
parse_json()
# details of the 1st item:
lobstr::tree(ldjson[["itemListElement"]][[1]])
#> <list>
#> ├─@type: "ListItem"
#> ├─position: 1
#> └─item: <list>
#> ├─@context: "https://schema./"
#> ├─@type: "Product"
#> ├─@id: "https://www.gbarbosa.br/raca..."
#> ├─name: "Ração p/ Cães Pedigree Adulto 7+..."
#> ├─brand: <list>
#> │ ├─@type: "Brand"
#> │ └─name: "Pedigree"
#> ├─image: "https://gbarbosa.vtexassets/..."
#> ├─description: ""
#> ├─mpn: "1634520"
#> ├─sku: "7922"
#> ├─offers: <list>
#> │ ├─@type: "AggregateOffer"
#> │ ├─lowPrice: 3.39
#> │ ├─highPrice: 3.39
#> │ ├─priceCurrency: "BRL"
#> │ ├─offers: <list>
#> │ │ └─<list>
#> │ │ ├─@type: "Offer"
#> │ │ ├─price: 3.39
#> │ │ ├─priceCurrency: "BRL"
#> │ │ ├─availability: "http://schema./InStock"
#> │ │ ├─sku: "7922"
#> │ │ ├─itemCondition: "http://schema./NewCondition"
#> │ │ ├─priceValidUntil: "2026-01-18T11:20:49Z"
#> │ │ └─seller: <list>
#> │ │ ├─@type: "Organization"
#> │ │ └─name: "www.gbarbosa.br"
#> │ └─offerCount: 1
#> └─gtin: "7896029014981"
ldjson$itemListElement |>
tibble(items = _) |>
hoist(items,
brand = list("item", "brand", "name"),
name = list("item", "name"),
sku = list("item", "sku"),
lowPrice = list("item", "offers", "lowPrice"),
highPrice = list("item", "offers", "highPrice")
) |>
select(-items)
Results:
#> # A tibble: 16 × 5
#> brand name sku lowPrice highPrice
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Pedigree Ração p/ Cães Pedigree Adulto 7+ Carne Sac… 7922 3.39 3.39
#> 2 Pedigree Ração p/ Cães Pedigree Adulto Raças Pequen… 7466 3.39 3.39
#> 3 Pedigree Ração p/ Cães Pedigree Adulto Raças Pequen… 16425 3.39 3.39
#> 4 Balance Alimento p/ Cães Balance Adultos Premium E… 30424 2.99 2.99
#> 5 Pedigree Alimento p/ Cães Pedigree Nutrição Essenci… 22080 129.99 129.99
#> 6 Jumbo Tapete Higiênico Jumbo s/ Fragrância 80x60… 7205 22.99 22.99
#> 7 Kynus Ração Kynus p/ Cães Adultos Carne Pacote 1… 22206 109.99 109.99
#> 8 Doguitos Petisco Dguitos Bifinho de Carne Cães Adul… 7952 7.99 7.99
#> 9 Pedigree Ração p/ Cães Pedigree Carne 280g 7537 12.99 12.99
#> 10 Balance Alimento p/ Cães Balance Adultos Premium E… 30458 2.99 2.99
#> 11 Pitty Alimento p/ Cães Pitty Adulto Carne 15Kg 35189 79.9 79.9
#> 12 Balance Alimento p/ Cães Balance Adultos Premium E… 29988 2.99 2.99
#> 13 Dog Chow Ração Úmida Dog Chow Cães Adultos Minis e … 7443 3.99 3.99
#> 14 Pedigree Ração p/ Cães Pedigree Filhote Raças Peque… 7378 3.39 3.39
#> 15 Pedigree Racao p/ Cães Pedigree Adulto Cordeiro 100g 7435 3.39 3.39
#> 16 Dog Chow Ração Úmida Dog Chow Cães Adultos Minis e … 7255 3.99 3.99
You can also try rvest::read_html_live()
that uses headless Chrome browser through chromote
package to properly render dynamic content. With up to date Chrome (>= v132) and until current CRAN chromote
release (0.3.1) gets updated, make sure to set chromote.headless = "new"
option. ( NEWS.md for details).
To trigger rendering of all elements, we first need to scroll though the page, for this we can use scroll_into_view()
.
# for chromote <= 0.3.1 and Chrome >= v132 make sure to switch to new headless mode
options(chromote.headless = "new")
sess <- read_html_live(url_)
sess$scroll_into_view(".vtex-render__container-id-shelf-related-search")
# give it a bit time to render elements
Sys.sleep(1)
containers <- html_elements(sess, "#gallery-layout-container .vtex-product-summary-2-x-container")
tibble(
price = html_element(containers, ".promotion-price-by") |> html_text(),
name = html_element(containers, "h3") |> html_text()
)
#> # A tibble: 16 × 2
#> price name
#> <chr> <chr>
#> 1 R$ 3,39 Ração p/ Cães Pedigree Adulto 7+ Carne Sachê 100g
#> 2 R$ 3,39 Ração p/ Cães Pedigree Adulto Raças Pequenas Carne 100g
#> 3 R$ 3,39 Ração p/ Cães Pedigree Adulto Raças Pequenas Frango 100g
#> 4 R$ 2,99 Alimento p/ Cães Balance Adultos Premium Especial Cordeiro e Veget…
#> 5 R$ 129,99 Alimento p/ Cães Pedigree Nutrição Essencial Adultos 9+ Carne ao L…
#> 6 R$ 22,99 Tapete Higiênico Jumbo s/ Fragrância 80x60cm c/ 7 Unid
#> 7 R$ 109,99 Ração Kynus p/ Cães Adultos Carne Pacote 15Kg
#> 8 R$ 7,99 Petisco Dguitos Bifinho de Carne Cães Adultos e Filhotes 65g
#> 9 R$ 12,99 Ração p/ Cães Pedigree Carne 280g
#> 10 R$ 2,99 Alimento p/ Cães Balance Adultos Premium Especial Carne e Vegetais…
#> 11 R$ 79,90 Alimento p/ Cães Pitty Adulto Carne 15Kg
#> 12 R$ 2,99 Alimento p/ Cães Balance Adultos Premium Especial Frango e Vegetai…
#> 13 R$ 3,99 Ração Úmida Dog Chow Cães Adultos Minis e Pequenos Frango 100g
#> 14 R$ 3,39 Ração p/ Cães Pedigree Filhote Raças Pequenas Frango 100g
#> 15 R$ 3,39 Racao p/ Cães Pedigree Adulto Cordeiro 100g
#> 16 R$ 3,99 Ração Úmida Dog Chow Cães Adultos Minis e Pequenos Frango 100g
Created on 2025-01-18 with reprex v2.1.1
Sometimes websites have dynamic content which loads in after the main page is pulled by rvest. This extra content is loaded in using JavaScript. Therefore it can happen, that some list items are not included in the version of the website, which rvest pulls. You will have much more success using selenider
which is a 'codable' browser than can run javaScript:
using<-function(...) {
libs<-unlist(list(...))
req<-unlist(lapply(libs,require,character.only=TRUE))
need<-libs[req==FALSE]
if(length(need)>0){
install.packages(need)
lapply(need,require,character.only=TRUE)
}
}
using("selenium","selenider","purrr","rvest","chromote","dplyr")
# 1. Make sure your system has Java version 17 or higher installed. Download the latest version of Java from the Official Oracle Website: https://www.oracle/java/technologies/downloads/
# 2. Restart your pc after installation
# How to use selenider: https://www.listendata/2024/04/how-to-use-selenium-in-r.html
# Connecting to url
session <- selenider_session("selenium", browser = "chrome")
url <- "https://www.gbarbosa.br/caes/caes?_q=C%C3%A3es&map=ft,categoria&page=1"
open_url(url)
# Click reject cookies button
button <- session %>%
find_element("button[id*='onetrust-reject-all-handler']") %>%
elem_wait_until(is_present)
session %>%
find_element("button[id*='onetrust-reject-all-handler']") %>% elem_click()
# Function to safely extract element text
safe_extract <- function(session, selector, wait = TRUE) {
tryCatch({
if (wait) {
session %>%
find_element(selector) %>%
elem_wait_until(is_present)
}
session %>%
get_page_source() %>%
html_elements(selector) %>%
html_text(trim = TRUE)
}, error = function(e) NA)
}
click_show_more <- function(session, max_attempts = 10) {
for (i in 1:max_attempts) {
# Try to find the button
button <- try({
session %>%
find_element("div[class*='vtex-button__label flex items-center justify-center h-100 ph5']")
}, silent = TRUE)
# If button not found or error, break the loop
if (inherits(button, "try-error")) {
message("Button no longer found after ", i-1, " clicks")
break
}
# Try to click the button
tryCatch({
button %>% elem_click()
message("Successfully clicked button - attempt ", i)
Sys.sleep(2) # Wait for new content to load
}, error = function(e) {
message("Failed to click button: ", e$message)
break
})
}
}
# click more button
click_show_more(session)
# Get all Elements
res <- data.frame(
products = safe_extract(session, "h3[class*='gbarbosa-gbarbosa-components-0-x-ProductName false']"),
prices = safe_extract(session, "div[class*='vtex-store-components-3-x-sellingPrice vtex-store-components-3-x-sellingPriceContainer pv1 b c-on-base vtex-store-components-3-x-price_sellingPriceContainer vtex-store-components-3-x-price_sellingPriceContainer--summary vtex-store-components-3-x-price-without-from']")
)
res
Res
products | prices |
---|---|
Ração p/ Cães Pedigree Adulto 7+ Carne Sachê 100g | R$ 3,39 |
Ração p/ Cães Pedigree Adulto Raças Pequenas Carne 100g | R$ 3,39 |
Ração p/ Cães Pedigree Adulto Raças Pequenas Frango 100g | R$ 3,39 |
Alimento p/ Cães Balance Adultos Premium Especial Cordeiro e Vegetais ao Molho Sachê 85g | R$ 2,99 |
Alimento p/ Cães Pedigree Nutrição Essencial Adultos 9+ Carne ao Leite Pacote 10.1kg Grátis 1.1kg | R$ 129,99 |
Tapete Higiênico Jumbo s/ Fragrância 80x60cm c/ 7 Unid | R$ 22,99 |
Alimento p/ Cães Balance Adultos Raças Pequenas Carne, Frango e Vegetais Pouch 900g | R$ 29,90 |
Ração Úmida Dog Chow Cães Adultos Cordeiro 100g | R$ 3,99 |
Ração Úmida Dog Chow Cães Adultos Cordeiro 100g | R$ 3,99 |
Racão p/ Gato Pitukats Carne 1Kg | R$ 15,90 |
Petisco Doguitos Bifinho de Frango Cães Adultos e Filhotes 65g | R$ 7,99 |
Alimento p/ Cães Pedigree Adultos Carne e Frango 900g | R$ 25,99 |
Petisco p/ Cães Pedigree Biscrok Adultos Leite Cenoura Fígado e Espinafre Multi Pouch 500g | R$ 26,99 |
Alimento p/ Cães Pedigree Filhotes Carne ao Molho Sachê 100g | R$ 3,39 |
Alimento p/ Cães Purina Filhotes Purina Alpo Carne e Frango 1Kg | R$ 24,90 |
Alimento p/ Cães Balance Adultos Raças Pequenaa Carne/Frango e Vegetais Pouch 2.7Kg | R$ 59,90 |
Alimento p/ Cães Balance Filhotes Raças Pequenaa Carne/Frango e Vegetais Pouch 2.7Kg | R$ 59,90 |
Shampoo Sanol Antipulgas p/ Cães 500ml | R$ 22,90 |
Alimento p/ Cães Champ Adulto Sabor de Casa Frango Sachê 85g | R$ 2,99 |
Alimento p/ Cães Faro Filhotes Carne ao Molho Sachê 85g | R$ 2,29 |
Alimento p/ Cães Balance Adultos Premium Especial Carne ao Molho c/ Frutas Sachê 85g | R$ 2,99 |
Bifinho p/ Cães Keldog Kelco Criadores Churrasco 500g Leve + Pague - | R$ 34,99 |
Alimento p/ Cães Purina Adultos Mini/Pequenos Carne/Frango/Arroz 1Kg | R$ 35,90 |
Ração p/ Cães Biscrok Pedigree Filhotes 300g | R$ 20,99 |
Alimento p/ Cães Pitty Adulto Carne c/ Vitaminas 7Kg | R$ 54,90 |
Kit Banho Sanol Dog Shampoo Neutro 500ml + Colônia 120ml GrátisCondicionador 500ml | R$ 54,99 |
Ração p/ Cães Pedigree Júnior 280g | R$ 12,99 |
Alimento p/ Cães Purina Alpo Carne e Vegetais 1Kg | R$ 26,99 |
Alimento p/ Cães Pedigree Adultos Carne 900g | R$ 25,99 |
Alimento p/ Cães Pedigree Adultos Carne 2.7Kg | R$ 59,90 |
Ovos Brancos Grandes GBarbosa c/ 30 Unid | R$ 19,39 |
Filé de Peito de Frango Sadia Bandeja Congelado 1Kg | R$ 1,44Un (Aprox. 180g)R$ 7,99/ kg |
Filé de Peito de Frango Sadia Zip 1Kg | R$ 1,98Un (Aprox. 200g)R$ 9,90/ kg |
本文标签: rRvest returns only some htmlnodesStack Overflow
版权声明:本文标题:r - Rvest returns only some html_nodes - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742090844a2420258.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
rvest
I could also iterate through each of the 4 pages. But I still don't understand why on page 1 it only allows me to extract8 nodes
. – Rafael Díaz Commented Jan 17 at 21:39