The question is published on by Tutorial Guruji team.
Actually, I am trying to fetch the content from the Product Description from the Nykaa Website.
URL:- https://www.nykaa.com/nykaa-skinshield-matte-foundation/p/460512?productId=460512&pps=1&skuId=460502
This is the URL, and in the section of the Product description, clicking upon the ‘Read More’ button, at the end there is some text.
The Text which, I want to extract is :
Explore the entire range of Foundation available on Nykaa. Shop more Nykaa Cosmetics products here.You can browse through the complete world of Nykaa Cosmetics Foundation . Alternatively, you can also find many more products from the Nykaa SkinShield Anti-Pollution Matte Foundation range.
Expiry Date: 15 February 2024
Country of Origin: India
Name of Mfg / Importer / Brand: FSN E-commerce Ventures Pvt Ltd
Address of Mfg / Importer / Brand: 104 Vasan Udyog Bhavan Sun Mill Compound Senapati Bapat Marg, Lower Parel, Mumbai City Maharashtra – 400013
After inspecting the page, when I, ‘disable the javascript’ all the content from ‘product description’ vanishes off. It means the content is loading dynamically with the help of Javascript.
I have used ‘selenium’ for this purpose. And This, is what I have tried.
from msilib.schema import Error from tkinter import ON from turtle import goto import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import numpy as np from random import randint import pandas as pd import requests import csv browser = webdriver.Chrome( r'C:Userspaart.wdmdriverschromedriverwin3297.0.4692.71chromedriver.exe') browser.maximize_window() # For maximizing window browser.implicitly_wait(20) # gives an implicit wait for 20 seconds browser.get( "https://www.nykaa.com/nykaa-skinshield-matte-foundation/p/460512?productId=460512&pps=1&skuId=460502") # Creates "load more" button object. browser.implicitly_wait(20) loadMore = browser.find_element_by_xpath(xpath="/html/body/div[1]/div/div[3]/div[1]/div[2]/div/div/div[2]") loadMore.click() browser.implicitly_wait(20) desc_data = browser.find_elements_by_class_name('content-details') for desc in desc_data: para_details = browser.find_element_by_xpath( './/*[@id="content-details"]/p[1]').text extra_details = browser.find_elements_by_xpath( './/*[@id="content-details"]/p[2]', './/*[@id="content-details"]/p[3]', './/*[@id="content-details"]/p[4]', './/*[@id="content-details"]/p[5]').text print(para_details, extra_details)
And this, is the output which is displaying.
PS E:Web Scraping - Nykaa> python -u "e:Web Scraping - Nykaascrape_nykaa_final.py" e:Web Scraping - Nykaascrape_nykaa_final.py:16: DeprecationWarning: executable_path has been deprecated, please pass in a Service object browser = webdriver.Chrome( DevTools listening on ws://127.0.0.1:1033/devtools/browser/097c0e11-6f2c-4742-a2b5-cd05bee72661 e:Web Scraping - Nykaascrape_nykaa_final.py:28: DeprecationWarning: find_element_by_* commands are deprecated. Please use find_element() instead loadMore = browser.find_element_by_xpath( [9312:4972:0206/110327.883:ERROR:ssl_client_socket_impl.cc(996)] handshake failed; returned -1, SSL error code 1, net_error -101 [9312:4972:0206/110328.019:ERROR:ssl_client_socket_impl.cc(996)] handshake failed; returned -1, SSL error code 1, net_error -101 Traceback (most recent call last): File "e:Web Scraping - Nykaascrape_nykaa_final.py", line 28, in <module> loadMore = browser.find_element_by_xpath( File "C:Python310libsite-packagesseleniumwebdriverremotewebdriver.py", line 520, in find_element_by_xpath return self.find_element(by=By.XPATH, value=xpath) File "C:Python310libsite-packagesseleniumwebdriverremotewebdriver.py", line 1244, in find_element return self.execute(Command.FIND_ELEMENT, { File "C:Python310libsite-packagesseleniumwebdriverremotewebdriver.py", line 424, in execute self.error_handler.check_response(response) File "C:Python310libsite-packagesseleniumwebdriverremoteerrorhandler.py", line 247, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div[1]/div/div[3]/div[1]/div[2]/div/div/div[2]"} (Session info: chrome=97.0.4692.99) Stacktrace: Backtrace: Ordinal0 [0x00FDFDC3+2555331] Ordinal0 [0x00F777F1+2127857] Ordinal0 [0x00E72E08+1060360] Ordinal0 [0x00E9E49E+1238174] Ordinal0 [0x00E9E69B+1238683] Ordinal0 [0x00EC9252+1413714] Ordinal0 [0x00EB7B54+1342292] Ordinal0 [0x00EC75FA+1406458] Ordinal0 [0x00EB7976+1341814] Ordinal0 [0x00E936B6+1193654] Ordinal0 [0x00E94546+1197382] GetHandleVerifier [0x01179622+1619522] GetHandleVerifier [0x0122882C+2336844] GetHandleVerifier [0x010723E1+541697] GetHandleVerifier [0x01071443+537699] Ordinal0 [0x00F7D18E+2150798] Ordinal0 [0x00F81518+2168088] Ordinal0 [0x00F81660+2168416] Ordinal0 [0x00F8B330+2208560] BaseThreadInitThunk [0x76C9FA29+25] RtlGetAppContainerNamedObjectPath [0x77337A9E+286] RtlGetAppContainerNamedObjectPath [0x77337A6E+238]
Please, anyone help me getting this issue resolved, or any another specific piece of the code to write, which I am missing to fetch the text content from Product description. It would be a big help.
Thanks 🙏🏻.
Answer
try
from msilib.schema import Error from tkinter import ON from turtle import goto import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import numpy as np from random import randint import pandas as pd import requests import csv browser = webdriver.Chrome( r'C:Userspaart.wdmdriverschromedriverwin3297.0.4692.71chromedriver.exe') browser.maximize_window() # For maximizing window browser.implicitly_wait(20) # gives an implicit wait for 20 seconds browser.get( "https://www.nykaa.com/nykaa-skinshield-matte-foundation/p/460512?productId=460512&pps=1&skuId=460502") browser.execute_script("document.body.style.zoom='50%'") time.sleep(1) browser.execute_script("document.body.style.zoom='100%'") # Creates "load more" button object. browser.implicitly_wait(20) loadMore = browser.find_element_by_xpath(xpath='//div [@class="css-mqbsar"]') loadMore.click() browser.implicitly_wait(20) desc_data = browser.find_elements_by_xpath('//div[@id="content-details"]/p') # desc_data = browser.find_elements_by_class_name('content-details') # here in your previous code this class('content-details') which is a single element so it is not iterable # I used xpath to locate every every element <p> under the (id="content-details) attrid=bute for desc in desc_data: para_detail = desc.text print(para_detail) # if you you want to specify try this # para_detail = desc_data[0].text # expiry_ date = desc_data[1].text
and don’t just copy the XPath from the chrome dev tools it’s not reliable for dynamic content.