scrapy не показывает динамический контент, selenium перенаправляется: как показать страницы, запрошенные webscrapping?
Я не понимаю, почему при использовании scrapy у меня исходный код отличается от url https://www.nordstrom.com/browse/beauty/fragrance/perfume Действительно, вот вывод scrapy:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper>scrapy shell https://www.nordstrom.com/browse/beauty/fragrance/perfume
2022-02-23 12:48:46 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: nosetime_scraper)
2022-02-23 12:48:46 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.
9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform
Windows-10-10.0.19043-SP0
2022-02-23 12:48:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-23 12:48:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-02-23 12:48:46 [scrapy.extensions.telnet] INFO: Telnet Password: 23c7a4cba0ad587c
2022-02-23 12:48:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2022-02-23 12:48:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-23 12:48:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-23 12:48:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-23 12:48:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-23 12:48:46 [scrapy.core.engine] INFO: Spider opened
2022-02-23 12:48:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nordstrom.com/robots.txt> (referer: None)
2022-02-23 12:48:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nordstrom.com/browse/beauty/fragrance/perfume> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000002A56E53F700>
[s] item {}
[s] request <GET https://www.nordstrom.com/browse/beauty/fragrance/perfume>
[s] response <200 https://www.nordstrom.com/browse/beauty/fragrance/perfume>
[s] settings <scrapy.settings.Settings object at 0x000002A56E53FD30>
[s] spider <DefaultSpider 'default' at 0x2a56ed171c0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> response.css('head')
[<Selector xpath='descendant-or-self::head' data='<head>\n <title></title>\n <style>\n ...'>]
Вы можете посмотреть здесь [<Selector xpath='descendant-or-self::head' data='<head>\n <title></title>\n <style>\n ...'>] который отличается от исходного кода:
<!doctype html>
<html lang="en-us" xml:lang="en-us" >
<head>
<title data-react-helmet="true">Women's Perfume & Fragrances | Nordstrom</title>
<meta data-react-helmet="true" charset="UTF-8"/><meta data-react-helmet="true" name="viewport" content="width=980"/><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta data-react-helmet="true" name="description" content="Free shipping and returns on women's perfume and fragrances at Nordstrom.com. Find perfume, eau de toilette, and eau de parfum from top brands."/><meta data-react-helmet="true" name="keywords" content=""/><meta data-react-helmet="true" property="og:title" content="Women's Perfume & Fragrances | Nordstrom"/><meta data-react-helmet="true" property="og:description" content="Free shipping and returns on women's perfume and fragrances at Nordstrom.com. Find perfume, eau de toilette, and eau de parfum from top brands."/><meta data-react-helmet="true" property="og:url" content="https://www.nordstrom.com/browse/beauty/fragrance/perfume"/><meta data-react-helmet="true" property="og:site_name" content="Nordstrom"/><meta data-react-helmet="true" property="og:type" content="website"/><meta data-react-helmet="true" id="page-team" property="page-team" content="sbn"/>
<meta property="fb:app_id" content="143447719050737" />
В заголовке должно быть Women's Perfume & Fragrances | Nordstrom
Я также пробовал селен:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
# driver = webdriver.Chrome(ChromeDriverManager().install())
driver = webdriver.Chrome()
driver.get("https://www.nordstrom.com/browse/beauty/fragrance/perfume")
time.sleep(30)
Но он обнаруживает, что я использую chrome, и даже посылает мне страницу приглашения только для бизнеса.