Scrapy теперь не работает с веб-сайтом, который раньше работал нормально
Я использую scrapy для сканирования сайта: https://www.sephora.fr/marques/de-a-a-z/.
Год назад он работал нормально, но теперь выдает ошибку:
Таймаут пользователя вызвал сбой соединения: получение https://www.sephora.fr/robots.txt заняло более 180,0 секунд.
Он повторяет попытку 5 раз, а затем полностью выходит из строя. Я могу получить доступ к url в chrome, но он не работает в scrapy. Я пробовал использовать пользовательские агенты пользователя и эмулировать запросы заголовков, но это все равно не работает.
Мой друг запустил мой код, даже не изменив robots, и я не получил тайм-аут. Как я противодействовал блокам с помощью прокси и т.д., и, как правило, так и должно быть, но, похоже, проблема не в моем коде, а в моем окружении. Но я не знаю, какая часть там ....
Ниже приведен мой код:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json
import requests
from urllib.parse import parse_qsl, urlencode
import re
from ..pipelines import Pipeline
class SephoraSpider(scrapy.Spider):
"""
The SephoraSpider object gets you the data on all products hosted on sephora.fr
"""
name = 'sephora'
allowed_domains = ['sephora.fr']
# the url of all the brands
start_urls = ['https://www.sephora.fr/marques-de-a-a-z/']
custom_settings = {
'DOWNLOAD_TIMEOUT': '180',
}
def __init__(self):
self.base_url = 'https://www.sephora.fr'
def parse(self, response):
"""
Parses the response of a webpage we are given when we start crawling the first webpage.
This method is automatically launched by Scrapy when crawling.
:param response: the response from the webpage triggered by a get query while crawling.
A Response object represents an HTTP response, which is usually downloaded (by the Downloader)
and fed to the Spiders for processing.
:return: the results of parse_brand().
:rtype: scrapy.Request()
"""
# if we are given an url of the brand we are interested in (burberry) we send an http request to them
if response.url == "https://www.sephora.fr/marques/de-a-a-z/burberry-burb/":
yield scrapy.Request(url=response.url, callback=self.parse_brand)
# otherwise it means we are visiting another html object (another brand, a higher level url ...)
# we call the url back with another method
else:
self.log("parse: I just visited: " + response.url)
urls = response.css('a.sub-category-link::attr(href)').extract()
if urls:
for url in urls:
yield scrapy.Request(url=self.base_url + url, callback=self.parse_brand)
...
Scrapy log:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 16:39:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 16:39:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet Password: af81c5b648cc3542
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 16:39:19 [scrapy.core.engine] INFO: Spider opened
2022-03-13 16:39:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 16:40:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:41:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/robots.txt> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:42:19 [py.warnings] WARNING: /home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
return self.download(result, spider) if isinstance(result, Request) else result
2022-03-13 16:43:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:49:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:50:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:52:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:53:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 16:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/request_bytes': 1881,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 1080.231435,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 16, 57, 19, 904633),
'log_count/DEBUG': 5,
'log_count/ERROR': 4,
'log_count/INFO': 28,
'log_count/WARNING': 1,
'memusage/max': 72749056,
'memusage/startup': 70950912,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.TimeoutError': 4,
"robotstxt/exception_count/<class 'twisted.internet.error.TimeoutError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 16, 39, 19, 673198)}
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
Я собираюсь просмотреть заголовки запросов с помощью fiddler и провести несколько тестов. Может быть, Scrapy по умолчанию отправляет заголовок Connection: close, из-за которого я не получаю никакого ответа от сайта sephora?
Вот журналы, когда я решил не соблюдать robots.txt:
(scr_env) [email protected]:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 23:23:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 23:23:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet Password: 3f4205a34aff02c5
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 23:23:38 [scrapy.core.engine] INFO: Spider opened
2022-03-13 23:23:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 23:24:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:25:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:27:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:28:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:31:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:38 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 23:32:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
'downloader/request_bytes': 951,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 540.224149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 23, 32, 39, 59500),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 19,
'memusage/max': 72196096,
'memusage/startup': 70766592,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 23, 23, 38, 835351)}
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Spider closed (finished)
А вот вывод pip list в моей среде:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper>pip list
Package Version
------------------- ------------------
async-generator 1.10
attrs 21.4.0
Automat 20.2.0
beautifulsoup4 4.10.0
blis 0.7.5
bs4 0.0.1
catalogue 2.0.6
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
configparser 5.2.0
constantly 15.1.0
crayons 0.4.0
cryptography 36.0.1
cssselect 1.1.0
cymem 2.0.6
DAWG-Python 0.7.2
docopt 0.6.2
en-core-web-sm 3.2.0
et-xmlfile 1.1.0
geographiclib 1.52
geopy 2.2.0
h11 0.13.0
h2 3.2.0
hpack 3.0.0
hyperframe 5.2.0
hyperlink 21.0.0
idna 3.3
incremental 21.3.0
itemadapter 0.4.0
itemloaders 1.0.4
Jinja2 3.0.3
jmespath 0.10.0
langcodes 3.3.0
libretranslatepy 2.1.1
lxml 4.8.0
MarkupSafe 2.1.0
murmurhash 1.0.6
numpy 1.22.2
openpyxl 3.0.9
outcome 1.1.0
packaging 21.3
pandas 1.4.1
parsel 1.6.0
pathy 0.6.1
pip 22.0.4
preshed 3.0.6
priority 1.3.0
Protego 0.2.1
pyaes 1.6.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
pydantic 1.8.2
PyDispatcher 2.0.5
pymongo 3.11.0
pymorphy2 0.9.1
pymorphy2-dicts-ru 2.4.417127.4579844
pyOpenSSL 22.0.0
pyparsing 3.0.7
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2021.3
queuelib 1.6.2
requests 2.27.1
rsa 4.8
ru-core-news-md 3.2.0
Scrapy 2.5.1
selenium 4.1.2
service-identity 21.1.0
setuptools 56.0.0
six 1.16.0
smart-open 5.2.1
sniffio 1.2.0
sortedcontainers 2.4.0
soupsieve 2.3.1
spacy 3.2.2
spacy-legacy 3.0.9
spacy-loggers 1.0.1
srsly 2.4.2
Telethon 1.24.0
thinc 8.0.13
tqdm 4.62.3
translate 3.6.1
trio 0.20.0
trio-websocket 0.9.2
Twisted 22.1.0
twisted-iocpsupport 1.0.2
typer 0.4.0
typing_extensions 4.1.1
urllib3 1.26.8
w3lib 1.22.0
wasabi 0.9.0
webdriver-manager 3.5.3
wsproto 1.0.0
zope.interface 5.4.0
С scrapy runspider sephora.py я заметил, что он не принимает мой относительный импорт from ..pipelines import Pipeline:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper\nosetime_scraper\spiders>scrapy runspider sephora.py
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: nosetime_scraper)
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.
9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform
Windows-10-10.0.19043-SP0
2022-03-14 01:00:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
scrapy runspider [options] <spider_file>
runspider: error: Unable to load 'sephora.py': attempted relative import with no known parent package
Вот settings.py:
# Scrapy settings for nosetime_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'nosetime_scraper'
SPIDER_MODULES = ['nosetime_scraper.spiders']
NEWSPIDER_MODULE = 'nosetime_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 7
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'nosetime_scraper.pipelines.NosetimeScraperPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'