python - When and how should use multiple spiders in one Scrapy project -
I'm using Scrapy
, this is great! To make a crawler, the number of fast web sites is increasing, new spiders need to be made, but these are the same type of web, all these spiders use the same items, pipelines, parsing process
Contents of the project directory:
test / ├── scrapy.cfg └── exam ├── __init__.py ├ items.py ├── mybasespider.py ├ ── pipelines.py Settings Py ├── spider1_settings.py ├── spider2_settings.py └── Spider ├── __init__.py ├── spider1.py └── spi To reduce the source code, redundancy is a base spider
, like:MyBaseSpider
inmybasespider.py
, it contains 95% The source code is inherited from all other spiders, if the spider has certain things, then override somethingclass methods
, in general, to create only a new spider, many lines have the source code Need to addKeep all the normal settings in
settings.py
, the special setting of a spider [Name of the spider] is in _settings.pyspider1_settings.py Code>:
Settings import * LOG_FILE = 'spider1.log' LOG_LEVEL = 'INFO' JOBDIR = 'spider1-job' START_URLS = ['http://test1.com/
spider2_settings.py
:fromsettings
to > Using all URLs to run a spider, such as But in this way spiders run Can not be used for import * LOG_FILE = 'spider2.log' LOG_LEVEL = 'DEBUG' JOBDIR = 'spider2-job' START_URLS = ['http://test2.com/',]
is filled in in
, LOG_FILE
, LOG_LEVEL
, JOBDIR
Does; The START_URLS MyBaseSpider.start_urls
, the different spider has different content, but the name used in Base Spider MyBaseSpider
Changed START_URLS
.
[settings] default = test.settings spider1 = spider1.settings spider2 = spider2.settings [ Deploy] url = http: // localhost: 6800 / project = test
spider1
: < Li>
export SCRAPY_PROJECT = spider1
< scrapyd
. scrapyd-deploy
command always used in
and 'default'
project name scrapy.cfg
in the 'Settings' section to create an egg file scrapyd
Is this a project Is there a way to use multiple spiders? I do not build per spider? Are there any better ways?
How to separate the spider's special settings as above, which can run in scrapyd
and reduce source code redundancy < / P>
If all spiders use the same JOBDIR
, then is all the spiders safe to run together? Is constant spider condition contaminated?
Any insights will be highly appreciated. All spiders must have their own class as
, you should have per-spider per setting custom_settings < / Code> can be set with class arguments, something like this:
class MySpider1 (spider): name = "spider1" custom_settings = {'USER_AGENT': 'user_agent_for_spider1 / version1'} class MySpider1 (spider): name = "spider1" custom_settings = {'USER_AGENT': 'user_agent_for_spider2 / version2'}
this custom_settings
people settings.py
if you can still set some global people file then
Comments
Post a Comment