python - When and how should use multiple spiders in one Scrapy project -


I'm using Scrapy , this is great! To make a crawler, the number of fast web sites is increasing, new spiders need to be made, but these are the same type of web, all these spiders use the same items, pipelines, parsing process

Contents of the project directory:

  test / ├── scrapy.cfg └── exam ├── __init__.py ├ items.py ├── mybasespider.py ├ ── pipelines.py Settings Py ├── spider1_settings.py ├── spider2_settings.py └── Spider ├── __init__.py ├── spider1.py └── spi To reduce the source code, redundancy is a base spider  MyBaseSpider  in  mybasespider.py , it contains 95% The source code is inherited from all other spiders, if the spider has certain things, then override something  class methods , in general, to create only a new spider, many lines have the source code Need to add 

Keep all the normal settings in settings.py , the special setting of a spider [Name of the spider] is in _settings.py , like:

spider1_settings.py Code>:

Settings import * LOG_FILE = 'spider1.log' LOG_LEVEL = 'INFO' JOBDIR = 'spider1-job' START_URLS = ['http://test1.com/ spider2_settings.py :

 from  settings  

to > import * LOG_FILE = 'spider2.log' LOG_LEVEL = 'DEBUG' JOBDIR = 'spider2-job' START_URLS = ['http://test2.com/',]

Using all URLs in , LOG_FILE , LOG_LEVEL , JOBDIR Does; The START_URLS is filled in MyBaseSpider.start_urls , the different spider has different content, but the name used in Base Spider MyBaseSpider Changed START_URLS .

Content of scrapy.cfg :

  [settings] default = test.settings spider1 = spider1.settings spider2 = spider2.settings [ Deploy] url = http: // localhost: 6800 / project = test  

to run a spider, such as spider1 :

    < Li>

    export SCRAPY_PROJECT = spider1 <

But in this way spiders run Can not be used for scrapyd . scrapyd-deploy command always used in 'default' project name scrapy.cfg in the 'Settings' section to create an egg file and scrapyd

  • Is this a project Is there a way to use multiple spiders? I do not build per spider? Are there any better ways?

  • How to separate the spider's special settings as above, which can run in scrapyd and reduce source code redundancy < / P>

  • If all spiders use the same JOBDIR , then is all the spiders safe to run together? Is constant spider condition contaminated?

  • Any insights will be highly appreciated. All spiders must have their own class as

    , you should have per-spider per setting custom_settings < / Code> can be set with class arguments, something like this:

      class MySpider1 (spider): name = "spider1" custom_settings = {'USER_AGENT': 'user_agent_for_spider1 / version1'} class MySpider1 (spider): name = "spider1" custom_settings = {'USER_AGENT': 'user_agent_for_spider2 / version2'}  

    this custom_settings people settings.py if you can still set some global people file then


    Comments

    Popular posts from this blog

    sqlite3 - UPDATE a table from the SELECT of another one -

    c# - Showing a SelectedItem's Property -

    javascript - Render HTML after each iteration in loop -