当脚本在根目录之外时获取scrapy项目设置

时间:2021-06-07 15:58:33

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:

我已经制作了一个Scrapy蜘蛛,可以从位于项目根目录中的脚本成功运行。因为我需要从同一个脚本运行来自不同项目的多个蜘蛛(这将是一个根据用户的请求调用脚本的django应用程序),我将脚本从其中一个项目的根目录移动到父目录。出于某种原因,脚本不再能够获取项目的自定义设置,以便将已删除的结果传递到数据库表中。以下是我用来从脚本运行蜘蛛的scrapy文档中的代码:

def spiderCrawl():
   settings = get_project_settings()
   settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
   process = CrawlerProcess(settings)
   process.crawl(MySpider3)
   process.start()

Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.

是否需要导入一些额外的模块才能从项目外部获取项目设置?或者是否需要对此代码进行一些添加?下面我也有运行蜘蛛的脚本代码,谢谢。

from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider 

tc_spider.spiderCrawl()
vs_spider.spiderCrawl()

4 个解决方案

#1


3  

It should work , can you share your scrapy log file

它应该工作,你可以共享你的scrapy日志文件

Edit: your approach will not work because ...when you execute the script..it will look for your default settings in

编辑:您的方法将无法正常工作,因为...当您执行脚本时...将查找您的默认设置

  1. if you have set the environment variable ENVVAR
  2. 如果您已设置环境变量ENVVAR
  3. if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
  4. 如果您在执行脚本的目录中有scrapy.cfg文件,并且该文件指向有效的settings.py目录,它将加载这些设置...
  5. else it will run with vanilla settings provided by scrapy ( your case)
  6. 否则它会运行scrapy提供的vanilla设置(你的情况)

Solution 1 create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file

解决方案1在目录(外部文件夹)中创建cfg文件,并为其提供有效settings.py文件的路径

Solution 2 make your parent directory package , so that absolute path will not be required and you can use relative path

解决方案2创建父目录包,以便不需要绝对路径,并且可以使用相对路径

i.e python -m cron.project1

即python -m cron.project1

Solution 3

解决方案3

Also you can try something like

你也可以尝试类似的东西

Let it be where it is , inside the project directory..where it is working...

让它成为它的位置,在项目目录中..它正在工作......

Create a sh file...

创建一个sh文件......

  • Line 1: Cd to first projects location ( root directory)
  • 第1行:Cd到第一个项目位置(根目录)
  • Line 2 : Python script1.py
  • 第2行:Python script1.py
  • Line 3. Cd to second projects location
  • 第3行.Cd到第二个项目位置
  • Line 4: python script2.py
  • 第4行:python script2.py

Now you can execute spiders via this sh file when requested by django

现在,您可以在django请求时通过此sh文件执行蜘蛛

#2


1  

this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().

这可能发生,因为你不再“在scrapy项目内”,因此它不知道如何使用get_project_settings()获取设置。

You can also specify the settings as a dictionary as the example here:

您也可以将设置指定为字典,例如:

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

#3


1  

I have used this code to solve the problem:

我用这段代码来解决问题:

from scrapy.settings import Settings

settings = Settings()

settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')   
settings.setmodule(settings_module_path, priority='project')

print(settings.get('BASE_URL'))

#4


0  

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

感谢这里已经提供的一些答案,我意识到scrapy实际上没有导入settings.py文件。这是我修复它的方式。

TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.

TLDR:确保将“SCRAPY_SETTINGS_MODULE”变量设置为实际的settings.py文件。我在Scraper的__init __()函数中这样做。

Consider a project with the following structure.

考虑具有以下结构的项目。

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

基本上,命令scrapy startproject scraper在my_project文件夹中执行,我已将run_scraper.py文件添加到外部scraper文件夹,将main.py文件添加到我的根文件夹,并将quotes_spider.py添加到spiders文件夹。

My main file:

我的主要档案:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

My run_scraper.py file:

我的run_scraper.py文件:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spiders = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

另请注意,设置可能需要查看,因为路径需要根据根文件夹(my_project,而不是scraper)。所以在我的情况下:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

etc...

等等...

#1


3  

It should work , can you share your scrapy log file

它应该工作,你可以共享你的scrapy日志文件

Edit: your approach will not work because ...when you execute the script..it will look for your default settings in

编辑:您的方法将无法正常工作,因为...当您执行脚本时...将查找您的默认设置

  1. if you have set the environment variable ENVVAR
  2. 如果您已设置环境变量ENVVAR
  3. if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
  4. 如果您在执行脚本的目录中有scrapy.cfg文件,并且该文件指向有效的settings.py目录,它将加载这些设置...
  5. else it will run with vanilla settings provided by scrapy ( your case)
  6. 否则它会运行scrapy提供的vanilla设置(你的情况)

Solution 1 create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file

解决方案1在目录(外部文件夹)中创建cfg文件,并为其提供有效settings.py文件的路径

Solution 2 make your parent directory package , so that absolute path will not be required and you can use relative path

解决方案2创建父目录包,以便不需要绝对路径,并且可以使用相对路径

i.e python -m cron.project1

即python -m cron.project1

Solution 3

解决方案3

Also you can try something like

你也可以尝试类似的东西

Let it be where it is , inside the project directory..where it is working...

让它成为它的位置,在项目目录中..它正在工作......

Create a sh file...

创建一个sh文件......

  • Line 1: Cd to first projects location ( root directory)
  • 第1行:Cd到第一个项目位置(根目录)
  • Line 2 : Python script1.py
  • 第2行:Python script1.py
  • Line 3. Cd to second projects location
  • 第3行.Cd到第二个项目位置
  • Line 4: python script2.py
  • 第4行:python script2.py

Now you can execute spiders via this sh file when requested by django

现在,您可以在django请求时通过此sh文件执行蜘蛛

#2


1  

this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().

这可能发生,因为你不再“在scrapy项目内”,因此它不知道如何使用get_project_settings()获取设置。

You can also specify the settings as a dictionary as the example here:

您也可以将设置指定为字典,例如:

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

#3


1  

I have used this code to solve the problem:

我用这段代码来解决问题:

from scrapy.settings import Settings

settings = Settings()

settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')   
settings.setmodule(settings_module_path, priority='project')

print(settings.get('BASE_URL'))

#4


0  

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

感谢这里已经提供的一些答案,我意识到scrapy实际上没有导入settings.py文件。这是我修复它的方式。

TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.

TLDR:确保将“SCRAPY_SETTINGS_MODULE”变量设置为实际的settings.py文件。我在Scraper的__init __()函数中这样做。

Consider a project with the following structure.

考虑具有以下结构的项目。

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

基本上,命令scrapy startproject scraper在my_project文件夹中执行,我已将run_scraper.py文件添加到外部scraper文件夹,将main.py文件添加到我的根文件夹,并将quotes_spider.py添加到spiders文件夹。

My main file:

我的主要档案:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

My run_scraper.py file:

我的run_scraper.py文件:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spiders = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

另请注意,设置可能需要查看,因为路径需要根据根文件夹(my_project,而不是scraper)。所以在我的情况下:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

etc...

等等...