在Django中使用QT抓取一次,在主()线程中没有创建下一个使用QApplication运行的崩溃

时间:2021-09-05 00:36:21

I am working on creating a Django-based scraper in which a user can enter a search term. I use that search term(s) to build a URL and query the site, then returning un-rendered HTML and JS. I am then able to take the post request, render the page by creating a Qwebpage, passing it the URL and grabbing the frame's rendered HTML. This works one time in my Django app, and the next POST request crashes the site.

我正在创建一个基于djangol的刮刀,用户可以在其中输入搜索词。我使用这个搜索词来构建一个URL并查询站点,然后返回未呈现的HTML和JS。然后,我可以接收post请求,通过创建qweb页面来呈现页面,传递URL并获取框架的呈现HTML。这一次在我的Django应用程序中是有效的,下一个POST请求会导致站点崩溃。

My first concern is that in this current set up, I am forced to use the xvfb-run wrapper to run. Is this going to pose an issue when I deploy - better question is: can I use an xvfb wrapper in production somehow?

我首先关注的是,在当前设置中,我*使用xvb -run包装器来运行。这是否会在我部署时产生问题?更好的问题是:我是否可以在产品中以某种方式使用xvfb包装器?

With that said I am able to make one post request and this returns the page that I am looking for. If I hit back, and send another request, I see the following errors in console, and this then shuts down the ./manage.py server:

这样,我就可以发出一个post请求,并返回我正在寻找的页面。如果我回击并发送另一个请求,我将在控制台中看到以下错误,然后关闭./manage。py服务器:

WARNING: QApplication was not created in the main() thread.
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
Segmentation fault (core dumped)

I will admit that I do not understand what in particular the error is here since I'm rather new to threading concepts. I am uncertain if this error means that it can't reconnect to the xvfb wrapper thats already running, or if indeed it is a threading issue. The code that works once is here. This has been changed slightly since I don't want to show the site I'm actually scraping. Also, I am not hunting for data in this sample. This sample will simply bring rendered HTML to your browser as a test:

我承认我不理解这里的错误,因为我对线程概念很陌生。我不确定这个错误是否意味着它不能重新连接已经运行的xvfb包装器,或者它确实是一个线程问题。运行一次的代码在这里。由于我不想显示我正在抓取的站点,所以这一点已经稍微改变了。另外,我并不是在寻找这个示例中的数据。本示例将简单地将呈现的HTML带到浏览器中作为测试:

import sys
from django.shortcuts import render

# Create your views here.
from django.http import HttpResponse
from django.http import HttpResponseRedirect
from django.views.generic import View
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import * 
from bs4 import BeautifulSoup 

from .forms import QueryForm

def query(request):
        results = google.search("Real Estate")
        context = {'results': results}
        return render(request, 'searchlistings/search.html', context)

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()

class SearchView(View):
    form_class = QueryForm
    template_name = 'searchlistings/index.html'

    def get(self, request, *args, **kwargs):
        form = self.form_class()
        return render(request, self.template_name, {'form': form})

    def post(self, request, *args, **kwargs):
        form = self.form_class(request.POST)
        if form.is_valid():
            query = form.cleaned_data['query']
            context = self.isOnSite(query)
            #return context
            #return render(request, 'searchlistings/search.html', {'context': context})
            return HttpResponse(context)

    def isOnSite(self, query):
        url = "http://google.com"
        #This does the magic.Loads everything
        r = Render(url)  
        #result is a QString.
        result = r.frame.toHtml()
        r.app.quit()
        return result;

So my primary questions are this:

所以我的主要问题是:

  1. Is XVFB wrapper appropriate here and can I use this set up in production on a different host. Will this work not on my local vagrant box?

    XVFB包装器在这里是否合适,我是否可以在不同的主机上使用这个设置。这在我当地的流浪盒子上不会有用吗?

  2. The main() thread issue - is this a threading issue or an issue not connecting back to the xvfb server? Can this issue be resolved with Celery or something similar?

    主()线程问题——这是线程问题还是不连接到xvfb服务器的问题?这个问题可以用芹菜或类似的东西来解决吗?

  3. Is this an appropriate way to do what I want? I've seen lots of other solutions including scrapyjs, spynner, selenium and so on but they seem either overtly complicated or based on QT. A better question is do any of these alternative packages solve the main() thread issue?

    这是做我想做的事情的合适方式吗?我见过很多其他的解决方案,包括scrapyjs、spynner、selenium等等,但是它们看起来要么过于复杂,要么基于QT。

Thanks for your help!

谢谢你的帮助!

1 个解决方案

#1


0  

OK the solution here was to use twill as documented here http://twill.idyll.org/python-api.html - I am able to run this without the xvfb wrapper and it is much faster than previous methods with much less overhead. I can recommend this.

这里的解决方案是使用twill,如http://twill.idyll.org/python-api.html所示——我可以在没有xvfb包装器的情况下运行它,而且它比以前的方法快得多,开销也少得多。我可以推荐。

#1


0  

OK the solution here was to use twill as documented here http://twill.idyll.org/python-api.html - I am able to run this without the xvfb wrapper and it is much faster than previous methods with much less overhead. I can recommend this.

这里的解决方案是使用twill,如http://twill.idyll.org/python-api.html所示——我可以在没有xvfb包装器的情况下运行它,而且它比以前的方法快得多,开销也少得多。我可以推荐。