I am working on creating a Django-based scraper in which a user can enter a search term. I use that search term(s) to build a URL and query the site, then returning un-rendered HTML and JS. I am then able to take the post request, render the page by creating a Qwebpage, passing it the URL and grabbing the frame's rendered HTML. This works one time in my Django app, and the next POST request crashes the site.
我正在创建一个基于djangol的刮刀,用户可以在其中输入搜索词。我使用这个搜索词来构建一个URL并查询站点,然后返回未呈现的HTML和JS。然后,我可以接收post请求,通过创建qweb页面来呈现页面,传递URL并获取框架的呈现HTML。这一次在我的Django应用程序中是有效的,下一个POST请求会导致站点崩溃。
My first concern is that in this current set up, I am forced to use the xvfb-run wrapper to run. Is this going to pose an issue when I deploy - better question is: can I use an xvfb wrapper in production somehow?
我首先关注的是,在当前设置中,我*使用xvb -run包装器来运行。这是否会在我部署时产生问题?更好的问题是:我是否可以在产品中以某种方式使用xvfb包装器?
With that said I am able to make one post request and this returns the page that I am looking for. If I hit back, and send another request, I see the following errors in console, and this then shuts down the ./manage.py server:
这样,我就可以发出一个post请求,并返回我正在寻找的页面。如果我回击并发送另一个请求,我将在控制台中看到以下错误,然后关闭./manage。py服务器:
WARNING: QApplication was not created in the main() thread.
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
Segmentation fault (core dumped)
I will admit that I do not understand what in particular the error is here since I'm rather new to threading concepts. I am uncertain if this error means that it can't reconnect to the xvfb wrapper thats already running, or if indeed it is a threading issue. The code that works once is here. This has been changed slightly since I don't want to show the site I'm actually scraping. Also, I am not hunting for data in this sample. This sample will simply bring rendered HTML to your browser as a test:
我承认我不理解这里的错误,因为我对线程概念很陌生。我不确定这个错误是否意味着它不能重新连接已经运行的xvfb包装器,或者它确实是一个线程问题。运行一次的代码在这里。由于我不想显示我正在抓取的站点,所以这一点已经稍微改变了。另外,我并不是在寻找这个示例中的数据。本示例将简单地将呈现的HTML带到浏览器中作为测试:
import sys
from django.shortcuts import render
# Create your views here.
from django.http import HttpResponse
from django.http import HttpResponseRedirect
from django.views.generic import View
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
from .forms import QueryForm
def query(request):
results = google.search("Real Estate")
context = {'results': results}
return render(request, 'searchlistings/search.html', context)
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
class SearchView(View):
form_class = QueryForm
template_name = 'searchlistings/index.html'
def get(self, request, *args, **kwargs):
form = self.form_class()
return render(request, self.template_name, {'form': form})
def post(self, request, *args, **kwargs):
form = self.form_class(request.POST)
if form.is_valid():
query = form.cleaned_data['query']
context = self.isOnSite(query)
#return context
#return render(request, 'searchlistings/search.html', {'context': context})
return HttpResponse(context)
def isOnSite(self, query):
url = "http://google.com"
#This does the magic.Loads everything
r = Render(url)
#result is a QString.
result = r.frame.toHtml()
r.app.quit()
return result;
So my primary questions are this:
所以我的主要问题是:
-
Is XVFB wrapper appropriate here and can I use this set up in production on a different host. Will this work not on my local vagrant box?
XVFB包装器在这里是否合适,我是否可以在不同的主机上使用这个设置。这在我当地的流浪盒子上不会有用吗?
-
The main() thread issue - is this a threading issue or an issue not connecting back to the xvfb server? Can this issue be resolved with Celery or something similar?
主()线程问题——这是线程问题还是不连接到xvfb服务器的问题?这个问题可以用芹菜或类似的东西来解决吗?
-
Is this an appropriate way to do what I want? I've seen lots of other solutions including scrapyjs, spynner, selenium and so on but they seem either overtly complicated or based on QT. A better question is do any of these alternative packages solve the main() thread issue?
这是做我想做的事情的合适方式吗?我见过很多其他的解决方案,包括scrapyjs、spynner、selenium等等,但是它们看起来要么过于复杂,要么基于QT。
Thanks for your help!
谢谢你的帮助!
1 个解决方案
#1
0
OK the solution here was to use twill as documented here http://twill.idyll.org/python-api.html - I am able to run this without the xvfb wrapper and it is much faster than previous methods with much less overhead. I can recommend this.
这里的解决方案是使用twill,如http://twill.idyll.org/python-api.html所示——我可以在没有xvfb包装器的情况下运行它,而且它比以前的方法快得多,开销也少得多。我可以推荐。
#1
0
OK the solution here was to use twill as documented here http://twill.idyll.org/python-api.html - I am able to run this without the xvfb wrapper and it is much faster than previous methods with much less overhead. I can recommend this.
这里的解决方案是使用twill,如http://twill.idyll.org/python-api.html所示——我可以在没有xvfb包装器的情况下运行它,而且它比以前的方法快得多,开销也少得多。我可以推荐。