cron job在gae python失败了

时间:2022-02-11 08:04:16

I have a script in Google Appengine that is started every 20 minutes by cron.yaml. This works locally, on my own machine. When I go (manually) to the url which starts the script online, it also works. However, the script always fails to complete online, on Google's instances, when cron.yaml is in charge of starting it.

我在Google Appengine中有一个脚本,由cron.yaml每20分钟启动一次。这在我自己的机器上本地工作。当我(手动)转到在线启动脚本的URL时,它也可以工作。但是,当cron.yaml负责启动脚本时,在Google的实例上,脚本始终无法在线完成。

The log shows no errors, only 2 debug messages:

日志显示没有错误,只有2个调试消息:

D 2013-07-23 06:00:08.449
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)

D 2013-07-23 06:00:11.246
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)

Here's my script:

这是我的脚本:

# coding: utf-8
import jinja2, webapp2, urllib2, re

from bs4 import BeautifulSoup as bs
from google.appengine.api import memcache
from google.appengine.ext import db

class Article(db.Model):
    content = db.TextProperty()
    datetime = db.DateTimeProperty(auto_now_add=True)
    companies = db.ListProperty(db.Key)
    url = db.StringProperty()

class Company(db.Model):
    name = db.StringProperty() 
    ticker = db.StringProperty()

    @property
    def articles(self):
        return Article.gql("WHERE companies = :1", self.key()) 

def companies_key(companies_name=None):
  return db.Key.from_path('Companies', companies_name or 'default_companies')

def articles_key(articles_name=None):
  return db.Key.from_path('Articles', articles_name or 'default_articles')

def scrape():
   companies = memcache.get("companies")
   if not companies:
      companies = Company.all()
      memcache.add("companies",companies,30)
   for company in companies:
      links = links(company.ticker)
      links = set(links)
      for link in links:
          if link is not "None": 
              article_object = Article() 
              text = fetch(link)            
              article_object.content = text
              article_object.url = link
              article_object.companies.append(company.key()) #doesn't work.
              article_object.put()

def fetch(link):
    try:
        html = urllib2.urlopen(url).read()
        soup = bs(html)
    except:
        return "None"
    text = soup.get_text()
    text = text.encode('utf-8')
    text = text.decode('utf-8')
    text = unicode(text)
    if text is not "None": 
        return text
    else: 
        return "None"


def links(ticker):
    url = "https://www.google.com/finance/company_news?q=NASDAQ:" + ticker + "&start=10&num=10"
    html = urllib2.urlopen(url).read()
    soup = bs(html)
    div_class = re.compile("^g-section.*")
    divs = soup.find_all("div", {"class" : div_class})
    links = []
    for div in divs:
        a = unicode(div.find('a', attrs={'href': re.compile("^http://")})) 
        link_regex = re.search("(http://.*?)\"",a)
        try:
            link = link_regex.group(1)
            soup = bs(link)
            link = soup.get_text() 
        except:
            link = "None"
        links.append(link)

    return links

...and the script's handler in main:

...和main中的脚本处理程序:

class ScrapeHandler(webapp2.RequestHandler):
    def get(self):
        scrape.scrape()
        self.redirect("/")

My guess is that the problem might be the double for loop in the scrape script, but I don't understand exactly why.

我的猜测是问题可能是scrape脚本中的双循环,但我不明白为什么。

Update: Articles are indeed being scraped (as many as there should be), and now there are no log errors, or even debug messages at all. Looking at the log, the cron job seemed to execute perfectly. Even so, Appengine's cron job panel says the cron job failed.

更新:文章确实被删除(应尽可能多),现在根本没有日志错误,甚至根本没有调试消息。看着日志,cron的工作似乎完美无缺。尽管如此,Appengine的cron工作小组表示,cron工作失败了。

1 个解决方案

#1


0  

I,m pretty sure this error was due to DeadlineExceededError, which I did not run into locally. My scrape() script now does its thing on fewer companies and articles, and does not run into the exceeded deadline.

我,我很确定这个错误是由于DeadlineExceededError,我没有在本地遇到。我的scrape()脚本现在可以在更少的公司和文章上做到,并且不会遇到超出的截止日期。

#1


0  

I,m pretty sure this error was due to DeadlineExceededError, which I did not run into locally. My scrape() script now does its thing on fewer companies and articles, and does not run into the exceeded deadline.

我,我很确定这个错误是由于DeadlineExceededError,我没有在本地遇到。我的scrape()脚本现在可以在更少的公司和文章上做到,并且不会遇到超出的截止日期。