I am trying to implement a solution on AWS which is as follows:
我正在尝试在AWS上实施一个解决方案,如下所示:
I have a crawler that will run once a day to index certain sites. I want to cache this data and expose it the the form of an API since after crawling, this data will not change for an entire day. After the crawler refetches, I want to invalidate and rebuild this cache to serve the updated data. I'm trying to use serverless architecture to build this.
我有一个爬虫,每天运行一次索引某些网站。我想缓存这些数据并将其暴露为API的形式,因为在爬行之后,这些数据不会在一整天内发生变化。抓取器重新获取后,我想使此缓存无效并重建以提供更新的数据。我正在尝试使用无服务器架构来构建它。
Possible Solutions
可能的解决方案
It is clear that the crawler will run on AWS Lambda. What is unclear to me is how to manage the cache that will serve the data. Here are some solutions I thought of
很明显,爬虫将在AWS Lambda上运行。我不清楚的是如何管理将为数据提供服务的缓存。以下是我想到的一些解决方案
-
S3 and Cloudfront for caching: After crawling, store the data in the form of .json files in S3 that will be cached using AWS Cloudfront. When the crawler refetches new data, it will rebuild these files and ask Cloudfront to invalidate the cache.
用于缓存的S3和Cloudfront:在爬网之后,将数据以.json文件的形式存储在S3中,该文件将使用AWS Cloudfront进行缓存。当爬网程序重新获取新数据时,它将重建这些文件并要求Cloudfront使缓存无效。
-
API Gateway DynamoDB: After Crawling store the data in DynamoDB which will be then served by API Gateway which is cached. The only problem here is how can I ask for this cache to be invalidated at the end of the day when the crawler re-crawls? Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running (because if I implement caching on API Gateway, there will only one call to DynamoDB for caching after that it will be sitting idle for a day)
API Gateway DynamoDB:在Crawling之后将数据存储在DynamoDB中,然后由缓存的API网关提供服务。这里唯一的问题是如何在爬虫重新爬行的一天结束时要求此缓存失效?由于数据将在一天内保持静态,我怎么能不支付DynamoDB运行的额外时间(因为如果我在API Gateway上实现缓存,那么只有一次调用DynamoDB进行缓存,之后它将处于空闲状态一天)
Is there any other way that I am missing?
还有其他方法让我失踪吗?
Thanks!
谢谢!
1 个解决方案
#1
1
You can store new data in different path in S3 that would include the date of creation. Maybe something like:
您可以在S3中的不同路径中存储新数据,其中包括创建日期。也许是这样的:
index_2017_08_11.json
Then there is no need to invalidate caches on the CloudFront side. Since to access these new objects you need to provide new URLs, old CloudFront cache won't be an issue. You can remove S3 files for a previous day using S3 TTL feature.
然后,无需在CloudFront端使缓存无效。由于要访问这些新对象,您需要提供新的URL,旧的CloudFront缓存不会成为问题。您可以使用S3 TTL功能删除前一天的S3文件。
Another option is to set the Expires caching HTTP header to set when the data in cache should be invalidated:
另一个选项是设置Expires缓存HTTP标头,以便在缓存中的数据无效时设置:
The Expires header field lets you specify an expiration date and time using the format specified in RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1 Section 3.3.1, Full Date, for example: Sat, 27 Jun 2015 23:59:59 GMT
Expires头字段允许您使用RFC 2616,超文本传输协议 - HTTP / 1.1第3.3.1节,完整日期中指定的格式指定到期日期和时间,例如:周六,2015年6月27日23:59:59 GMT
You can set this header in API Gateway to specify when an object should be invalidated.
您可以在API网关中设置此标头,以指定何时应使对象失效。
Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running
由于数据将在一天内保持静态,我怎么能不支付DynamoDB运行的额外时间
If data is static, can you store it in S3 and use API Gateway to serve data from S3 instead of DynamoDB?
如果数据是静态的,您可以将其存储在S3中并使用API网关从S3而不是DynamoDB提供数据吗?
#1
1
You can store new data in different path in S3 that would include the date of creation. Maybe something like:
您可以在S3中的不同路径中存储新数据,其中包括创建日期。也许是这样的:
index_2017_08_11.json
Then there is no need to invalidate caches on the CloudFront side. Since to access these new objects you need to provide new URLs, old CloudFront cache won't be an issue. You can remove S3 files for a previous day using S3 TTL feature.
然后,无需在CloudFront端使缓存无效。由于要访问这些新对象,您需要提供新的URL,旧的CloudFront缓存不会成为问题。您可以使用S3 TTL功能删除前一天的S3文件。
Another option is to set the Expires caching HTTP header to set when the data in cache should be invalidated:
另一个选项是设置Expires缓存HTTP标头,以便在缓存中的数据无效时设置:
The Expires header field lets you specify an expiration date and time using the format specified in RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1 Section 3.3.1, Full Date, for example: Sat, 27 Jun 2015 23:59:59 GMT
Expires头字段允许您使用RFC 2616,超文本传输协议 - HTTP / 1.1第3.3.1节,完整日期中指定的格式指定到期日期和时间,例如:周六,2015年6月27日23:59:59 GMT
You can set this header in API Gateway to specify when an object should be invalidated.
您可以在API网关中设置此标头,以指定何时应使对象失效。
Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running
由于数据将在一天内保持静态,我怎么能不支付DynamoDB运行的额外时间
If data is static, can you store it in S3 and use API Gateway to serve data from S3 instead of DynamoDB?
如果数据是静态的,您可以将其存储在S3中并使用API网关从S3而不是DynamoDB提供数据吗?