如何使WritableField不将整个数据库加载到内存中?

时间:2021-02-04 19:35:17

I have a very large database (6 GB) that I would like to use Django-REST-Framework with. In particular, I have a model that has a ForeignKey relationship to the django.contrib.auth.models.User table (not so big) and a Foreign Key to a BIG table (lets call it Products). The model can be seen below:

我有一个非常大的数据库(6 GB),我希望使用Django-REST-Framework。特别地,我有一个与django.com的涉外关系的模型。用户表(不是很大)和一个大表的外键(我们称它为Products)。模型如下:

class ShoppingBag(models.Model):

    user     = models.ForeignKey('auth.User', related_name='+')
    product  = models.ForeignKey('myapp.Product', related_name='+')
    quantity = models.SmallIntegerField(default=1)

Again, there are 6GB of Products.

同样,有6GB的产品。

The serializer is as follows:

序列化器如下:

class ShoppingBagSerializer(serializers.ModelSerializer):

    product = serializers.RelatedField(many=False)
    user    = serializers.RelatedField(many=False)

    class Meta:
        model  = ShoppingBag
        fields = ('product', 'user', 'quantity')

So far this is great- I can do a GET on the list and individual shopping bags, and everything is fine. For reference the queries (using a query logger) look something like this:

到目前为止,这真是太棒了——我可以把它列在购物单和个人购物袋上,一切都很好。对于引用查询(使用查询日志记录器)看起来是这样的:

SELECT * FROM myapp_product WHERE product_id=1254
SELECT * FROM auth_user WHERE user_id=12
SELECT * FROM myapp_product WHERE product_id=1404
SELECT * FROM auth_user WHERE user_id=12
...

For as many shopping bags are getting returned.

因为有同样多的购物袋被退回。

But I would like to be able to POST to create new shopping bags, but serializers.RelatedField is read-only. Let's make it read-write:

但是我希望能够发布新的购物袋,但序列化器。RelatedField是只读的。让我们使它读写:

class ShoppingBagSerializer(serializers.ModelSerializer):

    product = serializers.PrimaryKeyRelatedField(many=False)
    user    = serializers.PrimaryKeyRelatedField(many=False)

    ...

Now things get bad... GET requests to the list action take > 5 minutes and I noticed that my server's memory jumps up to ~6GB; why?! Well, back to the SQL queries and now I see:

现在事情变得糟糕…获取对列表操作的请求花费> 5分钟,我注意到我的服务器的内存跳转到~6GB;为什么? !回到SQL查询,现在我看到:

SELECT * FROM myapp_products;
SELECT * FROM auth_user;

Ok, so that's not good. Clearly we're doing "prefetch related" or "select_related" or something like that in order to get access to all the products; but this table is HUGE.

这不好。显然,我们在做"预取相关"或"select_related"或类似的事情以便访问所有产品;但是这张桌子很大。

Further inspection reveals where this happens on Line 68 of relations.py in DRF:

进一步的检查揭示了在第68行关系中发生的情况。py DRF:

def initialize(self, parent, field_name):
    super(RelatedField, self).initialize(parent, field_name)
    if self.queryset is None and not self.read_only:
        manager = getattr(self.parent.opts.model, self.source or field_name)
        if hasattr(manager, 'related'):  # Forward
            self.queryset = manager.related.model._default_manager.all()
        else:  # Reverse
            self.queryset = manager.field.rel.to._default_manager.all()

If not readonly, self.queryset = ALL!!

如果不是只读的,自我。queryset = ! !

So, I'm pretty sure that this is where my problem is; and I need to say, don't select_related here, but I'm not 100% if this is the issue or where to deal with this. It seems like all should be memory safe with pagination, but this is simply not the case. I'd appreciate any advice.

我很确定这就是我的问题所在;我需要说,不要在这里选择_related,但我不是100%如果这是问题或者在哪里处理。似乎所有的分页都应该是内存安全的,但事实并非如此。任何建议,我将不胜感激。

1 个解决方案

#1


1  

In the end, we had to simply create our own PrimaryKeyRelatedField class to override the default behavior in Django-Rest-Framework. Basically we ensured that the queryset was None until we wanted to lookup the object, then we performed the lookup. This was extremely annoying, and I hope the Django-Rest-Framework guys take note of this!

最后,我们必须创建自己的PrimaryKeyRelatedField类来覆盖django rest - framework中的默认行为。基本上,我们确保queryset是None,直到我们想查找对象,然后执行查找。这太烦人了,我希望django - res - framework的家伙们注意到这一点!

Our final solution:

我们最终的解决方案:

class ProductField(serializers.PrimaryKeyRelatedField):

    many = False

    def __init__(self, *args, **kwargs):
        kwarsgs['queryset'] = Product.objects.none() # Hack to ensure ALL products are not loaded
        super(ProductField, self).__init__(*args, **kwargs)

    def field_to_native(self, obj, field_name):
        return unicode(obj)

    def from_native(self, data):
        """
        Perform query lookup here.
        """
        try:
            return Product.objects.get(pk=data)
        except Product.ObjectDoesNotExist:
            msg = self.error_messages['does_not_exist'] % smart_text(data)
            raise ValidationError(msg)
        except (TypeError, ValueError):
            msg = self.error_messages['incorrect_type'] % type(data)
            raise ValidationError(msg)

And then our serializer is as follows:

然后我们的序列化器如下:

class ShoppingBagSerializer(serializers.ModelSerializer):

    product = ProductField()
    ...

This hack ensures the entire database isn't loaded into memory, but rather performs one-off selects based on the data. It's not as efficient computationally, but it also doesn't blast our server with 5 second database queries loaded into memory!

这种方法确保整个数据库不会被加载到内存中,而是基于数据执行一次性的选择。它的计算效率不高,但它也不会让我们的服务器产生5秒钟的数据库查询!

#1


1  

In the end, we had to simply create our own PrimaryKeyRelatedField class to override the default behavior in Django-Rest-Framework. Basically we ensured that the queryset was None until we wanted to lookup the object, then we performed the lookup. This was extremely annoying, and I hope the Django-Rest-Framework guys take note of this!

最后,我们必须创建自己的PrimaryKeyRelatedField类来覆盖django rest - framework中的默认行为。基本上,我们确保queryset是None,直到我们想查找对象,然后执行查找。这太烦人了,我希望django - res - framework的家伙们注意到这一点!

Our final solution:

我们最终的解决方案:

class ProductField(serializers.PrimaryKeyRelatedField):

    many = False

    def __init__(self, *args, **kwargs):
        kwarsgs['queryset'] = Product.objects.none() # Hack to ensure ALL products are not loaded
        super(ProductField, self).__init__(*args, **kwargs)

    def field_to_native(self, obj, field_name):
        return unicode(obj)

    def from_native(self, data):
        """
        Perform query lookup here.
        """
        try:
            return Product.objects.get(pk=data)
        except Product.ObjectDoesNotExist:
            msg = self.error_messages['does_not_exist'] % smart_text(data)
            raise ValidationError(msg)
        except (TypeError, ValueError):
            msg = self.error_messages['incorrect_type'] % type(data)
            raise ValidationError(msg)

And then our serializer is as follows:

然后我们的序列化器如下:

class ShoppingBagSerializer(serializers.ModelSerializer):

    product = ProductField()
    ...

This hack ensures the entire database isn't loaded into memory, but rather performs one-off selects based on the data. It's not as efficient computationally, but it also doesn't blast our server with 5 second database queries loaded into memory!

这种方法确保整个数据库不会被加载到内存中,而是基于数据执行一次性的选择。它的计算效率不高,但它也不会让我们的服务器产生5秒钟的数据库查询!