有没有办法加速django中的身份验证功能?

时间:2022-04-06 05:38:28

We're using django to make a json webservice front-end for mysql. We have apache and django running on an EC2 instance and MySQL running on an RDS instance. We've started benchmarking performance using apache bench and got some really poor performance numbers. We also noticed that while running the tests, our apache/django instance goes to 100% cpu usage at very low load and the MySQL instance never gets above 2% cpu usage.

我们正在使用django为mysql创建一个json webservice前端。我们在EC2实例上运行apache和django,在RDS实例上运行MySQL。我们已经开始使用apache工作台对性能进行基准测试,并获得了一些非常差的性能数据。我们还注意到,在运行测试时,我们的apache / django实例在非常低的负载下达到100%的cpu使用率,并且MySQL实例永远不会超过2%的cpu使用率。

We're trying to make sense of this and isolate the problem, so we did several ab tests:

我们试图理解这一点并将问题隔离开来,所以我们做了几次ab测试:

  1. A request for a static html page from apache -- ~2000 requests/second.
  2. 来自apache的静态html页面请求 - 约2000个请求/秒。
  3. A request that executes a small python function in django, and no db interaction -- ~1000 requests/second.
  4. 在django中执行小python函数的请求,没有数据库交互 - 约1000个请求/秒。
  5. A request that executes one of our django webservice functions that calls authenticate and then does a very simple query to fetch one record from a table -- 11 requests/second
  6. 执行我们调用身份验证的django webservice函数之一的请求,然后执行一个非常简单的查询以从表中获取一条记录 - 11个请求/秒
  7. Same as 3, but commented the call to authenticate -- 95 requests/second.
  8. 与3相同,但注释了对身份验证的调用 - 每秒95个请求。

Why is authenticate so slow? Is it writing data to the db, finding a billion digits of pi, what?

为什么验证这么慢?它是在向数据库写入数据,找到数十亿个数字的pi,是什么?

We would like to keep the call to authenticate in these functions, because we don't want to leave them open to anyone that can guess the url, etc. Has anyone here noticed that authenticate is slow, and can anyone suggest a way to remedy it?

我们希望保留对这些功能进行身份验证的调用,因为我们不想让任何可以猜到网址的人都可以打开它们等等。有人在这里注意到身份验证很慢,任何人都可以建议一种方法来补救它?

Thank you very much!

非常感谢你!

1 个解决方案

#1


7  

I am no expert in authentication and security but the following are some ideas as to why this might be happening and possibly how you can increase the performance somewhat.

我不是身份验证和安全方面的专家,但以下是关于为什么会发生这种情况以及可能如何在某种程度上提高性能的一些想法。

Since passwords are stored in the db, to make their storage secure, plaintext password are not stored but their hash is stored instead. This way you can still validate user logging in by comparing the computed hash from the typed password to the one stored in the db. This increases security so that if a malicious party will get a copy of the db, the only way to decode the plaintext passwords is by either using rainbow-tables or doing a brute-force attack.

由于密码存储在数据库中,为了使其存储安全,不存储明文密码,而是存储其散列。这样,您仍然可以通过将键入的密码的计算哈希值与存储在db中的哈希值进行比较来验证用户登录。这增加了安全性,因此如果恶意方将获得数据库的副本,解码明文密码的唯一方法是使用彩虹表或进行暴力攻击。

This is where things get interesting. According to Moore's Law, computers are becoming exponentially faster, hence computing hash functions becomes much cheaper in terms of time, especially quick hash functions like md5 or sha1. This poses a problem because having all of the computing power available today combined with fast hash functions, hackers can brute-force hashed passwords relatively easy. To combat this, two things can be done. One it to loop the hash function multiple times (output of the hash is fed back into the hash). This however is not very effective because it only increases the complexity of the hash function by a constant. That's why the second approach is preferred which is to make the actual hash function be more complex and computationally expensive. Having more complex function, it takes more time for the hash to be computed. Even if it takes a second to compute, it is not a big deal for end-users, but it is a big deal for brute-force attack because millions of hashes have to be computed. That's why starting with Django 1.4, it uses a pretty computationally expensive function called PBKDF2.

这是事情变得有趣的地方。根据摩尔定律,计算机正在以指数方式加速,因此计算哈希函数在时间上变得便宜得多,尤其是快速哈希函数,如md5或sha1。这带来了一个问题,因为现在拥有所有可用的计算能力与快速散列函数相结合,黑客可以相对容易地暴力破解密码。为了解决这个问题,可以做两件事。一个是循环哈希函数多次(哈希的输出被反馈到哈希)。然而,这不是非常有效,因为它仅通过常数增加散列函数的复杂性。这就是为什么第二种方法是优选的,这使得实际的散列函数更复杂并且计算成本更高。具有更复杂的功能,需要更多时间来计算散列。即使计算需要一秒钟,对最终用户来说也不是什么大问题,但对于暴力攻击来说这是一个大问题,因为必须计算数百万个哈希值。这就是为什么从Django 1.4开始,它使用了一个名为PBKDF2的计算相当昂贵的函数。

To get back to your answer. It's because of this function, when you enable authentication, your benchmark number drastically goes down and your CPU goes up.

回到你的答案。正是因为这个功能,当你启用身份验证时,你的基准测试数量急剧下降,你的CPU上升。

Here are some ways you can increase the performance.

以下是一些可以提高性能的方法。

  • Starting with Django 1.4, you can change the default authentication function (docs). If you don't need much security, you can change the default function to be either SHA1 or MD5. This should increase the performance however keep in mind that the security will be much weaker. My personal opinion is that security is important and is worth the extra time but if it not warranted in your application, it's something you might want to consider.
  • 从Django 1.4开始,您可以更改默认的身份验证功能(docs)。如果您不需要太多安全性,可以将默认功能更改为SHA1或MD5。这应该会提高性能,但请记住,安全性会更弱。我个人认为安全性很重要,值得花费额外的时间,但如果在您的应用程序中没有保证,那么您可能需要考虑这一点。
  • Use sessions. The expensive hash function is only computed on the initial login. Once the user logs in, a session is created for that session and a cookie is send to the user with the session id. Then on subsequent requests, user uploads a cookie and if the session has not expired yet, the user is automatically authenticated (don't worry about security since session data is signed...). The point is that verifying session is A LOT less computationally expensive compared to computing the expensive hash function. I guess that in ab tests you did not send a session cookie. Try to do some tests with an addition of sending a session cookie and see how it performs. If sending cookies is not really an option since you are making a JSON API, then you can modify the session back-end to accept the session data via a session GET parameter instead of a cookie. Not sure however what are security ramifications of doing that.
  • 使用会话。昂贵的哈希函数仅在初始登录时计算。用户登录后,将为该会话创建会话,并使用会话ID将cookie发送给用户。然后在后续请求中,用户上传cookie,如果会话尚未到期,则会自动对用户进行身份验证(不要担心安全性,因为会话数据已签名...)。关键是与计算昂贵的散列函数相比,验证会话的计算成本更低。我想在ab测试中你没有发送会话cookie。尝试添加发送会话cookie并查看其执行情况,进行一些测试。如果您正在制作JSON API,那么发送cookie实际上不是一个选项,那么您可以通过会话GET参数而不是cookie来修改会话后端以接受会话数据。然而,不确定这样做的安全后果是什么。
  • Switch to nginx. I am not an expert in deployment but in my experience nginx is much faster and more friendly to Django compared to Apache. One advantage which I think might be of particular interest to you is nginx ability to have multiple worker processes and its ability to use proxy_pass to hand of requests to Django process(es). If you will have multiple worker processes, you can point each worker to a separate Django process via proxy_pass which will effectively add multiprocessing to Django. Another alternative is if you use something like gevent WSGI server, you can make a pool in Django process which also might increase performance. Not sure if any of these will increase your performance drastically since your CPU load is already at 100% but it might be something to look into.
  • 切换到nginx。我不是部署专家,但根据我的经验,与Apache相比,Django的nginx更快,更友好。我认为您可能特别感兴趣的一个优点是nginx具有多个工作进程的能力以及使用proxy_pass将请求传递给Django进程的能力。如果你有多个工作进程,你可以通过proxy_pass将每个worker指向一个单独的Django进程,这将有效地为Django添加多处理。另一种选择是如果你使用像gevent WSGI服务器这样的东西,你可以在Django进程中创建一个池,这也可以提高性能。不确定这些是否会大幅提高您的性能,因为您的CPU负载已经达到100%,但可能需要考虑一下。

#1


7  

I am no expert in authentication and security but the following are some ideas as to why this might be happening and possibly how you can increase the performance somewhat.

我不是身份验证和安全方面的专家,但以下是关于为什么会发生这种情况以及可能如何在某种程度上提高性能的一些想法。

Since passwords are stored in the db, to make their storage secure, plaintext password are not stored but their hash is stored instead. This way you can still validate user logging in by comparing the computed hash from the typed password to the one stored in the db. This increases security so that if a malicious party will get a copy of the db, the only way to decode the plaintext passwords is by either using rainbow-tables or doing a brute-force attack.

由于密码存储在数据库中,为了使其存储安全,不存储明文密码,而是存储其散列。这样,您仍然可以通过将键入的密码的计算哈希值与存储在db中的哈希值进行比较来验证用户登录。这增加了安全性,因此如果恶意方将获得数据库的副本,解码明文密码的唯一方法是使用彩虹表或进行暴力攻击。

This is where things get interesting. According to Moore's Law, computers are becoming exponentially faster, hence computing hash functions becomes much cheaper in terms of time, especially quick hash functions like md5 or sha1. This poses a problem because having all of the computing power available today combined with fast hash functions, hackers can brute-force hashed passwords relatively easy. To combat this, two things can be done. One it to loop the hash function multiple times (output of the hash is fed back into the hash). This however is not very effective because it only increases the complexity of the hash function by a constant. That's why the second approach is preferred which is to make the actual hash function be more complex and computationally expensive. Having more complex function, it takes more time for the hash to be computed. Even if it takes a second to compute, it is not a big deal for end-users, but it is a big deal for brute-force attack because millions of hashes have to be computed. That's why starting with Django 1.4, it uses a pretty computationally expensive function called PBKDF2.

这是事情变得有趣的地方。根据摩尔定律,计算机正在以指数方式加速,因此计算哈希函数在时间上变得便宜得多,尤其是快速哈希函数,如md5或sha1。这带来了一个问题,因为现在拥有所有可用的计算能力与快速散列函数相结合,黑客可以相对容易地暴力破解密码。为了解决这个问题,可以做两件事。一个是循环哈希函数多次(哈希的输出被反馈到哈希)。然而,这不是非常有效,因为它仅通过常数增加散列函数的复杂性。这就是为什么第二种方法是优选的,这使得实际的散列函数更复杂并且计算成本更高。具有更复杂的功能,需要更多时间来计算散列。即使计算需要一秒钟,对最终用户来说也不是什么大问题,但对于暴力攻击来说这是一个大问题,因为必须计算数百万个哈希值。这就是为什么从Django 1.4开始,它使用了一个名为PBKDF2的计算相当昂贵的函数。

To get back to your answer. It's because of this function, when you enable authentication, your benchmark number drastically goes down and your CPU goes up.

回到你的答案。正是因为这个功能,当你启用身份验证时,你的基准测试数量急剧下降,你的CPU上升。

Here are some ways you can increase the performance.

以下是一些可以提高性能的方法。

  • Starting with Django 1.4, you can change the default authentication function (docs). If you don't need much security, you can change the default function to be either SHA1 or MD5. This should increase the performance however keep in mind that the security will be much weaker. My personal opinion is that security is important and is worth the extra time but if it not warranted in your application, it's something you might want to consider.
  • 从Django 1.4开始,您可以更改默认的身份验证功能(docs)。如果您不需要太多安全性,可以将默认功能更改为SHA1或MD5。这应该会提高性能,但请记住,安全性会更弱。我个人认为安全性很重要,值得花费额外的时间,但如果在您的应用程序中没有保证,那么您可能需要考虑这一点。
  • Use sessions. The expensive hash function is only computed on the initial login. Once the user logs in, a session is created for that session and a cookie is send to the user with the session id. Then on subsequent requests, user uploads a cookie and if the session has not expired yet, the user is automatically authenticated (don't worry about security since session data is signed...). The point is that verifying session is A LOT less computationally expensive compared to computing the expensive hash function. I guess that in ab tests you did not send a session cookie. Try to do some tests with an addition of sending a session cookie and see how it performs. If sending cookies is not really an option since you are making a JSON API, then you can modify the session back-end to accept the session data via a session GET parameter instead of a cookie. Not sure however what are security ramifications of doing that.
  • 使用会话。昂贵的哈希函数仅在初始登录时计算。用户登录后,将为该会话创建会话,并使用会话ID将cookie发送给用户。然后在后续请求中,用户上传cookie,如果会话尚未到期,则会自动对用户进行身份验证(不要担心安全性,因为会话数据已签名...)。关键是与计算昂贵的散列函数相比,验证会话的计算成本更低。我想在ab测试中你没有发送会话cookie。尝试添加发送会话cookie并查看其执行情况,进行一些测试。如果您正在制作JSON API,那么发送cookie实际上不是一个选项,那么您可以通过会话GET参数而不是cookie来修改会话后端以接受会话数据。然而,不确定这样做的安全后果是什么。
  • Switch to nginx. I am not an expert in deployment but in my experience nginx is much faster and more friendly to Django compared to Apache. One advantage which I think might be of particular interest to you is nginx ability to have multiple worker processes and its ability to use proxy_pass to hand of requests to Django process(es). If you will have multiple worker processes, you can point each worker to a separate Django process via proxy_pass which will effectively add multiprocessing to Django. Another alternative is if you use something like gevent WSGI server, you can make a pool in Django process which also might increase performance. Not sure if any of these will increase your performance drastically since your CPU load is already at 100% but it might be something to look into.
  • 切换到nginx。我不是部署专家,但根据我的经验,与Apache相比,Django的nginx更快,更友好。我认为您可能特别感兴趣的一个优点是nginx具有多个工作进程的能力以及使用proxy_pass将请求传递给Django进程的能力。如果你有多个工作进程,你可以通过proxy_pass将每个worker指向一个单独的Django进程,这将有效地为Django添加多处理。另一种选择是如果你使用像gevent WSGI服务器这样的东西,你可以在Django进程中创建一个池,这也可以提高性能。不确定这些是否会大幅提高您的性能,因为您的CPU负载已经达到100%,但可能需要考虑一下。