非阻塞I / O与使用线程(上下文切换有多糟糕?)

时间:2021-09-26 23:56:24

We use sockets a lot in a program that I work on and we handle connections from up to about 100 machines simultaneously at times. We have a combination of non-blocking I/O in use with a state table to manage it and traditional Java sockets which use threads.

我们在我工作的程序中经常使用套接字,并且我们有时同时处理多达约100台机器的连接。我们将非阻塞I / O与状态表和管理它的传统Java套接字结合使用。

We have quite a few problems with non-blocking sockets and I personally like using threads to handle sockets much better. So my question is:

我们在非阻塞套接字方面存在很多问题,我个人更喜欢使用线程来处理套接字。所以我的问题是:

How much saving is made by using non-blocking sockets on a single thread? How bad is the context switching involved in using threads and how many concurrent connections can you scale to using the threaded model in Java?

在单个线程上使用非阻塞套接字可以节省多少钱?使用线程时涉及的上下文切换有多糟糕,以及在Java中使用线程模型可以扩展多少并发连接?

3 个解决方案

#1


10  

I/O and non-blocking I/O selection depends from your server activity profile. E.g. if you use long-living connections and thousands of clients I/O may become too expensive because of system resources exhaustion. However, direct I/O that doesn't crowd out CPU cache is faster than non-blocking I/O. There is a good article about that - Writing Java Multithreaded Servers - whats old is new.

I / O和非阻塞I / O选择取决于您的服务器活动配置文件。例如。如果你使用长期连接和成千上万的客户端,由于系统资源耗尽,I / O可能会变得太昂贵。但是,不会挤出CPU缓存的直接I / O比非阻塞I / O更快。有一篇很好的文章 - 写Java多线程服务器 - 什么是旧的。

About context switch cost - it's rather chip operation. Consider the simple test below:

关于上下文切换成本 - 它相当于芯片操作。考虑下面的简单测试:

package com;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

public class AAA {

    private static final long DURATION = TimeUnit.NANOSECONDS.convert(30, TimeUnit.SECONDS);
    private static final int THREADS_NUMBER = 2;
    private static final ThreadLocal<AtomicLong> COUNTER = new ThreadLocal<AtomicLong>() {
        @Override
        protected AtomicLong initialValue() {
            return new AtomicLong();
        }
    };
    private static final ThreadLocal<AtomicLong> DUMMY_DATA = new ThreadLocal<AtomicLong>() {
        @Override
        protected AtomicLong initialValue() {
            return new AtomicLong();
        }
    };
    private static final AtomicLong DUMMY_COUNTER = new AtomicLong();
    private static final AtomicLong END_TIME = new AtomicLong(System.nanoTime() + DURATION);

    private static final List<ThreadLocal<CharSequence>> DUMMY_SOURCE = new ArrayList<ThreadLocal<CharSequence>>();
    static {
        for (int i = 0; i < 40; ++i) {
            DUMMY_SOURCE.add(new ThreadLocal<CharSequence>());
        }
    }

    private static final Set<Long> COUNTERS = new ConcurrentSkipListSet<Long>();

    public static void main(String[] args) throws Exception {
        final CountDownLatch startLatch = new CountDownLatch(THREADS_NUMBER);
        final CountDownLatch endLatch = new CountDownLatch(THREADS_NUMBER);

        for (int i = 0; i < THREADS_NUMBER; i++) {
            new Thread() {
                @Override
                public void run() {
                    initDummyData();
                    startLatch.countDown();
                    try {
                        startLatch.await();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                    while (System.nanoTime() < END_TIME.get()) {
                        doJob();
                    }
                    COUNTERS.add(COUNTER.get().get());
                    DUMMY_COUNTER.addAndGet(DUMMY_DATA.get().get());
                    endLatch.countDown();
                }
            }.start();
        }
        startLatch.await();
        END_TIME.set(System.nanoTime() + DURATION);

        endLatch.await();
        printStatistics();
    }

    private static void initDummyData() {
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            threadLocal.set(getRandomString());
        }
    }

    private static CharSequence getRandomString() {
        StringBuilder result = new StringBuilder();
        Random random = new Random();
        for (int i = 0; i < 127; ++i) {
            result.append((char)random.nextInt(0xFF));
        }
        return result;
    }

    private static void doJob() {
        Random random = new Random();
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            for (int i = 0; i < threadLocal.get().length(); ++i) {
                DUMMY_DATA.get().addAndGet(threadLocal.get().charAt(i) << random.nextInt(31));
            }
        }
        COUNTER.get().incrementAndGet();
    }

    private static void printStatistics() {
        long total = 0L;
        for (Long counter : COUNTERS) {
            total += counter;
        }
        System.out.printf("Total iterations number: %d, dummy data: %d, distribution:%n", total, DUMMY_COUNTER.get());
        for (Long counter : COUNTERS) {
            System.out.printf("%f%%%n", counter * 100d / total);
        }
    }
}

I made four tests for two and ten thread scenarios and it shows performance loss is about 2.5% (78626 iterations for two threads and 76754 for ten threads), System resources are used by the threads approximately equally.

我针对两个和十个线程场景进行了四次测试,它显示性能损失约为2.5%(两个线程为78626次迭代,十个线程为76754次),线程大致相等地使用系统资源。

Also 'java.util.concurrent' authors suppose context switch time to be about 2000-4000 CPU cycles:

此外,'java.util.concurrent'作者假设上下文切换时间约为2000-4000个CPU周期:

public class Exchanger<V> {
   ...
   private static final int NCPU = Runtime.getRuntime().availableProcessors();
   ....
   /**
    * The number of times to spin (doing nothing except polling a
    * memory location) before blocking or giving up while waiting to
    * be fulfilled.  Should be zero on uniprocessors.  On
    * multiprocessors, this value should be large enough so that two
    * threads exchanging items as fast as possible block only when
    * one of them is stalled (due to GC or preemption), but not much
    * longer, to avoid wasting CPU resources.  Seen differently, this
    * value is a little over half the number of cycles of an average
    * context switch time on most systems.  The value here is
    * approximately the average of those across a range of tested
    * systems.
    */
   private static final int SPINS = (NCPU == 1) ? 0 : 2000; 

#2


1  

For your questions the best method might be to build a test program, get some hard measurement data and make the best decision based on the data. I usually do this when trying to make such decisions, and it helps to have hard numbers to bring with you to back up your argument.

对于您的问题,最好的方法可能是构建测试程序,获取一些硬测量数据并根据数据做出最佳决策。我通常在尝试做出这样的决定时会这样做,并且有助于让你用硬数据来支持你的论点。

Before starting though, how many threads are you talking about? And with what type of hardware are you running your software?

在开始之前,你在谈论多少线程?您运行软件的硬件类型是什么?

#3


1  

For 100 connections are are unlikely to have a problem with blocking IO and using two threads per connection (one for read and write) That's the simplest model IMHO.

对于100个连接,阻塞IO并且每个连接使用两个线程(一个用于读取和写入)不太可能有问题。这是最简单的模型恕我直言。

However you may find using JMS is a better way to manage your connections. If you use something like ActiveMQ you can consolidate all your connections.

但是,您可能会发现使用JMS是管理连接的更好方法。如果您使用ActiveMQ之类的东西,则可以整合所有连接。

#1


10  

I/O and non-blocking I/O selection depends from your server activity profile. E.g. if you use long-living connections and thousands of clients I/O may become too expensive because of system resources exhaustion. However, direct I/O that doesn't crowd out CPU cache is faster than non-blocking I/O. There is a good article about that - Writing Java Multithreaded Servers - whats old is new.

I / O和非阻塞I / O选择取决于您的服务器活动配置文件。例如。如果你使用长期连接和成千上万的客户端,由于系统资源耗尽,I / O可能会变得太昂贵。但是,不会挤出CPU缓存的直接I / O比非阻塞I / O更快。有一篇很好的文章 - 写Java多线程服务器 - 什么是旧的。

About context switch cost - it's rather chip operation. Consider the simple test below:

关于上下文切换成本 - 它相当于芯片操作。考虑下面的简单测试:

package com;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

public class AAA {

    private static final long DURATION = TimeUnit.NANOSECONDS.convert(30, TimeUnit.SECONDS);
    private static final int THREADS_NUMBER = 2;
    private static final ThreadLocal<AtomicLong> COUNTER = new ThreadLocal<AtomicLong>() {
        @Override
        protected AtomicLong initialValue() {
            return new AtomicLong();
        }
    };
    private static final ThreadLocal<AtomicLong> DUMMY_DATA = new ThreadLocal<AtomicLong>() {
        @Override
        protected AtomicLong initialValue() {
            return new AtomicLong();
        }
    };
    private static final AtomicLong DUMMY_COUNTER = new AtomicLong();
    private static final AtomicLong END_TIME = new AtomicLong(System.nanoTime() + DURATION);

    private static final List<ThreadLocal<CharSequence>> DUMMY_SOURCE = new ArrayList<ThreadLocal<CharSequence>>();
    static {
        for (int i = 0; i < 40; ++i) {
            DUMMY_SOURCE.add(new ThreadLocal<CharSequence>());
        }
    }

    private static final Set<Long> COUNTERS = new ConcurrentSkipListSet<Long>();

    public static void main(String[] args) throws Exception {
        final CountDownLatch startLatch = new CountDownLatch(THREADS_NUMBER);
        final CountDownLatch endLatch = new CountDownLatch(THREADS_NUMBER);

        for (int i = 0; i < THREADS_NUMBER; i++) {
            new Thread() {
                @Override
                public void run() {
                    initDummyData();
                    startLatch.countDown();
                    try {
                        startLatch.await();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                    while (System.nanoTime() < END_TIME.get()) {
                        doJob();
                    }
                    COUNTERS.add(COUNTER.get().get());
                    DUMMY_COUNTER.addAndGet(DUMMY_DATA.get().get());
                    endLatch.countDown();
                }
            }.start();
        }
        startLatch.await();
        END_TIME.set(System.nanoTime() + DURATION);

        endLatch.await();
        printStatistics();
    }

    private static void initDummyData() {
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            threadLocal.set(getRandomString());
        }
    }

    private static CharSequence getRandomString() {
        StringBuilder result = new StringBuilder();
        Random random = new Random();
        for (int i = 0; i < 127; ++i) {
            result.append((char)random.nextInt(0xFF));
        }
        return result;
    }

    private static void doJob() {
        Random random = new Random();
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            for (int i = 0; i < threadLocal.get().length(); ++i) {
                DUMMY_DATA.get().addAndGet(threadLocal.get().charAt(i) << random.nextInt(31));
            }
        }
        COUNTER.get().incrementAndGet();
    }

    private static void printStatistics() {
        long total = 0L;
        for (Long counter : COUNTERS) {
            total += counter;
        }
        System.out.printf("Total iterations number: %d, dummy data: %d, distribution:%n", total, DUMMY_COUNTER.get());
        for (Long counter : COUNTERS) {
            System.out.printf("%f%%%n", counter * 100d / total);
        }
    }
}

I made four tests for two and ten thread scenarios and it shows performance loss is about 2.5% (78626 iterations for two threads and 76754 for ten threads), System resources are used by the threads approximately equally.

我针对两个和十个线程场景进行了四次测试,它显示性能损失约为2.5%(两个线程为78626次迭代,十个线程为76754次),线程大致相等地使用系统资源。

Also 'java.util.concurrent' authors suppose context switch time to be about 2000-4000 CPU cycles:

此外,'java.util.concurrent'作者假设上下文切换时间约为2000-4000个CPU周期:

public class Exchanger<V> {
   ...
   private static final int NCPU = Runtime.getRuntime().availableProcessors();
   ....
   /**
    * The number of times to spin (doing nothing except polling a
    * memory location) before blocking or giving up while waiting to
    * be fulfilled.  Should be zero on uniprocessors.  On
    * multiprocessors, this value should be large enough so that two
    * threads exchanging items as fast as possible block only when
    * one of them is stalled (due to GC or preemption), but not much
    * longer, to avoid wasting CPU resources.  Seen differently, this
    * value is a little over half the number of cycles of an average
    * context switch time on most systems.  The value here is
    * approximately the average of those across a range of tested
    * systems.
    */
   private static final int SPINS = (NCPU == 1) ? 0 : 2000; 

#2


1  

For your questions the best method might be to build a test program, get some hard measurement data and make the best decision based on the data. I usually do this when trying to make such decisions, and it helps to have hard numbers to bring with you to back up your argument.

对于您的问题,最好的方法可能是构建测试程序,获取一些硬测量数据并根据数据做出最佳决策。我通常在尝试做出这样的决定时会这样做,并且有助于让你用硬数据来支持你的论点。

Before starting though, how many threads are you talking about? And with what type of hardware are you running your software?

在开始之前,你在谈论多少线程?您运行软件的硬件类型是什么?

#3


1  

For 100 connections are are unlikely to have a problem with blocking IO and using two threads per connection (one for read and write) That's the simplest model IMHO.

对于100个连接,阻塞IO并且每个连接使用两个线程(一个用于读取和写入)不太可能有问题。这是最简单的模型恕我直言。

However you may find using JMS is a better way to manage your connections. If you use something like ActiveMQ you can consolidate all your connections.

但是,您可能会发现使用JMS是管理连接的更好方法。如果您使用ActiveMQ之类的东西,则可以整合所有连接。