非阻塞I / O与使用线程(上下文切换有多糟糕?)

时间:2021-09-26 23:56:24

We use sockets a lot in a program that I work on and we handle connections from up to about 100 machines simultaneously at times. We have a combination of non-blocking I/O in use with a state table to manage it and traditional Java sockets which use threads.

我们在我工作的程序中经常使用套接字,并且我们有时同时处理多达约100台机器的连接。我们将非阻塞I / O与状态表和管理它的传统Java套接字结合使用。

We have quite a few problems with non-blocking sockets and I personally like using threads to handle sockets much better. So my question is:


How much saving is made by using non-blocking sockets on a single thread? How bad is the context switching involved in using threads and how many concurrent connections can you scale to using the threaded model in Java?


3 个解决方案



I/O and non-blocking I/O selection depends from your server activity profile. E.g. if you use long-living connections and thousands of clients I/O may become too expensive because of system resources exhaustion. However, direct I/O that doesn't crowd out CPU cache is faster than non-blocking I/O. There is a good article about that - Writing Java Multithreaded Servers - whats old is new.

I / O和非阻塞I / O选择取决于您的服务器活动配置文件。例如。如果你使用长期连接和成千上万的客户端,由于系统资源耗尽,I / O可能会变得太昂贵。但是,不会挤出CPU缓存的直接I / O比非阻塞I / O更快。有一篇很好的文章 - 写Java多线程服务器 - 什么是旧的。

About context switch cost - it's rather chip operation. Consider the simple test below:

关于上下文切换成本 - 它相当于芯片操作。考虑下面的简单测试:

package com;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

public class AAA {

    private static final long DURATION = TimeUnit.NANOSECONDS.convert(30, TimeUnit.SECONDS);
    private static final int THREADS_NUMBER = 2;
    private static final ThreadLocal<AtomicLong> COUNTER = new ThreadLocal<AtomicLong>() {
        protected AtomicLong initialValue() {
            return new AtomicLong();
    private static final ThreadLocal<AtomicLong> DUMMY_DATA = new ThreadLocal<AtomicLong>() {
        protected AtomicLong initialValue() {
            return new AtomicLong();
    private static final AtomicLong DUMMY_COUNTER = new AtomicLong();
    private static final AtomicLong END_TIME = new AtomicLong(System.nanoTime() + DURATION);

    private static final List<ThreadLocal<CharSequence>> DUMMY_SOURCE = new ArrayList<ThreadLocal<CharSequence>>();
    static {
        for (int i = 0; i < 40; ++i) {
            DUMMY_SOURCE.add(new ThreadLocal<CharSequence>());

    private static final Set<Long> COUNTERS = new ConcurrentSkipListSet<Long>();

    public static void main(String[] args) throws Exception {
        final CountDownLatch startLatch = new CountDownLatch(THREADS_NUMBER);
        final CountDownLatch endLatch = new CountDownLatch(THREADS_NUMBER);

        for (int i = 0; i < THREADS_NUMBER; i++) {
            new Thread() {
                public void run() {
                    try {
                    } catch (InterruptedException e) {
                    while (System.nanoTime() < END_TIME.get()) {
        END_TIME.set(System.nanoTime() + DURATION);


    private static void initDummyData() {
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {

    private static CharSequence getRandomString() {
        StringBuilder result = new StringBuilder();
        Random random = new Random();
        for (int i = 0; i < 127; ++i) {
        return result;

    private static void doJob() {
        Random random = new Random();
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            for (int i = 0; i < threadLocal.get().length(); ++i) {
                DUMMY_DATA.get().addAndGet(threadLocal.get().charAt(i) << random.nextInt(31));

    private static void printStatistics() {
        long total = 0L;
        for (Long counter : COUNTERS) {
            total += counter;
        System.out.printf("Total iterations number: %d, dummy data: %d, distribution:%n", total, DUMMY_COUNTER.get());
        for (Long counter : COUNTERS) {
            System.out.printf("%f%%%n", counter * 100d / total);

I made four tests for two and ten thread scenarios and it shows performance loss is about 2.5% (78626 iterations for two threads and 76754 for ten threads), System resources are used by the threads approximately equally.


Also 'java.util.concurrent' authors suppose context switch time to be about 2000-4000 CPU cycles:


public class Exchanger<V> {
   private static final int NCPU = Runtime.getRuntime().availableProcessors();
    * The number of times to spin (doing nothing except polling a
    * memory location) before blocking or giving up while waiting to
    * be fulfilled.  Should be zero on uniprocessors.  On
    * multiprocessors, this value should be large enough so that two
    * threads exchanging items as fast as possible block only when
    * one of them is stalled (due to GC or preemption), but not much
    * longer, to avoid wasting CPU resources.  Seen differently, this
    * value is a little over half the number of cycles of an average
    * context switch time on most systems.  The value here is
    * approximately the average of those across a range of tested
    * systems.
   private static final int SPINS = (NCPU == 1) ? 0 : 2000; 



For your questions the best method might be to build a test program, get some hard measurement data and make the best decision based on the data. I usually do this when trying to make such decisions, and it helps to have hard numbers to bring with you to back up your argument.


Before starting though, how many threads are you talking about? And with what type of hardware are you running your software?




For 100 connections are are unlikely to have a problem with blocking IO and using two threads per connection (one for read and write) That's the simplest model IMHO.


However you may find using JMS is a better way to manage your connections. If you use something like ActiveMQ you can consolidate all your connections.




I/O and non-blocking I/O selection depends from your server activity profile. E.g. if you use long-living connections and thousands of clients I/O may become too expensive because of system resources exhaustion. However, direct I/O that doesn't crowd out CPU cache is faster than non-blocking I/O. There is a good article about that - Writing Java Multithreaded Servers - whats old is new.

I / O和非阻塞I / O选择取决于您的服务器活动配置文件。例如。如果你使用长期连接和成千上万的客户端,由于系统资源耗尽,I / O可能会变得太昂贵。但是,不会挤出CPU缓存的直接I / O比非阻塞I / O更快。有一篇很好的文章 - 写Java多线程服务器 - 什么是旧的。

About context switch cost - it's rather chip operation. Consider the simple test below:

关于上下文切换成本 - 它相当于芯片操作。考虑下面的简单测试:

package com;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

public class AAA {

    private static final long DURATION = TimeUnit.NANOSECONDS.convert(30, TimeUnit.SECONDS);
    private static final int THREADS_NUMBER = 2;
    private static final ThreadLocal<AtomicLong> COUNTER = new ThreadLocal<AtomicLong>() {
        protected AtomicLong initialValue() {
            return new AtomicLong();
    private static final ThreadLocal<AtomicLong> DUMMY_DATA = new ThreadLocal<AtomicLong>() {
        protected AtomicLong initialValue() {
            return new AtomicLong();
    private static final AtomicLong DUMMY_COUNTER = new AtomicLong();
    private static final AtomicLong END_TIME = new AtomicLong(System.nanoTime() + DURATION);

    private static final List<ThreadLocal<CharSequence>> DUMMY_SOURCE = new ArrayList<ThreadLocal<CharSequence>>();
    static {
        for (int i = 0; i < 40; ++i) {
            DUMMY_SOURCE.add(new ThreadLocal<CharSequence>());

    private static final Set<Long> COUNTERS = new ConcurrentSkipListSet<Long>();

    public static void main(String[] args) throws Exception {
        final CountDownLatch startLatch = new CountDownLatch(THREADS_NUMBER);
        final CountDownLatch endLatch = new CountDownLatch(THREADS_NUMBER);

        for (int i = 0; i < THREADS_NUMBER; i++) {
            new Thread() {
                public void run() {
                    try {
                    } catch (InterruptedException e) {
                    while (System.nanoTime() < END_TIME.get()) {
        END_TIME.set(System.nanoTime() + DURATION);


    private static void initDummyData() {
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {

    private static CharSequence getRandomString() {
        StringBuilder result = new StringBuilder();
        Random random = new Random();
        for (int i = 0; i < 127; ++i) {
        return result;

    private static void doJob() {
        Random random = new Random();
        for (ThreadLocal<CharSequence> threadLocal : DUMMY_SOURCE) {
            for (int i = 0; i < threadLocal.get().length(); ++i) {
                DUMMY_DATA.get().addAndGet(threadLocal.get().charAt(i) << random.nextInt(31));

    private static void printStatistics() {
        long total = 0L;
        for (Long counter : COUNTERS) {
            total += counter;
        System.out.printf("Total iterations number: %d, dummy data: %d, distribution:%n", total, DUMMY_COUNTER.get());
        for (Long counter : COUNTERS) {
            System.out.printf("%f%%%n", counter * 100d / total);

I made four tests for two and ten thread scenarios and it shows performance loss is about 2.5% (78626 iterations for two threads and 76754 for ten threads), System resources are used by the threads approximately equally.


Also 'java.util.concurrent' authors suppose context switch time to be about 2000-4000 CPU cycles:


public class Exchanger<V> {
   private static final int NCPU = Runtime.getRuntime().availableProcessors();
    * The number of times to spin (doing nothing except polling a
    * memory location) before blocking or giving up while waiting to
    * be fulfilled.  Should be zero on uniprocessors.  On
    * multiprocessors, this value should be large enough so that two
    * threads exchanging items as fast as possible block only when
    * one of them is stalled (due to GC or preemption), but not much
    * longer, to avoid wasting CPU resources.  Seen differently, this
    * value is a little over half the number of cycles of an average
    * context switch time on most systems.  The value here is
    * approximately the average of those across a range of tested
    * systems.
   private static final int SPINS = (NCPU == 1) ? 0 : 2000; 



For your questions the best method might be to build a test program, get some hard measurement data and make the best decision based on the data. I usually do this when trying to make such decisions, and it helps to have hard numbers to bring with you to back up your argument.


Before starting though, how many threads are you talking about? And with what type of hardware are you running your software?




For 100 connections are are unlikely to have a problem with blocking IO and using two threads per connection (one for read and write) That's the simplest model IMHO.


However you may find using JMS is a better way to manage your connections. If you use something like ActiveMQ you can consolidate all your connections.
