Unlike C#'s IEnumerable
, where an execution pipeline can be executed as many times as we want, in Java a stream can be 'iterated' only once.
与c#的IEnumerable不同,在这种情况下,执行管道可以任意多次执行,在Java中,流只能“迭代”一次。
Any call to a terminal operation closes the stream, rendering it unusable. This 'feature' takes away a lot of power.
任何对终端操作的调用都会关闭流,使其无法使用。这个“特性”带走了很多力量。
I imagine the reason for this is not technical. What were the design considerations behind this strange restriction?
我想这不是技术上的原因。这个奇怪的限制背后的设计考虑是什么?
Edit: in order to demonstrate what I am talking about, consider the following implementation of Quick-Sort in C#:
编辑:为了演示我所说的内容,请考虑以下c#中的快速排序实现:
IEnumerable<int> QuickSort(IEnumerable<int> ints)
{
if (!ints.Any()) {
return Enumerable.Empty<int>();
}
int pivot = ints.First();
IEnumerable<int> lt = ints.Where(i => i < pivot);
IEnumerable<int> gt = ints.Where(i => i > pivot);
return QuickSort(lt).Concat(new int[] { pivot }).Concat(QuickSort(gt));
}
Now to be sure, I am not advocating that this is a good implementation of quick sort! It is however great example of the expressive power of lambda expression combined with stream operation.
现在可以肯定的是,我并不主张这是一个好的快速排序的实现!无论如何,它是lambda表达式与流操作结合的一个很好的例子。
And it can't be done in Java! I can't even ask a stream whether it is empty without rendering it unusable.
这在Java中是做不到的!我甚至不能问一个流是否为空,而不让它变得不可用。
5 个解决方案
#1
334
I have some recollections from the early design of the Streams API that might shed some light on the design rationale.
我从Streams API的早期设计中获得了一些回忆,这些回忆可能有助于了解设计原理。
Back in 2012, we were adding lambdas to the language, and we wanted a collections-oriented or "bulk data" set of operations, programmed using lambdas, that would facilitate parallelism. The idea of lazily chaining operations together was well established by this point. We also didn't want the intermediate operations to store results.
早在2012年,我们就在语言中添加了lambdas,我们想要一个面向集合的或“大量数据”的操作集,使用lambdas编程,以促进并行。延迟链接操作的想法在这一点上得到了很好的证实。我们也不希望中间操作存储结果。
The main issues we needed to decide were what the objects in the chain looked like in the API and how they hooked up to data sources. The sources were often collections, but we also wanted to support data coming from a file or the network, or data generated on-the-fly, e.g., from a random number generator.
我们需要确定的主要问题是API中链中的对象是什么样子,以及它们如何与数据源连接。来源通常是集合,但我们也希望支持来自文件或网络的数据,或动态生成的数据,例如,来自随机数生成器的数据。
There were many influences of existing work on the design. Among the more influential were Google's Guava library and the Scala collections library. (If anybody is surprised about the influence from Guava, note that Kevin Bourrillion, Guava lead developer, was on the JSR-335 Lambda expert group.) On Scala collections, we found this talk by Martin Odersky to be of particular interest: Future-Proofing Scala Collections: from Mutable to Persistent to Parallel. (Stanford EE380, 2011 June 1.)
现有工作对设计有许多影响。其中最有影响力的是谷歌的番石榴库和Scala集合库。(如果有人对番石榴的影响感到惊讶的话,请注意,番石榴主要开发商凯文•布里利安(Kevin Bourrillion)是JSR-335 Lambda专家组的成员。)在Scala集合中,我们发现Martin Odersky的这篇演讲具有特别的意义:未来的Scala集合:从可变到持久并行。(斯坦福EE380, 2011年6月1日)
Our prototype design at the time was based around Iterable
. The familiar operations filter
, map
, and so forth were extension (default) methods on Iterable
. Calling one added an operation to the chain and returned another Iterable
. A terminal operation like count
would call iterator()
up the chain to the source, and the operations were implemented within each stage's Iterator.
我们当时的原型设计是基于可迭代的。熟悉的操作过滤器、映射等是可迭代的扩展(默认)方法。调用其中一个将操作添加到链中并返回另一个可迭代的。像count这样的终端操作将调用到源链上的iterator(),操作在每个阶段的迭代器中实现。
Since these are Iterables, you can call the iterator()
method more than once. What should happen then?
由于这些是可迭代的,您可以多次调用iterator()方法。然后会发生什么?
If the source is a collection, this mostly works fine. Collections are Iterable, and each call to iterator()
produces a distinct Iterator instance that is independent of any other active instances, and each traverses the collection independently. Great.
如果源文件是一个集合,那么它基本上可以正常工作。集合是可迭代的,对iterator()的每个调用都生成一个独立于任何其他活动实例的不同迭代器实例,并且每个调用都独立地遍历集合。太好了。
Now what if the source is one-shot, like reading lines from a file? Maybe the first Iterator should get all the values but the second and subsequent ones should be empty. Maybe the values should be interleaved among the Iterators. Or maybe each Iterator should get all the same values. Then, what if you have two iterators and one gets farther ahead of the other? Somebody will have to buffer up the values in the second Iterator until they're read. Worse, what if you get one Iterator and read all the values, and only then get a second Iterator. Where do the values come from now? Is there a requirement for them all to be buffered up just in case somebody wants a second Iterator?
现在,如果源文件是一次性的,比如从文件中读取行,该怎么办?也许第一个迭代器应该得到所有的值,但是第二个和后面的值应该是空的。也许值应该在迭代器之间交叉。或者每个迭代器都应该得到相同的值。然后,如果有两个迭代器,其中一个比另一个超前得多怎么办?在第二次迭代器读取之前,必须先对其值进行缓冲。更糟糕的是,如果您得到一个迭代器并读取所有的值,然后才得到第二个迭代器。这些价值从何而来?是否需要对它们进行缓冲,以防有人需要第二个迭代器?
Clearly, allowing multiple Iterators over a one-shot source raises a lot of questions. We didn't have good answers for them. We wanted consistent, predictable behavior for what happens if you call iterator()
twice. This pushed us toward disallowing multiple traversals, making the pipelines one-shot.
很明显,允许多个迭代器超过一个一次性源会引发许多问题。我们没有很好的答案。如果调用迭代器()两次,我们希望得到一致的、可预测的行为。这促使我们不允许多次穿越,使得管道只能一次穿越。
We also observed others bumping into these issues. In the JDK, most Iterables are collections or collection-like objects, which allow multiple traversal. It isn't specified anywhere, but there seemed to be an unwritten expectation that Iterables allow multiple traversal. A notable exception is the NIO DirectoryStream interface. Its specification includes this interesting warning:
我们也观察到其他人遇到这些问题。在JDK中,大多数迭代是集合或类似集合的对象,它们允许多个遍历。它在任何地方都没有指定,但是似乎有一个未编写的期望,即Iterables允许多次遍历。一个值得注意的例外是NIO DirectoryStream接口。它的规范包括以下有趣的警告:
While DirectoryStream extends Iterable, it is not a general-purpose Iterable as it supports only a single Iterator; invoking the iterator method to obtain a second or subsequent iterator throws IllegalStateException.
虽然DirectoryStream扩展了Iterable,但它不是通用的Iterable,因为它只支持一个迭代器;调用迭代器方法获取第二个或后续迭代器会抛出IllegalStateException。
[bold in original]
(大胆原创)
This seemed unusual and unpleasant enough that we didn't want to create a whole bunch of new Iterables that might be once-only. This pushed us away from using Iterable.
这看起来很不寻常,也很不愉快,以至于我们不想创建一大堆可能只有一次的新迭代。这使得我们不能使用迭代。
About this time, an article by Bruce Eckel appeared that described a spot of trouble he'd had with Scala. He'd written this code:
这时,布鲁斯·埃克尔(Bruce Eckel)发表了一篇文章,描述了他在Scala里遇到的一个小麻烦。他写的这段代码:
// Scala
val lines = fromString(data).getLines
val registrants = lines.map(Registrant)
registrants.foreach(println)
registrants.foreach(println)
It's pretty straightforward. It parses lines of text into Registrant
objects and prints them out twice. Except that it actually only prints them out once. It turns out that he thought that registrants
was a collection, when in fact it's an iterator. The second call to foreach
encounters an empty iterator, from which all values have been exhausted, so it prints nothing.
这是很简单的。它将文本行解析为注册对象并将其打印两次。但实际上它只打印一次。原来他认为注册表是一个集合,而实际上它是一个迭代器。foreach的第二个调用遇到一个空迭代器,从这个空迭代器中所有的值都已被耗尽,因此它不会输出任何值。
This kind of experience convinced us that it was very important to have clearly predictable results if multiple traversal is attempted. It also highlighted the importance of distinguishing between lazy pipeline-like structures from actual collections that store data. This in turn drove the separation of the lazy pipeline operations into the new Stream interface and keeping only eager, mutative operations directly on Collections. Brian Goetz has explained the rationale for that.
这种经验使我们确信,如果尝试多次遍历,那么获得清晰可预测的结果是非常重要的。它还强调了区分类似于惰性管道的结构和存储数据的实际集合的重要性。这反过来又将惰性管道操作分离到新的流接口中,并只对集合进行直接的、有意义的操作。Brian Goetz已经解释了原因。
What about allowing multiple traversal for collection-based pipelines but disallowing it for non-collection-based pipelines? It's inconsistent, but it's sensible. If you're reading values from the network, of course you can't traverse them again. If you want to traverse them multiple times, you have to pull them into a collection explicitly.
对于基于集合的管道允许多次遍历,而不允许对基于非集合的管道进行遍历,这又如何呢?这是不一致的,但也是明智的。如果您正在从网络中读取值,当然您不能再次遍历它们。如果要多次遍历它们,就必须显式地将它们拉到集合中。
But let's explore allowing multiple traversal from collections-based pipelines. Let's say you did this:
但是,让我们研究一下允许从基于集合的管道进行多次遍历。假设你这么做了:
Iterable<?> it = source.filter(...).map(...).filter(...).map(...);
it.into(dest1);
it.into(dest2);
(The into
operation is now spelled collect(toList())
.)
(into操作现在拼写为collect(toList()))。)
If source is a collection, then the first into()
call will create a chain of Iterators back to the source, execute the pipeline operations, and send the results into the destination. The second call to into()
will create another chain of Iterators, and execute the pipeline operations again. This isn't obviously wrong but it does have the effect of performing all the filter and map operations a second time for each element. I think many programmers would have been surprised by this behavior.
如果源是一个集合,那么第一个into()调用将创建返回源的迭代器链,执行管道操作,并将结果发送到目标。第二个调用into()将创建另一个迭代器链,并再次执行管道操作。这并不是明显的错误,但是它确实可以对每个元素执行所有的过滤器和映射操作。我想很多程序员会对这种行为感到惊讶。
As I mentioned above, we had been talking to the Guava developers. One of the cool things they have is an Idea Graveyard where they describe features that they decided not to implement along with the reasons. The idea of lazy collections sounds pretty cool, but here's what they have to say about it. Consider a List.filter()
operation that returns a List
:
正如我上面提到的,我们一直在和番石榴的开发者交谈。他们拥有的一个很酷的东西是一个想法墓地,他们在那里描述他们决定不实现的特性以及原因。懒人收藏的想法听起来很酷,但下面是他们对此的看法。考虑一个List.filter()操作,它返回一个List:
The biggest concern here is that too many operations become expensive, linear-time propositions. If you want to filter a list and get a list back, and not just a Collection or an Iterable, you can use
ImmutableList.copyOf(Iterables.filter(list, predicate))
, which "states up front" what it's doing and how expensive it is.这里最大的问题是太多的操作变得昂贵,线性时间命题。如果您想要筛选一个列表并返回一个列表,而不只是一个集合或一个Iterable,您可以使用ImmutableList.copyOf(Iterables)。过滤器(列表,谓词),它“预先声明”它正在做什么以及它有多昂贵。
To take a specific example, what's the cost of get(0)
or size()
on a List? For commonly used classes like ArrayList
, they're O(1). But if you call one of these on a lazily-filtered list, it has to run the filter over the backing list, and all of a sudden these operations are O(n). Worse, it has to traverse the backing list on every operation.
举一个具体的例子,列表上的get(0)或size()的成本是多少?对于像ArrayList这样的常用类,它们是O(1)。但是如果你在一个延迟过滤的列表上调用其中一个,它必须在后台列表上运行过滤器,突然这些操作就变成了O(n)更糟糕的是,它必须遍历每个操作的支持列表。
This seemed to us to be too much laziness. It's one thing to set up some operations and defer actual execution until you so "Go". It's another to set things up in such a way that hides a potentially large amount of recomputation.
这在我们看来是太懒惰了。设置一些操作并将实际执行推迟到“Go”时是一回事。以隐藏大量重新计算的方式进行设置是另一回事。
In proposing to disallow non-linear or "no-reuse" streams, Paul Sandoz described the potential consequences of allowing them as giving rise to "unexpected or confusing results." He also mentioned that parallel execution would make things even trickier. Finally, I'd add that a pipeline operation with side effects would lead to difficult and obscure bugs if the operation were unexpectedly executed multiple times, or at least a different number of times than the programmer expected. (But Java programmers don't write lambda expressions with side effects, do they? DO THEY??)
在提议不允许非线性或“不重用”流时,Paul Sandoz描述了允许它们产生“意想不到或混乱的结果”的潜在后果。他还提到并行执行会使事情变得更加棘手。最后,我还要补充一点,如果操作被意外地执行了多次,或者至少比程序员预期的次数不同,那么带有副作用的管道操作将导致难以理解的bug。(但是Java程序员不写带有副作用的lambda表达式,是吗?)不是吗? ?)
So that's the basic rationale for the Java 8 Streams API design that allows one-shot traversal and that requires a strictly linear (no branching) pipeline. It provides consistent behavior across multiple different stream sources, it clearly separates lazy from eager operations, and it provides a straightforward execution model.
这就是Java 8 Streams API设计的基本原理,它允许一次性遍历,并且需要严格的线性(无分支)管道。它提供了跨多个不同流源的一致行为,它明确地将惰性操作与热切操作分离开来,并提供了一个简单的执行模型。
With regard to IEnumerable
, I am far from an expert on C# and .NET, so I would appreciate being corrected (gently) if I draw any incorrect conclusions. It does appear, however, that IEnumerable
permits multiple traversal to behave differently with different sources; and it permits a branching structure of nested IEnumerable
operations, which may result in some significant recomputation. While I appreciate that different systems make different tradeoffs, these are two characteristics that we sought to avoid in the design of the Java 8 Streams API.
关于IEnumerable,我不是c#和。net方面的专家,所以如果得出不正确的结论,我希望能得到(温和的)纠正。然而,似乎IEnumerable接口允许对不同的源进行多次遍历操作;它允许嵌套的IEnumerable结构,这可能导致一些重要的重新计算。虽然我很欣赏不同的系统做出不同的权衡,但这是我们在设计Java 8 Streams API时试图避免的两个特性。
The quicksort example given by the OP is interesting, puzzling, and I'm sorry to say, somewhat horrifying. Calling QuickSort
takes an IEnumerable
and returns an IEnumerable
, so no sorting is actually done until the final IEnumerable
is traversed. What the call seems to do, though, is build up a tree structure of IEnumerables
that reflects the partitioning that quicksort would do, without actually doing it. (This is lazy computation, after all.) If the source has N elements, the tree will be N elements wide at its widest, and it will be lg(N) levels deep.
OP给出的快速排序例子很有趣,令人费解,我很遗憾地说,有点可怕。调用QuickSort会获取一个IEnumerable,并返回一个IEnumerable,因此在最终的IEnumerable被遍历之前,实际上不会进行排序。不过,这个调用似乎要做的是构建一个IEnumerables的树结构,它反映了quicksort会做的分区,而实际上并没有这样做。(毕竟,这是一种惰性计算。)如果源有N个元素,那么树最宽处有N个元素,深度为lg(N)级。
It seems to me -- and once again, I'm not a C# or .NET expert -- that this will cause certain innocuous-looking calls, such as pivot selection via ints.First()
, to be more expensive than they look. At the first level, of course, it's O(1). But consider a partition deep in the tree, at the right-hand edge. To compute the first element of this partition, the entire source has to be traversed, an O(N) operation. But since the partitions above are lazy, they must be recomputed, requiring O(lg N) comparisons. So selecting the pivot would be an O(N lg N) operation, which is as expensive as an entire sort.
在我看来——再一次,我不是一个c#或。net专家——这将会导致某些看似无害的调用,比如通过ints.c . first()来进行pivot选择,这比它们看起来要贵。第一级当然是O(1)但是考虑树形深处的一个分区,在右边缘。要计算这个分区的第一个元素,必须遍历整个源,一个O(N)操作。但是由于上面的分区是惰性的,所以必须重新计算它们,需要进行O(lgn)比较。所以选择主元就是一个O(nlgn)的操作,它和整个排序一样昂贵。
But we don't actually sort until we traverse the returned IEnumerable
. In the standard quicksort algorithm, each level of partitioning doubles the number of partitions. Each partition is only half the size, so each level remains at O(N) complexity. The tree of partitions is O(lg N) high, so the total work is O(N lg N).
但是我们在遍历返回的IEnumerable之前是不会排序的。在标准的快速排序算法中,每个级别的分区都增加了分区的数量。每个分区的大小只有原来的一半,所以每个级别都保持O(N)复杂度。分区树是O(lgn)高,所以总功是O(nlgn)
With the tree of lazy IEnumerables, at the bottom of the tree there are N partitions. Computing each partition requires a traversal of N elements, each of which requires lg(N) comparisons up the tree. To compute all the partitions at the bottom of the tree, then, requires O(N^2 lg N) comparisons.
使用惰性IEnumerables树,树的底部有N个分区。计算每个分区需要遍历N个元素,每个元素都需要对树进行lg(N)比较。计算所有树的底部的分区,然后,需要O(N ^ 2 lg N)比较。
(Is this right? I can hardly believe this. Somebody please check this for me.)
(这是正确的吗?我简直不敢相信。请帮我查一下。
In any case, it is indeed cool that IEnumerable
can be used this way to build up complicated structures of computation. But if it does increase the computational complexity as much as I think it does, it would seem that programming this way is something that should be avoided unless one is extremely careful.
无论如何,IEnumerable可以用这种方式构建复杂的计算结构,确实很酷。但是,如果它确实像我认为的那样增加了计算复杂度,那么似乎应该避免这种编程方式,除非人们非常小心。
#2
117
Background
While the question appears simple, the actual answer requires some background to make sense. If you want to skip to the conclusion, scroll down...
虽然这个问题看起来很简单,但实际的答案需要一些背景知识才有意义。如果你想跳到结论,向下滚动…
Pick your comparison point - Basic functionality
Using basic concepts, C#'s IEnumerable
concept is more closely related to Java's Iterable
, which is able to create as many Iterators as you want. IEnumerables
create IEnumerators
. Java's Iterable
create Iterators
使用基本概念,c#的IEnumerable概念更接近于Java的Iterable,它能够创建任意多的迭代器。ienumerable创建IEnumerators。Java的Iterable创建迭代器
The history of each concept is similar, in that both IEnumerable
and Iterable
have a basic motivation to allow 'for-each' style looping over the members of data collections. That's an oversimplification as they both allow more than just that, and they also arrived at that stage via different progressions, but it is a significant common feature regardless.
每个概念的历史都是相似的,因为IEnumerable和Iterable都有一个基本的动机,允许对数据集合的成员进行“for-each”风格的循环。这是一种过度简化,因为它们都允许的不止这些,而且它们也通过不同的过程到达这个阶段,但不管怎样,这是一个重要的共同特征。
Let's compare that feature: in both languages, if a class implements the IEnumerable
/Iterable
, then that class must implement at least a single method (for C#, it's GetEnumerator
and for Java it's iterator()
). In each case, the instance returned from that (IEnumerator
/Iterator
) allows you to access the current and subsequent members of the data. This feature is used in the for-each language syntax.
让我们比较一下这个特性:在两种语言中,如果一个类实现了IEnumerable/Iterable,那么这个类必须实现至少一个方法(对于c#,它是GetEnumerator,对于Java,它是iterator()))。在每种情况下,从该实例返回的实例(IEnumerator/Iterator)允许您访问数据的当前和后续成员。这个特性用于for-each语言语法。
Pick your comparison point - Enhanced functionality
IEnumerable
in C# has been extended to allow a number of other language features (mostly related to Linq). Features added include selections, projections, aggregations, etc. These extensions have a strong motivation from use in set-theory, similar to SQL and Relational Database concepts.
c#中的IEnumerable是可扩展的,可以支持其他一些语言特性(大部分与Linq相关)。添加的特性包括选择、投影、聚合等。这些扩展与SQL和关系数据库概念类似,具有强烈的使用集理论的动机。
Java 8 has also had functionality added to enable a degree of functional programming using Streams and Lambdas. Note that Java 8 streams are not primarily motivated by set theory, but by functional programming. Regardless, there are a lot of parallels.
Java 8还添加了一些功能,以支持使用流和Lambdas进行一定程度的函数式编程。注意,Java 8流主要不是由集合理论驱动的,而是由函数式编程驱动的。无论如何,有很多相似之处。
So, this is the second point. The enhancements made to C# were implemented as an enhancement to the IEnumerable
concept. In Java, though, the enhancements made were implemented by creating new base concepts of Lambdas and Streams, and then also creating a relatively trivial way to convert from Iterators
and Iterables
to Streams, and visa-versa.
这是第二点。对c#的增强被实现为对IEnumerable概念的增强。不过,在Java中,所做的增强是通过创建lambda和Streams的新基本概念实现的,然后还创建了一种相对简单的方式来将迭代器和可迭代器转换为流,反之亦然。
So, comparing IEnumerable to Java's Stream concept is incomplete. You need to compare it to the combined Streams and Collections API's in Java.
因此,将IEnumerable与Java的流概念进行比较是不完整的。您需要将它与Java中合并的流和集合API进行比较。
In Java, Streams are not the same as Iterables, or Iterators
Streams are not designed to solve problems the same way that iterators are:
流设计来解决问题的方式与迭代器不同:
- Iterators are a way of describing the sequence of data.
- 迭代器是描述数据序列的一种方式。
- Streams are a way of describing a sequence of data transformations.
- 流是描述数据转换序列的一种方式。
With an Iterator
, you get a data value, process it, and then get another data value.
使用迭代器,您将获得一个数据值,并对其进行处理,然后获得另一个数据值。
With Streams, you chain a sequence of functions together, then you feed an input value to the stream, and get the output value from the combined sequence. Note, in Java terms, each function is encapsulated in a single Stream
instance. The Streams API allows you to link a sequence of Stream
instances in a way that chains a sequence of transformation expressions.
对于流,您将一系列函数链接在一起,然后将一个输入值输入到流中,并从组合序列中获得输出值。注意,在Java术语中,每个函数都封装在一个流实例中。Streams API允许您将一个流实例序列链接到一个链表,该方法将一系列转换表达式链接起来。
In order to complete the Stream
concept, you need a source of data to feed the stream, and a terminal function that consumes the stream.
为了完成流概念,您需要一个数据源来提供流,以及一个使用流的终端函数。
The way you feed values in to the stream may in fact be from an Iterable
, but the Stream
sequence itself is not an Iterable
, it is a compound function.
向流中输入值的方式实际上可能来自可迭代的,但是流序列本身不是可迭代的,它是一个复合函数。
A Stream
is also intended to be lazy, in the sense that it only does work when you request a value from it.
流也是懒惰的,因为它只在向它请求值时才工作。
Note these significant assumptions and features of Streams:
注意流的这些重要假设和特性:
- A
Stream
in Java is a transformation engine, it transforms a data item in one state, to being in another state. - Java中的流是一个转换引擎,它将一个状态中的数据项转换为另一个状态。
- streams have no concept of the data order or position, the simply transform whatever they are asked to.
- 流没有数据顺序或位置的概念,简单地转换它们被请求的任何内容。
- streams can be supplied with data from many sources, including other streams, Iterators, Iterables, Collections,
- 流可以提供来自许多源的数据,包括其他流、迭代器、Iterables、集合,
- you cannot "reset" a stream, that would be like "reprogramming the transformation". Resetting the data source is probably what you want.
- 不能“重置”流,这就像“重新编程转换”。您可能需要重新设置数据源。
- there is logically only 1 data item 'in flight' in the stream at any time (unless the stream is a parallel stream, at which point, there is 1 item per thread). This is independent of the data source which may have more than the current items 'ready' to be supplied to the stream, or the stream collector which may need to aggregate and reduce multiple values.
- 在任何时候,在流中逻辑上只有1个数据项“in flight”(除非流是并行流,在这一点上,每个线程有1个条目)。这是独立于数据源的,它可能比当前的项目“准备好”提供给流,或者需要聚合和减少多个值的流收集器。
- Streams can be unbound (infinite), limited only by the data source, or collector (which can be infinite too).
- 流可以是无限制的(无限的),仅受数据源或收集器的限制(也可以是无限的)。
- Streams are 'chainable', the output of filtering one stream, is another stream. Values input to and transformed by a stream can in turn be supplied to another stream which does a different transformation. The data, in its transformed state flows from one stream to the next. You do not need to intervene and pull the data from one stream and plug it in to the next.
- 流是“可链的”,过滤一个流的输出是另一个流。流输入和转换的值可以反过来提供给进行不同转换的另一个流。转换后的数据从一个流流到另一个流。您不需要进行干预并从一个流中提取数据并将其插入到下一个流中。
C# Comparison
When you consider that a Java Stream is just a part of a supply, stream, and collect system, and that Streams and Iterators are often used together with Collections, then it is no wonder that it is hard to relate to the same concepts which are almost all embedded in to a single IEnumerable
concept in C#.
当你考虑到一个Java流只是一个供应的一部分,流,并收集系统、流和迭代器经常使用集合一起,那么难怪很难与相同的概念,几乎都是嵌入在一个IEnumerable概念在c#中。
Parts of IEnumerable (and close related concepts) are apparent in all of the Java Iterator, Iterable, Lambda, and Stream concepts.
在所有的Java迭代器、Iterable、Lambda和流概念中,可枚举的部分(以及相关的概念)都很明显。
There are small things that the Java concepts can do that are harder in IEnumerable, and visa-versa.
在IEnumerable中有一些Java概念可以做到的小事情比较困难,反之亦然。
Conclusion
- There's no design problem here, just a problem in matching concepts between the languages.
- 这里没有设计问题,只是语言之间的概念匹配问题。
- Streams solve problems in a different way
- Streams以不同的方式解决问题。
- Streams add functionality to Java (they add a different way of doing things, they do not take functionality away)
- 流向Java添加功能(它们添加了一种不同的处理方式,它们不会带走功能)
Adding Streams gives you more choices when solving problems, which is fair to classify as 'enhancing power', not 'reducing', 'taking away', or 'restricting' it.
添加流可以在解决问题时给你更多的选择,这是公平的,可以将其划分为“增强能力”,而不是“减少”、“带走”或“限制”。
Why are Java Streams once-off?
This question is misguided, because streams are function sequences, not data. Depending on the data source that feeds the stream, you can reset the data source, and feed the same, or different stream.
这个问题被误导了,因为流是函数序列,而不是数据。根据提供流的数据源,您可以重置数据源,并提供相同或不同的流。
Unlike C#'s IEnumerable, where an execution pipeline can be executed as many times as we want, in Java a stream can be 'iterated' only once.
Comparing an IEnumerable
to a Stream
is misguided. The context you are using to say IEnumerable
can be executed as many times as you want, is best compared to Java Iterables
, which can be iterated as many times as you want. A Java Stream
represents a subset of the IEnumerable
concept, and not the subset that supplies data, and thus cannot be 'rerun'.
将IEnumerable比作流是错误的。您用来表示IEnumerable是可执行多次的,最好与Java Iterables比较,后者可以按您的要求进行多次迭代。Java流表示可枚举概念的子集,而不是提供数据的子集,因此不能“重新运行”。
Any call to a terminal operation closes the stream, rendering it unusable. This 'feature' takes away a lot of power.
The first statement is true, in a sense. The 'takes away power' statement is not. You are still comparing Streams it IEnumerables. The terminal operation in the stream is like a 'break' clause in a for loop. You are always free to have another stream, if you want, and if you can re-supply the data you need. Again, if you consider the IEnumerable
to be more like an Iterable
, for this statement, Java does it just fine.
第一种说法在某种意义上是正确的。“剥夺权力”的说法不是这样的。您仍然在比较它所列举的数据流。流中的终端操作类似于for循环中的“break”子句。如果您愿意,您总是可以拥有另一个流,如果您可以重新提供所需的数据。同样,如果您认为IEnumerable更像可迭代的,对于这个语句来说,Java也可以。
I imagine the reason for this is not technical. What were the design considerations behind this strange restriction?
The reason is technical, and for the simple reason that a Stream a subset of what think it is. The stream subset does not control the data supply, so you should reset the supply, not the stream. In that context, it is not so strange.
原因是技术上的,简单的原因是流是它的子集。流子集不控制数据供应,所以应该重置供应,而不是流。在这种背景下,这并不奇怪。
QuickSort example
Your quicksort example has the signature:
您的快速排序示例具有以下签名:
IEnumerable<int> QuickSort(IEnumerable<int> ints)
You are treating the input IEnumerable
as a data source:
您将输入IEnumerable处理为数据源:
IEnumerable<int> lt = ints.Where(i => i < pivot);
Additionally, return value is IEnumerable
too, which is a supply of data, and since this is a Sort operation, the order of that supply is significant. If you consider the Java Iterable
class to be the appropriate match for this, specifically the List
specialization of Iterable
, since List is a supply of data which has a guaranteed order or iteration, then the equivalent Java code to your code would be:
此外,返回值也是IEnumerable,它是一种数据供应,由于这是一种排序操作,所以该供应的顺序是显著的。如果您认为Java Iterable类适合这种情况,特别是Iterable的列表专门化,因为List是提供有保证顺序或迭代的数据的,那么您代码的等效Java代码是:
Stream<Integer> quickSort(List<Integer> ints) {
// Using a stream to access the data, instead of the simpler ints.isEmpty()
if (!ints.stream().findAny().isPresent()) {
return Stream.of();
}
// treating the ints as a data collection, just like the C#
final Integer pivot = ints.get(0);
// Using streams to get the two partitions
List<Integer> lt = ints.stream().filter(i -> i < pivot).collect(Collectors.toList());
List<Integer> gt = ints.stream().filter(i -> i > pivot).collect(Collectors.toList());
return Stream.concat(Stream.concat(quickSort(lt), Stream.of(pivot)),quickSort(gt));
}
Note there is a bug (which I have reproduced), in that the sort does not handle duplicate values gracefully, it is a 'unique value' sort.
注意,这里有一个bug(我已经复制了),因为它不能优雅地处理重复值,它是一个“唯一值”排序。
Also note how the Java code uses data source (List
), and stream concepts at different point, and that in C# those two 'personalities' can be expressed in just IEnumerable
. Also, although I have use List
as the base type, I could have used the more general Collection
, and with a small iterator-to-Stream conversion, I could have used the even more general Iterable
还要注意Java代码如何使用数据源(List),以及在不同的点上的流概念,在c#中这两个“个性”可以用IEnumerable表示。另外,虽然我使用List作为基类型,但是我可以使用更一般的集合,并且使用一个小的迭代器到流的转换,我可以使用更一般的可迭代性
#3
20
Stream
s are built around Spliterator
s which are stateful, mutable objects. They don’t have a “reset” action and in fact, requiring to support such rewind action would “take away much power”. How would Random.ints()
be supposed to handle such a request?
流是围绕spliterator构建的,它是有状态的、可变的对象。它们没有“重置”动作,事实上,需要支持这种“重放”动作会“消耗很多能量”。ints()如何处理这样的请求?
On the other hand, for Stream
s which have a retraceable origin, it is easy to construct an equivalent Stream
to be used again. Just put the steps made to construct the Stream
into a reusable method. Keep in mind that repeating these steps is not an expensive operation as all these steps are lazy operations; the actual work starts with the terminal operation and depending on the actual terminal operation entirely different code might get executed.
另一方面,对于具有可追溯起源的流,很容易构造一个等效的流来再次使用。只需将构建流的步骤放入可重用的方法中。记住,重复这些步骤并不是一项昂贵的操作,因为所有这些步骤都是惰性操作;实际工作从终端操作开始,根据实际的终端操作,可能会执行完全不同的代码。
It would be up to you, the writer of such a method, to specify what calling the method twice implies: does it reproduce exactly the same sequence, as streams created for an unmodified array or collection do, or does it produce a stream with a similar semantics but different elements like a stream of random ints or a stream of console input lines, etc.
你,的作者这样一个方法,指定调用方法两次意味着什么:它复制相同的序列,作为修改的数组或集合流创建,还是生产流语义相似但不同的元素如流的随机整数或一连串的控制台输入行,等等。
By the way, to avoid confusion, a terminal operation consumes the Stream
which is distinct from closing the Stream
as calling close()
on the stream does (which is required for streams having associated resources like, e.g. produced by Files.lines()
).
顺便说一下,为了避免混淆,终端操作将使用流,这与在流上调用close()时关闭流不同(这对于具有相关资源的流来说是必需的,例如由Files.lines()生成的流)。
It seems that a lot of confusion stems from misguiding comparison of IEnumerable
with Stream
. An IEnumerable
represents the ability to provide an actual IEnumerator
, so its like an Iterable
in Java. In contrast, a Stream
is a kind of iterator and comparable to an IEnumerator
so it’s wrong to claim that this kind of data type can be used multiple times in .NET, the support for IEnumerator.Reset
is optional. The examples discussed here rather use the fact that an IEnumerable
can be used to fetch new IEnumerator
s and that works with Java’s Collection
s as well; you can get a new Stream
. If the Java developers decided to add the Stream
operations to Iterable
directly, with intermediate operations returning another Iterable
, it was really comparable and it could work the same way.
似乎很多的混淆来自于对IEnumerable与流的错误引导比较。IEnumerable表示能够提供实际的IEnumerator,因此在Java中是可迭代的。相反,流是一种迭代器,可以与IEnumerator相比,所以说这种数据类型可以在。net中多次使用是错误的,因为。net支持IEnumerator。重置是可选的。这里讨论的示例使用了一个事实,即IEnumerable可以用来获取新的i枚举器,并且可以使用Java的集合;你可以得到一条新的小溪。如果Java开发人员决定添加流操作来直接迭代,中间操作返回另一个可迭代的,那么它确实具有可比性,并且可以以同样的方式工作。
However, the developers decided against it and the decision is discussed in this question. The biggest point is the confusion about eager Collection operations and lazy Stream operations. By looking at the .NET API, I (yes, personally) find it justified. While it looks reasonable looking at IEnumerable
alone, a particular Collection will have lots of methods manipulating the Collection directly and lots of methods returning a lazy IEnumerable
, while the particular nature of a method isn’t always intuitively recognizable. The worst example I found (within the few minutes I looked at it) is List.Reverse()
whose name matches exactly the name of the inherited (is this the right * for extension methods?) Enumerable.Reverse()
while having an entirely contradicting behavior.
但是,开发人员决定反对它,并在这个问题中讨论了这个决定。最大的问题是关于渴望收集操作和惰性流操作的混淆。通过查看。net API,我(是的,我个人)发现它是合理的。虽然单独查看IEnumerable是合理的,但是一个特定的集合会有很多方法直接操作这个集合,并且有很多方法返回一个惰性的IEnumerable,但是一个方法的特定性质并不总是可以直观地识别。我发现的最糟糕的示例(在我看了几分钟后)是List.Reverse(),它的名称与继承的名称完全匹配(这是扩展方法的正确终止吗?)反向(),同时具有完全相反的行为。
Of course, these are two distinct decisions. The first one to make Stream
a type distinct from Iterable
/Collection
and the second to make Stream
a kind of one time iterator rather than another kind of iterable. But these decision were made together and it might be the case that separating these two decision never was considered. It wasn’t created with being comparable to .NET’s in mind.
当然,这是两个截然不同的决定。第一个是使流成为与可迭代/集合不同的类型,第二个是使流成为一种时间迭代器而不是另一种可迭代器。但这些决定都是一起做出的,也许从来没有考虑过把这两个决定分开。它不是通过与。net相比较而创建的。
The actual API design decision was to add an improved type of iterator, the Spliterator
. Spliterator
s can be provided by the old Iterable
s (which is the way how these were retrofitted) or entirely new implementations. Then, Stream
was added as a high-level front-end to the rather low level Spliterator
s. That’s it. You may discuss about whether a different design would be better, but that’s not productive, it won’t change, given the way they are designed now.
实际的API设计决策是添加改进的迭代器类型Spliterator。Spliterators可以由旧的可迭代(这是对它们进行改造的方式)或全新的实现提供。然后,流被作为一个高级前端添加到相当低级的Spliterators。就是这样。您可能会讨论不同的设计是否更好,但这并不是有效的,考虑到它们现在的设计方式,它不会改变。
There is another implementation aspect you have to consider. Stream
s are not immutable data structures. Each intermediate operation may return a new Stream
instance encapsulating the old one but it may also manipulate its own instance instead and return itself (that doesn’t preclude doing even both for the same operation). Commonly known examples are operations like parallel
or unordered
which do not add another step but manipulate the entire pipeline). Having such a mutable data structure and attempts to reuse (or even worse, using it multiple times at the same time) doesn’t play well…
您还需要考虑另一个实现方面。流不是不可变的数据结构。每一个中间操作都可能返回一个封装了旧操作的新流实例,但它也可能反过来操作自己的实例并返回自己(这并不排除对同一个操作同时执行这两个操作)。常见的例子是并行或无序操作,它们不添加其他步骤,而是操作整个管道)。拥有这样一个可变的数据结构并尝试重用(或者更糟糕的是,同时多次使用它)并不能很好地发挥作用……
For completeness, here is your quicksort example translated to the Java Stream
API. It shows that it does not really “take away much power”.
为了完整起见,这里有一个转换为Java流API的快速排序示例。这表明它并没有真正“带走多少力量”。
static Stream<Integer> quickSort(Supplier<Stream<Integer>> ints) {
final Optional<Integer> optPivot = ints.get().findAny();
if(!optPivot.isPresent()) return Stream.empty();
final int pivot = optPivot.get();
Supplier<Stream<Integer>> lt = ()->ints.get().filter(i -> i < pivot);
Supplier<Stream<Integer>> gt = ()->ints.get().filter(i -> i > pivot);
return Stream.of(quickSort(lt), Stream.of(pivot), quickSort(gt)).flatMap(s->s);
}
It can be used like
可以用like
List<Integer> l=new Random().ints(100, 0, 1000).boxed().collect(Collectors.toList());
System.out.println(l);
System.out.println(quickSort(l::stream)
.map(Object::toString).collect(Collectors.joining(", ")));
You can write it even more compact as
你可以把它写得更紧凑
static Stream<Integer> quickSort(Supplier<Stream<Integer>> ints) {
return ints.get().findAny().map(pivot ->
Stream.of(
quickSort(()->ints.get().filter(i -> i < pivot)),
Stream.of(pivot),
quickSort(()->ints.get().filter(i -> i > pivot)))
.flatMap(s->s)).orElse(Stream.empty());
}
#4
8
I think there are very few differences between the two when you look closely enough.
我认为当你仔细观察时,这两者之间几乎没有什么区别。
At it's face, an IEnumerable
does appear to be a reusable construct:
从表面上看,IEnumerable确实是可重复使用的结构:
IEnumerable<int> numbers = new int[] { 1, 2, 3, 4, 5 };
foreach (var n in numbers) {
Console.WriteLine(n);
}
However, the compiler is actually doing a little bit of work to help us out; it generates the following code:
但是,编译器实际上做了一些工作来帮助我们;它生成以下代码:
IEnumerable<int> numbers = new int[] { 1, 2, 3, 4, 5 };
IEnumerator<int> enumerator = numbers.GetEnumerator();
while (enumerator.MoveNext()) {
Console.WriteLine(enumerator.Current);
}
Each time you would actually iterate over the enumerable, the compiler creates an enumerator. The enumerator is not reusable; further calls to MoveNext
will just return false, and there is no way to reset it to the beginning. If you want to iterate over the numbers again, you will need to create another enumerator instance.
每次实际遍历可枚举值时,编译器都会创建一个枚举数。枚举器不可重用;对MoveNext的进一步调用将返回false,而且无法将其重置到开始。如果希望再次遍历这些数字,则需要创建另一个枚举器实例。
To better illustrate that the IEnumerable has (can have) the same 'feature' as a Java Stream, consider a enumerable whose source of the numbers is not a static collection. For example, we can create an enumerable object which generates a sequence of 5 random numbers:
为了更好地说明IEnumerable与Java流具有(可以)相同的“特性”,可以考虑一个枚举型,它的数字来源不是静态集合。例如,我们可以创建一个可枚举对象,它生成5个随机数的序列:
class Generator : IEnumerator<int> {
Random _r;
int _current;
int _count = 0;
public Generator(Random r) {
_r = r;
}
public bool MoveNext() {
_current= _r.Next();
_count++;
return _count <= 5;
}
public int Current {
get { return _current; }
}
}
class RandomNumberStream : IEnumerable<int> {
Random _r = new Random();
public IEnumerator<int> GetEnumerator() {
return new Generator(_r);
}
public IEnumerator IEnumerable.GetEnumerator() {
return this.GetEnumerator();
}
}
Now we have very similar code to the previous array-based enumerable, but with a second iteration over numbers
:
现在我们有了与以前基于数组的可枚举值非常相似的代码,但是对数字进行第二次迭代:
IEnumerable<int> numbers = new RandomNumberStream();
foreach (var n in numbers) {
Console.WriteLine(n);
}
foreach (var n in numbers) {
Console.WriteLine(n);
}
The second time we iterate over numbers
we will get a different sequence of numbers, which isn't reusable in the same sense. Or, we could have written the RandomNumberStream
to thrown an exception if you try to iterate over it multiple times, making the enumerable actually unusable (like a Java Stream).
当我们第二次对数字进行迭代时,我们会得到一个不同的数字序列,这在同一意义上是不可重用的。或者,如果您尝试多次遍历的话,我们可以编写RandomNumberStream来抛出异常,使可枚举实际上不可用(比如Java流)。
Also, what does your enumerable-based quick sort mean when applied to a RandomNumberStream
?
此外,当应用于随机数流时,基于枚举的快速排序意味着什么?
Conclusion
So, the biggest difference is that .NET allows you to reuse an IEnumerable
by implicitly creating a new IEnumerator
in the background whenever it would need to access elements in the sequence.
所以,最大的区别在于。net允许您重用一个IEnumerable,方法是在后台隐式地创建一个新的IEnumerator,当它需要访问序列中的元素时。
This implicit behavior is often useful (and 'powerful' as you state), because we can repeatedly iterate over a collection.
这种隐式行为通常是有用的(并且在您声明时是“强大的”),因为我们可以在集合上反复迭代。
But sometimes, this implicit behavior can actually cause problems. If your data source is not static, or is costly to access (like a database or web site), then a lot of assumptions about IEnumerable
have to be discarded; reuse is not that straight-forward
但有时,这种内隐行为实际上会引起问题。如果您的数据源不是静态的,或者访问(比如数据库或web站点)开销很大,那么许多关于IEnumerable的假设都必须丢弃;重用并不是那么简单
#5
1
It is possible to bypass some of the "run once" protections in the Stream API; for example we can avoid java.lang.IllegalStateException
exceptions (with message "stream has already been operated upon or closed") by referencing and reusing the Spliterator
(rather than the Stream
directly).
可以绕过流API中的“运行一次”保护;例如,我们可以避免java.lang。通过引用和重用Spliterator(而不是直接使用流),IllegalStateException异常(带有消息“流已经被操作或关闭”)。
For example, this code will run without throwing an exception:
例如,该代码运行时不会抛出异常:
Spliterator<String> split = Stream.of("hello","world")
.map(s->"prefix-"+s)
.spliterator();
Stream<String> replayable1 = StreamSupport.stream(split,false);
Stream<String> replayable2 = StreamSupport.stream(split,false);
replayable1.forEach(System.out::println);
replayable2.forEach(System.out::println);
However the output will be limited to
但是输出将被限制在
prefix-hello
prefix-world
rather than repeating the output twice. This is because the ArraySpliterator
used as the Stream
source is stateful and stores its current position. When we replay this Stream
we start again at the end.
而不是重复输出两次。这是因为用作流源的ArraySpliterator是有状态的,并存储其当前位置。当我们重播这条小溪的时候,我们又从头开始。
We have a number of options to solve this challenge:
我们有许多选择来解决这个挑战:
-
We could make use of a stateless
Stream
creation method such asStream#generate()
. We would have to manage state externally in our own code and reset betweenStream
"replays":我们可以使用无状态流创建方法,如Stream#generate()。我们必须在我们自己的代码中管理状态,并在流“重放”之间进行重置:
Spliterator<String> split = Stream.generate(this::nextValue) .map(s->"prefix-"+s) .spliterator(); Stream<String> replayable1 = StreamSupport.stream(split,false); Stream<String> replayable2 = StreamSupport.stream(split,false); replayable1.forEach(System.out::println); this.resetCounter(); replayable2.forEach(System.out::println);
-
Another (slightly better but not perfect) solution to this is to write our own
ArraySpliterator
(or similarStream
source) that includes some capacity to reset the current counter. If we were to use it to generate theStream
we could potentially replay them successfully.另一个(稍微好一点但不是完美的)解决方案是编写我们自己的ArraySpliterator(或类似的流源),其中包含一些重置当前计数器的能力。如果我们使用它来生成流,我们可以成功地重放它们。
MyArraySpliterator<String> arraySplit = new MyArraySpliterator("hello","world"); Spliterator<String> split = StreamSupport.stream(arraySplit,false) .map(s->"prefix-"+s) .spliterator(); Stream<String> replayable1 = StreamSupport.stream(split,false); Stream<String> replayable2 = StreamSupport.stream(split,false); replayable1.forEach(System.out::println); arraySplit.reset(); replayable2.forEach(System.out::println);
-
The best solution to this problem (in my opinion) is to make a new copy of any stateful
Spliterator
s used in theStream
pipeline when new operators are invoked on theStream
. This is more complex and involved to implement, but if you don't mind using third party libraries, cyclops-react has aStream
implementation that does exactly this. (Disclosure: I am the lead developer for this project.)这个问题的最佳解决方案(在我看来)是在流管道上调用新操作符时,生成流管道中使用的任何有状态的spliterator的新副本。这更复杂,而且涉及到实现,但是如果您不介意使用第三方库,cyclops-react就有一个实现这个功能的流实现。(披露:我是这个项目的主要开发人员。)
Stream<String> replayableStream = ReactiveSeq.of("hello","world") .map(s->"prefix-"+s); replayableStream.forEach(System.out::println); replayableStream.forEach(System.out::println);
This will print
这将打印
prefix-hello
prefix-world
prefix-hello
prefix-world
as expected.
像预期的那样。
#1
334
I have some recollections from the early design of the Streams API that might shed some light on the design rationale.
我从Streams API的早期设计中获得了一些回忆,这些回忆可能有助于了解设计原理。
Back in 2012, we were adding lambdas to the language, and we wanted a collections-oriented or "bulk data" set of operations, programmed using lambdas, that would facilitate parallelism. The idea of lazily chaining operations together was well established by this point. We also didn't want the intermediate operations to store results.
早在2012年,我们就在语言中添加了lambdas,我们想要一个面向集合的或“大量数据”的操作集,使用lambdas编程,以促进并行。延迟链接操作的想法在这一点上得到了很好的证实。我们也不希望中间操作存储结果。
The main issues we needed to decide were what the objects in the chain looked like in the API and how they hooked up to data sources. The sources were often collections, but we also wanted to support data coming from a file or the network, or data generated on-the-fly, e.g., from a random number generator.
我们需要确定的主要问题是API中链中的对象是什么样子,以及它们如何与数据源连接。来源通常是集合,但我们也希望支持来自文件或网络的数据,或动态生成的数据,例如,来自随机数生成器的数据。
There were many influences of existing work on the design. Among the more influential were Google's Guava library and the Scala collections library. (If anybody is surprised about the influence from Guava, note that Kevin Bourrillion, Guava lead developer, was on the JSR-335 Lambda expert group.) On Scala collections, we found this talk by Martin Odersky to be of particular interest: Future-Proofing Scala Collections: from Mutable to Persistent to Parallel. (Stanford EE380, 2011 June 1.)
现有工作对设计有许多影响。其中最有影响力的是谷歌的番石榴库和Scala集合库。(如果有人对番石榴的影响感到惊讶的话,请注意,番石榴主要开发商凯文•布里利安(Kevin Bourrillion)是JSR-335 Lambda专家组的成员。)在Scala集合中,我们发现Martin Odersky的这篇演讲具有特别的意义:未来的Scala集合:从可变到持久并行。(斯坦福EE380, 2011年6月1日)
Our prototype design at the time was based around Iterable
. The familiar operations filter
, map
, and so forth were extension (default) methods on Iterable
. Calling one added an operation to the chain and returned another Iterable
. A terminal operation like count
would call iterator()
up the chain to the source, and the operations were implemented within each stage's Iterator.
我们当时的原型设计是基于可迭代的。熟悉的操作过滤器、映射等是可迭代的扩展(默认)方法。调用其中一个将操作添加到链中并返回另一个可迭代的。像count这样的终端操作将调用到源链上的iterator(),操作在每个阶段的迭代器中实现。
Since these are Iterables, you can call the iterator()
method more than once. What should happen then?
由于这些是可迭代的,您可以多次调用iterator()方法。然后会发生什么?
If the source is a collection, this mostly works fine. Collections are Iterable, and each call to iterator()
produces a distinct Iterator instance that is independent of any other active instances, and each traverses the collection independently. Great.
如果源文件是一个集合,那么它基本上可以正常工作。集合是可迭代的,对iterator()的每个调用都生成一个独立于任何其他活动实例的不同迭代器实例,并且每个调用都独立地遍历集合。太好了。
Now what if the source is one-shot, like reading lines from a file? Maybe the first Iterator should get all the values but the second and subsequent ones should be empty. Maybe the values should be interleaved among the Iterators. Or maybe each Iterator should get all the same values. Then, what if you have two iterators and one gets farther ahead of the other? Somebody will have to buffer up the values in the second Iterator until they're read. Worse, what if you get one Iterator and read all the values, and only then get a second Iterator. Where do the values come from now? Is there a requirement for them all to be buffered up just in case somebody wants a second Iterator?
现在,如果源文件是一次性的,比如从文件中读取行,该怎么办?也许第一个迭代器应该得到所有的值,但是第二个和后面的值应该是空的。也许值应该在迭代器之间交叉。或者每个迭代器都应该得到相同的值。然后,如果有两个迭代器,其中一个比另一个超前得多怎么办?在第二次迭代器读取之前,必须先对其值进行缓冲。更糟糕的是,如果您得到一个迭代器并读取所有的值,然后才得到第二个迭代器。这些价值从何而来?是否需要对它们进行缓冲,以防有人需要第二个迭代器?
Clearly, allowing multiple Iterators over a one-shot source raises a lot of questions. We didn't have good answers for them. We wanted consistent, predictable behavior for what happens if you call iterator()
twice. This pushed us toward disallowing multiple traversals, making the pipelines one-shot.
很明显,允许多个迭代器超过一个一次性源会引发许多问题。我们没有很好的答案。如果调用迭代器()两次,我们希望得到一致的、可预测的行为。这促使我们不允许多次穿越,使得管道只能一次穿越。
We also observed others bumping into these issues. In the JDK, most Iterables are collections or collection-like objects, which allow multiple traversal. It isn't specified anywhere, but there seemed to be an unwritten expectation that Iterables allow multiple traversal. A notable exception is the NIO DirectoryStream interface. Its specification includes this interesting warning:
我们也观察到其他人遇到这些问题。在JDK中,大多数迭代是集合或类似集合的对象,它们允许多个遍历。它在任何地方都没有指定,但是似乎有一个未编写的期望,即Iterables允许多次遍历。一个值得注意的例外是NIO DirectoryStream接口。它的规范包括以下有趣的警告:
While DirectoryStream extends Iterable, it is not a general-purpose Iterable as it supports only a single Iterator; invoking the iterator method to obtain a second or subsequent iterator throws IllegalStateException.
虽然DirectoryStream扩展了Iterable,但它不是通用的Iterable,因为它只支持一个迭代器;调用迭代器方法获取第二个或后续迭代器会抛出IllegalStateException。
[bold in original]
(大胆原创)
This seemed unusual and unpleasant enough that we didn't want to create a whole bunch of new Iterables that might be once-only. This pushed us away from using Iterable.
这看起来很不寻常,也很不愉快,以至于我们不想创建一大堆可能只有一次的新迭代。这使得我们不能使用迭代。
About this time, an article by Bruce Eckel appeared that described a spot of trouble he'd had with Scala. He'd written this code:
这时,布鲁斯·埃克尔(Bruce Eckel)发表了一篇文章,描述了他在Scala里遇到的一个小麻烦。他写的这段代码:
// Scala
val lines = fromString(data).getLines
val registrants = lines.map(Registrant)
registrants.foreach(println)
registrants.foreach(println)
It's pretty straightforward. It parses lines of text into Registrant
objects and prints them out twice. Except that it actually only prints them out once. It turns out that he thought that registrants
was a collection, when in fact it's an iterator. The second call to foreach
encounters an empty iterator, from which all values have been exhausted, so it prints nothing.
这是很简单的。它将文本行解析为注册对象并将其打印两次。但实际上它只打印一次。原来他认为注册表是一个集合,而实际上它是一个迭代器。foreach的第二个调用遇到一个空迭代器,从这个空迭代器中所有的值都已被耗尽,因此它不会输出任何值。
This kind of experience convinced us that it was very important to have clearly predictable results if multiple traversal is attempted. It also highlighted the importance of distinguishing between lazy pipeline-like structures from actual collections that store data. This in turn drove the separation of the lazy pipeline operations into the new Stream interface and keeping only eager, mutative operations directly on Collections. Brian Goetz has explained the rationale for that.
这种经验使我们确信,如果尝试多次遍历,那么获得清晰可预测的结果是非常重要的。它还强调了区分类似于惰性管道的结构和存储数据的实际集合的重要性。这反过来又将惰性管道操作分离到新的流接口中,并只对集合进行直接的、有意义的操作。Brian Goetz已经解释了原因。
What about allowing multiple traversal for collection-based pipelines but disallowing it for non-collection-based pipelines? It's inconsistent, but it's sensible. If you're reading values from the network, of course you can't traverse them again. If you want to traverse them multiple times, you have to pull them into a collection explicitly.
对于基于集合的管道允许多次遍历,而不允许对基于非集合的管道进行遍历,这又如何呢?这是不一致的,但也是明智的。如果您正在从网络中读取值,当然您不能再次遍历它们。如果要多次遍历它们,就必须显式地将它们拉到集合中。
But let's explore allowing multiple traversal from collections-based pipelines. Let's say you did this:
但是,让我们研究一下允许从基于集合的管道进行多次遍历。假设你这么做了:
Iterable<?> it = source.filter(...).map(...).filter(...).map(...);
it.into(dest1);
it.into(dest2);
(The into
operation is now spelled collect(toList())
.)
(into操作现在拼写为collect(toList()))。)
If source is a collection, then the first into()
call will create a chain of Iterators back to the source, execute the pipeline operations, and send the results into the destination. The second call to into()
will create another chain of Iterators, and execute the pipeline operations again. This isn't obviously wrong but it does have the effect of performing all the filter and map operations a second time for each element. I think many programmers would have been surprised by this behavior.
如果源是一个集合,那么第一个into()调用将创建返回源的迭代器链,执行管道操作,并将结果发送到目标。第二个调用into()将创建另一个迭代器链,并再次执行管道操作。这并不是明显的错误,但是它确实可以对每个元素执行所有的过滤器和映射操作。我想很多程序员会对这种行为感到惊讶。
As I mentioned above, we had been talking to the Guava developers. One of the cool things they have is an Idea Graveyard where they describe features that they decided not to implement along with the reasons. The idea of lazy collections sounds pretty cool, but here's what they have to say about it. Consider a List.filter()
operation that returns a List
:
正如我上面提到的,我们一直在和番石榴的开发者交谈。他们拥有的一个很酷的东西是一个想法墓地,他们在那里描述他们决定不实现的特性以及原因。懒人收藏的想法听起来很酷,但下面是他们对此的看法。考虑一个List.filter()操作,它返回一个List:
The biggest concern here is that too many operations become expensive, linear-time propositions. If you want to filter a list and get a list back, and not just a Collection or an Iterable, you can use
ImmutableList.copyOf(Iterables.filter(list, predicate))
, which "states up front" what it's doing and how expensive it is.这里最大的问题是太多的操作变得昂贵,线性时间命题。如果您想要筛选一个列表并返回一个列表,而不只是一个集合或一个Iterable,您可以使用ImmutableList.copyOf(Iterables)。过滤器(列表,谓词),它“预先声明”它正在做什么以及它有多昂贵。
To take a specific example, what's the cost of get(0)
or size()
on a List? For commonly used classes like ArrayList
, they're O(1). But if you call one of these on a lazily-filtered list, it has to run the filter over the backing list, and all of a sudden these operations are O(n). Worse, it has to traverse the backing list on every operation.
举一个具体的例子,列表上的get(0)或size()的成本是多少?对于像ArrayList这样的常用类,它们是O(1)。但是如果你在一个延迟过滤的列表上调用其中一个,它必须在后台列表上运行过滤器,突然这些操作就变成了O(n)更糟糕的是,它必须遍历每个操作的支持列表。
This seemed to us to be too much laziness. It's one thing to set up some operations and defer actual execution until you so "Go". It's another to set things up in such a way that hides a potentially large amount of recomputation.
这在我们看来是太懒惰了。设置一些操作并将实际执行推迟到“Go”时是一回事。以隐藏大量重新计算的方式进行设置是另一回事。
In proposing to disallow non-linear or "no-reuse" streams, Paul Sandoz described the potential consequences of allowing them as giving rise to "unexpected or confusing results." He also mentioned that parallel execution would make things even trickier. Finally, I'd add that a pipeline operation with side effects would lead to difficult and obscure bugs if the operation were unexpectedly executed multiple times, or at least a different number of times than the programmer expected. (But Java programmers don't write lambda expressions with side effects, do they? DO THEY??)
在提议不允许非线性或“不重用”流时,Paul Sandoz描述了允许它们产生“意想不到或混乱的结果”的潜在后果。他还提到并行执行会使事情变得更加棘手。最后,我还要补充一点,如果操作被意外地执行了多次,或者至少比程序员预期的次数不同,那么带有副作用的管道操作将导致难以理解的bug。(但是Java程序员不写带有副作用的lambda表达式,是吗?)不是吗? ?)
So that's the basic rationale for the Java 8 Streams API design that allows one-shot traversal and that requires a strictly linear (no branching) pipeline. It provides consistent behavior across multiple different stream sources, it clearly separates lazy from eager operations, and it provides a straightforward execution model.
这就是Java 8 Streams API设计的基本原理,它允许一次性遍历,并且需要严格的线性(无分支)管道。它提供了跨多个不同流源的一致行为,它明确地将惰性操作与热切操作分离开来,并提供了一个简单的执行模型。
With regard to IEnumerable
, I am far from an expert on C# and .NET, so I would appreciate being corrected (gently) if I draw any incorrect conclusions. It does appear, however, that IEnumerable
permits multiple traversal to behave differently with different sources; and it permits a branching structure of nested IEnumerable
operations, which may result in some significant recomputation. While I appreciate that different systems make different tradeoffs, these are two characteristics that we sought to avoid in the design of the Java 8 Streams API.
关于IEnumerable,我不是c#和。net方面的专家,所以如果得出不正确的结论,我希望能得到(温和的)纠正。然而,似乎IEnumerable接口允许对不同的源进行多次遍历操作;它允许嵌套的IEnumerable结构,这可能导致一些重要的重新计算。虽然我很欣赏不同的系统做出不同的权衡,但这是我们在设计Java 8 Streams API时试图避免的两个特性。
The quicksort example given by the OP is interesting, puzzling, and I'm sorry to say, somewhat horrifying. Calling QuickSort
takes an IEnumerable
and returns an IEnumerable
, so no sorting is actually done until the final IEnumerable
is traversed. What the call seems to do, though, is build up a tree structure of IEnumerables
that reflects the partitioning that quicksort would do, without actually doing it. (This is lazy computation, after all.) If the source has N elements, the tree will be N elements wide at its widest, and it will be lg(N) levels deep.
OP给出的快速排序例子很有趣,令人费解,我很遗憾地说,有点可怕。调用QuickSort会获取一个IEnumerable,并返回一个IEnumerable,因此在最终的IEnumerable被遍历之前,实际上不会进行排序。不过,这个调用似乎要做的是构建一个IEnumerables的树结构,它反映了quicksort会做的分区,而实际上并没有这样做。(毕竟,这是一种惰性计算。)如果源有N个元素,那么树最宽处有N个元素,深度为lg(N)级。
It seems to me -- and once again, I'm not a C# or .NET expert -- that this will cause certain innocuous-looking calls, such as pivot selection via ints.First()
, to be more expensive than they look. At the first level, of course, it's O(1). But consider a partition deep in the tree, at the right-hand edge. To compute the first element of this partition, the entire source has to be traversed, an O(N) operation. But since the partitions above are lazy, they must be recomputed, requiring O(lg N) comparisons. So selecting the pivot would be an O(N lg N) operation, which is as expensive as an entire sort.
在我看来——再一次,我不是一个c#或。net专家——这将会导致某些看似无害的调用,比如通过ints.c . first()来进行pivot选择,这比它们看起来要贵。第一级当然是O(1)但是考虑树形深处的一个分区,在右边缘。要计算这个分区的第一个元素,必须遍历整个源,一个O(N)操作。但是由于上面的分区是惰性的,所以必须重新计算它们,需要进行O(lgn)比较。所以选择主元就是一个O(nlgn)的操作,它和整个排序一样昂贵。
But we don't actually sort until we traverse the returned IEnumerable
. In the standard quicksort algorithm, each level of partitioning doubles the number of partitions. Each partition is only half the size, so each level remains at O(N) complexity. The tree of partitions is O(lg N) high, so the total work is O(N lg N).
但是我们在遍历返回的IEnumerable之前是不会排序的。在标准的快速排序算法中,每个级别的分区都增加了分区的数量。每个分区的大小只有原来的一半,所以每个级别都保持O(N)复杂度。分区树是O(lgn)高,所以总功是O(nlgn)
With the tree of lazy IEnumerables, at the bottom of the tree there are N partitions. Computing each partition requires a traversal of N elements, each of which requires lg(N) comparisons up the tree. To compute all the partitions at the bottom of the tree, then, requires O(N^2 lg N) comparisons.
使用惰性IEnumerables树,树的底部有N个分区。计算每个分区需要遍历N个元素,每个元素都需要对树进行lg(N)比较。计算所有树的底部的分区,然后,需要O(N ^ 2 lg N)比较。
(Is this right? I can hardly believe this. Somebody please check this for me.)
(这是正确的吗?我简直不敢相信。请帮我查一下。
In any case, it is indeed cool that IEnumerable
can be used this way to build up complicated structures of computation. But if it does increase the computational complexity as much as I think it does, it would seem that programming this way is something that should be avoided unless one is extremely careful.
无论如何,IEnumerable可以用这种方式构建复杂的计算结构,确实很酷。但是,如果它确实像我认为的那样增加了计算复杂度,那么似乎应该避免这种编程方式,除非人们非常小心。
#2
117
Background
While the question appears simple, the actual answer requires some background to make sense. If you want to skip to the conclusion, scroll down...
虽然这个问题看起来很简单,但实际的答案需要一些背景知识才有意义。如果你想跳到结论,向下滚动…
Pick your comparison point - Basic functionality
Using basic concepts, C#'s IEnumerable
concept is more closely related to Java's Iterable
, which is able to create as many Iterators as you want. IEnumerables
create IEnumerators
. Java's Iterable
create Iterators
使用基本概念,c#的IEnumerable概念更接近于Java的Iterable,它能够创建任意多的迭代器。ienumerable创建IEnumerators。Java的Iterable创建迭代器
The history of each concept is similar, in that both IEnumerable
and Iterable
have a basic motivation to allow 'for-each' style looping over the members of data collections. That's an oversimplification as they both allow more than just that, and they also arrived at that stage via different progressions, but it is a significant common feature regardless.
每个概念的历史都是相似的,因为IEnumerable和Iterable都有一个基本的动机,允许对数据集合的成员进行“for-each”风格的循环。这是一种过度简化,因为它们都允许的不止这些,而且它们也通过不同的过程到达这个阶段,但不管怎样,这是一个重要的共同特征。
Let's compare that feature: in both languages, if a class implements the IEnumerable
/Iterable
, then that class must implement at least a single method (for C#, it's GetEnumerator
and for Java it's iterator()
). In each case, the instance returned from that (IEnumerator
/Iterator
) allows you to access the current and subsequent members of the data. This feature is used in the for-each language syntax.
让我们比较一下这个特性:在两种语言中,如果一个类实现了IEnumerable/Iterable,那么这个类必须实现至少一个方法(对于c#,它是GetEnumerator,对于Java,它是iterator()))。在每种情况下,从该实例返回的实例(IEnumerator/Iterator)允许您访问数据的当前和后续成员。这个特性用于for-each语言语法。
Pick your comparison point - Enhanced functionality
IEnumerable
in C# has been extended to allow a number of other language features (mostly related to Linq). Features added include selections, projections, aggregations, etc. These extensions have a strong motivation from use in set-theory, similar to SQL and Relational Database concepts.
c#中的IEnumerable是可扩展的,可以支持其他一些语言特性(大部分与Linq相关)。添加的特性包括选择、投影、聚合等。这些扩展与SQL和关系数据库概念类似,具有强烈的使用集理论的动机。
Java 8 has also had functionality added to enable a degree of functional programming using Streams and Lambdas. Note that Java 8 streams are not primarily motivated by set theory, but by functional programming. Regardless, there are a lot of parallels.
Java 8还添加了一些功能,以支持使用流和Lambdas进行一定程度的函数式编程。注意,Java 8流主要不是由集合理论驱动的,而是由函数式编程驱动的。无论如何,有很多相似之处。
So, this is the second point. The enhancements made to C# were implemented as an enhancement to the IEnumerable
concept. In Java, though, the enhancements made were implemented by creating new base concepts of Lambdas and Streams, and then also creating a relatively trivial way to convert from Iterators
and Iterables
to Streams, and visa-versa.
这是第二点。对c#的增强被实现为对IEnumerable概念的增强。不过,在Java中,所做的增强是通过创建lambda和Streams的新基本概念实现的,然后还创建了一种相对简单的方式来将迭代器和可迭代器转换为流,反之亦然。
So, comparing IEnumerable to Java's Stream concept is incomplete. You need to compare it to the combined Streams and Collections API's in Java.
因此,将IEnumerable与Java的流概念进行比较是不完整的。您需要将它与Java中合并的流和集合API进行比较。
In Java, Streams are not the same as Iterables, or Iterators
Streams are not designed to solve problems the same way that iterators are:
流设计来解决问题的方式与迭代器不同:
- Iterators are a way of describing the sequence of data.
- 迭代器是描述数据序列的一种方式。
- Streams are a way of describing a sequence of data transformations.
- 流是描述数据转换序列的一种方式。
With an Iterator
, you get a data value, process it, and then get another data value.
使用迭代器,您将获得一个数据值,并对其进行处理,然后获得另一个数据值。
With Streams, you chain a sequence of functions together, then you feed an input value to the stream, and get the output value from the combined sequence. Note, in Java terms, each function is encapsulated in a single Stream
instance. The Streams API allows you to link a sequence of Stream
instances in a way that chains a sequence of transformation expressions.
对于流,您将一系列函数链接在一起,然后将一个输入值输入到流中,并从组合序列中获得输出值。注意,在Java术语中,每个函数都封装在一个流实例中。Streams API允许您将一个流实例序列链接到一个链表,该方法将一系列转换表达式链接起来。
In order to complete the Stream
concept, you need a source of data to feed the stream, and a terminal function that consumes the stream.
为了完成流概念,您需要一个数据源来提供流,以及一个使用流的终端函数。
The way you feed values in to the stream may in fact be from an Iterable
, but the Stream
sequence itself is not an Iterable
, it is a compound function.
向流中输入值的方式实际上可能来自可迭代的,但是流序列本身不是可迭代的,它是一个复合函数。
A Stream
is also intended to be lazy, in the sense that it only does work when you request a value from it.
流也是懒惰的,因为它只在向它请求值时才工作。
Note these significant assumptions and features of Streams:
注意流的这些重要假设和特性:
- A
Stream
in Java is a transformation engine, it transforms a data item in one state, to being in another state. - Java中的流是一个转换引擎,它将一个状态中的数据项转换为另一个状态。
- streams have no concept of the data order or position, the simply transform whatever they are asked to.
- 流没有数据顺序或位置的概念,简单地转换它们被请求的任何内容。
- streams can be supplied with data from many sources, including other streams, Iterators, Iterables, Collections,
- 流可以提供来自许多源的数据,包括其他流、迭代器、Iterables、集合,
- you cannot "reset" a stream, that would be like "reprogramming the transformation". Resetting the data source is probably what you want.
- 不能“重置”流,这就像“重新编程转换”。您可能需要重新设置数据源。
- there is logically only 1 data item 'in flight' in the stream at any time (unless the stream is a parallel stream, at which point, there is 1 item per thread). This is independent of the data source which may have more than the current items 'ready' to be supplied to the stream, or the stream collector which may need to aggregate and reduce multiple values.
- 在任何时候,在流中逻辑上只有1个数据项“in flight”(除非流是并行流,在这一点上,每个线程有1个条目)。这是独立于数据源的,它可能比当前的项目“准备好”提供给流,或者需要聚合和减少多个值的流收集器。
- Streams can be unbound (infinite), limited only by the data source, or collector (which can be infinite too).
- 流可以是无限制的(无限的),仅受数据源或收集器的限制(也可以是无限的)。
- Streams are 'chainable', the output of filtering one stream, is another stream. Values input to and transformed by a stream can in turn be supplied to another stream which does a different transformation. The data, in its transformed state flows from one stream to the next. You do not need to intervene and pull the data from one stream and plug it in to the next.
- 流是“可链的”,过滤一个流的输出是另一个流。流输入和转换的值可以反过来提供给进行不同转换的另一个流。转换后的数据从一个流流到另一个流。您不需要进行干预并从一个流中提取数据并将其插入到下一个流中。
C# Comparison
When you consider that a Java Stream is just a part of a supply, stream, and collect system, and that Streams and Iterators are often used together with Collections, then it is no wonder that it is hard to relate to the same concepts which are almost all embedded in to a single IEnumerable
concept in C#.
当你考虑到一个Java流只是一个供应的一部分,流,并收集系统、流和迭代器经常使用集合一起,那么难怪很难与相同的概念,几乎都是嵌入在一个IEnumerable概念在c#中。
Parts of IEnumerable (and close related concepts) are apparent in all of the Java Iterator, Iterable, Lambda, and Stream concepts.
在所有的Java迭代器、Iterable、Lambda和流概念中,可枚举的部分(以及相关的概念)都很明显。
There are small things that the Java concepts can do that are harder in IEnumerable, and visa-versa.
在IEnumerable中有一些Java概念可以做到的小事情比较困难,反之亦然。
Conclusion
- There's no design problem here, just a problem in matching concepts between the languages.
- 这里没有设计问题,只是语言之间的概念匹配问题。
- Streams solve problems in a different way
- Streams以不同的方式解决问题。
- Streams add functionality to Java (they add a different way of doing things, they do not take functionality away)
- 流向Java添加功能(它们添加了一种不同的处理方式,它们不会带走功能)
Adding Streams gives you more choices when solving problems, which is fair to classify as 'enhancing power', not 'reducing', 'taking away', or 'restricting' it.
添加流可以在解决问题时给你更多的选择,这是公平的,可以将其划分为“增强能力”,而不是“减少”、“带走”或“限制”。
Why are Java Streams once-off?
This question is misguided, because streams are function sequences, not data. Depending on the data source that feeds the stream, you can reset the data source, and feed the same, or different stream.
这个问题被误导了,因为流是函数序列,而不是数据。根据提供流的数据源,您可以重置数据源,并提供相同或不同的流。
Unlike C#'s IEnumerable, where an execution pipeline can be executed as many times as we want, in Java a stream can be 'iterated' only once.
Comparing an IEnumerable
to a Stream
is misguided. The context you are using to say IEnumerable
can be executed as many times as you want, is best compared to Java Iterables
, which can be iterated as many times as you want. A Java Stream
represents a subset of the IEnumerable
concept, and not the subset that supplies data, and thus cannot be 'rerun'.
将IEnumerable比作流是错误的。您用来表示IEnumerable是可执行多次的,最好与Java Iterables比较,后者可以按您的要求进行多次迭代。Java流表示可枚举概念的子集,而不是提供数据的子集,因此不能“重新运行”。
Any call to a terminal operation closes the stream, rendering it unusable. This 'feature' takes away a lot of power.
The first statement is true, in a sense. The 'takes away power' statement is not. You are still comparing Streams it IEnumerables. The terminal operation in the stream is like a 'break' clause in a for loop. You are always free to have another stream, if you want, and if you can re-supply the data you need. Again, if you consider the IEnumerable
to be more like an Iterable
, for this statement, Java does it just fine.
第一种说法在某种意义上是正确的。“剥夺权力”的说法不是这样的。您仍然在比较它所列举的数据流。流中的终端操作类似于for循环中的“break”子句。如果您愿意,您总是可以拥有另一个流,如果您可以重新提供所需的数据。同样,如果您认为IEnumerable更像可迭代的,对于这个语句来说,Java也可以。
I imagine the reason for this is not technical. What were the design considerations behind this strange restriction?
The reason is technical, and for the simple reason that a Stream a subset of what think it is. The stream subset does not control the data supply, so you should reset the supply, not the stream. In that context, it is not so strange.
原因是技术上的,简单的原因是流是它的子集。流子集不控制数据供应,所以应该重置供应,而不是流。在这种背景下,这并不奇怪。
QuickSort example
Your quicksort example has the signature:
您的快速排序示例具有以下签名:
IEnumerable<int> QuickSort(IEnumerable<int> ints)
You are treating the input IEnumerable
as a data source:
您将输入IEnumerable处理为数据源:
IEnumerable<int> lt = ints.Where(i => i < pivot);
Additionally, return value is IEnumerable
too, which is a supply of data, and since this is a Sort operation, the order of that supply is significant. If you consider the Java Iterable
class to be the appropriate match for this, specifically the List
specialization of Iterable
, since List is a supply of data which has a guaranteed order or iteration, then the equivalent Java code to your code would be:
此外,返回值也是IEnumerable,它是一种数据供应,由于这是一种排序操作,所以该供应的顺序是显著的。如果您认为Java Iterable类适合这种情况,特别是Iterable的列表专门化,因为List是提供有保证顺序或迭代的数据的,那么您代码的等效Java代码是:
Stream<Integer> quickSort(List<Integer> ints) {
// Using a stream to access the data, instead of the simpler ints.isEmpty()
if (!ints.stream().findAny().isPresent()) {
return Stream.of();
}
// treating the ints as a data collection, just like the C#
final Integer pivot = ints.get(0);
// Using streams to get the two partitions
List<Integer> lt = ints.stream().filter(i -> i < pivot).collect(Collectors.toList());
List<Integer> gt = ints.stream().filter(i -> i > pivot).collect(Collectors.toList());
return Stream.concat(Stream.concat(quickSort(lt), Stream.of(pivot)),quickSort(gt));
}
Note there is a bug (which I have reproduced), in that the sort does not handle duplicate values gracefully, it is a 'unique value' sort.
注意,这里有一个bug(我已经复制了),因为它不能优雅地处理重复值,它是一个“唯一值”排序。
Also note how the Java code uses data source (List
), and stream concepts at different point, and that in C# those two 'personalities' can be expressed in just IEnumerable
. Also, although I have use List
as the base type, I could have used the more general Collection
, and with a small iterator-to-Stream conversion, I could have used the even more general Iterable
还要注意Java代码如何使用数据源(List),以及在不同的点上的流概念,在c#中这两个“个性”可以用IEnumerable表示。另外,虽然我使用List作为基类型,但是我可以使用更一般的集合,并且使用一个小的迭代器到流的转换,我可以使用更一般的可迭代性
#3
20
Stream
s are built around Spliterator
s which are stateful, mutable objects. They don’t have a “reset” action and in fact, requiring to support such rewind action would “take away much power”. How would Random.ints()
be supposed to handle such a request?
流是围绕spliterator构建的,它是有状态的、可变的对象。它们没有“重置”动作,事实上,需要支持这种“重放”动作会“消耗很多能量”。ints()如何处理这样的请求?
On the other hand, for Stream
s which have a retraceable origin, it is easy to construct an equivalent Stream
to be used again. Just put the steps made to construct the Stream
into a reusable method. Keep in mind that repeating these steps is not an expensive operation as all these steps are lazy operations; the actual work starts with the terminal operation and depending on the actual terminal operation entirely different code might get executed.
另一方面,对于具有可追溯起源的流,很容易构造一个等效的流来再次使用。只需将构建流的步骤放入可重用的方法中。记住,重复这些步骤并不是一项昂贵的操作,因为所有这些步骤都是惰性操作;实际工作从终端操作开始,根据实际的终端操作,可能会执行完全不同的代码。
It would be up to you, the writer of such a method, to specify what calling the method twice implies: does it reproduce exactly the same sequence, as streams created for an unmodified array or collection do, or does it produce a stream with a similar semantics but different elements like a stream of random ints or a stream of console input lines, etc.
你,的作者这样一个方法,指定调用方法两次意味着什么:它复制相同的序列,作为修改的数组或集合流创建,还是生产流语义相似但不同的元素如流的随机整数或一连串的控制台输入行,等等。
By the way, to avoid confusion, a terminal operation consumes the Stream
which is distinct from closing the Stream
as calling close()
on the stream does (which is required for streams having associated resources like, e.g. produced by Files.lines()
).
顺便说一下,为了避免混淆,终端操作将使用流,这与在流上调用close()时关闭流不同(这对于具有相关资源的流来说是必需的,例如由Files.lines()生成的流)。
It seems that a lot of confusion stems from misguiding comparison of IEnumerable
with Stream
. An IEnumerable
represents the ability to provide an actual IEnumerator
, so its like an Iterable
in Java. In contrast, a Stream
is a kind of iterator and comparable to an IEnumerator
so it’s wrong to claim that this kind of data type can be used multiple times in .NET, the support for IEnumerator.Reset
is optional. The examples discussed here rather use the fact that an IEnumerable
can be used to fetch new IEnumerator
s and that works with Java’s Collection
s as well; you can get a new Stream
. If the Java developers decided to add the Stream
operations to Iterable
directly, with intermediate operations returning another Iterable
, it was really comparable and it could work the same way.
似乎很多的混淆来自于对IEnumerable与流的错误引导比较。IEnumerable表示能够提供实际的IEnumerator,因此在Java中是可迭代的。相反,流是一种迭代器,可以与IEnumerator相比,所以说这种数据类型可以在。net中多次使用是错误的,因为。net支持IEnumerator。重置是可选的。这里讨论的示例使用了一个事实,即IEnumerable可以用来获取新的i枚举器,并且可以使用Java的集合;你可以得到一条新的小溪。如果Java开发人员决定添加流操作来直接迭代,中间操作返回另一个可迭代的,那么它确实具有可比性,并且可以以同样的方式工作。
However, the developers decided against it and the decision is discussed in this question. The biggest point is the confusion about eager Collection operations and lazy Stream operations. By looking at the .NET API, I (yes, personally) find it justified. While it looks reasonable looking at IEnumerable
alone, a particular Collection will have lots of methods manipulating the Collection directly and lots of methods returning a lazy IEnumerable
, while the particular nature of a method isn’t always intuitively recognizable. The worst example I found (within the few minutes I looked at it) is List.Reverse()
whose name matches exactly the name of the inherited (is this the right * for extension methods?) Enumerable.Reverse()
while having an entirely contradicting behavior.
但是,开发人员决定反对它,并在这个问题中讨论了这个决定。最大的问题是关于渴望收集操作和惰性流操作的混淆。通过查看。net API,我(是的,我个人)发现它是合理的。虽然单独查看IEnumerable是合理的,但是一个特定的集合会有很多方法直接操作这个集合,并且有很多方法返回一个惰性的IEnumerable,但是一个方法的特定性质并不总是可以直观地识别。我发现的最糟糕的示例(在我看了几分钟后)是List.Reverse(),它的名称与继承的名称完全匹配(这是扩展方法的正确终止吗?)反向(),同时具有完全相反的行为。
Of course, these are two distinct decisions. The first one to make Stream
a type distinct from Iterable
/Collection
and the second to make Stream
a kind of one time iterator rather than another kind of iterable. But these decision were made together and it might be the case that separating these two decision never was considered. It wasn’t created with being comparable to .NET’s in mind.
当然,这是两个截然不同的决定。第一个是使流成为与可迭代/集合不同的类型,第二个是使流成为一种时间迭代器而不是另一种可迭代器。但这些决定都是一起做出的,也许从来没有考虑过把这两个决定分开。它不是通过与。net相比较而创建的。
The actual API design decision was to add an improved type of iterator, the Spliterator
. Spliterator
s can be provided by the old Iterable
s (which is the way how these were retrofitted) or entirely new implementations. Then, Stream
was added as a high-level front-end to the rather low level Spliterator
s. That’s it. You may discuss about whether a different design would be better, but that’s not productive, it won’t change, given the way they are designed now.
实际的API设计决策是添加改进的迭代器类型Spliterator。Spliterators可以由旧的可迭代(这是对它们进行改造的方式)或全新的实现提供。然后,流被作为一个高级前端添加到相当低级的Spliterators。就是这样。您可能会讨论不同的设计是否更好,但这并不是有效的,考虑到它们现在的设计方式,它不会改变。
There is another implementation aspect you have to consider. Stream
s are not immutable data structures. Each intermediate operation may return a new Stream
instance encapsulating the old one but it may also manipulate its own instance instead and return itself (that doesn’t preclude doing even both for the same operation). Commonly known examples are operations like parallel
or unordered
which do not add another step but manipulate the entire pipeline). Having such a mutable data structure and attempts to reuse (or even worse, using it multiple times at the same time) doesn’t play well…
您还需要考虑另一个实现方面。流不是不可变的数据结构。每一个中间操作都可能返回一个封装了旧操作的新流实例,但它也可能反过来操作自己的实例并返回自己(这并不排除对同一个操作同时执行这两个操作)。常见的例子是并行或无序操作,它们不添加其他步骤,而是操作整个管道)。拥有这样一个可变的数据结构并尝试重用(或者更糟糕的是,同时多次使用它)并不能很好地发挥作用……
For completeness, here is your quicksort example translated to the Java Stream
API. It shows that it does not really “take away much power”.
为了完整起见,这里有一个转换为Java流API的快速排序示例。这表明它并没有真正“带走多少力量”。
static Stream<Integer> quickSort(Supplier<Stream<Integer>> ints) {
final Optional<Integer> optPivot = ints.get().findAny();
if(!optPivot.isPresent()) return Stream.empty();
final int pivot = optPivot.get();
Supplier<Stream<Integer>> lt = ()->ints.get().filter(i -> i < pivot);
Supplier<Stream<Integer>> gt = ()->ints.get().filter(i -> i > pivot);
return Stream.of(quickSort(lt), Stream.of(pivot), quickSort(gt)).flatMap(s->s);
}
It can be used like
可以用like
List<Integer> l=new Random().ints(100, 0, 1000).boxed().collect(Collectors.toList());
System.out.println(l);
System.out.println(quickSort(l::stream)
.map(Object::toString).collect(Collectors.joining(", ")));
You can write it even more compact as
你可以把它写得更紧凑
static Stream<Integer> quickSort(Supplier<Stream<Integer>> ints) {
return ints.get().findAny().map(pivot ->
Stream.of(
quickSort(()->ints.get().filter(i -> i < pivot)),
Stream.of(pivot),
quickSort(()->ints.get().filter(i -> i > pivot)))
.flatMap(s->s)).orElse(Stream.empty());
}
#4
8
I think there are very few differences between the two when you look closely enough.
我认为当你仔细观察时,这两者之间几乎没有什么区别。
At it's face, an IEnumerable
does appear to be a reusable construct:
从表面上看,IEnumerable确实是可重复使用的结构:
IEnumerable<int> numbers = new int[] { 1, 2, 3, 4, 5 };
foreach (var n in numbers) {
Console.WriteLine(n);
}
However, the compiler is actually doing a little bit of work to help us out; it generates the following code:
但是,编译器实际上做了一些工作来帮助我们;它生成以下代码:
IEnumerable<int> numbers = new int[] { 1, 2, 3, 4, 5 };
IEnumerator<int> enumerator = numbers.GetEnumerator();
while (enumerator.MoveNext()) {
Console.WriteLine(enumerator.Current);
}
Each time you would actually iterate over the enumerable, the compiler creates an enumerator. The enumerator is not reusable; further calls to MoveNext
will just return false, and there is no way to reset it to the beginning. If you want to iterate over the numbers again, you will need to create another enumerator instance.
每次实际遍历可枚举值时,编译器都会创建一个枚举数。枚举器不可重用;对MoveNext的进一步调用将返回false,而且无法将其重置到开始。如果希望再次遍历这些数字,则需要创建另一个枚举器实例。
To better illustrate that the IEnumerable has (can have) the same 'feature' as a Java Stream, consider a enumerable whose source of the numbers is not a static collection. For example, we can create an enumerable object which generates a sequence of 5 random numbers:
为了更好地说明IEnumerable与Java流具有(可以)相同的“特性”,可以考虑一个枚举型,它的数字来源不是静态集合。例如,我们可以创建一个可枚举对象,它生成5个随机数的序列:
class Generator : IEnumerator<int> {
Random _r;
int _current;
int _count = 0;
public Generator(Random r) {
_r = r;
}
public bool MoveNext() {
_current= _r.Next();
_count++;
return _count <= 5;
}
public int Current {
get { return _current; }
}
}
class RandomNumberStream : IEnumerable<int> {
Random _r = new Random();
public IEnumerator<int> GetEnumerator() {
return new Generator(_r);
}
public IEnumerator IEnumerable.GetEnumerator() {
return this.GetEnumerator();
}
}
Now we have very similar code to the previous array-based enumerable, but with a second iteration over numbers
:
现在我们有了与以前基于数组的可枚举值非常相似的代码,但是对数字进行第二次迭代:
IEnumerable<int> numbers = new RandomNumberStream();
foreach (var n in numbers) {
Console.WriteLine(n);
}
foreach (var n in numbers) {
Console.WriteLine(n);
}
The second time we iterate over numbers
we will get a different sequence of numbers, which isn't reusable in the same sense. Or, we could have written the RandomNumberStream
to thrown an exception if you try to iterate over it multiple times, making the enumerable actually unusable (like a Java Stream).
当我们第二次对数字进行迭代时,我们会得到一个不同的数字序列,这在同一意义上是不可重用的。或者,如果您尝试多次遍历的话,我们可以编写RandomNumberStream来抛出异常,使可枚举实际上不可用(比如Java流)。
Also, what does your enumerable-based quick sort mean when applied to a RandomNumberStream
?
此外,当应用于随机数流时,基于枚举的快速排序意味着什么?
Conclusion
So, the biggest difference is that .NET allows you to reuse an IEnumerable
by implicitly creating a new IEnumerator
in the background whenever it would need to access elements in the sequence.
所以,最大的区别在于。net允许您重用一个IEnumerable,方法是在后台隐式地创建一个新的IEnumerator,当它需要访问序列中的元素时。
This implicit behavior is often useful (and 'powerful' as you state), because we can repeatedly iterate over a collection.
这种隐式行为通常是有用的(并且在您声明时是“强大的”),因为我们可以在集合上反复迭代。
But sometimes, this implicit behavior can actually cause problems. If your data source is not static, or is costly to access (like a database or web site), then a lot of assumptions about IEnumerable
have to be discarded; reuse is not that straight-forward
但有时,这种内隐行为实际上会引起问题。如果您的数据源不是静态的,或者访问(比如数据库或web站点)开销很大,那么许多关于IEnumerable的假设都必须丢弃;重用并不是那么简单
#5
1
It is possible to bypass some of the "run once" protections in the Stream API; for example we can avoid java.lang.IllegalStateException
exceptions (with message "stream has already been operated upon or closed") by referencing and reusing the Spliterator
(rather than the Stream
directly).
可以绕过流API中的“运行一次”保护;例如,我们可以避免java.lang。通过引用和重用Spliterator(而不是直接使用流),IllegalStateException异常(带有消息“流已经被操作或关闭”)。
For example, this code will run without throwing an exception:
例如,该代码运行时不会抛出异常:
Spliterator<String> split = Stream.of("hello","world")
.map(s->"prefix-"+s)
.spliterator();
Stream<String> replayable1 = StreamSupport.stream(split,false);
Stream<String> replayable2 = StreamSupport.stream(split,false);
replayable1.forEach(System.out::println);
replayable2.forEach(System.out::println);
However the output will be limited to
但是输出将被限制在
prefix-hello
prefix-world
rather than repeating the output twice. This is because the ArraySpliterator
used as the Stream
source is stateful and stores its current position. When we replay this Stream
we start again at the end.
而不是重复输出两次。这是因为用作流源的ArraySpliterator是有状态的,并存储其当前位置。当我们重播这条小溪的时候,我们又从头开始。
We have a number of options to solve this challenge:
我们有许多选择来解决这个挑战:
-
We could make use of a stateless
Stream
creation method such asStream#generate()
. We would have to manage state externally in our own code and reset betweenStream
"replays":我们可以使用无状态流创建方法,如Stream#generate()。我们必须在我们自己的代码中管理状态,并在流“重放”之间进行重置:
Spliterator<String> split = Stream.generate(this::nextValue) .map(s->"prefix-"+s) .spliterator(); Stream<String> replayable1 = StreamSupport.stream(split,false); Stream<String> replayable2 = StreamSupport.stream(split,false); replayable1.forEach(System.out::println); this.resetCounter(); replayable2.forEach(System.out::println);
-
Another (slightly better but not perfect) solution to this is to write our own
ArraySpliterator
(or similarStream
source) that includes some capacity to reset the current counter. If we were to use it to generate theStream
we could potentially replay them successfully.另一个(稍微好一点但不是完美的)解决方案是编写我们自己的ArraySpliterator(或类似的流源),其中包含一些重置当前计数器的能力。如果我们使用它来生成流,我们可以成功地重放它们。
MyArraySpliterator<String> arraySplit = new MyArraySpliterator("hello","world"); Spliterator<String> split = StreamSupport.stream(arraySplit,false) .map(s->"prefix-"+s) .spliterator(); Stream<String> replayable1 = StreamSupport.stream(split,false); Stream<String> replayable2 = StreamSupport.stream(split,false); replayable1.forEach(System.out::println); arraySplit.reset(); replayable2.forEach(System.out::println);
-
The best solution to this problem (in my opinion) is to make a new copy of any stateful
Spliterator
s used in theStream
pipeline when new operators are invoked on theStream
. This is more complex and involved to implement, but if you don't mind using third party libraries, cyclops-react has aStream
implementation that does exactly this. (Disclosure: I am the lead developer for this project.)这个问题的最佳解决方案(在我看来)是在流管道上调用新操作符时,生成流管道中使用的任何有状态的spliterator的新副本。这更复杂,而且涉及到实现,但是如果您不介意使用第三方库,cyclops-react就有一个实现这个功能的流实现。(披露:我是这个项目的主要开发人员。)
Stream<String> replayableStream = ReactiveSeq.of("hello","world") .map(s->"prefix-"+s); replayableStream.forEach(System.out::println); replayableStream.forEach(System.out::println);
This will print
这将打印
prefix-hello
prefix-world
prefix-hello
prefix-world
as expected.
像预期的那样。