可翻转数据结构的模式名称？

I'm trying to think of a naming convention that accurately conveys what's going on within a class I'm designing. On a secondary note, I'm trying to decide between two almost-equivalent user APIs.

我正在尝试一种命名约定,准确地传达我正在设计的类中正在发生的事情。在次要说明中,我试图在两个几乎等效的用户API之间做出决定。

Here's the situation:

这是情况:

I'm building a scientific application, where one of the central data structures has three phases: 1) accumulation, 2) analysis, and 3) query execution.

我正在构建一个科学应用程序,其中一个*数据结构有三个阶段:1)积累,2)分析,和3)查询执行。

In my case, it's a spatial modeling structure, internally using a KDTree to partition a collection of points in 3-dimensional space. Each point describes one or more attributes of the surrounding environment, with a certain level of confidence about the measurement itself.

在我的例子中,它是一个空间建模结构,在内部使用KDTree来划分三维空间中的点集合。每个点描述周围环境的一个或多个属性,对测量本身具有一定程度的置信度。

After adding (a potentially large number of) measurements to the collection, the owner of the object will query it to obtain an interpolated measurement at a new data point somewhere within the applicable field.

在向集合添加(可能大量)测量之后,对象的所有者将查询它以在适用字段内的某个新数据点处获得插值测量。

The API will look something like this (the code is in Java, but that's not really important; the code is divided into three sections, for clarity):

API看起来像这样(代码是用Java编写的,但这并不重要;为清楚起见,代码分为三个部分):

// SECTION 1:
// Create the aggregation object, and get the zillion objects to insert...
ContinuousScalarField field = new ContinuousScalarField();
Collection<Measurement> measurements = getMeasurementsFromSomewhere();

// SECTION 2:
// Add all of the zillion objects to the aggregation object...
// Each measurement contains its xyz location, the quantity being measured,
// and a numeric value for the measurement. For example, something like
// "68 degrees F, plus or minus 0.5, at point 1.23, 2.34, 3.45"
foreach (Measurement m : measurements) {
   field.add(m);
}

// SECTION 3:
// Now the user wants to ask the model questions about the interpolated
// state of the model. For example, "what's the interpolated temperature
// at point (3, 4, 5)
Point3d p = new Point3d(3, 4, 5);
Measurement result = field.interpolateAt(p);

For my particular problem domain, it will be possible to perform a small amount of incremental work (partitioning the points into a balanced KDTree) during SECTION 2.

对于我的特定问题域,可以在第2节期间执行少量增量工作(将点划分为平衡的KDTree)。

And there will be a small amount of work (performing some linear interpolations) that can occur during SECTION 3.

并且在第3节期间可能会进行少量工作(执行一些线性插值)。

But there's a huge amount of work (constructing a kernel density estimator and performing a Fast Gauss Transform, using Taylor series and Hermite functions, but that's totally beside the point) that must be performed between sections 2 and 3.

但是有大量的工作(使用泰勒级数和Hermite函数构建核密度估计和执行快速高斯变换,但这完全不是重点)必须在第2和第3部分之间执行。

Sometimes in the past, I've just used lazy-evaluation to construct the data structures (in this case, it'd be on the first invocation of the "interpolateAt" method), but then if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch.

有时在过去,我只是使用惰性求值来构造数据结构(在这种情况下,它是在第一次调用“interpolateAt”方法),但是如果用户调用“field.add” ()“方法再次,我必须完全丢弃这些数据结构,并从头开始。

In other projects, I've required the user to explicitly call an "object.flip()" method, to switch from "append mode" into "query mode". The nice this about a design like this is that the user has better control over the exact moment when the hard-core computation starts. But it can be a nuisance for the API consumer to keep track of the object's current mode. And besides, in the standard use case, the caller never adds another value to the collection after starting to issue queries; data-aggregation almost always fully precedes query preparation.

在其他项目中,我要求用户显式调用“object.flip()”方法,从“追加模式”切换到“查询模式”。关于这样的设计的好处是用户可以更好地控制硬核计算开始时的确切时刻。但是API消费者跟踪对象的当前模式可能会很麻烦。此外,在标准用例中,调用者在开始发出查询后从不向集合添加其他值;数据聚合几乎总是完全在查询准备之前。

How have you guys handled designing a data structure like this?

你们是如何设计这样的数据结构的呢?

Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?

您是否更喜欢让对象懒惰地执行其重要分析,当新数据进入集合时丢弃中间数据结构?或者您是否要求程序员明确地将数据结构从追加模式转换为查询模式?

And do you know of any naming convention for objects like this? Is there a pattern I'm not thinking of?

你知道像这样的对象的任何命名约定吗?有没有我想到的模式?

ON EDIT:

There seems to be some confusion and curiosity about the class I used in my example, named "ContinuousScalarField".

我在我的例子中使用的类似乎有一些混乱和好奇心,名为“ContinuousScalarField”。

You can get a pretty good idea for what I'm talking about by reading these wikipedia pages:

通过阅读这些*页面,您可以很好地了解我正在谈论的内容:

Let's say you wanted to create a topographical map (this is not my exact problem, but it's conceptually very similar). So you take a thousand altitude measurements over an area of one square mile, but your survey equipment has a margin of error of plus-or-minus 10 meters in elevation.

假设你想要创建一个地形图(这不是我的确切问题,但它在概念上非常相似)。因此,您在一平方英里的区域内进行了一千次高度测量,但您的测量设备的高度误差为正负10米。

Once you've gathered all the data points, you feed them into a model which not only interpolates the values, but also takes into account the error of each measurement.

一旦收集了所有数据点,就可以将它们输入到一个模型中,该模型不仅可以插值,还可以考虑每个测量的误差。

To draw your topo map, you query the model for the elevation of each point where you want to draw a pixel.

要绘制地形图,请在模型中查询要绘制像素的每个点的高程。

As for the question of whether a single class should be responsible for both appending and handling queries, I'm not 100% sure, but I think so.

至于单个班级是否应该对追加和处理查询负责的问题,我不是百分百肯定,但我想是的。

Here's a similar example: HashMap and TreeMap classes allow objects to be both added and queried. There aren't separate interfaces for adding and querying.

这是一个类似的例子:HashMap和TreeMap类允许添加和查询对象。没有用于添加和查询的单独接口。

Both classes are also similar to my example, because the internal data structures have to be maintained on an ongoing basis in order to support the query mechanism. The HashMap class has to periodically allocate new memory, re-hash all objects, and move objects from the old memory to the new memory. A TreeMap has to continually maintain tree balance, using the red-black-tree data structure.

这两个类也与我的示例类似,因为必须持续维护内部数据结构以支持查询机制。 HashMap类必须定期分配新内存,重新散列所有对象,并将对象从旧内存移动到新内存。 TreeMap必须使用红黑树数据结构持续维护树平衡。

The only difference is that my class will perform optimally if it can perform all of its calculations once it knows the data set is closed.

唯一的区别是,如果我的类能够在知道数据集关闭后执行所有计算,那么它将以最佳方式执行。

6 个解决方案

#1

I generally prefer to have an explicit change, rather than lazily recomputing the result. This approach makes the performance of the utility more predictable, and it reduces the amount of work I have to do to provide a good user experience. For example, if this occurs in a UI, where do I have to worry about popping up an hourglass, etc.? Which operations are going to block for a variable amount of time, and need to be performed in a background thread?

我通常喜欢有一个明确的改变,而不是懒惰地重新计算结果。这种方法使得该实用程序的性能更加可预测,并且它减少了我必须做的工作量以提供良好的用户体验。例如,如果在UI中发生这种情况,我在哪里可以担心弹出沙漏等?哪些操作将在可变的时间内阻塞,需要在后台线程中执行?

That said, rather than explicitly changing the state of one instance, I would recommend the Builder Pattern to produce a new object. For example, you might have an aggregator object that does a small amount of work as you add each sample. Then instead of your proposed void flip() method, I'd have a Interpolator interpolator() method that gets a copy of the current aggregation and performs all your heavy-duty math. Your interpolateAt method would be on this new Interpolator object.

也就是说,我建议使用Builder Pattern生成一个新对象,而不是明确地改变一个实例的状态。例如,您可能有一个聚合器对象,在添加每个样本时执行少量工作。然后我没有你提出的void flip()方法,我有一个Interpolator interpolator()方法,它获取当前聚合的副本并执行所有重要的数学运算。您的interpolateAt方法将在这个新的Interpolator对象上。

If your usage patterns warrant, you could do simple caching by keeping a reference to the interpolator you create, and return it to multiple callers, only clearing it when the aggregator is modified.

如果您的使用模式有保证,您可以通过保留对您创建的插补器的引用来执行简单缓存,并将其返回给多个调用者,仅在修改聚合器时将其清除。

This separation of responsibilities can help yield more maintainable and reusable object-oriented programs. An object that can return a Measurement at a requested Point is very abstract, and perhaps a lot of clients could use your Interpolator as one strategy implementing a more general interface.

这种职责分离有助于产生更易于维护和可重用的面向对象程序。可以在请求的Point返回Measurement的对象非常抽象,也许许多客户端可以使用Interpolator作为实现更通用接口的策略。

I think that the analogy you added is misleading. Consider an alternative analogy:

我认为你添加的类比是误导。考虑另一种类比:

Key[] data = new Key[...];
data[idx++] = new Key(...); /* Fast! */
...
Arrays.sort(data); /* Slow! */
...
boolean contains = Arrays.binarySearch(data, datum) >= 0; /* Fast! */

This can work like a set, and actually, it gives better performance than Set implementations (which are implemented with hash tables or balanced trees).

这可以像集合一样工作,实际上,它提供了比Set实现(使用哈希表或平衡树实现)更好的性能。

A balanced tree can be seen as an efficient implementation of insertion sort. After every insertion, the tree is in a sorted state. The predictable time requirements of a balanced tree are due to the fact the cost of sorting is spread over each insertion, rather than happening on some queries and not others.

平衡树可以看作是插入排序的有效实现。每次插入后,树都处于排序状态。平衡树的可预测时间要求是由于排序成本分散在每次插入上,而不是发生在某些查询而不是其他查询上。

The rehashing of hash tables does result in less consistent performance, and because of that, aren't appropriate for certain applications (perhaps a real-time microcontroller). But even the rehashing operation depends only on the load factor of the table, not the pattern of insertion and query operations.

重新散列哈希表会导致性能不太稳定,因此不适合某些应用程序(可能是实时微控制器)。但即使是重新运行操作也只取决于表的加载因子,而不是插入和查询操作的模式。

For your analogy to hold strictly, you would have to "sort" (do the hairy math) your aggregator with each point you add. But it sounds like that would be cost prohibitive, and that leads to the builder or factory method patterns. This makes it clear to your clients when they need to be prepared for the lengthy "sort" operation.

对于严格要求的类比,您必须在添加的每个点上对您的聚合器进行“排序”(做毛茸茸的数学运算)。但听起来似乎成本过高,这导致了构建器或工厂方法模式。这使得客户在需要为冗长的“排序”操作做好准备时会清楚。

#2

If an object has two modes like this, I would suggest exposing two interfaces to the client. If the object is in append mode, then you make sure that the client can only ever use the IAppendable implementation. To flip to query mode, you add a method to IAppendable such as AsQueryable. To flip back, call IQueryable.AsAppendable.

如果一个对象有两个这样的模式,我建议将两个接口暴露给客户端。如果对象处于追加模式,则确保客户端只能使用IAppendable实现。要转换到查询模式,可以向IAppendable添加方法,例如AsQueryable。要回头,请调用IQueryable.AsAppendable。

You can implement IAppendable and IQueryable on the same object, and keep track of the state in the same way internally, but having two interfaces makes it clear to the client what state the object is in, and forces the client to deliberately make the (expensive) switch.

您可以在同一个对象上实现IAppendable和IQueryable,并在内部以相同的方式跟踪状态,但是有两个接口使客户端清楚该对象处于什么状态,并强制客户端故意制作(昂贵) )开关。

#3

Your objects should have one role and responsibility. In your case should the ContinuousScalarField be responsible for interpolating?

您的对象应该有一个角色和责任。在你的情况下,ContinuousScalarField应该负责插值吗?

Perhaps you might be better off doing something like:

也许你最好做以下事情:

IInterpolator interpolator = field.GetInterpolator();
Measurement measurement = Interpolator.InterpolateAt(...);

I hope this makes sense, but without fully understanding your problem domain it's hard to give you a more coherent answer.

我希望这是有道理的,但如果没有完全理解你的问题领域,很难给你一个更连贯的答案。

#4

"I've just used lazy-evaluation to construct the data structures" -- Good

“我刚刚使用了懒惰评估来构建数据结构” - 很好

"if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch." -- Interesting

“如果用户再次调用”field.add()“方法,我必须完全丢弃这些数据结构并从头开始。” - 有趣

"in the standard use case, the caller never adds another value to the collection after starting to issue queries" -- Whoops, false alarm, actually not interesting.

“在标准用例中,调用者在开始发出查询后从不向集合中添加其他值” - 哎呀,误报,实际上并不感兴趣。

Since lazy eval fits your use case, stick with it. That's a very heavily used model because it is so delightfully reliable and fits most use cases very well.

懒惰eval适合您的使用案例,坚持使用它。这是一个非常使用的模型,因为它非常可靠,非常适合大多数用例。

The only reason for rethinking this is (a) the use case change (mixed adding and interpolation), or (b) performance optimization.

重新考虑这一点的唯一原因是(a)用例改变(混合添加和插值),或(b)性能优化。

Since use case changes are unlikely, you might consider the performance implications of breaking up interpolation. For example, during idle time, can you precompute some values? Or with each add is there a summary you can update?

由于用例更改不太可能,您可能会考虑分解插值的性能影响。例如,在空闲时间,您可以预先计算一些值吗?或者每次添加都有可以更新的摘要吗?

Also, a highly stateful (and not very meaningful) flip method isn't so useful to clients of your class. However, breaking interpolation into two parts might still be helpful to them -- and help you with optimization and state management.

此外,一个高度有状态(并且不是很有意义)的翻转方法对您班级的客户来说并不那么有用。但是,将插值分为两部分可能仍然有助于它们 - 并帮助您进行优化和状态管理。

You could, for example, break interpolation into two methods.

例如,您可以将插值分解为两种方法。

public void interpolateAt( Point3d p );
public Measurement interpolatedMasurement();

This borrows the relational database Open and Fetch paradigm. Opening a cursor can do a lot of preliminary work, and may start executing the query, you don't know. Fetching the first row may do all the work, or execute the prepared query, or simply fetch the first buffered row. You don't really know. You only know that it's a two part operation. The RDBMS developers are free to optimize as they see fit.

这借用了关系数据库Open和Fetch范例。打开游标可以做很多初步的工作,并且可能会开始执行查询,你不知道。获取第一行可以完成所有工作,或执行准备好的查询,或者只是获取第一个缓冲行。你真的不知道。你只知道这是一个两部分的操作。 RDBMS开发人员可以根据自己的需要*优化。

#5

Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?

您是否更喜欢让对象懒惰地执行其重要分析,当新数据进入集合时丢弃中间数据结构?或者您是否要求程序员明确地将数据结构从追加模式转换为查询模式?

I prefer using data structures that allow me to incrementally add to it with "a little more work" per addition, and to incrementally pull the data I need with "a little more work" per extraction.

我更喜欢使用数据结构,允许我逐步添加它,每次添加“更多工作”,并逐步提取我需要的数据,每次提取“多一点工作”。

Perhaps if you do some "interpolate_at()" call in the upper-right corner of your region, you only need to do calculations involving the points in that upper-right corner, and it doesn't hurt anything to leave the other 3 quadrants "open" to new additions. (And so on down the recursive KDTree).

也许如果你在你所在地区的右上角做一些“interpolate_at()”调用,你只需要进行涉及右上角点数的计算,并且不会伤害其他3个象限。对新成员“开放”。 (等等递归KDTree)。

Alas, that's not always possible -- sometimes the only way to add more data is to throw away all the previous intermediate and final results, and re-calculate everything again from scratch.

唉,这并不总是可行的 - 有时候添加更多数据的唯一方法就是抛弃所有先前的中间和最终结果,并从头开始重新计算所有内容。

The people who use the interfaces I design -- in particular, me -- are human and fallible. So I don't like using objects where those people must remember to do things in a certain way, or else things go wrong -- because I'm always forgetting those things.

使用我设计的界面的人 - 特别是我 - 是人类和易犯错误的。因此,我不喜欢使用那些人必须记住以某种方式做事的物品,否则就会出错 - 因为我总是忘记这些事情。

If an object must be in the "post-calculation state" before getting data out of it, i.e. some "do_calculations()" function must be run before the interpolateAt() function gets valid data, I much prefer letting the interpolateAt() function check if it's already in that state, running "do_calculations()" and updating the state of the object if necessary, and then returning the results I expected.

如果一个对象在从中获取数据之前必须处于“计算后状态”,即必须在interpolateAt()函数获得有效数据之前运行一些“do_calculations()”函数,我更喜欢让insertlateAt()函数检查它是否已经处于该状态,运行“do_calculations()”并在必要时更新对象的状态,然后返回我期望的结果。

Sometimes I hear people describe such a data structure as "freeze" the data or "crystallize" the data or "compile" or "put the data into an immutable data structure". One example is converting a (mutable) StringBuilder or StringBuffer into an (immutable) String.

有时我听到有人将这样的数据结构描述为“冻结”数据或“结晶”数据或“编译”或“将数据放入不可变数据结构中”。一个例子是将(可变的)StringBuilder或StringBuffer转换为(不可变的)String。

I can imagine that for some kinds of analysis, you expect to have all the data ahead of time, and pulling out some interpolated value before all the data has put in would give wrong results. In that case, I'd prefer to set things up such that the "add_data()" function fails or throws an exception if it (incorrectly) gets called after any interpolateAt() call.

我可以想象,对于某些类型的分析,您希望提前获得所有数据,并在所有数据输入之前提取一些内插值会产生错误的结果。在这种情况下,我更喜欢设置“add_data()”函数失败或抛出异常(如果在任何interpolateAt()调用后调用(错误))。

I would consider defining a lazily-evaluated "interpolated_point" object that doesn't really evaluate the data right away, but only tells that program that sometime in the future that data at that point will be required. The collection isn't actually frozen, so it's OK to continue adding more data to it, up until the point something actually extract the first real value from some "interpolated_point" object, which internally triggers the "do_calculations()" function and freezes the object. It might speed things up if you know not only all the data, but also all the points that need to be interpolated, all ahead of time. Then you can throw away data that is "far away" from the interpolated points, and only do the heavy-duty calculations in regions "near" the interpolated points.

我会考虑定义一个懒惰评估的“interpolated_point”对象,它不会立即真正评估数据,但只会告诉程序将来某个时候需要的数据。该集合实际上并未被冻结,因此可以继续向其添加更多数据,直到某些内容实际从某个“interpolated_point”对象中提取第一个实际值,该对象在内部触发“do_calculations()”函数并冻结宾语。如果您不仅知道所有数据,而且还知道所有需要插值的点,这可能会提前加速。然后,您可以丢弃距离插值点“远离”的数据,并且仅在“插入点”附近的区域中进行重载计算。

For other kinds of analysis, you do the best you can with the data you have, but when more data comes in later, you want to use that new data in your later analysis. If the only way to do that is to throw away all the intermediate results and recalculate everything from scratch, then that's what you have to do. (And it's best if the object automatically handled this, rather than requiring people to remember to call some "clear_cache()" and "do_calculations()" function every time).

对于其他类型的分析,您可以使用所拥有的数据尽力而为,但是当以后有更多数据出现时,您希望在以后的分析中使用这些新数据。如果这样做的唯一方法就是抛弃所有中间结果并从头开始重新计算所有内容,那么这就是你必须要做的。 (并且最好是对象自动处理这个,而不是要求人们记住每次都调用一些“clear_cache()”和“do_calculations()”函数。

#6

-1

You could have a state variable. Have a method for starting the high level processing, which will only work if the STATE is in SECTION-1. It will set the state to SECTION-2, and then to SECTION-3 when it is done computing. If there's a request to the program to interpolate a given point, it will check if the state is SECTION-3. If not, it will request the computations to begin, and then interpolate the given data.

你可以有一个状态变量。有一个启动高级处理的方法,只有当STATE在SECTION-1中时才有效。它将状态设置为SECTION-2,然后在完成计算时设置为SECTION-3。如果要求程序插入给定点,它将检查状态是否为SECTION-3。如果不是,它将请求计算开始,然后插入给定数据。

This way, you accomplish both - the program will perform its computations at the first request to interpolate a point, but can also be requested to do so earlier. This would be convenient if you wanted to run the computations overnight, for example, without needing to request an interpolation.

通过这种方式,您可以完成两者 - 程序将在第一次插入点的请求时执行计算,但也可以提前请求执行此计算。如果您想在一夜之间运行计算,这将非常方便,例如,无需请求插值。

#1