如何测试需要复杂输入数据的程序？

We have a suite of converters that take complex data and transform it. Mostly the input is EDI and the output XML, or vice-versa, although there are other formats.

我们有一套转换器,可以获取复杂数据并对其进行转换。大多数输入是EDI和输出XML,反之亦然,尽管还有其他格式。

There are many inter-dependencies in the data. What methods or software are available that can generate complex input data like this?

数据中存在许多相互依赖关系。有哪些方法或软件可以生成这样的复杂输入数据?

Right now we use two methods: (1) a suite of sample files that we've built over the years mostly from files bugs and samples in documentation, and (2) generating pseudo-random test data. But the former only covers a fraction of the cases, and the latter has lots of compromises and only tests a subset of the fields.

现在我们使用两种方法:(1)我们多年来建立的一组样本文件,主要来自文件中的文件错误和样本,以及(2)生成伪随机测试数据。但前者仅涵盖了一小部分案例,后者有很多妥协,只测试了一部分字段。

Before go further down the path of implementing (reinventing?) a complex table-driven data generator, what options have you found successful?

在进一步沿着实现(重新发明?)一个复杂的表驱动数据生成器的道路前,您发现哪些选项成功?

2 个解决方案

#1

Well, the answer is in your question. Unless you implement a complex table-driven data generator, you're doing the things right with (1) and (2).

嗯,答案就在你的问题中。除非您实现一个复杂的表驱动数据生成器,否则您正在使用(1)和(2)做正确的事情。

(1) covers the rule of "1 bug verified, 1 new test case". And if the structure of the pseudo-random test data of (2) corresponds whatsoever in real life situations, it is fine.

(1)涵盖“1个错误验证,1个新测试用例”的规则。并且如果(2)的伪随机测试数据的结构在现实生活中无论如何都是对应的,那就很好了。

(2) can always be improved, and it'll improve mainly over time, when thinking about new edge cases. The problem with random data for tests is that it can only be random to a point where it becomes so difficult to compute the expected output from the random data in the test case, that you have to basically rewrite the tested algorithm in the test case.

(2)总是可以改进,并且在考虑新的边缘情况时它会主要随着时间的推移而改善。测试的随机数据的问题在于,它只能是随机的,在测试用例中计算随机数据的预期输出变得非常困难,你必须在测试用例中基本重写测试算法。

So (2) will always match a fraction of the cases. If one day it matches all the cases, it will be in fact a new version of your algorithm.

所以(2)将始终匹配一小部分案例。如果有一天它匹配所有情况,它实际上将是您的算法的新版本。

#2

I'd advise against using random data as it can make it difficult if not impossible to reproduce the error that reported (I know you said 'pseudo-random', just not sure what you mean by that exactly).

我建议不要使用随机数据,因为如果不是不可能重现报告的错误(我知道你说'伪随机',只是不确定你的意思)。
Operating over entire files of data would likely be considering functional or integration testing. I would suggest taking your set of files with known bugs and translating these into unit tests, or at least do so for any future bugs you come across. Then you can also extend these unit tests to include coverage for the other erroneous conditions that you don't have any 'sample data'. This will likely be easier then coming up with a whole new data file every time you think of a condition/rule violation you want to check for.

对整个数据文件进行操作可能会考虑功能或集成测试。我建议将您的文件集与已知错误一起使用并将其转换为单元测试,或至少为您遇到的任何未来错误执行此操作。然后,您还可以扩展这些单元测试,以包括您没有任何“样本数据”的其他错误条件的覆盖范围。每当您想到要检查的条件/规则违规时,这可能会更容易提出一个全新的数据文件。
Make sure your parsing of the data format is encapsulated from the interpretation of the data in the format. This will make unit testing as described above much easier.

确保您对数据格式的解析是根据格式中数据的解释进行封装的。这将使上述单元测试更容易。
If you definitely need to drive your testing you may want to consider getting a machine readable description of the file format, and writing a test data generator which will analyze the format and generate valid/invalid files based upon it. This will also allow your test data to evolve as the file formats do as well.

如果您肯定需要开展测试,您可能需要考虑获取文件格式的机器可读描述,并编写测试数据生成器,该数据生成器将分析格式并根据它生成有效/无效文件。这也将允许您的测试数据随着文件格式的变化而发展。

#1