软件开发与统计编程/分析相比如何?

时间:2021-01-29 16:17:36

Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.

统计分析/编程,正在编写代码。无论是描述性还是推断性,都要编写代码:导入数据、清理数据、分析数据和编写报告。

Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.

分析数据可能涉及到统计过程的许多曲折,以及查看数据的角度。最后,您有许多文件和许多行代码,在数据上执行任务。其中一些是可重用的,您将其封装为“具有良好的功能”。

This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.

这种“统计分析”的过程对我来说就像是“编程”,但我不确定它对每个人的感觉都是一样的。

From the Wikipedia article on Software development:

*上关于软件开发的文章:

The term software development is often used to refer to the activity of computer programming, which is the process of writing and maintaining the source code, whereas the broader sense of the term includes all that is involved between the conception of the desired software through to the final manifestation of the software. Therefore, software development may include research, new development, modification, reuse, re-engineering, maintenance, or any other activities that result in software products. For larger software systems, usually developed by a team of people, some form of process is typically followed to guide the stages of production of the software.

软件开发这个词通常是指计算机编程的活动,这是一个过程,编写和维护源代码,而这个词的广义上包括所有涉及的概念之间所需的软件通过软件的最终表现。因此,软件开发可能包括研究、新开发、修改、重用、再工程、维护或其他导致软件产品的活动。对于较大的软件系统,通常是由团队开发的,通常遵循某种形式的流程来指导软件的生产阶段。

According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.

根据这个简单的定义(以及我的拙见),这听起来很像建立一个统计分析。但我想事情没那么简单。

Which leads me to my question: what differences can you outline between the two activities?

这就引出了我的问题:你认为这两种活动有什么不同?

It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.

它可以是在技术方面,不同的策略或工作风格,以及任何你认为相关的东西。

This question came to me from the following threads:

这个问题是我从以下几个方面想到的:

3 个解决方案

#1


12  

As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.

正如我在回答你的另一个问题时所说,你所描述的是编程。所以简单的回答是:没有区别。稍长一点的答案是,与其他编程相比,统计和科学计算需要对开发进行更多的控制。

A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.

一定比例的统计分析可以使用Excel进行,或者使用SPSS、SAS、Matlab或S-Plus(例如)进行点对点的分析。使用其中一个包含编程的程序(或R)进行更复杂的分析,显然是软件开发的一种形式。这种统计计算可以从软件开发的最佳实践中获益良多:源控制、文档、项目计划、范围文档、bug跟踪/更改控制等。

Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:

此外,还有不同种类的统计分析可以采用不同的方法,如任何方案规划项目:

  • Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
  • 探索性数据分析应该遵循迭代方法,比如敏捷方法。在这种情况下,当您不知道前面涉及的步骤的明确性时,使用自适应和自反思的开发方法是至关重要的。
  • A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
  • 一种更常规的分析(如*的年度调查,如人口普查)可以采用更传统的方法,如瀑布法,因为它将遵循一套非常明确的步骤,这些步骤通常是预先知道的。

I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.

我建议,任何统计学家都应该从阅读《代码完成》(Code Complete)这样的书中受益(看看这篇文章中的其他畅销书):你的分析越有条理,成功的可能性就越大。

Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.

在某种意义上,统计分析需要版本控制和文档方面比其他编程更好的实践。如果您的程序只是服务于某些业务需求,那么只要程序按照规范要求的方式运行,所使用的算法或软件就真的是次要的。另一方面,通过科学和统计计算,准确度和重现性是最重要的。这是约翰·钱伯斯(S语言的创造者)在《数据分析软件》中的主要重点之一。这是将有文化的编程(例如用Sweave)作为统计学家工具包中的重要工具的另一个原因。

#2


4  

Perhaps the common denominator is "problem solving."

也许共同的标准是“解决问题”。

Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.

除此之外,我怀疑我是否能提供任何洞见,但我至少能从个人经验中提供一个有限的答案。

This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.

这是我们在招聘时遇到的问题。我们是雇一个程序员来教他们统计,还是雇一个统计人员来教他们编程?理想情况下,我们可以找到一个精通这两门学科的人,事实上,这是我们的第三个网络,但很少成功。

Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.

这是一个例子。这两个活动(软件开发和统计分析)之间最稳定的区别可能是它们各自的输出或项目可交付物。例如,在我的小组中,有人正在对我们的分裂路径和阶乘实验的结果进行统计分析(例如,从t检验结果中,无论差异是否显著,或者测试是否应该继续)。该分析将被发送到营销部门,他们将用来修改包含站点的web页面,以改进转换。第二个任务涉及抽象和部分自动化的分析,以便结果可以接近实时处理。

For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).

对于第一个任务,我们会指派一个统计学家;第二个是程序员。我们试图解决的业务问题对于两个任务都是相同的,但是对于第一个任务,关键是统计,对于第二个任务,统计问题已经基本解决,关键是一个核心的编程任务(I/O)。

Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.

还要注意如何与两个活动相关工具的演变进化得如此的两者之间的区别(软件开发和数据分析)进一步混淆:主流开发语言正在被用于使用特定于域的分析工具,同时,继续开发框架,使非开发人员快速构建轻量级的、面向任务的应用程序在dsl。

For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.

例如,python,R一个通用开发语言绑定(RPy2)连同其本地交互式解释器(空闲),大大促进了python的使用统计分析,同时,有一个明显的趋势在R包发展(网络)应用程序开发:R绑定Qt,gWidgetsWWW,和RApache——都是R包针对客户机或web应用程序开发的最初版本(我认为)w /在过去18个月。除此之外,至少从去年最后一个季度开始,我注意到在R中关于Web应用程序开发的主题的博客文章、演讲等等的频率在不断增加。

Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)

最后,我想知道你的问题是否是r越来越流行的证据。十年前,当我的雇主购买了一个站点许可证时,我开始学习和使用一个主要的统计计算产品(这里没有必要说哪个产品,它以“S”开头)。我发现它不自然,而且不灵活。与Perl(当时我正在使用的Perl)不同,这个工具不是我大脑的扩展(它不是分析工具的可选属性,对我来说,它或多或少是分析工具的定义)。与这个系统交互更像是使用自动售货机——我选择了一些我想要的统计功能,然后等待“输出”,这通常是一组令人印象深刻的高冲击力、全彩的图表和表格。几乎总是,我想要的是修改我的输入或者在下一个分析步骤中使用那个输出。这似乎需要另一个,单独的自动售货机之旅。这个工具是上下文感知的,也就是说。它知道统计数据——而Perl不知道,这并不能弥补笨拙的交互。这样做的统计分析永远不会与软件开发混淆。(同样,这只是我个人经验的总结,我不认为它可以抽象。它也不是针对任何(或所有)商业数据分析平台的争论——数百万人使用它们,他们为创建它们的人赚了无数钱,所以让我们假设是我自己的限制导致了债券的失败。

I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).

我直到大约18个月前才听说过R,直到在为Python的统计库扫描PyPI (Python的外部包存储库的Web接口)时才发现它。在那里我遇到了RPy,它看起来很出色,但是需要一个名为“R”的依赖项(RPy当然只是一组Python绑定到R)。

Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.

也许R对程序员和非程序员同样有吸引力,对于程序员/分析师来说,这是天赐之物。打击一切对我的愿望列表数据分析平台:基于一个全功能的一个引擎,通用编程语言(在本例中是一个证明方案后裔),一个潜在的功能模式,内置的交互式解释器,本地数据类型由地面数据分析,和领域知识的。数据分析变得更像编码。生活很好。

#3


1  

If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.

如果你使用R,那么你很可能会编写代码来解决统计问题,所以从这个意义上说,统计分析是编程的一个子集。

On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

另一方面,也有很多SPSS用户,他们从来没有尝试过使用指向和点击来解决他们的统计问题。这对我来说不像编程。

#1


12  

As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.

正如我在回答你的另一个问题时所说,你所描述的是编程。所以简单的回答是:没有区别。稍长一点的答案是,与其他编程相比,统计和科学计算需要对开发进行更多的控制。

A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.

一定比例的统计分析可以使用Excel进行,或者使用SPSS、SAS、Matlab或S-Plus(例如)进行点对点的分析。使用其中一个包含编程的程序(或R)进行更复杂的分析,显然是软件开发的一种形式。这种统计计算可以从软件开发的最佳实践中获益良多:源控制、文档、项目计划、范围文档、bug跟踪/更改控制等。

Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:

此外,还有不同种类的统计分析可以采用不同的方法,如任何方案规划项目:

  • Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
  • 探索性数据分析应该遵循迭代方法,比如敏捷方法。在这种情况下,当您不知道前面涉及的步骤的明确性时,使用自适应和自反思的开发方法是至关重要的。
  • A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
  • 一种更常规的分析(如*的年度调查,如人口普查)可以采用更传统的方法,如瀑布法,因为它将遵循一套非常明确的步骤,这些步骤通常是预先知道的。

I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.

我建议,任何统计学家都应该从阅读《代码完成》(Code Complete)这样的书中受益(看看这篇文章中的其他畅销书):你的分析越有条理,成功的可能性就越大。

Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.

在某种意义上,统计分析需要版本控制和文档方面比其他编程更好的实践。如果您的程序只是服务于某些业务需求,那么只要程序按照规范要求的方式运行,所使用的算法或软件就真的是次要的。另一方面,通过科学和统计计算,准确度和重现性是最重要的。这是约翰·钱伯斯(S语言的创造者)在《数据分析软件》中的主要重点之一。这是将有文化的编程(例如用Sweave)作为统计学家工具包中的重要工具的另一个原因。

#2


4  

Perhaps the common denominator is "problem solving."

也许共同的标准是“解决问题”。

Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.

除此之外,我怀疑我是否能提供任何洞见,但我至少能从个人经验中提供一个有限的答案。

This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.

这是我们在招聘时遇到的问题。我们是雇一个程序员来教他们统计,还是雇一个统计人员来教他们编程?理想情况下,我们可以找到一个精通这两门学科的人,事实上,这是我们的第三个网络,但很少成功。

Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.

这是一个例子。这两个活动(软件开发和统计分析)之间最稳定的区别可能是它们各自的输出或项目可交付物。例如,在我的小组中,有人正在对我们的分裂路径和阶乘实验的结果进行统计分析(例如,从t检验结果中,无论差异是否显著,或者测试是否应该继续)。该分析将被发送到营销部门,他们将用来修改包含站点的web页面,以改进转换。第二个任务涉及抽象和部分自动化的分析,以便结果可以接近实时处理。

For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).

对于第一个任务,我们会指派一个统计学家;第二个是程序员。我们试图解决的业务问题对于两个任务都是相同的,但是对于第一个任务,关键是统计,对于第二个任务,统计问题已经基本解决,关键是一个核心的编程任务(I/O)。

Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.

还要注意如何与两个活动相关工具的演变进化得如此的两者之间的区别(软件开发和数据分析)进一步混淆:主流开发语言正在被用于使用特定于域的分析工具,同时,继续开发框架,使非开发人员快速构建轻量级的、面向任务的应用程序在dsl。

For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.

例如,python,R一个通用开发语言绑定(RPy2)连同其本地交互式解释器(空闲),大大促进了python的使用统计分析,同时,有一个明显的趋势在R包发展(网络)应用程序开发:R绑定Qt,gWidgetsWWW,和RApache——都是R包针对客户机或web应用程序开发的最初版本(我认为)w /在过去18个月。除此之外,至少从去年最后一个季度开始,我注意到在R中关于Web应用程序开发的主题的博客文章、演讲等等的频率在不断增加。

Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)

最后,我想知道你的问题是否是r越来越流行的证据。十年前,当我的雇主购买了一个站点许可证时,我开始学习和使用一个主要的统计计算产品(这里没有必要说哪个产品,它以“S”开头)。我发现它不自然,而且不灵活。与Perl(当时我正在使用的Perl)不同,这个工具不是我大脑的扩展(它不是分析工具的可选属性,对我来说,它或多或少是分析工具的定义)。与这个系统交互更像是使用自动售货机——我选择了一些我想要的统计功能,然后等待“输出”,这通常是一组令人印象深刻的高冲击力、全彩的图表和表格。几乎总是,我想要的是修改我的输入或者在下一个分析步骤中使用那个输出。这似乎需要另一个,单独的自动售货机之旅。这个工具是上下文感知的,也就是说。它知道统计数据——而Perl不知道,这并不能弥补笨拙的交互。这样做的统计分析永远不会与软件开发混淆。(同样,这只是我个人经验的总结,我不认为它可以抽象。它也不是针对任何(或所有)商业数据分析平台的争论——数百万人使用它们,他们为创建它们的人赚了无数钱,所以让我们假设是我自己的限制导致了债券的失败。

I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).

我直到大约18个月前才听说过R,直到在为Python的统计库扫描PyPI (Python的外部包存储库的Web接口)时才发现它。在那里我遇到了RPy,它看起来很出色,但是需要一个名为“R”的依赖项(RPy当然只是一组Python绑定到R)。

Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.

也许R对程序员和非程序员同样有吸引力,对于程序员/分析师来说,这是天赐之物。打击一切对我的愿望列表数据分析平台:基于一个全功能的一个引擎,通用编程语言(在本例中是一个证明方案后裔),一个潜在的功能模式,内置的交互式解释器,本地数据类型由地面数据分析,和领域知识的。数据分析变得更像编码。生活很好。

#3


1  

If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.

如果你使用R,那么你很可能会编写代码来解决统计问题,所以从这个意义上说,统计分析是编程的一个子集。

On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

另一方面,也有很多SPSS用户,他们从来没有尝试过使用指向和点击来解决他们的统计问题。这对我来说不像编程。