Does anyone has some tool or some recommended practice how to find a piece of code which is similar to some other code?
有没有人有一些工具或一些推荐的练习如何找到一段与其他代码类似的代码?
Often I write a function or a code fragment and I remember I have already written something like that before, and I would like to reuse previous implementation, however using plain text search does not reveal anything, as I did not use the variable names which would be exactly the same.
我经常写一个函数或一个代码片段,我记得我之前已经编写了类似的东西,我想重用以前的实现,但是使用纯文本搜索并没有透露任何东西,因为我没有使用变量名称完全一样。
Having similar code fragments leads to unnecessary code duplication, however with a large code base it is impossible to keep all code in memory. Are there any tools which would perform some analysis of the code and marked fragments or functions which are "similar" in terms of functionality?
具有相似的代码片段会导致不必要的代码重复,但是如果代码库很大,则无法将所有代码保留在内存中。是否有任何工具可以对代码进行某些分析,并在功能方面标记片段或功能“相似”?
Consider following examples:
考虑以下示例:
float xDistance = 0, zDistance = 0;
if (camPos.X()<xgMin) xDistance = xgMin-camPos.X();
if (camPos.X()>xgMax) xDistance = camPos.X()-xgMax;
if (camPos.Z()<zgMin) zDistance = zgMin-camPos.Z();
if (camPos.Z()>zgMax) zDistance = camPos.Z()-zgMax;
float dist = sqrt(xDistance*xDistance+zDistance*zDistance);
and
float distX = 0, distZ = 0;
if (cPos.X()<xgMin) distX = xgMin-cPos.X();
if (cPos.X()>xgMax) distX = cPos.X()-xgMax;
if (cPos.Z()<zgMin) distZ = zgMin-cPos.Z();
if (cPos.Z()>zgMax) distZ = cPos.Z()-zgMax;
float dist = sqrt(distX*distX +distZ*distZ);
It seems to me this has been already asked and answered several times:
在我看来,这已经被多次询问和回答:
https://*.com/questions/204177/what-tool-to-find-code-duplicates-in-c-projects
How to detect code duplication during development?
如何在开发过程中检测代码重复?
I suggest closing as duplicate here.
我建议在这里关闭重复。
Actually I think it is a more general search problem, like: How do I search if the question was already asked on *?
实际上我认为这是一个更普遍的搜索问题,例如:如何在*上询问问题时如何搜索?
3 个解决方案
#1
You can use Simian. It is a tool that detects duplicate code in Java, C#, C++, XML, and many more (even plain txt files). It even integrates nicely in a tool like CruiseControl.
你可以使用Simian。它是一个工具,可以检测Java,C#,C ++,XML等更复杂的代码(甚至是普通的txt文件)。它甚至可以很好地集成在像CruiseControl这样的工具中。
#2
Our CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP, Python and many other languages.
我们的CloneDR在大型源系统中找到重复的代码,包括精确副本和接近未命中,由langauge语法参数化。它支持Java,C#,COBOL,C ++,PHP,Python和许多其他语言。
It accepts a number of parameters to define "What is a clone?", including: a) Similarilty threshold, controlling how similar two blocks of code must be to be declared as clones (typically 95% is good) b) number of lines minimum clone size (3 tends to be a good choice) c) number of parameters (distinct changes to the text; 5 tends to be a good choice) With these settings, it tends to find 10-15% redundant code in virturally everything it processes.
它接受许多参数来定义“什么是克隆?”,包括:a)相似阈值,控制两个代码块必须如何相似地声明为克隆(通常95%是好的)b)最小行数克隆大小(3往往是一个不错的选择)c)参数的数量(文本的明显变化; 5往往是一个不错的选择)使用这些设置,它往往会在它处理的所有内容中找到10-15%的冗余代码。
Line-oriented clone detection tools such as Simian can't find cloned code that has been reformatted, but CloneDR will. They may tell that two blocks of code match, but they usually don't show you exactly how they match or where the differences are; CloneDR will. They don't suggest how to abstract the cloned code; CloneDR will.
面向行的克隆检测工具(如Simian)找不到已重新格式化的克隆代码,但CloneDR会。他们可能会说两个代码块匹配,但它们通常不会准确显示它们的匹配方式或差异所在; CloneDR会。他们没有建议如何抽象克隆代码; CloneDR会。
By virtue of having weaker matching algorithms, they tend to produce more false positives; when you get 5000 clones reported across a million lines, the number of false positives matters a lot.
由于具有较弱的匹配算法,它们往往会产生更多的误报;当你在一百万行中报告5000个克隆时,误报的数量很重要。
Based on your example, I'd expect it to find those two fragments (you don't have have point to either one) and note that they are similar if you abstract away the variable names.
根据你的例子,我希望它找到这两个片段(你没有指向任何一个片段),并注意如果你抽象出变量名称它们是相似的。
#3
Here is the best collection on code clones detection I've seen:
这是我见过的代码克隆检测的最佳集合:
https://web.archive.org/web/20120502162147/http://students.cis.uab.edu/tairasr/clones/literature
There are many programs, but none of them seems to be the best or the most popular. You can think what is the most important for you and find what suits your needs.
有很多节目,但它们似乎都不是最好的或最受欢迎的节目。您可以考虑什么对您最重要,并找到适合您需求的东西。
#1
You can use Simian. It is a tool that detects duplicate code in Java, C#, C++, XML, and many more (even plain txt files). It even integrates nicely in a tool like CruiseControl.
你可以使用Simian。它是一个工具,可以检测Java,C#,C ++,XML等更复杂的代码(甚至是普通的txt文件)。它甚至可以很好地集成在像CruiseControl这样的工具中。
#2
Our CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP, Python and many other languages.
我们的CloneDR在大型源系统中找到重复的代码,包括精确副本和接近未命中,由langauge语法参数化。它支持Java,C#,COBOL,C ++,PHP,Python和许多其他语言。
It accepts a number of parameters to define "What is a clone?", including: a) Similarilty threshold, controlling how similar two blocks of code must be to be declared as clones (typically 95% is good) b) number of lines minimum clone size (3 tends to be a good choice) c) number of parameters (distinct changes to the text; 5 tends to be a good choice) With these settings, it tends to find 10-15% redundant code in virturally everything it processes.
它接受许多参数来定义“什么是克隆?”,包括:a)相似阈值,控制两个代码块必须如何相似地声明为克隆(通常95%是好的)b)最小行数克隆大小(3往往是一个不错的选择)c)参数的数量(文本的明显变化; 5往往是一个不错的选择)使用这些设置,它往往会在它处理的所有内容中找到10-15%的冗余代码。
Line-oriented clone detection tools such as Simian can't find cloned code that has been reformatted, but CloneDR will. They may tell that two blocks of code match, but they usually don't show you exactly how they match or where the differences are; CloneDR will. They don't suggest how to abstract the cloned code; CloneDR will.
面向行的克隆检测工具(如Simian)找不到已重新格式化的克隆代码,但CloneDR会。他们可能会说两个代码块匹配,但它们通常不会准确显示它们的匹配方式或差异所在; CloneDR会。他们没有建议如何抽象克隆代码; CloneDR会。
By virtue of having weaker matching algorithms, they tend to produce more false positives; when you get 5000 clones reported across a million lines, the number of false positives matters a lot.
由于具有较弱的匹配算法,它们往往会产生更多的误报;当你在一百万行中报告5000个克隆时,误报的数量很重要。
Based on your example, I'd expect it to find those two fragments (you don't have have point to either one) and note that they are similar if you abstract away the variable names.
根据你的例子,我希望它找到这两个片段(你没有指向任何一个片段),并注意如果你抽象出变量名称它们是相似的。
#3
Here is the best collection on code clones detection I've seen:
这是我见过的代码克隆检测的最佳集合:
https://web.archive.org/web/20120502162147/http://students.cis.uab.edu/tairasr/clones/literature
There are many programs, but none of them seems to be the best or the most popular. You can think what is the most important for you and find what suits your needs.
有很多节目,但它们似乎都不是最好的或最受欢迎的节目。您可以考虑什么对您最重要,并找到适合您需求的东西。