I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.
我想知道是否有一种方法可以在不排序的情况下取消重复记录?有时,我想保持原来的顺序,只想删除重复的记录。
Is it possible?
是可能的吗?
BTW, below are what I know regarding unduplicating records, which does sorting in the end..
顺便说一句,下面是我所知道的关于重复记录的信息,它最终会进行排序。
1.
1。
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
2。
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;
8 个解决方案
#1
16
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
您可以使用散列对象来跟踪在通过数据集时看到了哪些值,只有在遇到尚未观察到的键时才会输出。这将按输入数据集中观察到的数据的顺序输出。
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
这里有一个使用输入数据集“sashelp.cars”的示例。原始数据按字母顺序排列,因此您可以看到输出数据集“nodupes”保持相同的顺序。
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;
#2
1
/* Give each record in the original dataset and row number */ data with_id ; set mydata ; _id = _n_ ; run ; /* Remove dupes */ proc sort data=with_id nodupkey ; by var1 var2 var3 ; run ; /* Sort back into original order */ proc sort data=with_id ; by _id ; run ;
#3
1
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
我认为简单的答案是,没有,至少没有一种方法比基于排序的方法对性能的影响更大。
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
可能存在这样的特定情况(所有变量都被索引的数据集?)一个相对较小的数据集,您可以合理地加载到内存中并使用它吗?)
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
克里斯·J的解决方案可能是获得你想要的结果的最好方法,但这不是你实际问题的答案。
#4
0
Depending on the number of variables in your data set, the following might be practical:
根据您的数据集中的变量数量,以下可能是实用的:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;
#5
0
This is the fastest way I can think of. It requires no sorting.
这是我能想到的最快的方法。它不需要排序。
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;
#6
0
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
请参阅使用说明37581:如何在没有排序的情况下,从大型数据集中消除重复的观察,http://support.sas.com/kb/37/581.html。使用说明37581显示了如何使用PROC摘要更有效地删除重复,而不使用排序。
#7
0
The two examples given in the original post are not identical.
原文中给出的两个例子并不相同。
- distinct in proc sql only removes lines which are fully identical
- proc sql中的distinct只删除完全相同的行
- nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
- proc sort中的nodupkey删除任何键变量相同的行(即使其他变量不相同)。您需要选择noduprecs来删除完全相同的行。
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
如果您只是在寻找具有通用关键变量的记录,那么我可以想到的另一个解决方案是创建一个只有关键变量(s)的数据集,并找出哪个是重复的,然后在原始数据上应用一种格式来标记重复的记录。如果数据集中有多个键变量,则需要创建一个新变量,该变量包含所有键变量值的连接——如果需要,将其转换为字符。
#8
-1
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.
这不会对数据进行排序,但会删除每个组中的重复数据。
#1
16
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
您可以使用散列对象来跟踪在通过数据集时看到了哪些值,只有在遇到尚未观察到的键时才会输出。这将按输入数据集中观察到的数据的顺序输出。
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
这里有一个使用输入数据集“sashelp.cars”的示例。原始数据按字母顺序排列,因此您可以看到输出数据集“nodupes”保持相同的顺序。
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;
#2
1
/* Give each record in the original dataset and row number */ data with_id ; set mydata ; _id = _n_ ; run ; /* Remove dupes */ proc sort data=with_id nodupkey ; by var1 var2 var3 ; run ; /* Sort back into original order */ proc sort data=with_id ; by _id ; run ;
#3
1
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
我认为简单的答案是,没有,至少没有一种方法比基于排序的方法对性能的影响更大。
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
可能存在这样的特定情况(所有变量都被索引的数据集?)一个相对较小的数据集,您可以合理地加载到内存中并使用它吗?)
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
克里斯·J的解决方案可能是获得你想要的结果的最好方法,但这不是你实际问题的答案。
#4
0
Depending on the number of variables in your data set, the following might be practical:
根据您的数据集中的变量数量,以下可能是实用的:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;
#5
0
This is the fastest way I can think of. It requires no sorting.
这是我能想到的最快的方法。它不需要排序。
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;
#6
0
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
请参阅使用说明37581:如何在没有排序的情况下,从大型数据集中消除重复的观察,http://support.sas.com/kb/37/581.html。使用说明37581显示了如何使用PROC摘要更有效地删除重复,而不使用排序。
#7
0
The two examples given in the original post are not identical.
原文中给出的两个例子并不相同。
- distinct in proc sql only removes lines which are fully identical
- proc sql中的distinct只删除完全相同的行
- nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
- proc sort中的nodupkey删除任何键变量相同的行(即使其他变量不相同)。您需要选择noduprecs来删除完全相同的行。
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
如果您只是在寻找具有通用关键变量的记录,那么我可以想到的另一个解决方案是创建一个只有关键变量(s)的数据集,并找出哪个是重复的,然后在原始数据上应用一种格式来标记重复的记录。如果数据集中有多个键变量,则需要创建一个新变量,该变量包含所有键变量值的连接——如果需要,将其转换为字符。
#8
-1
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.
这不会对数据进行排序,但会删除每个组中的重复数据。