如何在Stata中通过语句(来自MySQL)进行分组?

时间:2020-12-20 04:28:47

I am not a statistics guy but have to deal with quite some data. In most cases these data sets come from an online survey; hence I do have a MySQL database and know how to get some results out of that.

我不是一个统计人员,但必须处理相当多的数据。在大多数情况下,这些数据集来自在线调查;因此我有一个MySQL数据库,知道如何从中获得一些结果。

However, now I got a Stata file and I am required to do some analysis. In MySQL I'd know how to do that, but I am stuck in Stata and ask for your help.

但是,现在我得到了一个Stata文件,我需要做一些分析。在MySQL中,我知道如何做到这一点,但我被困在Stata并寻求你的帮助。

I have a not too small table (roughly 50k rows) containing following columns (there are more cols but these are the ones I have to work with):

我有一个不太小的表(大约50k行)包含以下列(有更多cols但这些是我必须使用的):

id - Object ID, unique values

id - 对象ID,唯一值

name - Name of object, string value

name - 对象名称,字符串值

class - Class of object, integer range 1 - 6

class - 对象类,整数范围1 - 6

origin - Origin of object, integer range 1 - 2

origin - 对象的来源,整数范围1 - 2

Within the 50k rows there are only about 7k different names. In Stata I can retrieve all names with list name and could even restrict it to a single class with list name if class == 2.

在50k行中,只有大约7k个不同的名称。在Stata中,我可以使用列表名称检索所有名称,如果class == 2,甚至可以将其限制为具有列表名称的单个类。

Now I want a list of all different names along with a count of objects having that name and have the list sorted by count. In MySQL I'd query SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC. But how would that be done in Stata?

现在我想要一个包含所有不同名称的列表以及具有该名称的对象计数,并按列表排序列表。在MySQL中,我查询SELECT名称,COUNT(*)AS cnt FROM对象GROUP BY名称ORDER BY cnt DESC。但是如何在Stata中完成?

Next steps would be to get such lists for each class or for both origins, i.e. SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC, is that possible with Stata, too?

接下来的步骤是为每个类或两个来源获取这样的列表,即SELECT名称,COUNT(*)AS cnt FROM对象WHERE class = 2 GROUP BY名称ORDER BY cnt DESC,也可以使用Stata吗?

ps: I don't know if * is the right place as Stata is not really a programming language, is it? But I found some Stata-related questions here; that's why I posted it here. If there's a better place to do so, please point me to the right place.

ps:我不知道*是否是正确的地方,因为Stata不是真正的编程语言,是吗?但我在这里发现了一些与Stata有关的问题;这就是我在这里发布的原因。如果有更好的地方,请指出我正确的地方。

2 个解决方案

#1


2  

Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.

请记住,Stata仅适用于固定长度的矩形表,因此您只能添加跨越整个50k行的列。在此设置中,您可以执行此操作。

For the first problem (the list of names and frequencies), you can

对于第一个问题(名称和频率列表),您可以

   collapse (count) freq = name, by(class)
   sort class freq name
   list class name freq, sepby(class)

Note that collapse will delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don't want this to happen, here's a more sophisticated trick:

请注意,折叠将删除现有数据并替换为摘要。 (通常,我讨厌这个数据管理方面的命令,但它应该在这里工作。)如果你不希望这种情况发生,这里有一个更复杂的技巧:

   bysort class name : generate long freq = _N
   bysort class name : generate byte first = (_n==1)
   sort class freq name
   list class name freq if first, sepby(class)

(Explanation: _N is the number of observations in by-group, and _n is the number of the current observation within the by-group.)

(说明:_N是分组中观察的数量,_n是分组内当前观察的数量。)

You can then subset this to the class of interest with if class==#, as you already know.

然后,您可以使用if ==#将其子集化为感兴趣的类,如您所知。

#2


0  

Also check out the groups command downloadable using ssc inst groups.

还可以使用ssc inst groups查看可下载的groups命令。

#1


2  

Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.

请记住,Stata仅适用于固定长度的矩形表,因此您只能添加跨越整个50k行的列。在此设置中,您可以执行此操作。

For the first problem (the list of names and frequencies), you can

对于第一个问题(名称和频率列表),您可以

   collapse (count) freq = name, by(class)
   sort class freq name
   list class name freq, sepby(class)

Note that collapse will delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don't want this to happen, here's a more sophisticated trick:

请注意,折叠将删除现有数据并替换为摘要。 (通常,我讨厌这个数据管理方面的命令,但它应该在这里工作。)如果你不希望这种情况发生,这里有一个更复杂的技巧:

   bysort class name : generate long freq = _N
   bysort class name : generate byte first = (_n==1)
   sort class freq name
   list class name freq if first, sepby(class)

(Explanation: _N is the number of observations in by-group, and _n is the number of the current observation within the by-group.)

(说明:_N是分组中观察的数量,_n是分组内当前观察的数量。)

You can then subset this to the class of interest with if class==#, as you already know.

然后,您可以使用if ==#将其子集化为感兴趣的类,如您所知。

#2


0  

Also check out the groups command downloadable using ssc inst groups.

还可以使用ssc inst groups查看可下载的groups命令。