GitHub BigQuery随着时间的推移提交查询而不返回某些回购的结果

时间:2022-06-23 14:05:41

I'm trying to pull data about GitHub monthly commits over time using the public dataset at Google BigQuery. The following query provided results for Chef and Ansible but returned nothing for Puppet or Salt.

我正在尝试使用Google BigQuery中的公共数据集随时间推送有关GitHub每月提交的数据。以下查询为Chef和Ansible提供了结果,但没有为Puppet或Salt返回任何内容。

SELECT
  MONTH(committer.date) month,
  YEAR(committer.date) year,
  repo_name,
  COUNT(*) commits,
FROM 
  [bigquery-public-data:github_repos.commits]
WHERE
  repo_name IN ('puppetlabs/puppet',
  'saltstack/salt',
  'ansible/ansible',
  'chef/chef')
GROUP BY
  month,
  year,
  repo_name

I then attempted to revise the query to pull only Salt or Puppet by revising the WHERE clause to:

然后我尝试通过修改WHERE子句来修改查询以仅提取Salt或Puppet:

WHERE
    repo_name = 'puppetlabs/puppet'

(I also repeated this with 'saltstack/salt' as a separate query.) In each case I received the error message:

(我也用'saltstack / salt'作为一个单独的查询重复了这个。)在每种情况下,我收到了错误消息:

'Query returned zero records.'

'查询返回零记录。'

I have tried to troubleshoot by:
1) confirming that I am using the correct repo names
2) confirming that the repos are public and should (in theory) be included in the BigQuery data and
3) I have tied the query results for Ansible and Chef back to the commits on github.com, and the query in those cases gave accurate results.

我试图通过以下方式进行故障排除:1)确认我使用了正确的回购名称2)确认回购是公开的,并且(理论上)应该包含在BigQuery数据中3)我已经将Ansible的查询结果与厨师回到github.com上的提交,在这些情况下的查询给出了准确的结果。

Does anyone have any ideas about where the issue lies and how I can modify my query to return data for Salt and Puppet?

有没有人对问题所在以及如何修改我的查询以返回Salt和Puppet的数据有任何想法?

2 个解决方案

#1


0  

puppetlabs/puppet is not open source, at least as determined by GitHub's License API:

puppetlabs / puppet不是开源的,至少由GitHub的License API决定:

curl -H "Accept: application/vnd.github.drax-preview+json" \
     https://api.github.com/repos/puppetlabs/puppet |grep license -A 6

"license": {
  "key": "other",
  "name": "Other",
  "spdx_id": null,
  "url": null,
  "featured": false
},

Documentation for the API:

API的文档:

puppetlabs/puppet LICENSE:

puppetlabs / puppet LICENSE:

It looks like an Apache License 2.0 to me, but it won't be included on the GitHub BigQuery repository until the GitHub License API can determine that this is in fact an open source license.

对我来说它看起来像Apache License 2.0,但在GitHub License API确定这实际上是开源许可证之前,它不会包含在GitHub BigQuery存储库中。

Note that GitHub uses licensee to power their API, and this is how they say their algorithm runs:

请注意,GitHub使用被许可方为其API供电,这就是他们说算法运行的方式:

If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.

如果许可文件具有明确的版权声明,并且仅此而已(例如,版权所有(c)2015 Ben Balter),我们将假设作者打算保留所有权利,因此该项目未获得许可。

If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.

如果许可证与已知许可证完全匹配。如果我们删除空格和版权声明,我们可能会很幸运,Ruby中的直接字符串比较便宜。

If we still can't match the license, we use a fancy math thing called the Sørensen–Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 90% similar to the MIT license, that 10% likely representing the copyright line being properly adapted to the project.

如果我们仍然无法匹配许可证,我们使用一种名为Sørensen-Dice系数的奇特数学事物,它非常擅长计算两个字符串之间的相似性。通过计算从已知许可证到许可证文件的更改百分比,您可以告知,例如,给定许可证与MIT许可证的90%相似,10%可能表示版权线正确适应项目。

Now, if you are trying to get their commit info over time, you could use the GitHub Archive BigQuery dataset:

现在,如果您尝试获取他们的提交信息,您可以使用GitHub Archive BigQuery数据集:

SELECT type, COUNT(*) c
FROM [githubarchive:month.201607]
WHERE repo.name = 'puppetlabs/puppet'
AND type='PushEvent'
GROUP BY 1

#2


0  

Run below to see all from puppetlabs for example

以下运行以查看来自puppetlabs的所有内容

SELECT repo_name, COUNT(1) commits
FROM [bigquery-public-data:github_repos.commits]
WHERE repo_name LIKE 'puppetlabs/%' 
GROUP BY repo_name
ORDER BY commits DESC

There are quite a number!

有很多人!

#1


0  

puppetlabs/puppet is not open source, at least as determined by GitHub's License API:

puppetlabs / puppet不是开源的,至少由GitHub的License API决定:

curl -H "Accept: application/vnd.github.drax-preview+json" \
     https://api.github.com/repos/puppetlabs/puppet |grep license -A 6

"license": {
  "key": "other",
  "name": "Other",
  "spdx_id": null,
  "url": null,
  "featured": false
},

Documentation for the API:

API的文档:

puppetlabs/puppet LICENSE:

puppetlabs / puppet LICENSE:

It looks like an Apache License 2.0 to me, but it won't be included on the GitHub BigQuery repository until the GitHub License API can determine that this is in fact an open source license.

对我来说它看起来像Apache License 2.0,但在GitHub License API确定这实际上是开源许可证之前,它不会包含在GitHub BigQuery存储库中。

Note that GitHub uses licensee to power their API, and this is how they say their algorithm runs:

请注意,GitHub使用被许可方为其API供电,这就是他们说算法运行的方式:

If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.

如果许可文件具有明确的版权声明,并且仅此而已(例如,版权所有(c)2015 Ben Balter),我们将假设作者打算保留所有权利,因此该项目未获得许可。

If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.

如果许可证与已知许可证完全匹配。如果我们删除空格和版权声明,我们可能会很幸运,Ruby中的直接字符串比较便宜。

If we still can't match the license, we use a fancy math thing called the Sørensen–Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 90% similar to the MIT license, that 10% likely representing the copyright line being properly adapted to the project.

如果我们仍然无法匹配许可证,我们使用一种名为Sørensen-Dice系数的奇特数学事物,它非常擅长计算两个字符串之间的相似性。通过计算从已知许可证到许可证文件的更改百分比,您可以告知,例如,给定许可证与MIT许可证的90%相似,10%可能表示版权线正确适应项目。

Now, if you are trying to get their commit info over time, you could use the GitHub Archive BigQuery dataset:

现在,如果您尝试获取他们的提交信息,您可以使用GitHub Archive BigQuery数据集:

SELECT type, COUNT(*) c
FROM [githubarchive:month.201607]
WHERE repo.name = 'puppetlabs/puppet'
AND type='PushEvent'
GROUP BY 1

#2


0  

Run below to see all from puppetlabs for example

以下运行以查看来自puppetlabs的所有内容

SELECT repo_name, COUNT(1) commits
FROM [bigquery-public-data:github_repos.commits]
WHERE repo_name LIKE 'puppetlabs/%' 
GROUP BY repo_name
ORDER BY commits DESC

There are quite a number!

有很多人!