github的仓库是可以统计每个贡献者的代码行数的,公司年会的时候,特设了一个“码神奖”,颁给去年贡献代码最多的工程师,github的统计数据显示,这位大神去年提交的代码达到了110w行,这个数据太惊人了,一个人不可能写这么多代码的,我非常好奇的研究了一下,发现中间还包括了他提交的很多第三方库,但github也一并统计了,而且经过他合并的代码也会统计进去。那么有没有办法去掉这些无效数据,得到真实的代码贡献量呢?查了一下github api,再结合git 命令,还是可以的,上代码:
#copy this script to your target repo代码也可以在github上获得: https://github.com/heliclei/githubtools/blob/master/github-stats.py
#run python github-stats.py to collect data
import re
import json
import os
import sys
import requests
#get token from cmd line
tk = sys.argv[1]
user_stats={"dummy":{"additions":0,"deletions":0,"total":0}}
#query github api for last year's commits
payload = {'since':'2013-01-01T00:00:00Z','until':'2014-01-01T00:00:00Z','access_token':tk}
token = {'access_token':tk}
def is_merge(commit_sha):
cmd = "git show --oneline " + commit_sha
output = os.popen(cmd)
title = output.read()
p_merge = re.compile("Merge")
if(p_merge.search(title) is not None):
return True
else:
return False
def collect_stats(commit_list):
for m in commit_list:
#print user_stats
#print m['sha']
#print data
if(is_merge(m['sha'])):
continue
git_show_command = "git show -s --format=%an " + m['sha']
output = os.popen(git_show_command)
user = output.read().strip(' \t\n\r')
#print user
#r2 = requests.get(commit_request_api+m['sha'], params = token)
#commit = r2.json()
#print commit
git_diff_command = "git diff --shortstat "+m['sha'] + " " + m['sha'] + "^"
output = os.popen(git_diff_command)
data = output.read()
#print "data is:"
#print data
p_ins = re.compile("(\d+) insertion")
r_ins = p_ins.search(data)
ins_data = 0
del_data = 0
if(r_ins is not None):
ins_str = r_ins.group(1)
ins_data = int(ins_str)
#print ins_data
p_del = re.compile("(\d+) deletion")
r_del = p_del.search(data)
if(r_del is not None):
del_str = r_del.group(1)
del_data = int(del_str)
#print del_data
if(ins_data + del_data > 5000):
print user
print 'ins:'+str(ins_data)
print 'del:'+str(del_data)
ins_data = 0
del_data = 0
if(user in user_stats):
stats = user_stats[user]
stats['additions'] += ins_data
stats['deletions'] += del_data
stats['total'] += (ins_data + del_data)
user_stats[user] = stats
else:
new_stat = {'additions':ins_data, 'deletions':del_data, 'total':ins_data+del_data}
user_stats[user] = new_stat
r = requests.get("https://api.github.com/repos/cocos2d/cocos2d-x/commits", params = payload)
collect_stats(r.json())
print user_stats
pattern = re.compile("<(\S+)>; rel=\"next\"")
h = r.headers
print r.headers['X-RateLimit-Remaining']
result = pattern.search(h['link'])
while(result is not None):
next_url = result.group(1)
r = requests.get(next_url, params = token)
collect_stats(r.json())
h = r.headers
print h['link']
result = pattern.search(h['link'])
#print h['link']
#next_url = result.group(1)
#print next_url
#r_next = requests.get(next_url[1])
print r.headers['X-RateLimit-Remaining']
print user_stats
这个脚本过滤了单次提交超过5000行的commit,并且过滤了合并的commit,先把需要统计的仓库克隆到本地,再把这个脚本拷贝到本地git仓库下,注意要把这一行改为对应仓库的url
https://api.github.com/repos/cocos2d/cocos2d-x/commits
github token可以用上一篇的脚本生成
运行 python git-stats.py xxxxxxxxxxxxxgithub-oauth-tokenxxxxxxxxxxxxxxxxxxx
PS: 过滤后,cocos2d-x的码神去年的代码贡献量超过了10w行,还是非常的厉害~~但这个数据没有110W行那么超现实了。