用unix sort，uniq和awk替换SQL查询

We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

我们目前在HDFS集群上有一些数据,我们使用Hive生成报告。基础设施正处于退役状态,我们的任务是提出生成数据报告的替代方案(我们将这些报告分隔文件导入新环境)

Assuming we have a table with the following fields.

假设我们有一个包含以下字段的表。

Query
IPAddress
LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

我们以前在Hive上运行的原始SQL查询是(确实不是......但类似的东西)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

我想知道是否有人可以使用标准的unix / linux工具(如sort,uniq和awk)为我提供最有效的脚本,它可以作为上述查询的替代品。

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

假设脚本的输入是文本文件的目录。 dir将包含大约2000个文件。每个文件都包含任意数量的制表符分隔的表单记录:

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

2 个解决方案

#1

Once you have a sorted file containing all the unique

一旦你有一个包含所有唯一的排序文件

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

you could:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
q == $1 && l == $2{count++}
q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file

To get this sorted_uniq_file the naive way can be:

要获得这个sorted_uniq_file,天真的方式可以是:

sort -u dir/* > sorted_uniq_file

But this can be very long and memory consuming.

但这可能会很长并且会消耗内存。

A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

更快的选择(以及更少的内存消耗)可能是尽快消除重复,先排序并稍后合并。这需要一个临时空间用于排序文件,让我们使用一个名为sorted的目录:

mkdir sorted;
for f in dir/*; do
   sort -u $f > sorted/$f
done
sort -mu sorted/* > sorted_uniq_file
rm -rf sorted

If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

如果上面的解决方案遇到了一些shell或排序限制(dir / *的扩展,或者sort / *的扩展,或者sort的参数数量):

mkdir sorted;
ls dir | while read f; do
   sort -u dir/$f > sorted/$f
done
while [ `ls sorted | wc -l` -gt 1 ]; do
  mkdir sorted_tmp
  ls sorted | while read f1; do
    if read f2; then
      sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
    else
      mv sorted/$f1 sorted_tmp
    fi
  done
  rm -rf sorted
  mv sorted_tmp sorted
done
mv sorted/* sorted_uniq_file
rm -rf sorted

The solution above can be optimized to merge more that 2 files at the same time.

上述解决方案可以进行优化,以便同时合并更多的2个文件。

#2

Not a direct answer to your original question (which you already got), but if you have a bunch of flat file data that you want to query in different ways you might consider using NoSQL:

不能直接回答您的原始问题(您已经获得),但是如果您想要以不同方式查询一堆平面文件数据,则可以考虑使用NoSQL:

http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

This NoSQL project is totally different animal from (and predates by many years) what have more recently come to be known as "NoSQL databases". Instead, this NoSQL ties together Unix tools, with Awk as the centerpiece, to simplify their use in accessing and maintaining a database of formatted text files. Makes it easy to do a lot of slick stuff, e.g., table joins.

这个NoSQL项目与最近被称为“NoSQL数据库”的动物完全不同(并且早于多年)。相反,这个NoSQL将Unix工具与Awk结合在一起,以简化其在访问和维护格式化文本文件数据库中的使用。可以很容易地做很多光滑的东西,例如表连接。

#1