Linux上的文本文件的SQL查询引擎？

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.

我们一直在命令行使用grep,cut,sort,uniq和join来进行数据分析。虽然有缺点,但它们效果很好。例如,您必须为每个工具提供列号。我们经常有宽文件(很多列)和列标题,用于给出列名。实际上,我们的文件看起来很像SQL表。我确定有一个驱动程序(ODBC?)将操作分隔的文本文件,以及一些将使用该驱动程序的查询引擎,因此我们可以在我们的文本文件上使用SQL查询。由于进行分析通常是临时的,因此查询新文件(仅使用我在此目录中指定的文件)而不是在某些配置中声明特定表必须是最小化设置。

Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?

实际上,最简单的是什么?也就是说,最容易设置并用于应用文本文件的SQL引擎和驱动程序?

6 个解决方案

#1

Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.

重复其他人的建议,这里是sqlite3的Python脚本。有点冗长,但它的确有效。

I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.

我不喜欢完全复制文件以删除标题行,但我不知道如何说服sqlite3的.import跳过它。我可以创建INSERT语句,但是如果不是更糟的话,这看起来也差一点。

Sample invocation:

$ sql.py --file foo --sql "select count(*) from data"

The code:

#!/usr/bin/env python

"""Run a SQL statement on a text file"""

import os
import sys
import getopt
import tempfile
import re

class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg

def runCmd(cmd):
    if os.system(cmd):
        print "Error running " + cmd
        sys.exit(1)
        # TODO(dan): Return actual exit code

def usage():
    print >>sys.stderr, "Usage: sql.py --file file --sql sql"

def main(argv=None):
    if argv is None:
        argv = sys.argv

    try:
        try:
            opts, args = getopt.getopt(argv[1:], "h",
                                       ["help", "file=", "sql="])
        except getopt.error, msg:
            raise Usage(msg)
    except Usage, err:
        print >>sys.stderr, err.msg
        print >>sys.stderr, "for help use --help"
        return 2

    filename = None
    sql = None
    for o, a in opts:
        if o in ("-h", "--help"):
            usage()
            return 0
        elif o in ("--file"):
            filename = a
        elif o in ("--sql"):
            sql = a
        else:
            print "Found unexpected option " + o

    if not filename:
        print >>sys.stderr, "Must give --file"
        sys.exit(1)
    if not sql:
        print >>sys.stderr, "Must give --sql"
        sys.exit(1)

    # Get the first line of the file to make a CREATE statement
    #
    # Copy the rest of the lines into a new file (datafile) so that
    # sqlite3 can import data without header.  If sqlite3 could skip
    # the first line with .import, this copy would be unnecessary.
    foo = open(filename)
    datafile = tempfile.NamedTemporaryFile()
    first = True
    for line in foo.readlines():
        if first:
            headers = line.rstrip().split()
            first = False
        else:
            print >>datafile, line,
    datafile.flush()
    #print datafile.name
    #runCmd("cat %s" % datafile.name)
    # Create columns with NUMERIC affinity so that if they are numbers,
    # SQL queries will treat them as such.
    create_statement = "CREATE TABLE data (" + ",".join(
        map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"

    cmdfile = tempfile.NamedTemporaryFile()
    #print cmdfile.name
    print >>cmdfile,create_statement
    print >>cmdfile,".separator ' '"
    print >>cmdfile,".import '" + datafile.name + "' data"
    print >>cmdfile, sql + ";"
    cmdfile.flush()
    #runCmd("cat %s" % cmdfile.name)
    runCmd("cat %s | sqlite3" % cmdfile.name)

if __name__ == "__main__":
    sys.exit(main())

#2

David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.

David Malcolm编写了一个名为“squeal”(以前称为“show”)的小工具,它允许您使用类似SQL的命令行语法来解析各种格式的文本文件,包括CSV。

An example on squeal's home page:

关于squeal主页的一个例子:

$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source              |
--------+--------------------+
1633    |kernel              |
1324    |NetworkManager      |
98      |ntpd                |
70      |avahi-daemon        |
63      |dhclient            |
48      |setroubleshoot      |
39      |dnsmasq             |
29      |nm-system-settings  |
27      |bluetoothd          |
14      |/usr/sbin/gpm       |
13      |acpid               |
10      |init                |
9       |pcscd               |
9       |pulseaudio          |
6       |gnome-keyring-ask   |
6       |gnome-keyring-daemon|
6       |gnome-session       |
6       |rsyslogd            |
5       |rpc.statd           |
4       |vpnc                |
3       |gdm-session-worker  |
2       |auditd              |
2       |console-kit-daemon  |
2       |libvirtd            |
2       |rpcbind             |
1       |nm-dispatcher.action|
1       |restorecond         |

#3

Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?

也许编写一个创建SQLite实例的脚本(可能在内存中),从文件/ stdin导入数据(接受数据的格式),运行查询,然后退出?

Depending on the amount of data, performance could be acceptable.

根据数据量,性能可以接受。

#4

MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.

MySQL有一个CVS存储引擎,如果您的文件是CSV文件,它可能会满足您的需求。

Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.

否则,您可以使用mysqlimport将文本文件导入MySQL。您可以在mysqlimport周围创建一个包装器,它可以计算列等,并创建必要的表。

You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.

您也可以使用DBD :: AnyData,一个允许您访问文本文件(如数据库)的Perl模块。

That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?

也就是说,这听起来很像你应该真正看到使用数据库。将表格面向数据保存在文本文件中真的更容易吗?

#5

q - Run SQL directly on CSV or TSV files:

q - 直接在CSV或TSV文件上运行SQL:

https://github.com/harelba/q

#6

I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!

我已经使用Microsoft LogParser多次查询csv文件......它可以达到目的。令人惊讶的是看到M $这么有用的工具太免费了!

#1