从分区文件结构创建Impala外部表

时间:2022-07-10 13:49:43

Provided a partitioned fs structure like the following:

提供了如下的分区fs结构:

logs
└── log_type
    └── 2013
        ├── 07
        │   ├── 28
        │   │   ├── host1
        │   │   │   └── log_file_1.csv
        │   │   └── host2
        │   │       ├── log_file_1.csv
        │   │       └── log_file_2.csv
        │   └── 29
        │       ├── host1
        │       │   └── log_file_1.csv
        │       └── host2
        │           └── log_file_1.csv
        └── 08

I've been trying to create an external table in Impala:

我一直在尝试在Impala中创建一个外部表:

create external table log_type (
    field1    string,
    field2    string,
    ...
)
row format delimited fields terminated by '|' location '/logs/log_type/2013/08';

I wish Impala would recurse into the subdirs and load all the csv files; but no cigar. No errors are thrown but no data is loaded into the table.

我希望Impala能够进入子目录并加载所有csv文件;但没有雪茄。不会抛出任何错误,但没有数据加载到表中。

Different globs like /logs/log_type/2013/08/*/* or /logs/log_type/2013/08/*/*/* did not work either.

像/ logs / log_type / 2013/08 / * / *或/ logs / log_type / 2013/08 / * / * / *这些不同的全局也不起作用。

Is there a way to do this? Or should I restructure the fs - any advice on that?

有没有办法做到这一点?或者我应该重组fs - 对此有何建议?

2 个解决方案

#1


9  

in case you are still searching for an answer. You need to register each individual partition manually.

如果你还在寻找答案。您需要手动注册每个单独的分区。

See here for details Registering External Table

有关详细信息,请参见此处注册外部表

Your schema for the table needs to be adjusted

您需要调整表的架构

create external table log_type (
        field1    string,
        field2    string,
...)
  partitioned by (year int, month int, day int, host string)
  row format delimited fields terminated by '|';

After you changed your schema, to include year, month, day and host, you recursively have to add each partition to the table.

更改架构后,要包括年,月,日和主机,您必须递归地将每个分区添加到表中。

Something like this

像这样的东西

ALTER TABLE log_type ADD PARTITION (year=2013, month=07, day=28, host="host1")
    LOCATION '/logs/log_type/2013/07/28/host1';

Afterwards you need to refresh the table in impala.

之后你需要在黑斑羚中刷新表格。

invalidate log_type;
refresh log_type;

#2


0  

Another way to do this might be to use the LOAD DATA function in Impala. If your data is in a SequenceFile or other less Impala-friendly format (Impala file formats), you can create your external table like Joey does above but instead of ALTER TABLE, you can do something like

另一种方法可能是在Impala中使用LOAD DATA函数。如果您的数据是SequenceFile或其他较少Impala友好格式(Impala文件格式),您可以像上面的Joey一样创建外部表,但不是ALTER TABLE,您可以执行类似的操作

LOAD DATA INPATH '/logs/log_type/2013/07/28/host1/log_file_1.csv' INTO TABLE log_type PARTITION (year=2013, month=07, day=28, host=host1);

#1


9  

in case you are still searching for an answer. You need to register each individual partition manually.

如果你还在寻找答案。您需要手动注册每个单独的分区。

See here for details Registering External Table

有关详细信息,请参见此处注册外部表

Your schema for the table needs to be adjusted

您需要调整表的架构

create external table log_type (
        field1    string,
        field2    string,
...)
  partitioned by (year int, month int, day int, host string)
  row format delimited fields terminated by '|';

After you changed your schema, to include year, month, day and host, you recursively have to add each partition to the table.

更改架构后,要包括年,月,日和主机,您必须递归地将每个分区添加到表中。

Something like this

像这样的东西

ALTER TABLE log_type ADD PARTITION (year=2013, month=07, day=28, host="host1")
    LOCATION '/logs/log_type/2013/07/28/host1';

Afterwards you need to refresh the table in impala.

之后你需要在黑斑羚中刷新表格。

invalidate log_type;
refresh log_type;

#2


0  

Another way to do this might be to use the LOAD DATA function in Impala. If your data is in a SequenceFile or other less Impala-friendly format (Impala file formats), you can create your external table like Joey does above but instead of ALTER TABLE, you can do something like

另一种方法可能是在Impala中使用LOAD DATA函数。如果您的数据是SequenceFile或其他较少Impala友好格式(Impala文件格式),您可以像上面的Joey一样创建外部表,但不是ALTER TABLE,您可以执行类似的操作

LOAD DATA INPATH '/logs/log_type/2013/07/28/host1/log_file_1.csv' INTO TABLE log_type PARTITION (year=2013, month=07, day=28, host=host1);