如何有效地实现tail ?

时间:2022-12-09 10:06:14

What is the efficient way to implement tail in *NIX? I came up (wrote) with two simple solution, both using kind of circular buffer to load lines into circular structure (array | doubly linked circular list - for fun). I've seen part of older implementation in busybox and from what I understood, they used fseek to find EOF and then read stuff "backwards". Is there anything cleaner and faster out there? I got asked this on interview and asker did not look satisfied. Thank you in advance.

在*NIX中实现tail的有效方法是什么?我提出(编写)了两个简单的解决方案,都使用某种圆形缓冲区将线加载到圆形结构中(为了好玩,数组|双链圆形列表)。我在busybox中看到过一些旧的实现,根据我的理解,他们使用fseek查找EOF,然后读取“反向”的内容。外面有更干净、更快的东西吗?我在面试中被问到这个问题,问我的人看起来并不满意。提前谢谢你。

4 个解决方案

#1


14  

I don't think there are solutions different than "keep the latest N lines while reading forward the data" or "start from the end and go backwards until you read the Nth line".

我不认为有什么解决方案与“在读取数据时保持最新的N行”或“从末尾开始,一直向后直到读取第N行”不同。

The point is that you'd use one or the another based on the context.

重点是您将根据上下文使用其中一个或另一个。

The "go to the end and go backwards" is better when tail accesses a random access file, or when the data is small enough to be put on memory. In this case the runtime is minimized, since you scan the data that has to be outputted (so, it's "optimal")

当tail访问随机访问文件,或者当数据足够小时,“go to The end and go backwards”会更好。在这种情况下,运行时被最小化,因为您扫描了必须输出的数据(因此,这是“最优”)

Your solution (keep the N latest lines) is better when tail is fed with a pipeline or when the data is huge. In this case, the other solution wastes too much memory, so it is not practical and, in the case the source is slower than tail (which is probable) scanning all the file doesn't matter that much.

当尾部使用管道或当数据量很大时,您的解决方案(保持最新的N行)会更好。在这种情况下,其他的解决方案浪费了太多的内存,所以它是不实际的,而且,如果源的速度比tail(可能的)慢,那么扫描所有的文件就不那么重要了。

#2


6  

Read backwards from the end of the file until N linebreaks are read or the beginning of the file is reached.

从文件的末尾向后读取,直到读取N个换行符或到达文件的开头。

Then print what was just read.

然后打印刚刚阅读的内容。

I dont think any fancy datastructures are needed here.

我认为这里不需要任何花哨的数据结构。

Here is the source code of tail if you're interested.

如果你感兴趣的话,这是tail的源代码。

#3


5  

First use fseek to find the end-of-file then subtract 512 and fseek to that offset, then read forward from there to end. Count the number of line-breaks because if there are too few you will have to do the same with a subtracted offset of 1024 ... but in 99% of cases 512 will be enough.

首先使用fseek查找文件的末尾,然后减去512,然后fseek到该偏移量,然后从那里读取到结束。计算断行的数量,因为如果断行的数量太少,你就必须用减去1024的偏移量来计算。但在99%的情况下512就足够了。

This (1) avoids reading the whole file forward and (2) the reason why this is probably more efficient than reading backwards from the end is that reading forward is typically faster.

这(1)避免向前读取整个文件,(2)这可能比从末尾向后读取更有效的原因是向前读取通常更快。

#4


0  

/*This example implements the option n of tail command.*/

/*此示例实现tail命令的选项n。*/

#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <getopt.h>

#define BUFF_SIZE 4096

FILE *openFile(const char *filePath)
{
  FILE *file;
  file= fopen(filePath, "r");
  if(file == NULL)
  {
    fprintf(stderr,"Error opening file: %s\n",filePath);
    exit(errno);
  }
  return(file);
}

void printLine(FILE *file, off_t startline)
{
  int fd;
  fd= fileno(file);
  int nread;
  char buffer[BUFF_SIZE];
  lseek(fd,(startline + 1),SEEK_SET);
  while((nread= read(fd,buffer,BUFF_SIZE)) > 0)
  {
    write(STDOUT_FILENO, buffer, nread);
  }
}

void walkFile(FILE *file, long nlines)
{
  off_t fposition;
  fseek(file,0,SEEK_END);
  fposition= ftell(file);
  off_t index= fposition;
  off_t end= fposition;
  long countlines= 0;
  char cbyte;

  for(index; index >= 0; index --)
  {
    cbyte= fgetc(file);
    if (cbyte == '\n' && (end - index) > 1)
    {
      countlines ++;
      if(countlines == nlines)
      {
    break;
      }
     }
    fposition--;
    fseek(file,fposition,SEEK_SET);
  }
  printLine(file, fposition);
  fclose(file);
}

int main(int argc, char *argv[])
{
  FILE *file;
  file= openFile(argv[2]);
  walkFile(file, atol(argv[1]));
  return 0;
}

/*Note: take in mind that i not wrote code to parse input options and arguments, neither code to check if the lines number argument is really a number.*/

#1


14  

I don't think there are solutions different than "keep the latest N lines while reading forward the data" or "start from the end and go backwards until you read the Nth line".

我不认为有什么解决方案与“在读取数据时保持最新的N行”或“从末尾开始,一直向后直到读取第N行”不同。

The point is that you'd use one or the another based on the context.

重点是您将根据上下文使用其中一个或另一个。

The "go to the end and go backwards" is better when tail accesses a random access file, or when the data is small enough to be put on memory. In this case the runtime is minimized, since you scan the data that has to be outputted (so, it's "optimal")

当tail访问随机访问文件,或者当数据足够小时,“go to The end and go backwards”会更好。在这种情况下,运行时被最小化,因为您扫描了必须输出的数据(因此,这是“最优”)

Your solution (keep the N latest lines) is better when tail is fed with a pipeline or when the data is huge. In this case, the other solution wastes too much memory, so it is not practical and, in the case the source is slower than tail (which is probable) scanning all the file doesn't matter that much.

当尾部使用管道或当数据量很大时,您的解决方案(保持最新的N行)会更好。在这种情况下,其他的解决方案浪费了太多的内存,所以它是不实际的,而且,如果源的速度比tail(可能的)慢,那么扫描所有的文件就不那么重要了。

#2


6  

Read backwards from the end of the file until N linebreaks are read or the beginning of the file is reached.

从文件的末尾向后读取,直到读取N个换行符或到达文件的开头。

Then print what was just read.

然后打印刚刚阅读的内容。

I dont think any fancy datastructures are needed here.

我认为这里不需要任何花哨的数据结构。

Here is the source code of tail if you're interested.

如果你感兴趣的话,这是tail的源代码。

#3


5  

First use fseek to find the end-of-file then subtract 512 and fseek to that offset, then read forward from there to end. Count the number of line-breaks because if there are too few you will have to do the same with a subtracted offset of 1024 ... but in 99% of cases 512 will be enough.

首先使用fseek查找文件的末尾,然后减去512,然后fseek到该偏移量,然后从那里读取到结束。计算断行的数量,因为如果断行的数量太少,你就必须用减去1024的偏移量来计算。但在99%的情况下512就足够了。

This (1) avoids reading the whole file forward and (2) the reason why this is probably more efficient than reading backwards from the end is that reading forward is typically faster.

这(1)避免向前读取整个文件,(2)这可能比从末尾向后读取更有效的原因是向前读取通常更快。

#4


0  

/*This example implements the option n of tail command.*/

/*此示例实现tail命令的选项n。*/

#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <getopt.h>

#define BUFF_SIZE 4096

FILE *openFile(const char *filePath)
{
  FILE *file;
  file= fopen(filePath, "r");
  if(file == NULL)
  {
    fprintf(stderr,"Error opening file: %s\n",filePath);
    exit(errno);
  }
  return(file);
}

void printLine(FILE *file, off_t startline)
{
  int fd;
  fd= fileno(file);
  int nread;
  char buffer[BUFF_SIZE];
  lseek(fd,(startline + 1),SEEK_SET);
  while((nread= read(fd,buffer,BUFF_SIZE)) > 0)
  {
    write(STDOUT_FILENO, buffer, nread);
  }
}

void walkFile(FILE *file, long nlines)
{
  off_t fposition;
  fseek(file,0,SEEK_END);
  fposition= ftell(file);
  off_t index= fposition;
  off_t end= fposition;
  long countlines= 0;
  char cbyte;

  for(index; index >= 0; index --)
  {
    cbyte= fgetc(file);
    if (cbyte == '\n' && (end - index) > 1)
    {
      countlines ++;
      if(countlines == nlines)
      {
    break;
      }
     }
    fposition--;
    fseek(file,fposition,SEEK_SET);
  }
  printLine(file, fposition);
  fclose(file);
}

int main(int argc, char *argv[])
{
  FILE *file;
  file= openFile(argv[2]);
  walkFile(file, atol(argv[1]));
  return 0;
}

/*Note: take in mind that i not wrote code to parse input options and arguments, neither code to check if the lines number argument is really a number.*/