将夜间IMDB转储组织成结构化数据

时间:2021-01-31 14:53:12

I'm currently trying to write a website for testing / learning purposes that will wrap around the IMDB datasets that are dumped.

我目前正在尝试编写一个用于测试/学习目的的网站,该网站将包含转储的IMDB数据集。

I'm having trouble determining the best way to extract the data into a format that is easier to manage. I will need to pull data from several files:

我无法确定将数据提取为更易于管理的格式的最佳方法。我需要从几个文件中提取数据:

  • movies.list = Movie list of all movies and year of production
  • movies.list =所有电影和制作年份的电影列表

  • mpaa-ratings-reasons.list = MPAA ratings
  • mpaa-ratings-reasons.list = MPAA评级

  • running-times.list = Running times
  • running-times.list =运行时间

The data in these tables are linked by a unique name that is given to each line. Essentially, I will need to join the lines of each of these text files together using the unique name. After doing this, I will need to parse the data I need out of the actual unique name since the movie title isn't listed explicitly. The unique name also specifies if the entry is a video game or TV show, which I will not be collecting data for.

这些表中的数据通过为每行提供的唯一名称链接。基本上,我需要使用唯一名称将每个文本文件的行连接在一起。执行此操作后,我将需要从实际唯一名称解析我需要的数据,因为未明确列出电影标题。唯一名称还指定条目是视频游戏还是电视节目,我不会为其收集数据。

Pulling the data from those unique name qualifiers is most likely going to be a Regex nightmare, but I'm more concerned with what the best method is for actually grouping the text files into a manageable format somewhere... Should I...

从那些唯一名称限定符中提取数据很可能是一个正则表达式的噩梦,但是我更关心的是将文本文件实际分组为可管理格式的最佳方法...我应该......

  1. Pull the data into staging tables on the SQL server, and then write a separate part in my app to join the tables and pull everything together?
  2. 将数据拉入SQL服务器上的临时表,然后在我的应用程序中编写一个单独的部分来连接表并将所有内容拉到一起?

  3. Load the lines from the text files into a .NET data table and do my processing that way?
    1. In doing so, am I going to cause a memory nightmare for the box that is running this app?
    2. 这样做,我是否会为运行此应用程序的盒子造成内存噩梦?

  4. 将文本文件中的行加载到.NET数据表中并以这种方式进行处理?这样做,我是否会为运行此应用程序的盒子造成内存噩梦?

  5. Some other alternative?
  6. 还有其他选择吗?

On a side note, the movies.list file alone contains over 1 million lines of data.

另外,仅movie.list文件包含超过100万行数据。

Thanks in advance for your help.

在此先感谢您的帮助。

Chris

1 个解决方案

#1


Staging tables on the DB server, scrub the data into final tables.

在数据库服务器上暂存表,将数据清理到最终表中。

If this means loading back into a client app for the processing, so be it.

如果这意味着加载回客户端应用程序进行处理,那就这样吧。

Practically, a DB server will handle the quantity of data but SQL Server may not be the best for your processing.

实际上,数据库服务器将处理数据量,但SQL Server可能不是最适合您的处理。

#1


Staging tables on the DB server, scrub the data into final tables.

在数据库服务器上暂存表,将数据清理到最终表中。

If this means loading back into a client app for the processing, so be it.

如果这意味着加载回客户端应用程序进行处理,那就这样吧。

Practically, a DB server will handle the quantity of data but SQL Server may not be the best for your processing.

实际上,数据库服务器将处理数据量,但SQL Server可能不是最适合您的处理。