IMDB 5000 Movie Dataset(来自IMDB的5000个电影的数据集)

时间:2022-06-13 04:39:13

描述

Background

背景

How can we tell the greatness of a movie before it is released in cinema?
电影上映之前,我们怎样得知它的好坏呢?

This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.
这个问题困扰我很长一段时间了,因为并没有一个普适的方法来判定电影的好坏。许多人依靠评分来测量一个电影质量的好坏,而另一些人则凭他们的直觉。但在电影上映后也需要时间来获取一个合理数量的评分反馈。同时,人们的直觉有时候是不可靠的。

Question

问题

  1. Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?
    给出每年生产出的数以千计的电影,在不依赖于评分或我们的直觉的前提下,有没有一个更好的方法来得知电影的好坏呢?

  2. Will the number of human faces in movie poster correlate with the movie rating?
    电影海报中的人脸数量会和电影评分有关吗?

Method

方法

To answer this question, I scraped 5000+ movies from IMDB website using a Python library called “scrapy”.
为了回答这个问题,我使用一个叫做“scrapy”的python函数库从IMDB网站上爬取了超过5000部电影的信息。

The scraping process took 2 hours to finish. In the end, I was able to obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables:
爬取过程需要2个小时来完成。最后,我可以获取到所有5043部电影和4906张海报需要的28个变量,涵盖66个国家跨越100年的时间长度。有2399个不同的导演名字和数千位男女影员。以下是28个变量:

“movie_title”
“电影标题”
“color”
“颜色”
“num_critic_for_reviews”
“评论的评分数量”
“movie_facebook_likes”
“电影facebook赞数”
“duration”
“电影时长”
“director_name”
“导演名字”
“director_facebook_likes”
“导演facebook赞数”
“actor_3_name”
“演员3的姓名”
“actor_3_facebook_likes”
“演员3的facebook赞数”
“actor_2_name”
“演员2的姓名”
“actor_2_facebook_likes”
“演员2的姓名”
“actor_1_name”
“演员1的姓名”
“actor_1_facebook_likes”
“演员1的姓名”
“gross”
“票房收入”
“genres”
“体裁”
“num_voted_users”
“投票用户数”
“cast_total_facebook_likes”
“演员总的facebook赞数”
“facenumber_in_poster”
“海报中的人脸数量”
“plot_keywords”
“情节关键词”
“movie_imdb_link”
“电影imdb链接”
“num_user_for_reviews”
“评论的用户数”
“language”
“语言”
“country”
“国家”
“content_rating”
“内容评级”
“budget”
“成本”
“title_year”
“上线日期”
“imdb_score”
“imdb评分”
“aspect_ratio”
“电影宽高比”

To answer question 2, I applied the human face detection algorithm on all the posters using python library called dlib, and extracted the number of faces in posters.
为了回答问题2,我使用一个叫做dlib的python函数库将人脸检测算法应用到所有的海报上,并提取出海报中的人脸数量。

Blog and Github codes

博客和Github代码

See here for more details about the scraping steps, the EDA, and the predictions : https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/
关于爬取步骤、EDA以及预测的更多详情可见: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/

Github page: https://github.com/sundeepblue/movie_rating_prediction
Github页面: https://github.com/sundeepblue/movie_rating_prediction

Important notes

重要说明

  1. This dataset is by no means to be a comprehensive scraping of all attributes relating to movies. It stemmed from one of my project built from scratch and finished in around one week. So please do not be surprised if you find something is off.
    该数据集绝不是一个与电影相关的所有属性的复杂抓取。它来源于我的一个基于爬虫的项目并在一周之内完成。所以如果你发现某些事情无法接受,请不要惊讶。

  2. This dataset is a proof of concept. It can be used for experimental and learning purpose to get hands dirty on web scraping, basic EDA, and learning algorithms in R or Python. For comprehensive movie analysis and accurate movie ratings prediction, 28 attributes from 5000 movies might not be enough. A decent dataset could contain hundreds of attributes from 50K or more movies, and requires tons of feature engineering.
    该数据集是概念的一个证明。它可以用于实验和学习目的,对于那些想要学习网络爬虫、基本的电子设计自动化以及使用R或者Python的学习算法的人。对于复杂的电影分析和精准的电影评分预测,来自5000部电影的28个属性可能是不够的。一个漂亮的数据集会包含来自5万或者更多电影的数以百计的属性,并且需要大量的特征工程。

  3. There are around 800 “0”s in the “gross” attribute. This was either caused by (a) no gross number was found in certain movie page, or (b) the response returned by scrapy http request returned nothing in short period of time. So please make your own judgement when analyzing on this attribute.
    在“gross”属性中有大约800个‘0’。这可能是由以下原因引起的:(a)在该电影页面没有找到票房数量;(b)爬虫http请求返回的response在短时间内没有任何返回。所以分析这项属性时,请您自行决断。

  4. There are around 908 directors whose “director_facebook_likes” attribute are 0. If somebody did analysis on “directory_facebook_like” attribute, there could be some off, and say, the top10, or top50 directors could be inaccurate. Thanks for pointing this out by user Kryslor. This is interesting, since the code I used to scrape everybody’s facebook like were identical. See function parse_facebook_likes_number(). It was hard to directly scrape this data from IMDB website (due to dynamic embedded div frame), so I had to use a hacky way by directly sending request to facebook website (see line 38 of this file). Perhaps for some directors, facebook did not respond with reasonable result within short timespan (< 0.25 second) and returned “None” in Python (translated to 0 in my code).
    有大约908个导演的”director_facebook_likes”属性是0。如果有人想要基于”director_facebook_likes”属性作分析,当提出排名前10、前50的导演时可能不太准确。

  5. For those 0s, you might want to treat them as “missing value” when using certain machine learning algorithms.
    对于那些0,在使用特定的机器学习算法时,你可能要把他们当作“丢失值”。

  6. Thanks to user “Quinton”, who found a bug in the dataset on 11/23/2016:
    (November 23, 2016 at 12:08 am) We actually used your IMDB dataset for an Advanced Data Mining class at Rockhurst University in Kansas City, MO. We love the data set and we really appreciate the time it took to create the it. However, we believe we found a small flaw in the data. Not all of the IMDB movie budget numbers are in US dollars, for example, the South Korean movie “The Host” has its budget numbers in S. Korean Won (Korean currency). But there is no data in the dataset that tells you the currency. The existance of foreign currencies skews the budget data for foreign films particularly for currencies with extreme exchange rates when compared to USD. For instance, many could assume the data set shows “The Host” cost $12 billion to make when it truthfully cost only 12 billion Won, but the dataset doesn’t make the distinction. It is not just an issue with Korean movies we found Turkish and Japanese movies with the same issue.
    Quinton was right. When I parsed the currency, I didn’t take the Korean currency into consideration. Therefore please be cautious if you analyze the currency related attributes for non US dollar currencies. The fix is actually quite simple in the corresponding python code.
    数据集中“budget”有着没有考虑票房单位(韩元、日元、土耳其货币)的bug。所以当您分析非美元单位的相关属性时,要小心对待。

  7. Please be mindful that, analyzing currency related attributes, such as “gross” or “budget”, is actually more complicated than it seems. For a really thorough and accurate analysis (EDA or prediction), we may want to do some feature engineering on those attributes in a systematic way. For example, one US dollar in 1920 is different from that of 2010. So we need to take inflation factors across years into consideration, and normalize all US dollars into one basis (a certain year). So do all other currencies (British pound, Chinese RMB Yuan, etc). If you also consider exchange rate between two different currencies and wanted to convert everything into dollars, things become tricker, because even those rates also varies over time. $1 equals RMB8.4 in 2000 but RMB6.8 in 2015.
    请记住,分析货币相关的属性,例如”gross”或”budget”时,实际上比它看起来更加复杂。对于一个真正很彻底且精确的分析,我们可能要系统性地对那些属性做一些特征工程。例如,1920年的1美元与2010年的是不一样的。所以我们需要将跨越年份的通货膨胀因素考虑进去,并且将所有的美元归一化到同一个基准(一个特定的年份)上。其它的货币也是一样。如果您也考虑了两种不同货币之间的汇率并且想要把任何币种都转换成美元,事情将变得不可信,因为即使那些评级也会随时间而不同。在2000年的时候,1美元等于8.4人民币;而到了2015年则等于6.8人民币。