Selcuk Gulcan

Grouplens recently published Movielens 25m dataset, a successor of Movielens 20m dataset that is heavily used in the data science field. In this post, I'll do a quick analysis of the dataset and compare it with movielens20m.

Datasets come in several files. I'm just going to use the following ones:

movies.csv Contains id, movie title, and tag information.

movieId	title	genres
1	Toy Story (1995)	Adventure
2	Jumanji (1995)	Adventure
3	Grumpier Old Men (1995)	Comedy
4	Waiting to Exhale (1995)	Comedy
5	Father of the Bride Part II (1995)	Comedy

ratings.csv is the main data file I'm interested in, it contains the rating matrix. The rating matrix is a big matrix where rows represent users, columns represent movies and the values inside the matrix are rating values between 0.5-5 for a particular user-movie pair. The names of the datasets come from this file, Movielens 20m means that the rating matrix has 20 million ratings. This is what the file looks like:

userId	movieId	rating
1	296	5.0
1	306	3.5
1	307	5.0
1	665	5.0
1	899	3.5

tags.csv has user-generated tags, I'll use this file to create a tag cloud.
links.csv is a mapping file that includes movielens identifiers and their corresponding IMDb identifiers. I will use it to get posters of the top-rated movies.

The code I've used to get all the following stats and plots is located here.

Basic Statistic Comparison

-	Movielens 20M	Movielens 25M
Rating count	20000263	25000095
User count	138493	162541
Movie count	26744	59047
Density of the Matrix	0.00540	0.00260
Max # movies rated by a user	9254	32202
Min # movies rated by a user	20	20
Average # movies rated by a user	144.41	153.81
Max # users rated a movie	67310	81491
Min # users rated a movie	1	1
Average # users rated a movie	747.84	423.39

There is a person on the Earth, that watched 32202 movies and rated them on the Movielens platform. That is probably not true, considering that watching all those movies would probably take around 6-7 years but that is what the data says so, I'm not the judge.
Users with less than 20 ratings are filtered out so they do not appear on the matrix. README file confirms this statement: "Users were selected at random for inclusion. All selected users had rated at least 20 movies."
Another thing to note here is that the 25M dataset is a lot more sparse due to the increased number of movies. The New dataset has 25 % more ratings but has two times over movies.
The movie rated most has over 80k ratings and that movie is Forrest Gump.

Rating Distributions

Rating histograms show how high and low scores are distributed among movies and users.

User rating distribution

It looks almost like a normal distribution, the things are different in the movies' side, however.

Movie rating distribution

Movielens top 20

To get this list, I look for movies with more than 2000 ratings. Otherwise, movies with only one 5-star rating would be the top movies. Here is the list with their averaged rating score:

20 - Pulp Fiction (1994)

Score: 4.189