blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Data Analysis of Movie Releases, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own analysis on a data-set of our choosing from a prescribed list. I chose Movie Releases, a processed version of this dataset on Kaggle :  https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse several attributes and answer several questions that we ask of the data ourselves.

I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • Using Hist and plot() to build Histograms visualisations
  • Using plotting.scatter_matrix and plot() to build scatter plot visualisations
  • Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted tex
  • Renaming data frame Columns in Pandas
  • Using GroupBy and Query in Pandas to aggregate and group selections of data
  • Creating Line charts, Bar charts, Heatmaps in matplotlib and utilising Seaborn to add better visuals and formatting like adding appropriate labels, titles , colour
  • Using lambda functions to wrangle data formats
  • Structuring a report in a way that is readable and informative, taking the reader through conclusions drawn

Interesting Snippets

Average budget verses Average Revenue of Genres
Average RoI (Revenue by Budget)
Two dimensional analysis of Genres over Time, judging the average Budget, Revenue and Ratings
Top 10 Directors by their Total Revenue

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s