The Project
The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own analysis on a data-set of our choosing from a prescribed list. I chose Movie Releases, a processed version of this dataset on Kaggle : https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse several attributes and answer several questions that we ask of the data ourselves.
I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.
I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.
The PDF report written to communicate my project and findings can also be found here
What We Learned
- Using Hist and plot() to build Histograms visualisations
- Using plotting.scatter_matrix and plot() to build scatter plot visualisations
- Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted tex
- Renaming data frame Columns in Pandas
- Using GroupBy and Query in Pandas to aggregate and group selections of data
- Creating Line charts, Bar charts, Heatmaps in matplotlib and utilising Seaborn to add better visuals and formatting like adding appropriate labels, titles , colour
- Using lambda functions to wrangle data formats
- Structuring a report in a way that is readable and informative, taking the reader through conclusions drawn
Interesting Snippets




The Code and the Report
- GitHub repository for the data, SQL, PDF report and Jupyter Notebook
- the PDF report can also be found here
References
- TMDB movie data (cleaned from Kaggle, by Udacity) (https://www.google.com/url?q=https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdbmovies/tmdbmovies.csv&sa=D&ust=1532469042115000)
- UDACITY Data Analyst Nanodegree ( https://eu.udacity.com/course/data-analyst-nanodegree–nd002 )
- Investigate TMDb Movie Dataset (Python Data Analysis Project) by Lorna Yen
(https://medium.com/@onpillow/01-investigate-tmdb-movie-dataset-python-data-analysis-projectpart-1-data-wrangling-3d2b55ea7714) - Investigate a dataset (https://praxitelisk.github.io/DAND-P1-Investigate-a-Dataset/Investigate_a_Dataset.html)
- Title Image (http://pluspng.com/)