blog, Coaching, data, data engineering, data science, portfolio, python

Project:- Who are the Goodest Doggos? Wrangling & Analysing WeRateDogs Tweets to Find the Goodest Floofs

The Project

This project focused on wrangling data from the WeRateDogs Twitter account using Python, documented in a Jupyter Notebook (wrangle_act.ipynb and the subsequent analysis in act_analysis_notebook.ipynb).

This Twitter account rates dogs with humorous commentary. The rating denominator is usually 10, however, the numerators are usually greater than 10. WeRateDogs has over 4 million followers and has received international media coverage. Each day you can see a good doggo, lots of floofers and many pupper images.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for us to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. The data is enhanced by a second dataset with predictions of dog breeds for each of the Tweets. Finally, we used the Twitter API to glean further basic information from the Tweet such as favourites and retweets.

Using this freshly cleaned WeRateDogs Twitter data , interesting and trustworthy analyses and visualizations can be created to communicate back our findings.

The Python Notebooks, and PDF reports written to communicate my project and findings can also be foundĀ here

What Questions Are We Trying To Answer?

  • Q1. What Correlations can we find in the data that make a good doggo?
  • Q2. Which are the more popular; doggos, puppers, fullfers or poppos?
  • Q3. Which are the more popular doggo breeds and why is it Spaniels?

WeRateDogs @dog_rates

WeRateDogs

What Correlations can we find in the data that make a Good Doggo?

First, we wanted to determine if there were any correlations in the data to find any interesting relationships. To do this we performed some correlation analysis and produced visuals to support that. Prior to the analysis, we assumed that Favourite & Retweet would be correlated since these are both ways to show your appreciation for a tweet on Twitter.

The output of our analysis is as follows

This scatter plot matrix shows the relationships between each of the variables. It shows that while there looks to be a strong linear relationship between Favourites and Retweets, there no othe relationships were highlighted.

With that in mind, we wanted to quantify these relationships to solidify our understanding

Again, the above heatchart shows our correlation relationships, showing a strong relationship between favourites and retweets with a correlation coeficcient of aprox (r = 0.8)

Let’s narrow in on just that relationship.

With this chart we can see Favourites verses Retweets and it’s strong linearly positive relationship

Observations

  • As we assumed, There is a strong linear relationship between Favourites and Retweets.
  • The regression coefficient for this relationship is (r= 0.797)
  • From the points we plotted, we cannot find any other correlations.
  • In future, we could try and categorize the source and dog_stage to investigate correlations there with popularity of the Tweet.

Which are the more popular; doggos, puppers, fullfers or poppos?

We performed some data wrangling on the tweet_archive dataset to integrate 4 different “Class” of doggos down into one column which would be easier to analyse.

These classes are fun terminologies used by WeRateDogs so it would be really cool to see the popularity of these different types (dog_class = [doggo, pupper, fluffer, puppo])

Can we ascertain which category of dog is more popular?

Observations

  • Interesting, as we look at Retweets and Favourites, Puppos are by far the more popular on average containing the higher number of favourites and retweets
  • From the points we plotted, we can see that Puppers have the lower numbers on average, there are a lot of outliers.

Which are the more popular doggo breeds and why is it Spaniels?

Everyone loves doggos, but we all have a different favourite kind. With so many to choose from, which breed really is the goodest doggo and why is it Spaniels?

By integrating the image_prediction data into our dataset, we have three columns denoting the probability chance of the image being of a particular breed. This is some really interesting data to use, lets use it to see if we can determine the popularity of certain breeds of doggos.

Observations

  • We can see the most common types of dog here are Golden Retrievers and Labrator Retrievers, this seems sensible since these dog types are very common. Other dog breeds rounding out the top 5 are Chihuahuas, Pugs and Pembrokes.
  • We could also limit the probability to ensure it meets a minimum probability level
  • Some incorrect values like Seat-Belt, hamster, bath towel still exist in the data which we could clean given more time in future
  • We only used the 1st prediction column, we may have been able to use all 3 to determine the overall probability or popularity of dog breeds
  • There must be some mistake, Spaniels were not even in the top 10!?

Observations and Conclusion

  • During our analysis, we ound that there is a strong linear relationship between the number of Favourites and the number of Retweets of a given Tweet. The regression coefficient for this relationship is (r= 0.797)
  • We did anticipate this relationship already since there is a fair chance that if a user enjoys a twee they have the choice option to Favourite or Retweet it – both are a measure of the users enjoyment of the tweet.
  • We have also found through visualisation and data wrnagling that the pupper is the most popular doggo, with on average, more Retweets and more Favourites per tweet than the other 3 categories Doggo, Fluffer and Puppo.
  • Golden retriever are the goodest doggos, Labrador Retriever, Pembroke, Chihuahua and Pugs complete the top 5 common dog breeds in the data.

The Goodest Doggos

What We Learned

  • How to Programmatically download files using Python <code>requests</code> library
  • How to sign up for and use an API
  • How to use the <code>tweepy</code> library to connect Python to the Twitter API
  • How to handle JSON files in Python
  • How to manually assess and programmatically assess datasets and define Quality and Tidiness Issues
  • How to structure a report to document, define, and test Data Cleansing Steps

References

blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Analyse A/B Test Results to Determine the Conversion Rate of a New Web Page

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own data analysis to determine whether a web-site should change their page design from and old page to a new page, based on the results of an AB test on a subset of users.

The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse the data and determine probabilities of a user converting or not using various statistical methods based on whether the user used old page or new page

The PDF report written to communicate my project and findings can also be foundĀ here

What We Learned

  • Using proportions to find probability.
  • How to write Hypothesis statements and using these to Test against.
  • Writing out Hypotheses and observation in accurate terminology
  • Using statsmodel to simulate 10000 examples from a sample dataset, and finding differences from the mean
  • Plotting differences from the mean in a plt.Hist Histogram, and adding a representation line for the actual observed difference
  • Using Logistic Regression to determine probabilities for one of two possible outcomes
  • Creating Dummy variables for making categorical variables usable in regression
  • Creating interaction variables to better represent attributes in combination for use in regression
  • Interpreting regression summary() results and accurately concluding and making observations from results

Interesting Snippets

The Code and the Report

References

blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Data Analysis of Movie Releases, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own analysis on a data-set of our choosing from a prescribed list. I chose Movie Releases, a processed version of this dataset on Kaggle :  https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse several attributes and answer several questions that we ask of the data ourselves.

I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • Using Hist and plot() to build Histograms visualisations
  • Using plotting.scatter_matrix and plot() to build scatter plot visualisations
  • Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted tex
  • Renaming data frame Columns in Pandas
  • Using GroupBy and Query in Pandas to aggregate and group selections of data
  • Creating Line charts, Bar charts, Heatmaps in matplotlib and utilising Seaborn to add better visuals and formatting like adding appropriate labels, titles , colour
  • Using lambda functions to wrangle data formats
  • Structuring a report in a way that is readable and informative, taking the reader through conclusions drawn

Interesting Snippets

Average budget verses Average Revenue of Genres
Average RoI (Revenue by Budget)
Two dimensional analysis of Genres over Time, judging the average Budget, Revenue and Ratings
Top 10 Directors by their Total Revenue

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References

blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Data Analysis of Wine Quality, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Case Study on wine quality, using the UCI Wine Quality Data Set:Ā https://archive.ics.uci.edu/ml/datasets/Wine+Quality

The Case Study introduces us to several new concepts which we can apply to the data set which will allow us to analyse several attributes and ascertain what qualities of wine correspond to highly rated wines.

I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.

The PDF report written to communicate my project and findings can also be foundĀ here

What We Learned

  • Using Hist and plot() to build Histograms visualisations
  • Using plotting.scatter_matrix and plot() to build scatter plot visualisations
  • Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted text
  • Appending data frames together in Pandas
  • Renaming data frame Columns in Pandas
  • Using GroupBy and Query in Pandas to aggregate and group selections of data
  • Creating Bar charts in matplotlib and using Seaborn to add better formating
  • Adding appropriate labels, titles , colour
  • Engineering proportionality in the data that allows data sets be compared more easily

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References