blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Analyse A/B Test Results to Determine the Conversion Rate of a New Web Page

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own data analysis to determine whether a web-site should change their page design from and old page to a new page, based on the results of an AB test on a subset of users.

The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse the data and determine probabilities of a user converting or not using various statistical methods based on whether the user used old page or new page

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • Using proportions to find probability.
  • How to write Hypothesis statements and using these to Test against.
  • Writing out Hypotheses and observation in accurate terminology
  • Using statsmodel to simulate 10000 examples from a sample dataset, and finding differences from the mean
  • Plotting differences from the mean in a plt.Hist Histogram, and adding a representation line for the actual observed difference
  • Using Logistic Regression to determine probabilities for one of two possible outcomes
  • Creating Dummy variables for making categorical variables usable in regression
  • Creating interaction variables to better represent attributes in combination for use in regression
  • Interpreting regression summary() results and accurately concluding and making observations from results

Interesting Snippets

The Code and the Report

References

blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Data Analysis of Movie Releases, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Project where we perform our own analysis on a data-set of our choosing from a prescribed list. I chose Movie Releases, a processed version of this dataset on Kaggle :  https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

The Project aims to bring together several concepts taught to us over the duration of the course, which we can apply to the data set which will allow us to analyse several attributes and answer several questions that we ask of the data ourselves.

I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • Using Hist and plot() to build Histograms visualisations
  • Using plotting.scatter_matrix and plot() to build scatter plot visualisations
  • Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted tex
  • Renaming data frame Columns in Pandas
  • Using GroupBy and Query in Pandas to aggregate and group selections of data
  • Creating Line charts, Bar charts, Heatmaps in matplotlib and utilising Seaborn to add better visuals and formatting like adding appropriate labels, titles , colour
  • Using lambda functions to wrangle data formats
  • Structuring a report in a way that is readable and informative, taking the reader through conclusions drawn

Interesting Snippets

Average budget verses Average Revenue of Genres
Average RoI (Revenue by Budget)
Two dimensional analysis of Genres over Time, judging the average Budget, Revenue and Ratings
Top 10 Directors by their Total Revenue

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References

blog, Coaching, data, data engineering, data science, portfolio, python, Statistics

Project:- Data Analysis of Wine Quality, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The section of the course is a Case Study on wine quality, using the UCI Wine Quality Data Set: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

The Case Study introduces us to several new concepts which we can apply to the data set which will allow us to analyse several attributes and ascertain what qualities of wine correspond to highly rated wines.

I downloaded the data from the above link. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I structured the project similarly to the CRISP-DM method – that is I i. Stated the objectives, ii. Decided what questions to ask of the data, iii. Carried out tasks to understand the data, iv. Performed Data Wrangling and Exploratory Data Analysis and then drew conclusions and answered the questions posed.

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • Using Hist and plot() to build Histograms visualisations
  • Using plotting.scatter_matrix and plot() to build scatter plot visualisations
  • Changing the figsize of a chart to a more readable format, and adding a ‘;’ to the end of the line to remove unwanted text
  • Appending data frames together in Pandas
  • Renaming data frame Columns in Pandas
  • Using GroupBy and Query in Pandas to aggregate and group selections of data
  • Creating Bar charts in matplotlib and using Seaborn to add better formating
  • Adding appropriate labels, titles , colour
  • Engineering proportionality in the data that allows data sets be compared more easily

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References



blog, data, data science, python, Statistics

Project:- Data Analysis of Global Weather Trends Compared to Edinburgh, in Python

The Project

The project is part of the Udacity Data Analysis Nanodegree. The project requires the student to extract Global average temperature data and the average temperature of a local city. In this case I chose Edinburgh. We then need to discuss what questions we want to ask of the data, analyse the data, visualise the data and draw our conclusions.

I extracted the data from the Udacity virtual environment using SQL. I then imported the data into Python so we could use a Jupyter Notebook to create the required report, which allows us to document and code in the same document, great for presenting back findings and visualisations from the data.

I approached the project by first deciding what questions to ask of the data. I then imported the data to Python using Pandas and carried out some rudimentary exploration of the data to understand its layout, structure, number of records and so on . To prepare the data for visualisation, I then applied the rolling() function to the data to smooth out some of the jagged changes in the data

With the data now using a rolling window, I then visualised the data in both a Line chart and a Box Plot for the Global data and Local data so that we can compare, ascertain trends and answer the questions posed.

Finally, I drew conclusions and answered our questions we posed at the start.

The PDF report written to communicate my project and findings can also be found here

What We Learned

  • How to approach an analysis project, posing questions and drawing conclusions
  • Manipulating data in Python
  • Creating a rolling average in Python using the rolling() function
  • Utilising Matplotlib to visualise data in Line charts and Box plots, complete with customised colour, axis, labels and title

The Code and the Report

  • GitHub repository for the data, SQL, PDF report and Jupyter Notebook
  • the PDF report can also be found here

References

blog, Coaching, data, data science, Project Management, python, stakeholder management

Applying CRISP-DM to Data Science and a Re-Usable Template in Jupyter

What is CRISP-DM?

CRISP-DM is a process methodology that provides a certain amount of structure for data-mining and analysis projects. It stands for cross-industry process for data mining. According to polls popular Data  Science website KD Nuggets, it is the most widely used process for data-mining.

 

 

The process revolves are six major steps:

1.       Business Understanding

Start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and  project plan.

2.       Data Understanding

As with every data project, there is an initial hurdle to collect data and to familiarise yourself with it, identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.

3.       Data Preparation

The nitty-gritty dirty work of preparing the data by cleaning it, merging, moulding it etc to form a final dataset that can be used in modeling.

4.       Modeling

At this point we decide which model techniques to actually use and build them

5.       Evaluation

Once we appear to have good enough quality model results, these need to be tested to ensure they test well against unseen data and that all key business issues have been answered.

6.       Deployment

At this stage we are ready to deploy our code representation of the model into an production environment and solve our original business problem.

Why Use It ?

Puts the Business Problem First

One of the greatest advantages is that it puts business understanding at the centre of the project. This means we are concentrating on solving the business’s needs first and foremost and trying to deliver value to our stakeholders.

Commonality of Structured Solutions

As a manager of Data Scientists and Analysts it also ensures that we stick to a common methodology to maintain optimal results that the team can follow, ensuring we have a followed best practice or tackled common issues.

Flexible

It’s also not a rigid structure – It’s malleable and steps are repeatable, and often you will naturally go back through the steps to optimise your final data set for modeling.

It does not necessarily need to be mining or model related. Since so many of the business problems today require extensive data analysis and preperation, the methodology can flex to suit other categories of solutions like recommender systems, Sentiment analysis, NLP amongst other

A Template in Jupyter

Since we have a structure process to follow, it is likely we have re-usable steps and components. Ideal for re-usable code. What’s more, a Jupyter notebook can contain the documentation necessary for our business understanding and description of each step in the process.

To that end, I have a re-usable notebook on my DataSci_Resources GitHub Repository here 

Resources

portfolio, python

Python Web Scraper

Why Build This?

It can sometimes be a challenge to find and get the data you require. Whether we are Data Developers, Engineers, Analysts or Scientists; we perform a business function where we need to ensure we return value to our stakeholders.

Sometimes the data you need isn’t so easily or readily available to do that. Sometimes it might not even exist. We may need to look externally and use APIs to bring in complementary data sources for our reporting, alternatively we might even have to scrape it ourselves from other sources.

Luckily, there are packages available to help us do just that.

The Challenge

I’m used to dealing with structure data from various RDBMS, it’s second nature now. However, with more and more data becoming available through the web, particularly on Social Media, there is a treasure trove on information that we can use that can drive valuable insight for our business stakeholders.

We can glean customer sentiment from social media, mine sites for information, or watch how stocks shift in real time to news events.

While still learning Python, I figured that I could kill two birds with one stone here and write something that would be a good stretch of my abilities, but it would give us back information and we analyse it to also learn some visuals and analysis.

To that end I started looking around for what tools are available. To that end I found BeautifulSoup,

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers 
hours or days of work

 

 

What We’ve Learned

  • Scraping from HTML using BeautifulSoup
  • Using loops to parse through information and pass them into…
  • Pandas dataframes
  • Some good practices to avoid excessive processing while looping, avoiding load on

Code link to GitHub

Resources