portfolio, python

Python Web Scraper

Why Build This?

It can sometimes be a challenge to find and get the data you require. Whether we are Data Developers, Engineers, Analysts or Scientists; we perform a business function where we need to ensure we return value to our stakeholders.

Sometimes the data you need isn’t so easily or readily available to do that. Sometimes it might not even exist. We may need to look externally and use APIs to bring in complementary data sources for our reporting, alternatively we might even have to scrape it ourselves from other sources.

Luckily, there are packages available to help us do just that.

The Challenge

I’m used to dealing with structure data from various RDBMS, it’s second nature now. However, with more and more data becoming available through the web, particularly on Social Media, there is a treasure trove on information that we can use that can drive valuable insight for our business stakeholders.

We can glean customer sentiment from social media, mine sites for information, or watch how stocks shift in real time to news events.

While still learning Python, I figured that I could kill two birds with one stone here and write something that would be a good stretch of my abilities, but it would give us back information and we analyse it to also learn some visuals and analysis.

To that end I started looking around for what tools are available. To that end I found BeautifulSoup,

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers 
hours or days of work

 

 

What We’ve Learned

  • Scraping from HTML using BeautifulSoup
  • Using loops to parse through information and pass them into…
  • Pandas dataframes
  • Some good practices to avoid excessive processing while looping, avoiding load on

Code link to GitHub

Resources