Nowadays, I’m working on an app project that requires scraping, mapping, and iterating data from websites.
Frankly, building your first application with a cumulated software engineering knowledge of 4 weeks is not an easy task. When tasked to write a web scraper for my software engineering program at The Flatiron School, my initial reaction was to start panicking with a desire to scrape my own brain.
After spending some time with scraping, things started to come together and finally made some sense. In no way, I think of my self as an “expert” on this topic. However, I would like to share some insights that you might find helpful.
NOKOGIRI WILL SAVE YOUR LIFE!
Nokogiri is a Ruby gem designed to scrape data from websites and it is an essential tool for Ruby developers. When it is combined with OpenURI, which is an easy-to-use native Ruby library for wrapping HTML information, it becomes a powerful tool to get information from web sites that don’t make data available through an API.
Scraping can be difficult to accomplish. You might scrape a lot of unstructured data at first and it might seem very scary.
But, all you need is a little understanding of Google ChromeDeveloper tools, a little bit of CSS and Ruby fundamentals, and lots of patience because, in order to get the right data, you will need to closely analyze the HTML and identify exactly which page elements contain the correct information.
For my project, I chose to build a scraper that would look at the Comixology Sales Campaigns page and scrape the comics listed on sale. Let’s have a simplified walkthrough of the scraping process.
1- Install Nokogiri
First, let’s install the Nokogiri gem. It is as simple as typing gem install nokogiri in the terminal!
2- Require Nokogiri and Open-Uri
Installing Nokogiri might take some time so in the meantime, you might want to go to your kitchen and get a cup of coffee or tea. Once installed, require them at the top of your ruby file.
3- Inspect the content to scrape on your webpage by using Google Chrome Developer Tools. It is easy — just right click on the page you want to scrape and click inspect. Now you will be able to inspect the elements on the web page! Just don’t forget to click on “select an element in the page to inspect it” button on the top left (the arrow pointing to the middle of the square.)
Hover over the items you are looking for. When you select the items, in the Elements window you will see the corresponding CSS will be highlighted. Feel Free to click on items and select their CSS in the Elements window. In my case, I can look for a couple of things. Book titles, issues, discounted price, original price, and so on... It is important to keep in mind that we are now going to use CSS selectors. Here are some important things to consider. Is the content we are looking for stored within an item? Can we see the text, link, image, etc. wrapped within the CSS in the Elements panel? Is there any ID or Class information that follows the selector/tag?
So it is critical to find the content we are looking for is stored within the right item. Let’s try to see where the image, text, link, or whatever information we are targeting to scrape is wrapped within the CSS in the Elements panel. Check the classes. Also, make sure that there are patterns in the CSS. So if we check another item, we would see a similar pattern in the way the information is stored. Let’s go and play around. Bingo! ‘.content-item’ class wraps all the information we are looking for!
Ok, let’s save the site we would like to scrape to a variable in our Ruby code and go and play some more with the developer tools and the CSS selector to get to the specifics of each item we would like to scrape.
I will not get into the much detail with the rest of our Ruby coding(this is not a play by play instructional article) but when we finish the rest of our Code creating variables, mapping and iterating each element we have scraped— we should have a nice clean list of book titles(book_title), issues(book_number), discounted price(book_price), and original price(book_original_price) based on the CSS elements we have selected.
Let’s run our code and Shazam!!
Now we have scraped a beautiful list of comics on sale! We can be proud of our scraping skills even if we don’t have adamantium claws like Wolverine!
Final Words — A couple of things I learned from my experience:
- If the website does not support API’s, Nokogiri will save your life!.
- Browser’s Element inspector is very helpful in understanding the location of the data stored.
- Creating right CSS selectors in the Ruby code might be challenging but eventually, we will come to that point.
- Scraping data is fun!
I am definitely still learning about scraping and effectively using CSS selectors. Creating this project has definitely been a huge learning experience. If you are a beginner software developer as I am, I hope you could find a couple of useful tips to help your own studies. If you have any questions or thoughts, please share your comments.