WebScrapers

Two WebScrapers

Python, BS4, Requests, Selenium

Inspiration

As an person who has an avid interest in classic novels, I'm always on the hunt for more interesting selections to read. After learning Python and webscraping, I thought I could employ these skills for my own purposes, such as going onto certain online book websites to gather well-rated classic novels to add to my reading list without having to personally click and investigate each book for the relevant information. I've been on the Open Library website every now and then and its classics section had 80 pages that would have taken me far too long to comb through, so I decided to attempt web scraping instead.

BS4 and Requests

I learned how to look for the corresponding html code that aligned with the desired information and web scrape using Beautiful Soup and Requests while I was learning Python, and I decided to employ this technique on Open Library's classic novels section. Since I was looking for specifically classic novels that had a relatively good reputation by general consensus, I decided to collect books that were rated four stars or above by at least 50 people. Since there was pagination involved and the url for each page only differed by the page number, I decided to use a 'for loop' to reconstruct the url for every page. Within the for loop, I checked to make sure the page was valid, after which I could safely set variables for the information I wanted for each valid book, such as the title, author, star count, and number of ratings. One challenge was that the html listed the star count and number of ratings together (i.e 4.2 stars (41 ratings)) which I fixed by splitting (sometimes multiple times) the text element of the tag that marked the rating. I also used strip() to make sure that I wasn't getting any preceding or trailing spaces just in case there were some exceptions to the generally consistent format. After running the program, I realied that some of the books were already checked out and I wasn't able to read it immediately, so I decided to add another condition; I only wanted books with a status of 'read' (I can read it online) or 'borrow' (available for borrow). I noticed that Open Library uses a particular class of 'cta-btn available' for this more general criteria, which served my purposes exactly. Finally, I used the pandas library to save the web scraping results to a csv file in the same folder so it was easily accessible if needed.

Selenium and ChromeDriver

After having tried the method with BS4 and requests, it came to my attention that some relevant information was not directly accessible on each page. More specifically, I would have to click into each book to access information like the publishing company or page number. I was introduced to Selenium in this way(and by extension, ChromeDriver), which enabled me to access the desired information by automating the tasks. I was interested in the page count of the books that were rated highly, as I would much rather read a long book through a physical copy than online, so I set a filter that would only return books with at most 800 pages. From my experience building the former web-scraper, I attempted to stimulate the clicking of each book to access the page number by reconstructing the url again, which was possible by adding the href attribute link of each book (a relative url) to the domain. However, I kept getting errors on some pages that indicated the url passed in did not exist. By adding a print statement that printed the url being accessed, I discovered that the domain was being prepended twice, even though I confirmed that each href tag was a relative url and not an absolute one. Through further research, I learned that sometimes modern browsers automatically converted a relative url in an <a> tag to an absolute url in a process of url resolution (before Selenium even read it). Since Selenium interacted with the browsers through a WebDriver API, it ended up extracting the resolved absolute url instead of the relative url I thought it was extracting. So I actually only had to take the relative url (book_link) itself using the .get() method, a small yet satisfying mystery to debug. Another error I encountered while running the program was that sometimes the page-number tag simply did not exist. That made me realize that some books in the later pages did not have a page count when accessed. So I added another if statement to resolve this issue: I would only assign page count information if the page count tag existed, and otherwise, it would simply be listed as 'None'. Furthermore, I changed the method from find_element to find_elements so that should the tag not exist, it would return an empty list instead of raising an exception. Similarly, I also instructed Python to save the relevant books to a csv file in the same folder.