BeautifulSoup to scrape the Ulysses Page by Page blog (http://ulyssespages.blogspot.com/)
|Charles Reid 267d558af4 adding wget instructions to readme||1 year ago|
|.gitignore||1 year ago|
|LICENSE||1 year ago|
|OUTPUT_media_links.txt||1 year ago|
|OUTPUT_media_links_sorted.txt||1 year ago|
|OUTPUT_other_links.txt||1 year ago|
|OUTPUT_other_links_sorted.txt||1 year ago|
|OUTPUT_ubumexico_links.txt||1 year ago|
|OUTPUT_ubumexico_links_sorted.txt||1 year ago|
|README.md||1 year ago|
|bad_links.txt||1 year ago|
|get_soup.py||1 year ago|
|ss.png||1 year ago|
This repository contains a script to use BeautifulSoup to scrape the Ulysses Page by Page blog (http://ulyssespages.blogspot.com/) and get a list of MP3 files embedded on each page.
This is not terribly complicated - use the Ulysses Page by Page
index (which provides one link for each page of Ulysses).
The only difficulty is that some of the links on the
index are broken, but the broken links (and corresponding correct links)
are listed in the
bad_links.txt CSV file.
The output files that result contain one link per line, and can be used
with the wget utility to download the MP3s from the command line.
For example, to download all of the MP3 files to a directory called
you would run the following commands from the repository:
mkdir mp3/ cd mp3/ wget -i ../OUTPUT_media_links_sorted.txt