Introduction
I do love a good excuse to start a new project and learn something new. There is something profoundly beautiful about the journey you need to take when trying to find a way to solve a new problem.
So here's the current excuse for a new project: I was going on a trip to New York City and needed to find a way to (productively) spend my time on both flights (it takes about 10 hours to get to NYC). I decided to catch up on the WWDC videos I hadn't watched at the time. After going to the Apple Developer portal, I decided that I wasn't going to spend all the time clicking through all the links on the page to download every video, so I used it as an excuse to create a tiny video downloader script using Ruby.
I have used Nokogiri a number of times in the past, when I needed to scrape data from a website, but I had never used it in conjuction with a system command to download files.
The download utility I used was wget
(I also added an option for cURL
in the code below). I decided to not write a download function from scratch because there was no reason to re-invent the wheel, especially when such great utilities already exist. The main learning focus was on how to effectively scrape a web page recursively and use command line utilities programatically.
I wanted the script to be also usable in any *NIX and BSD system, so I could potentially use it in a Rails application should I ever need to, with only a few small changes to some methods or logic.
Script execution and flow explanation
You can execute the script using this command: ruby [name-of-script] https://developer.apple.com/videos/wwdc2017/ .
. You need to specify the page you want to scrape and the directory to download the files. The defaults are the files from WWDC 2017 and the current directory to save the files.
The basic flow of the script is as follows:
- The script opens the main page specified by the first argument of the script (default: WWDC 2017)
- It scrapes the index page for all video links, creates video objects and adds them to the videos array of the downloader
- It goes through each entry in the videos array and starts downloading it, creating its parent directory (if it doesn't exist)
The following section contains the full source code for the downloader/scraper script.
Code
module DownloadController
require 'nokogiri'
require 'open-uri'
require 'awesome_print'
# Helper functions
#
# Returns the result of opening the page that is passed to it
def self.open_page(uri)
Nokogiri::HTML(open(uri))
end
# Video class definition
class Video
attr_accessor :name, :page_url, :download_url, :section
def initialize(name, page_url="", download_url="", section="")
@name = name
@page_url = page_url
@download_url = download_url
@section = section
end
def describe
puts "=========================================================="
puts "Video Info:\n Name: #{@name}\n Page: #{@page_url}\n Download Link: #{@download_url}\n Section: #{@section}\n"
puts "=========================================================="
end
def set_download_url
video_download_url = DownloadController::open_page(@page_url).css('a').select {|a| a.text == 'HD Video'}
# Checks for an HD Video link. If none exists, select will return an empty array
if video_download_url == []
@download_url = ""
else
@download_url = video_download_url.first['href']
end
end
# Returns true if video contains a valid download url
def valid_download?
@download_url != ""
end
# Returns the video filename removing the ?dl=1 suffix
def filename
@download_url.split("/").last.split("?").first
end
# Returns the file extension using the filename
def file_extension
filename.split(".").last
end
# Retuns the number for the video session
def video_session
filename.split("_").first
end
# Returns the a more readable version of the video filename
def proper_name
"#{@name} (Session: #{video_session}).#{file_extension}"
end
end
# Downloader class
class Downloader
def initialize
@base_url = ""
@page_url = ARGV[0] || "https://developer.apple.com/videos/wwdc2017/"
@save_directory = ARGV[1] || "."
@videos = []
end
# Helper functions
# Sets the base url for the video links
def set_base_url
uri = URI.parse(@page_url)
@base_url = "#{uri.scheme}://#{uri.host}"
end
# Scrapes the index page for all videos
def get_videos_from_index_page
ap "Getting index page links"
DownloadController::open_page(@page_url).css(".collection-focus-group").each{ |c|
# Get the video section to name the folder
section_name = c.css(".font-bold").text.strip ||= ""
# Go through each of the section video links and create the video objects
c.css("a").select {|link| link.text.strip != ''}.each {|link|
v = Video.new(
name = link.text.strip,
page_url = "#{@base_url}#{link['href']}",
download_url = "",
section = section_name
)
ap "Scraping #{v.name}"
v.set_download_url
v.describe
@videos << v
}
}
end
# Goes over every entry in the video array and downloads valid videos
def download_files
@videos.select {|video| video.valid_download? }.each {|v|
# Download video file
download_file(v)
}
end
# Bootstrap method. It sets the base URL for the videos, it gets all video links from the front page
# and proceeds to download the videos in the video array of the Downloader instance.
def start
set_base_url
get_videos_from_index_page
download_files
end
private
# Download file method. It creates the folder if it doesn't already exist
def download_file(video)
# Check if the save directory exists and if not create it
`mkdir "#{@save_directory}/#{video.section}"` unless Dir.exist?("#{@save_directory}/#{video.section}/")
video.describe
ap "Downloading file..."
`wget -O "#{@save_directory}/#{video.section}/#{video.filename}" -N "#{video.download_url}" --show-progress`
# `curl --url "#{video.download_url}" -o "#{@save_directory}/#{video.section}/#{video.filename}"`
end
end
end
# Initialize a new Downloader object and start downloading the files
d = DownloadController::Downloader.new
d.start
Resources
You can get the source code here