This is Part 1 of 5-Part series on building an ETL platform with ruby.
AC: Download all Federal German Law by calling one method
def execute
links = crawl_links
books = scrape_books
until books.empty?
book = books.shift
documents = get_documents
next if documents.empty?
handle_deleted
handle_updated
post_process(book)
end
end
If this looks like a pseudo-code, it’s because it is. BUT the code we’re going to have at the end of this series is going to look pretty similar. Okay, so that’s the big plan. In this blog post, we’re going to discuss the part that gets the links.
Before we start the implementation, we need to answer several questions:
- What’s the root URL?
- In our case data belongs to http://www.gesetze-im-internet.de
- Where do the links come from?
- There’s an alphabetical list of all the Law books under http://www.gesetze-im-internet.de/aktuell.html
- Which piece of data do we need?
- There’s a compiled in the XML version of each book to be found on the book page. We need to extract the XML and import the book’s parameters into our database.
Meet Mechanize:
The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
Run bundle install mechanize
Now, open an IRB
console and let’s start one by one.
require 'mechanize'
base_url = "http://www.gesetze-im-internet.de"
start_url = "#{base_url}/aktuell.html"
page = @agent.get(start_url)
puts page.links
There are the links!
Actually No. These are all the links on that page. We need just a bunch of them, namely the ones that have only 1 char in the text.
page_links = []
page.links.each do |l|
if l.text.length == 1
page_links << l
end
end
There are the correct links!
Yes. Now we need to visit each of these links and get that XML file. After you download the XML file a couple of hundred times, it seems obvious that the URL of the book page and the URL of the XML are really similar. We need to just substitute ‘index.html’ with ‘xml.zip’. This is a movie-style shortcut, but these things happen more often than one could imagine. For one simple reason - the people who built this service are kinda similar to you - their work is systematic and predictable. One should use that as leverage. Back to code:
page_links.each do |link|
page = link.click
page.links.each do |l|
next unless /\/\S+\/index\.html/.match? l.uri.to_s
relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
relative_url[0] = '' # most efficient first char removal according to THE INTERNET
books_links << [base_url, relative_url].join
end
end
There are the links to all the XML files! YAY!
Now after you have subscribed for the next parts in the field above ☝️, let’s really get it going. Meet Composable Operations
Composable Operations is a tool set for creating operations and assembling multiple of these operations in operation pipelines. An operation is, at its core, an implementation of the strategy pattern and in this sense an encapsulation of an algorithm. An operation pipeline is an assembly of multiple operations and useful for implementing complex algorithms. Pipelines themselves can be part of other pipelines.
This is precisely what we want. Read further on the strategy pattern. Here’s the deal:
module Download
class Law < ComposableOperations::Operation
before do
@agent = Mechanize.new
end
def execute
base_url = "http://www.gesetze-im-internet.de"
start_url = "#{base_url}/aktuell.html"
page = @agent.get(start_url)
page_links = []
books_links = []
page.links.each do |l|
if l.text.length == 1
page_links << l
end
end
page_links.each do |link|
page = link.click
page.links.each do |l|
next unless /\/\S+\/index\.html/.match? l.uri.to_s
relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
success(self, relative_url)
relative_url[0] = '' # most efficient first char removal according to THE INTERNET
books_links << [base_url, relative_url].join
end
end
books_links
end
end
end
Next step: parsing XML
Weapon of choice: Nokogiri
module Scrape
class Law < ComposableOperations::Operation
processes :book # Process book links gathered from the Download::Law Operation
def execute
begin
documents = []
page = download(book)
if page.nil?
error(self, "Empty Page Error #{page.uri}")
return []
end
xml = RemoteArchiveExtractor.perform(page.body)
if xml.nil?
error(self, "Empty XML Error #{page.uri}")
return []
end
# XSLT https://en.wikipedia.org/wiki/XSLT
# is a way to transform XML in all possible formats
# or create custom ones, like in our case.
# Not always is there a way to parse or transform XML by hand,
# so if there's something that can help, it's XSLT
xslt = Nokogiri::XSLT(File.read(File.join('inventory', 'import', 'federal_law_de.xsl')))
docs = xml.search("norm")
# get book attributes
book_label = docs.first.xpath('./metadaten/jurabk').text
book_title = docs.first.xpath('./metadaten/langue').text
book_uid = docs.first.attribute('doknr').text
book_meta = { legislation: 'federal' }
book_attributes = { label: book_label, title: book_title, uid: book_uid, meta_data: book_meta }.with_indifferent_access
# abstract away saving to database
# create book in any way suitable for your needs, by implementing the following operation
# In this case it is ActiveRecord connected an saving to PostgreSQL database
book_obj = Postprocess::Book::Create.perform(book_attributes)
# let's count the paragraphs (documents) of the book
index = 0
docs.each do |d|
# documents attributes
title = d.xpath('./metadaten/titel').text
label = d.xpath('./metadaten/enbez').text
next if title.include? "Inhaltsverzeichnis" # Table of Contents
uid = d.attribute("doknr").text
extracted_label = Analysis::PrefixExtractor.perform(label)
label = extracted_label.dig(:label)
prefix = extracted_label.dig(:prefix)
doc = xslt.transform(norm_doc)
html_content = content.empty? ? nil : doc.to_arr.to_html
# coming in next episode...
keywords = Analysis::KeywordExtractor.perform([title, content].flatten)
meta = { legislation: 'federal', keywords: keywords, type: "law", order: index }
# increment documents count
index += 1
# document's attributes
d = { uid: uid, label: label, title: title, content: content, book_id: book_obj.id,
html_content: html_content, prefix: prefix, exclude_from_search: exclude_from_search, slug: urlify(label),
meta_title: Preprocess::WhitespaceRemover.perform("#{prefix} #{label} #{book_obj.meta_title} #{title}"), meta_data: meta }.with_indifferent_access
documents << d
success(self, d)
end
if documents.empty?
error(self, "No documents found on #{page.uri}")
end
rescue StandardError => e
puts e.message
puts e.backtrace.inspect
return
end
documents
end
end
end
We’ll look at the KeywordExtractor
in Part 2, but I can show you this little fellow bellow first. It is not only useful for strings that need a bit of white-space trimming, but also shows how simple it is to add new operations.
module Preprocess
class WhitespaceRemover < ComposableOperations::Operation
processes :string
def execute
string.delete("\t\n").gsub(/[[:space:]]+/, ' ').strip
end
end
end
Last operation I’m willing to share in this session extends the Operation class itself, or can be also added as a helper function. But I strongly recommend abstracting away the success and error logging operations. In such a way fine control is possible like a simple or more verbose output.
class ComposableOperations::Operation
def success(operation, obj)
quiet = true
if quiet
print "."
else
puts "#{operation} successfully finished processing #{obj.inspect}"
end
end
def error(operation, error)
puts "#{urlify(operation)} exit error #{error}"
end
def urlify(string)
string.to_s.to_slug.normalize(transliterations: :german).to_s
end
end