ETL* platform for German Federal Law (Part 1)

This is Part 1 of 5-Part series on building an ETL platform with ruby.

AC: Download all Federal German Law by calling one method

def execute
  links = crawl_links
  books = scrape_books
  until books.empty?
    book = books.shift
    documents = get_documents
    next if documents.empty?
    handle_deleted
    handle_updated
    post_process(book)
  end
end

If this looks like a pseudo-code, it’s because it is. BUT the code we’re going to have at the end of this series is going to look pretty similar. Okay, so that’s the big plan. In this blog post, we’re going to discuss the part that gets the links.

Before we start the implementation, we need to answer several questions:

What’s the root URL?
- In our case data belongs to http://www.gesetze-im-internet.de
Where do the links come from?
- There’s an alphabetical list of all the Law books under http://www.gesetze-im-internet.de/aktuell.html
Which piece of data do we need?
- There’s a compiled in the XML version of each book to be found on the book page. We need to extract the XML and import the book’s parameters into our database.

Meet Mechanize:

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

Run bundle install mechanize
Now, open an IRB console and let’s start one by one.

require 'mechanize'
base_url = "http://www.gesetze-im-internet.de"
start_url = "#{base_url}/aktuell.html"
page = @agent.get(start_url)
puts page.links

There are the links!

Actually No. These are all the links on that page. We need just a bunch of them, namely the ones that have only 1 char in the text.

page_links = []
page.links.each do |l|
  if l.text.length == 1
    page_links << l
  end
end

There are the correct links!

Yes. Now we need to visit each of these links and get that XML file. After you download the XML file a couple of hundred times, it seems obvious that the URL of the book page and the URL of the XML are really similar. We need to just substitute ‘index.html’ with ‘xml.zip’. This is a movie-style shortcut, but these things happen more often than one could imagine. For one simple reason - the people who built this service are kinda similar to you - their work is systematic and predictable. One should use that as leverage. Back to code:

page_links.each do |link|
  page = link.click
  page.links.each do |l|
    next unless /\/\S+\/index\.html/.match? l.uri.to_s
    relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
    relative_url[0] = '' # most efficient first char removal according to THE INTERNET
    books_links << [base_url, relative_url].join
  end
end

There are the links to all the XML files! YAY!

Now after you have subscribed for the next parts in the field above ☝️, let’s really get it going. Meet Composable Operations

Composable Operations is a tool set for creating operations and assembling multiple of these operations in operation pipelines. An operation is, at its core, an implementation of the strategy pattern and in this sense an encapsulation of an algorithm. An operation pipeline is an assembly of multiple operations and useful for implementing complex algorithms. Pipelines themselves can be part of other pipelines.

This is precisely what we want. Read further on the strategy pattern. Here’s the deal:

module Download
  class Law < ComposableOperations::Operation
    before do
      @agent = Mechanize.new
    end

    def execute
      base_url = "http://www.gesetze-im-internet.de"
      start_url = "#{base_url}/aktuell.html"
      page = @agent.get(start_url)
      page_links = []
      books_links = []
      page.links.each do |l|
        if l.text.length == 1
          page_links << l
        end
      end
      page_links.each do |link|
        page = link.click
        page.links.each do |l|
          next unless /\/\S+\/index\.html/.match? l.uri.to_s
          relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
          success(self, relative_url)
          relative_url[0] = '' # most efficient first char removal according to THE INTERNET
          books_links << [base_url, relative_url].join
        end
      end
      books_links
    end
  end
end

Next step: parsing XML

Weapon of choice: Nokogiri

    module Scrape
      class Law < ComposableOperations::Operation
        processes :book # Process book links gathered from the Download::Law Operation

        def execute
          begin
            documents = []

            page = download(book)
            if page.nil?
              error(self, "Empty Page Error #{page.uri}")
              return []
            end
            xml = RemoteArchiveExtractor.perform(page.body)
            if xml.nil?
              error(self, "Empty XML Error #{page.uri}")
              return []
            end

            # XSLT https://en.wikipedia.org/wiki/XSLT
            # is a way to transform XML in all possible formats
            # or create custom ones, like in our case.
            # Not always is there a way to parse or transform XML by hand,
            # so if there's something that can help, it's XSLT
            xslt = Nokogiri::XSLT(File.read(File.join('inventory', 'import', 'federal_law_de.xsl')))
            
            docs = xml.search("norm")

            # get book attributes
            book_label = docs.first.xpath('./metadaten/jurabk').text
            book_title = docs.first.xpath('./metadaten/langue').text
            book_uid = docs.first.attribute('doknr').text
            book_meta = { legislation: 'federal' }
            book_attributes = { label: book_label, title: book_title, uid: book_uid, meta_data: book_meta }.with_indifferent_access
            
            # abstract away saving to database
            # create book in any way suitable for your needs, by implementing the following operation
            # In this case it is ActiveRecord connected an saving to PostgreSQL database
            book_obj = Postprocess::Book::Create.perform(book_attributes)

            # let's count the paragraphs (documents) of the book
            index = 0
            docs.each do |d|

              # documents attributes
              title = d.xpath('./metadaten/titel').text
              label = d.xpath('./metadaten/enbez').text
              next if title.include? "Inhaltsverzeichnis" # Table of Contents
              uid = d.attribute("doknr").text
              extracted_label = Analysis::PrefixExtractor.perform(label)
              label = extracted_label.dig(:label)
              prefix = extracted_label.dig(:prefix)
              doc = xslt.transform(norm_doc)
              html_content = content.empty? ? nil : doc.to_arr.to_html

              # coming in next episode...
              keywords = Analysis::KeywordExtractor.perform([title, content].flatten)
              meta = { legislation: 'federal', keywords: keywords, type: "law", order: index }
              
              # increment documents count
              index += 1

              # document's attributes
              d = { uid: uid, label: label, title: title, content: content, book_id: book_obj.id,
                    html_content: html_content, prefix: prefix, exclude_from_search: exclude_from_search, slug: urlify(label),
                    meta_title: Preprocess::WhitespaceRemover.perform("#{prefix} #{label} #{book_obj.meta_title} #{title}"), meta_data: meta }.with_indifferent_access
              documents << d
              success(self, d)
            end
            if documents.empty?
              error(self, "No documents found on #{page.uri}")
            end
          rescue StandardError => e
            puts e.message
            puts e.backtrace.inspect
            return
          end
          documents
        end
      end
    end

We’ll look at the KeywordExtractor in Part 2, but I can show you this little fellow bellow first. It is not only useful for strings that need a bit of white-space trimming, but also shows how simple it is to add new operations.

module Preprocess
  class WhitespaceRemover < ComposableOperations::Operation
    processes :string

    def execute
      string.delete("\t\n").gsub(/[[:space:]]+/, ' ').strip
    end
  end
end

Last operation I’m willing to share in this session extends the Operation class itself, or can be also added as a helper function. But I strongly recommend abstracting away the success and error logging operations. In such a way fine control is possible like a simple or more verbose output.

class ComposableOperations::Operation
  def success(operation, obj)
    quiet = true
    if quiet
      print "."
    else
      puts "#{operation} successfully finished processing #{obj.inspect}"
    end
  end
  def error(operation, error)
    puts "#{urlify(operation)} exit error #{error}"
  end
  def urlify(string)
    string.to_s.to_slug.normalize(transliterations: :german).to_s
  end
end

AC: Download all Federal German Law by calling one method

There are the links!

There are the correct links!

There are the links to all the XML files! YAY!

Join the newsletter

Next step: parsing XML