This is Part 1 of 5-Part series on building an ETL platform with ruby.

AC: Download all Federal German Law by calling one method

def execute
  links = crawl_links
  books = scrape_books
  until books.empty?
    book = books.shift
    documents = get_documents
    next if documents.empty?
    handle_deleted
    handle_updated
    post_process(book)
  end
end

If this looks like a pseudo-code, it’s because it is. BUT the code we’re going to have at the end of this series is going to look pretty similar. Okay, so that’s the big plan. In this blog post, we’re going to discuss the part that gets the links.

Before we start the implementation, we need to answer several questions:

  1. What’s the root URL?
    • In our case data belongs to http://www.gesetze-im-internet.de
  2. Where do the links come from?
    • There’s an alphabetical list of all the Law books under http://www.gesetze-im-internet.de/aktuell.html
  3. Which piece of data do we need?
    • There’s a compiled in the XML version of each book to be found on the book page. We need to extract the XML and import the book’s parameters into our database.

Meet Mechanize:

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

Run bundle install mechanize
Now, open an IRB console and let’s start one by one.

require 'mechanize'
base_url = "http://www.gesetze-im-internet.de"
start_url = "#{base_url}/aktuell.html"
page = @agent.get(start_url)
puts page.links

Actually No. These are all the links on that page. We need just a bunch of them, namely the ones that have only 1 char in the text.

page_links = []
page.links.each do |l|
  if l.text.length == 1
    page_links << l
  end
end

Yes. Now we need to visit each of these links and get that XML file. After you download the XML file a couple of hundred times, it seems obvious that the URL of the book page and the URL of the XML are really similar. We need to just substitute ‘index.html’ with ‘xml.zip’. This is a movie-style shortcut, but these things happen more often than one could imagine. For one simple reason - the people who built this service are kinda similar to you - their work is systematic and predictable. One should use that as leverage. Back to code:

page_links.each do |link|
  page = link.click
  page.links.each do |l|
    next unless /\/\S+\/index\.html/.match? l.uri.to_s
    relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
    relative_url[0] = '' # most efficient first char removal according to THE INTERNET
    books_links << [base_url, relative_url].join
  end
end

Join the newsletter

Subscribe for goodies on tech straight from my reading list.

Once a week. Maybe Wednesday 🤷

    No spam I promise 🤝

    Now after you have subscribed for the next parts in the field above ☝️, let’s really get it going. Meet Composable Operations

    Composable Operations is a tool set for creating operations and assembling multiple of these operations in operation pipelines. An operation is, at its core, an implementation of the strategy pattern and in this sense an encapsulation of an algorithm. An operation pipeline is an assembly of multiple operations and useful for implementing complex algorithms. Pipelines themselves can be part of other pipelines.

    This is precisely what we want. Read further on the strategy pattern. Here’s the deal:

    module Download
      class Law < ComposableOperations::Operation
        before do
          @agent = Mechanize.new
        end
    
        def execute
          base_url = "http://www.gesetze-im-internet.de"
          start_url = "#{base_url}/aktuell.html"
          page = @agent.get(start_url)
          page_links = []
          books_links = []
          page.links.each do |l|
            if l.text.length == 1
              page_links << l
            end
          end
          page_links.each do |link|
            page = link.click
            page.links.each do |l|
              next unless /\/\S+\/index\.html/.match? l.uri.to_s
              relative_url = l.uri.to_s.gsub('index.html', 'xml.zip')
              success(self, relative_url)
              relative_url[0] = '' # most efficient first char removal according to THE INTERNET
              books_links << [base_url, relative_url].join
            end
          end
          books_links
        end
      end
    end
    

    Next step: parsing XML

    Weapon of choice: Nokogiri

        module Scrape
          class Law < ComposableOperations::Operation
            processes :book # Process book links gathered from the Download::Law Operation
    
            def execute
              begin
                documents = []
    
                page = download(book)
                if page.nil?
                  error(self, "Empty Page Error #{page.uri}")
                  return []
                end
                xml = RemoteArchiveExtractor.perform(page.body)
                if xml.nil?
                  error(self, "Empty XML Error #{page.uri}")
                  return []
                end
    
                # XSLT https://en.wikipedia.org/wiki/XSLT
                # is a way to transform XML in all possible formats
                # or create custom ones, like in our case.
                # Not always is there a way to parse or transform XML by hand,
                # so if there's something that can help, it's XSLT
                xslt = Nokogiri::XSLT(File.read(File.join('inventory', 'import', 'federal_law_de.xsl')))
                
                docs = xml.search("norm")
    
                # get book attributes
                book_label = docs.first.xpath('./metadaten/jurabk').text
                book_title = docs.first.xpath('./metadaten/langue').text
                book_uid = docs.first.attribute('doknr').text
                book_meta = { legislation: 'federal' }
                book_attributes = { label: book_label, title: book_title, uid: book_uid, meta_data: book_meta }.with_indifferent_access
                
                # abstract away saving to database
                # create book in any way suitable for your needs, by implementing the following operation
                # In this case it is ActiveRecord connected an saving to PostgreSQL database
                book_obj = Postprocess::Book::Create.perform(book_attributes)
    
                # let's count the paragraphs (documents) of the book
                index = 0
                docs.each do |d|
    
                  # documents attributes
                  title = d.xpath('./metadaten/titel').text
                  label = d.xpath('./metadaten/enbez').text
                  next if title.include? "Inhaltsverzeichnis" # Table of Contents
                  uid = d.attribute("doknr").text
                  extracted_label = Analysis::PrefixExtractor.perform(label)
                  label = extracted_label.dig(:label)
                  prefix = extracted_label.dig(:prefix)
                  doc = xslt.transform(norm_doc)
                  html_content = content.empty? ? nil : doc.to_arr.to_html
    
                  # coming in next episode...
                  keywords = Analysis::KeywordExtractor.perform([title, content].flatten)
                  meta = { legislation: 'federal', keywords: keywords, type: "law", order: index }
                  
                  # increment documents count
                  index += 1
    
                  # document's attributes
                  d = { uid: uid, label: label, title: title, content: content, book_id: book_obj.id,
                        html_content: html_content, prefix: prefix, exclude_from_search: exclude_from_search, slug: urlify(label),
                        meta_title: Preprocess::WhitespaceRemover.perform("#{prefix} #{label} #{book_obj.meta_title} #{title}"), meta_data: meta }.with_indifferent_access
                  documents << d
                  success(self, d)
                end
                if documents.empty?
                  error(self, "No documents found on #{page.uri}")
                end
              rescue StandardError => e
                puts e.message
                puts e.backtrace.inspect
                return
              end
              documents
            end
          end
        end
    
    

    We’ll look at the KeywordExtractor in Part 2, but I can show you this little fellow bellow first. It is not only useful for strings that need a bit of white-space trimming, but also shows how simple it is to add new operations.

    module Preprocess
      class WhitespaceRemover < ComposableOperations::Operation
        processes :string
    
        def execute
          string.delete("\t\n").gsub(/[[:space:]]+/, ' ').strip
        end
      end
    end
    

    Last operation I’m willing to share in this session extends the Operation class itself, or can be also added as a helper function. But I strongly recommend abstracting away the success and error logging operations. In such a way fine control is possible like a simple or more verbose output.

    class ComposableOperations::Operation
      def success(operation, obj)
        quiet = true
        if quiet
          print "."
        else
          puts "#{operation} successfully finished processing #{obj.inspect}"
        end
      end
      def error(operation, error)
        puts "#{urlify(operation)} exit error #{error}"
      end
      def urlify(string)
        string.to_s.to_slug.normalize(transliterations: :german).to_s
      end
    end