Get all the data!
Nowadays it’s a hell of a business to get hold of data and by analyzing it, to provide additional value. I’ve worked twice on projects that acquire data, transform it on different levels and charge you for accessing the extra. As this is still a pretty valid case for business, no matter the industry,
I’m sharing all the lessons I’ve learned on how to create an ETL with Ruby
But why with ruby? Isn’t it better to do it with Golang? – somebody
Here and here are
some mostly active gems on various topics in NLP. It’s not insanely fast and optimized like you would’ve had with
go for example, but the pool of documents I was interested in, and probably most of you too, is ~ 200.000+ law documents, some of which I had to check daily. So I’d argue it’s better to spin up your own ETL platform for your ML models or data-driven business with script language like
ruby (aka “the slow”) when we keep in mind how fast and easy it is to get something up and running.
There is of course the argument of volume and scaling, but with simple multithreading, which btw is easy in ruby too, we can reduce the time to completion dramatically. Our little ETL is a collection of I/O intensive processes - HTTP requests, downloading files, extracting zips, loading files in memory, parsing them and saving part of it in the database. This part was implemented before ruby’s new actors model and definitely deserves to be rewritten using it.
In Part-2 through Part-4 we’ll apply some natural language pre- and post-processing. Some basic NLP functionality will expand the search capabilities and allow easy access to documents’ content for more advanced ML models.
What I will try to do here is start a 5-parts series on how I created an
inventory loader type of service for the German Federal Law:
[Part-1] Composable operations; Crawl; Download; Normalize white spaces; XSLT
[WIP][Part-2] Pre- and post-processing: stemming; lemmatisation; stop words; black-listing
[Part-3] Take advantage of multi-core CPUs with parallel composable operations
[Part-4] Keyword extraction; document similarity calculation
[Part-5] Extend Analysis Module: dataset-internal link recognition;
The German Government has a nice open website with options to download documents in relatively good formatted XML files. Kudos to them! With some small catches, which we’ll discuss in one of the bonus parts, of course. The truth is, access is seldomly as easy as to government documents in Germany and the developer needs to do magic tricks like switching IPs, solving/disabling captchas, etc. Some of these topics are going to be touched upon, but not in great detail. Ping me if you want to read more on that.