Get all the data!
Nowadays it’s a hell of a business to get hold of data and by analyzing it, to provide additional value. I’ve worked twice on projects that acquire data, transform it on different levels and charge you for accessing the extra. As this is still a pretty valid case for business, no matter the industry,
I’m sharing all the lessons I’ve learned on how to create an ETL with Ruby
But why with ruby? Isn’t it better to do it in Go? – somebody
Here and here are some active gems on various topics in NLP. It’s not insanely fast and optimized like you would’ve had with
go for example, but the pool of documents I was interested in, and probably most of you too, is ~ 200.000+ law documents, some of which I had to check daily. So I’d argue it’s faster to create an ETL platform for your ML models or data-driven business with script language like
ruby (aka “the slow”) when we keep in mind how fast and easy it is to get something up and running.
For the counter-argument of volume and scaling, I have one magical keyword:
#parallelism. ETL is a collection of I/O intensive processes that takes good use of multi-core machines if done correctly.
In Part-2 through Part-4 we’ll apply some natural language pre- and post-processing. Some basic NLP functionality will expand the search capabilities and allow easy access to documents’ content for more advanced ML models.
What I am to do here is start a 5-parts series on creating an
inventory loader type of service for the German Federal Law:
[Part-1] Composable operations; Crawl; Download; Normalize white spaces; XSLT
[WIP][Part-2] Pre- and post-processing: stemming; lemmatisation; stop words; black-listing
[Part-3] Take advantage of multi-core CPUs with parallel composable operations
[Part-4] Keyword extraction; document similarity calculation
[Part-5] Extend Analysis Module: dataset-internal link recognition;
The German Government has a nice open website with options to download documents in relatively good formatted XML files. Kudos to them! With some small catches, which we’ll discuss in one of the bonus parts, of course. The truth is, access is seldomly as easy as to government documents in Germany and the developer needs to do magic tricks like switching IPs, solving/disabling captchas, etc. Some of these topics are going to be touched upon, but not in great detail. Ping me if you want to read more on that.
Show me the code
Code is available here. Repository is rough around the edges and shouldn’t be used for production purposes. I’ll try to put it into shape as we proceed through the series. Be advised that posts are code heavy as well.