Massively modifying images with ImageMagick

    Web editors often have the need to edit a large number of images. For example, the large image size of professional cameras tends to be overkill for most sites. I wanted a quick and easy way to resize large images and that was how I found ImageMagick.

    ImageMagick is a suite of tools and according to the man page, we can “use it to convert between image formats as  well  as  resize  an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more”. First, let’s install imagemagick:

    Then, we can use the convert command to do the actual edition. Check the man page to see the astounding number of options this command has. For example, if I want to resize all the JPG images in the current directory to a width of 1280 pixels and save the resulting images as the same name but with “min-” before the name I would execute the following command:

    And here lies the advantage of ImageMagick: it can be used in a script to edit images extremely quickly. ImageMagick can also be used on Mac OS X and Windows. For more information about the convert command, refer to http://www.imagemagick.org/script/convert.php

Share

Alan Verdugo / 2016/12/12 / Uncategorized / 0 Comments

Comptia Linux+ certification

    I recently completed the Comptia Linux+ certification. I spent much more time than I previously hoped on this, and because of that, I wanted to write about it. After all, this was the reason why I did not update this blog as frequently as I wanted.

    First of all, let me tell you about the basic stuff. I chose this particular certification as my first one because I am very interested in Linux and everything that is related to Open Source. Also, this particular certification has a 3-for-1 offer. This means that if you complete the certification requirements, you will not only get the Comptia Linux+ certification, you will also get LPIC-1 and the SUSE CLA certification. Alas, after September 1st, 2016, SUSE decided to stop participating on this offer, so now it is actually a 2-for-1 offer, which is still pretty good in my opinion.

    In order to get the certification, you need to pass two exams: LX0-103 and LX0-104. Currently, an opportunity to take each test has a price of $194 US dollars. Each exam consist of 60 questions that you can answer in a 90-minute period. In order to pass an exam, you need a minimum of 500 points (on a scale of 200 to 800). I am still not sure how the questions are graded

Preparing for the exams.

    The only material I used for studying was the “Comptia Linux+ Powered by Linux Professional Institute Study Guide: Exam LX0-103 and Exam LX0-104 3rd edition” book. Its name alone should tell you how long and boring it is to read (like most technical books). However, it is the tool that allowed me to be certified, so it does deliver what it promises and I would recommend it. The book also includes a discount code for the exams and access to a website where you can study using flashcards and a test exam.

    I admit I did not study frequently, there were days when I read the book for a couple of hours, then I did not read it until weeks later because I just did not have the time. I know for a fact that proper discipline and regular study schedules while reading this same book will result in better grades on the exams. However, I read the book three times from beginning to end. It was boring, painful, and I just got sick of reading the same thing over and over again (I committed to not read any other book until I got the certification), but it was worth it in the end.

Taking the exams.

    Once you paid and scheduled your exam, you just need to go to the PearsonVue center you selected. You only need to take a couple of official IDs with you. The lady that helped me was very kind and made sure to explain the whole process clearly. She asked me for my IDs, verified that my signature and picture matched and then took another picture of me. All this is just to ensure that nobody else is able to take the exam and claim it was you. So, if you were thinking in asking a friend to go and take your certification test for you, it will simply not work. Security is very thigh and I think that is good. I was given a set of rules and told to agree on them. The rules basically say that you will not cheat and will not help other people cheat (which is practically impossible anyway).

    After that, I was given a key and told to put all my things in a drawer. You are not allowed to sit in front of your computer with your cellphone, jacket, keys, notebooks, or anything else that could be used to cheat. I was given a marker and a small whiteboard, which I was supposed to use as a notebook if needed.

    As for the actual questions, some of them are multiple choice, some are what I like to call multiple-multiple-choice (“choose the 3 correct answers from 5 options”) and in some questions you have to actually type the answer on a text box. I think 90 minutes is much more time than it is actually needed for 60 questions since you will know the answer right away or not know it, in both cases you maybe need a couple of seconds for each question. I used my extra time to re-read and think about the answer I chose, since some of the questions can be very tricky.

    Once you finish the exam, you are given your grade, so you know right away if you passed or not. The only “feedback” you receive are the exam objectives you failed. You never know which questions you answered incorrectly or why. If you failed on an answer related to network routing (for example), in your results sheet you will see a message saying that “network routing” is one of the exam objectives you failed. And that’s it. Of course, this is done to further ensure that you do not spread information about the questions or answers after you took the exam.

Lessons learned.

    I spent several months studying for the exams. Actually, I spent so much time studying for this, that the original exams (LX0-101 and LX0-102) were updated to new versions, which made me start studying again using new study materials because the exams’ objectives were also updated. In the future I will try to complete certifications faster to avoid this. The SUSE CLA certification offer was removed just after I scheduled my second exam, but before I actually took it, so I lost that opportunity as well just because I wanted more time to study. This is just another example of how quickly technology advances, you can literally see how some projects are outdated in a matter of days. If you want to stay current, you need to move fast, and this is something not a lot people can or want to do.

    Would I do this again? Yes, I would. Maybe not this year or even next, but I think certifications are valuable, not just because of the title in your CV, but because it shows that you are willing to undertake a challenge, prepare for it, and actually achieve it, while learning new tricks during the process. Maybe Comptia Linux+ and LPIC-1 are not as famous as the certifications from RedHat, and I was able to pass both exams in my first try, but they were much harder than I expected, and because of that I think they should be taken more seriously among employers and recruiters. I considered myself an advanced Linux user with professional experience as a system administrator, but I was still able, and required, to learn many new things in order to get the certification, for this fact alone I think it is worth it.

Share

Alan Verdugo / 2016/09/21 / Uncategorized / 0 Comments

Introduction to Apache Spark

SPARKlogosmall    Spark is an open source cluster computing framework widely known for being extremely fast. It was started by AMPLab at UC Berkeley in 2009. Now it is an Apache top-level project. Spark can run on its own or can run, for example, in Hadoop or Mesos, and it can access data from diverse sources, including HDFS, Cassandra, HBase and Hive. Spark shares some characteristics with Apache Hadoop but they have important differences. Spark was developed to overcome the limitations of Hadoop’s MapReduce in regards of iterative algorithms and interactive data analysis.

Logistic regression in Hadoop and Spark.

Logistic regression in Hadoop and Spark.

   Since the very beginning, Spark showed great potential. Soon after its creation, Spark was already showing that it was being ten or twenty times faster than MapReduce for certain jobs. If is often said that it can be as 100 times faster, and this has been proven many times. For that reason, it is now widely used in areas where analysis is fundamental, like retail, astronomy, biomedicine, physics, marketing, and of course, IT. Thanks to this, Spark has become synonymous with a new term: “Fast data”. This means having the capability to process large amounts of data as fast as possible. Let’s not forget Spark’s motto and raison d’être: “Lightning-fast cluster computing”.

   Spark can efficiently scale up and down using minimal resources, and developers enjoy a more concise API, which helps them be more productive. Spark supports Scala, Java, Python, and R. While it also offers interactive shells for Scala and Python.

Components:

  • Spark Core and Resilient Distributed Datasets. As its name implies, Spark Core contains the basic Spark functionality, it includes task scheduling, memory management, fault recovery, interaction with storage systems, and more. Resilient Distributed Datasets (RDDs) are Spark’s main programming abstraction, they are logical collections of data partitioned across nodes. RDDs can be seen as the “base unit” in Spark. Generally, RDDs reside in memory and will only use the disc as a last resource. This is the reason why Spark is usually much faster than MapReduce jobs.
  • Spark SQL. This is how Spark interacts with structured data (like SQL or HQL). Shark was a previous effort in doing this, which was later abandoned in favor of Spark SQL. Spark SQL allows developers to intermix SQL with any of Spark’s supported programming languages.
  • Spark Streaming. Streaming, in the context of Big Data, is not to be confused with video or audio streaming, even if they are similar concepts. Spark Streaming handles any kind of data instead of only video or audio feeds. So, it is used to process data that does not stop coming at any time in particular (like a stream). For example, Tweets or logs from production web servers.
  • MLlib. It is a Machine Learning library, it contains the common functionality of machine learning algorithms like classification, regression, clustering, and collaborative filtering. All of these algorithms are designed to scale out across the cluster.
  • GraphX. Apache Spark’s API for graphs and graph-parallel computation, it comes with a variety of graph algorithms.

spark2

   Spark can recover failed nodes by recomputing the Directed Acyclic Graph (DAG) of the RDDs, and it also supports a recovery method using “checkpoints”. This is a clever way of guaranteeing fault tolerance that minimizes network I/O. RDDs achieve this by using lineage, i.e. if an RDD is lost, the RDD has enough information about how it was derived from other RDD (i.e. the DAGs are “replayed”), so it can be rebuilt easily. This works better than fetching data from disk every time. In this way, fault tolerance is achieved without using replication.

Spark usage metrics:

   In late December 2014, Typesafe created a survey about Spark, and they noticed a “hockey-stick-like” growth in its use[1], with many people already using Spark in production or planning to do it soon. The survey reached 2,136 technology professionals. These are some of their conclusions:

  • 13% of respondents currently use Spark in production, while 31% are evaluating Spark and 20% planned to use it in 2015.
  • 78% of Spark users hope to use the tool to solve issues with fast batch processing of large data sets.
  • Low awareness and/or experience is currently the biggest barrier for users implementing Spark effectively.
  • Top 3 industries represented: Telecoms, Banks, Retail.
Typesafe survey results snapshot.

Typesafe survey results snapshot.

   So, is Spark better than Hadoop? This is a question that is very difficult to answer. This is a topic that is frequently discussed, and the only conclusion is that there is no clear winner. Both technologies offer different advantages and they can be used alongside perfectly. That is the reason why the Apache foundation has not merged both projects, and this will likely not happen, at least not anytime soon. Hadoop reputation was cemented as the Big Data poster child, and with good reason. And for that, with every new project that emerges, people wonder how it relates to Hadoop. Is it a complement for Hadoop? A competitor? An enabler? something that could leverage Hadoop’s capabilities? all of the above?

Comparison of Spark's stack and alternatives.

Comparison of Spark’s stack and alternatives.

   As you can see in the table above, Spark offers a lot of functionality out of the box. If you wanted to build an environment with the same capabilities, you would need to install, configure and maintain several projects at the same time. This is one of the great advantages in Spark: having a full-fledged data engine ready to work out of the box. Of course, many of the projects are interchangeable. For example, you could easily use HDFS for storage instead of Tachyon, or use YARN instead of Mesos. The fact that all of these are open source projects means they have a lot of versatility, and this means the users can have many options available, so they can have their cake and eat it. For example, if you are used to program in Pig and want to use it in Spark, a new project called Spork (you got to love the name) was created so you can do it. Hive, Hue, Mahout and many other tools from the Hadoop ecosystem already work or soon will work with Spark.[2]

   Let’s say you want to build a cluster, and you want it to be cheap. Since Spark uses memory heavily, and RAM is relatively expensive, one could think that Hadoop MapReduce is cheaper, since MapReduce relies more on disc space than on RAM. However, the potentially more expensive Spark cluster could finish the job faster precisely because it uses RAM heavily. So, you could end up paying a few hours of usage for the Spark cluster instead of days for the Hadoop cluster. The official Spark FAQ page talks about how Spark was used to sort 100 TB of data three times faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona Graysort Benchmark[3].

   If you have specific needs (like running machine learning algorithms, for example) you may decide over one technology or the other. It all really depends of what you need to do and how you are paying for resources. It is basically a case-by-case decision. Neither Spark nor Hadoop are silver bullets. In fact, I would say that there are no silver bullets in Big Data yet. For example, while Spark has streaming capabilities, Apache Storm generally is considered better at it.

Real-world use-cases:

   There are many ingenious and useful examples of Spark in the wild. Let’s talk about some of them.

   IBM has been working with NASA and the SETI Institute using Spark in order to analyze 100 million radio events detected over several years. This analysis could lead to the discovery of intelligent extraterrestrial life. [4] [5] [6]

   The analytic capabilities of Spark are also being used to identify suspicious vehicles mentioned in AMBER alerts. Basically, video feeds are entered into a Spark cluster using Spark Streaming, then they are processed with OpenCV for image recognition and MLlib for machine learning. Together, they would identify the model and color of cars. Which in turn could help to find missing children.[4] Spark’s speed here is crucial. In this example, huge amounts of live data need to be processed as quickly as possible and it also needs to be done continually, i.e. processing as the data is being collected, hence the use of Spark Streaming. [7]

   Warren Buffet created an application where social network analysis is performed in order to predict stock trends. Based on this, the user of the application gets recommendations from the application about when and how to buy, sell or hold stocks. It is obvious that a lot of people would be interested in suggestions like this, and specially when they are taken from live, real data like Tweets. All this is accomplished with Spark Streaming and MLlib.[4]

   Of course, there is also a long list of companies using Spark for their day-to-day analytics, the “Powered by Spark” page has a lot of important names like Yahoo, eBay, Amazon, NASA, Nokia, IBM Almaden, UC Berkeley, TripAdvisor, among many others.

   Take, for example, mapping Twitter activity based on streaming data. In the video below you can see a Spark Notebook that is consuming the Twitter stream, filtering the tweets that have geospatial information and plotting them on a map that is narrowing the view to the minimal bounding box enclosing the last batch’s tweets. It is very easy to imagine the collaboration that streaming technologies and the Internet of Things will end up doing. All that data generated by IoT devices will need to be processed and streaming tools like Spark Streaming and/or Apache Storm will be there to do the job.

Conclusion:

   Spark was designed in a very intelligent way. Since it is newer, the architects used the learned lessons from other projects (mainly from Hadoop). The emergence of the Internet of Things is already producing a constant flow of large amounts of data. There will be a need to gather that data, process it and draw conclusions on it. Spark can do all of this and do it blindingly fast.

   IBM has shown a tremendous amount of interest and commitment to Spark. For example, they founded the Spark Technology Center in San Francisco, enabled a Spark-As-A-Service model on Bluemix, and organized Spark hackatons[8]. They also committed to train more than 1 million of data scientists, and donated SystemML (a machine learning technology) to further advance Spark’s development [9]. That doesn’t happen if an initiative doesn’t have support at the highest levels of the company. In fact, they have called it “potentially, the most significant open source project of the next decade”. [10]

   All this heralds a bright future for Spark and the related projects. It is hard to imagine how the project will evolve, but the impact it has already done in the big data ecosystem is something to take very seriously.

References:

[1] http://www.slideshare.net/Typesafe_Inc/sneak-preview-apache-spark

[2] http://es.slideshare.net/sbaltagi/spark-or-hadoop-is-it-an-eitheror-proposition-by-slim-baltagi

[3] https://spark.apache.org/faq.html

[4] http://www.spark.tc/projects/

[5] http://blog.ibmjstart.net/2015/07/14/seti-sparks-machine-learning-to-sift-big-data/

[6] http://blog.ibmjstart.net/2015/08/06/types-of-bigdata-from-the-allen-telescope-array/

[7] https://github.com/hackspark/Amber-Alert-Aid

[8] http://blog.ibmjstart.net/2015/06/29/why-is-ibm-involved-with-apache-spark/

[9] http://www.ibm.com/analytics/us/en/technology/spark/

[10] https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Share

Alan Verdugo / 2016/02/09 / Uncategorized / 0 Comments