Probando Microsoft SQL Server en Linux

    Hace años, nunca hubiera pensado escribir un título como ese, pero las cosas cambian y Microsoft ahora está prestando atención a otros ambientes que no son los suyos. Esto nos ha dado la oportunidad de probar algunas de sus herramientas sin tener que vernos obligados a instalar Windows, lo cual se agradece mucho.

    Voy a instalar SQL Server en Linux (específicamente en Linux Mint). Aquí voy a detallar el proceso de instsalación a modo de tutorial y también voy a ejecutar algunas consultas sencillas para demostrar el uso de SQL Server.

    Durante el proceso de instalación, necesitaremos permisos de administrador, así que nos cambiaremos a nuestra cuenta root:

    Creamos un directorio destinado para herramientas y notas sobre SQL Server y nos cambiamos a él (este paso es totalmente opcional):

    Bajamos e instalamos las llaves de los repositorios de Microsoft:

    Añadimos el repositorio de Microsoft:

    Actualizamos nuestra lista de repositorios:

    Finalmente, instalamos SQL Server:

    Ejecutamos la herramienta de configuración:

    Verificamos que el servicio de SQL Server está ejecutándose correctamente:

    Ahora instalaremos las herramientas para la línea de comandos:

    Nos conectamos a nuestra instancia local:

    Creamos una base de datos de prueba y mostramos las bases de datos existentes:

    En nuestra nueva base de datos, creamos una tabla de prueba e insertamos un registro:

    Hacemos una consulta de prueba:

   Como se puede ver, SQL Server fue inesperadamente sencillo de instalar, configurar y utilizar, sobre todo teniendo en mente que utilizamos un sistema operativo que no es del propio Microsoft. En esta práctica aprendimos a crear tablas y hacer consultas con las herramientas de SQL Server, las cuales son muy parecidas a otras herramientas de otros RDBMS, como MySQL.


Alan Verdugo / 2017/08/18 / Uncategorized / 0 Comments

Massively modifying images with ImageMagick

    Web editors often have the need to edit a large number of images. For example, the large image size of professional cameras tends to be overkill for most sites. I wanted a quick and easy way to resize large images and that was how I found ImageMagick.

    ImageMagick is a suite of tools and according to the man page, we can “use it to convert between image formats as  well  as  resize  an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more”. First, let’s install imagemagick:

    Then, we can use the convert command to do the actual edition. Check the man page to see the astounding number of options this command has. For example, if I want to resize all the JPG images in the current directory to a width of 1280 pixels and save the resulting images as the same name but with “min-” before the name I would execute the following command:

    And here lies the advantage of ImageMagick: it can be used in a script to edit images extremely quickly. ImageMagick can also be used on Mac OS X and Windows. For more information about the convert command, refer to


Alan Verdugo / 2016/12/12 / Uncategorized / 0 Comments

Broken WordPress after Ubuntu 16.04 upgrade

    After some delays, I finally upgraded the server’s OS to the LTS Ubuntu 16.04. At first I thought that everything went fine, but then I tried to access the blog and it did not work, it only showed a blank page. A very bad omen. Then, when I tried to login into WordPress, this horrible message appeared:

    The message was actually much longer, I am just posting the beginning. If you have suffered with PHP in the past (like me), you will notice that this uninterpreted PHP code. That was my first clue, something was wrong with PHP. I created the infamous test.php page to test if PHP is actually working correctly with Apache. For those of you who haven’t done this, it basically is a “hello world” approach to see if PHP is working correctly. We paste the following code into a file named test.php or pleasework.php or something like that.

    Then we move that file to the Apache public directory (/var/www/html, is the default in Ubuntu) and grant it appropriate permissions. Then we go to and, if PHP is working, we should see a page with PHP’s logo and all sorts of information like System, Server API and many more. In my case, I only got another blank page. This meant that something was very wrong with PHP.

    So I went into the server via SSH and executed php -v. Turns out I didn’t even have the php command. How was that possible? Well, turns out PHP5 is no longer the default in Ubuntu 16.04, instead, PHP7 is the default. At some point during the upgrade, PHP was completely uninstalled. So, let’s install it again:

Then install libapache2-mod-php7.0:

Then install php7.0-mbstring:

Then install php7.0-mysql:

Finally, reload Apache’s configuration:

    Once all that was done, I reloaded the test.php page and it gave me all the information I mentioned before. I also logged in successfully into WordPress. Now I am wondering if I should change the OS to something else than Ubuntu, and if I should change the WordPress theme. There are other problems that need to be solved, but for now WordPress is working as it should and I am happy.


Alan Verdugo / 2016/12/04 / Uncategorized / 1 Comment

Introduction to Apache Spark

SPARKlogosmall    Spark is an open source cluster computing framework widely known for being extremely fast. It was started by AMPLab at UC Berkeley in 2009. Now it is an Apache top-level project. Spark can run on its own or can run, for example, in Hadoop or Mesos, and it can access data from diverse sources, including HDFS, Cassandra, HBase and Hive. Spark shares some characteristics with Apache Hadoop but they have important differences. Spark was developed to overcome the limitations of Hadoop’s MapReduce in regards of iterative algorithms and interactive data analysis.

Logistic regression in Hadoop and Spark.

Logistic regression in Hadoop and Spark.

   Since the very beginning, Spark showed great potential. Soon after its creation, Spark was already showing that it was being ten or twenty times faster than MapReduce for certain jobs. If is often said that it can be as 100 times faster, and this has been proven many times. For that reason, it is now widely used in areas where analysis is fundamental, like retail, astronomy, biomedicine, physics, marketing, and of course, IT. Thanks to this, Spark has become synonymous with a new term: “Fast data”. This means having the capability to process large amounts of data as fast as possible. Let’s not forget Spark’s motto and raison d’être: “Lightning-fast cluster computing”.

   Spark can efficiently scale up and down using minimal resources, and developers enjoy a more concise API, which helps them be more productive. Spark supports Scala, Java, Python, and R. While it also offers interactive shells for Scala and Python.


  • Spark Core and Resilient Distributed Datasets. As its name implies, Spark Core contains the basic Spark functionality, it includes task scheduling, memory management, fault recovery, interaction with storage systems, and more. Resilient Distributed Datasets (RDDs) are Spark’s main programming abstraction, they are logical collections of data partitioned across nodes. RDDs can be seen as the “base unit” in Spark. Generally, RDDs reside in memory and will only use the disc as a last resource. This is the reason why Spark is usually much faster than MapReduce jobs.
  • Spark SQL. This is how Spark interacts with structured data (like SQL or HQL). Shark was a previous effort in doing this, which was later abandoned in favor of Spark SQL. Spark SQL allows developers to intermix SQL with any of Spark’s supported programming languages.
  • Spark Streaming. Streaming, in the context of Big Data, is not to be confused with video or audio streaming, even if they are similar concepts. Spark Streaming handles any kind of data instead of only video or audio feeds. So, it is used to process data that does not stop coming at any time in particular (like a stream). For example, Tweets or logs from production web servers.
  • MLlib. It is a Machine Learning library, it contains the common functionality of machine learning algorithms like classification, regression, clustering, and collaborative filtering. All of these algorithms are designed to scale out across the cluster.
  • GraphX. Apache Spark’s API for graphs and graph-parallel computation, it comes with a variety of graph algorithms.


   Spark can recover failed nodes by recomputing the Directed Acyclic Graph (DAG) of the RDDs, and it also supports a recovery method using “checkpoints”. This is a clever way of guaranteeing fault tolerance that minimizes network I/O. RDDs achieve this by using lineage, i.e. if an RDD is lost, the RDD has enough information about how it was derived from other RDD (i.e. the DAGs are “replayed”), so it can be rebuilt easily. This works better than fetching data from disk every time. In this way, fault tolerance is achieved without using replication.

Spark usage metrics:

   In late December 2014, Typesafe created a survey about Spark, and they noticed a “hockey-stick-like” growth in its use[1], with many people already using Spark in production or planning to do it soon. The survey reached 2,136 technology professionals. These are some of their conclusions:

  • 13% of respondents currently use Spark in production, while 31% are evaluating Spark and 20% planned to use it in 2015.
  • 78% of Spark users hope to use the tool to solve issues with fast batch processing of large data sets.
  • Low awareness and/or experience is currently the biggest barrier for users implementing Spark effectively.
  • Top 3 industries represented: Telecoms, Banks, Retail.
Typesafe survey results snapshot.

Typesafe survey results snapshot.

   So, is Spark better than Hadoop? This is a question that is very difficult to answer. This is a topic that is frequently discussed, and the only conclusion is that there is no clear winner. Both technologies offer different advantages and they can be used alongside perfectly. That is the reason why the Apache foundation has not merged both projects, and this will likely not happen, at least not anytime soon. Hadoop reputation was cemented as the Big Data poster child, and with good reason. And for that, with every new project that emerges, people wonder how it relates to Hadoop. Is it a complement for Hadoop? A competitor? An enabler? something that could leverage Hadoop’s capabilities? all of the above?

Comparison of Spark's stack and alternatives.

Comparison of Spark’s stack and alternatives.

   As you can see in the table above, Spark offers a lot of functionality out of the box. If you wanted to build an environment with the same capabilities, you would need to install, configure and maintain several projects at the same time. This is one of the great advantages in Spark: having a full-fledged data engine ready to work out of the box. Of course, many of the projects are interchangeable. For example, you could easily use HDFS for storage instead of Tachyon, or use YARN instead of Mesos. The fact that all of these are open source projects means they have a lot of versatility, and this means the users can have many options available, so they can have their cake and eat it. For example, if you are used to program in Pig and want to use it in Spark, a new project called Spork (you got to love the name) was created so you can do it. Hive, Hue, Mahout and many other tools from the Hadoop ecosystem already work or soon will work with Spark.[2]

   Let’s say you want to build a cluster, and you want it to be cheap. Since Spark uses memory heavily, and RAM is relatively expensive, one could think that Hadoop MapReduce is cheaper, since MapReduce relies more on disc space than on RAM. However, the potentially more expensive Spark cluster could finish the job faster precisely because it uses RAM heavily. So, you could end up paying a few hours of usage for the Spark cluster instead of days for the Hadoop cluster. The official Spark FAQ page talks about how Spark was used to sort 100 TB of data three times faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona Graysort Benchmark[3].

   If you have specific needs (like running machine learning algorithms, for example) you may decide over one technology or the other. It all really depends of what you need to do and how you are paying for resources. It is basically a case-by-case decision. Neither Spark nor Hadoop are silver bullets. In fact, I would say that there are no silver bullets in Big Data yet. For example, while Spark has streaming capabilities, Apache Storm generally is considered better at it.

Real-world use-cases:

   There are many ingenious and useful examples of Spark in the wild. Let’s talk about some of them.

   IBM has been working with NASA and the SETI Institute using Spark in order to analyze 100 million radio events detected over several years. This analysis could lead to the discovery of intelligent extraterrestrial life. [4] [5] [6]

   The analytic capabilities of Spark are also being used to identify suspicious vehicles mentioned in AMBER alerts. Basically, video feeds are entered into a Spark cluster using Spark Streaming, then they are processed with OpenCV for image recognition and MLlib for machine learning. Together, they would identify the model and color of cars. Which in turn could help to find missing children.[4] Spark’s speed here is crucial. In this example, huge amounts of live data need to be processed as quickly as possible and it also needs to be done continually, i.e. processing as the data is being collected, hence the use of Spark Streaming. [7]

   Warren Buffet created an application where social network analysis is performed in order to predict stock trends. Based on this, the user of the application gets recommendations from the application about when and how to buy, sell or hold stocks. It is obvious that a lot of people would be interested in suggestions like this, and specially when they are taken from live, real data like Tweets. All this is accomplished with Spark Streaming and MLlib.[4]

   Of course, there is also a long list of companies using Spark for their day-to-day analytics, the “Powered by Spark” page has a lot of important names like Yahoo, eBay, Amazon, NASA, Nokia, IBM Almaden, UC Berkeley, TripAdvisor, among many others.

   Take, for example, mapping Twitter activity based on streaming data. In the video below you can see a Spark Notebook that is consuming the Twitter stream, filtering the tweets that have geospatial information and plotting them on a map that is narrowing the view to the minimal bounding box enclosing the last batch’s tweets. It is very easy to imagine the collaboration that streaming technologies and the Internet of Things will end up doing. All that data generated by IoT devices will need to be processed and streaming tools like Spark Streaming and/or Apache Storm will be there to do the job.


   Spark was designed in a very intelligent way. Since it is newer, the architects used the learned lessons from other projects (mainly from Hadoop). The emergence of the Internet of Things is already producing a constant flow of large amounts of data. There will be a need to gather that data, process it and draw conclusions on it. Spark can do all of this and do it blindingly fast.

   IBM has shown a tremendous amount of interest and commitment to Spark. For example, they founded the Spark Technology Center in San Francisco, enabled a Spark-As-A-Service model on Bluemix, and organized Spark hackatons[8]. They also committed to train more than 1 million of data scientists, and donated SystemML (a machine learning technology) to further advance Spark’s development [9]. That doesn’t happen if an initiative doesn’t have support at the highest levels of the company. In fact, they have called it “potentially, the most significant open source project of the next decade”. [10]

   All this heralds a bright future for Spark and the related projects. It is hard to imagine how the project will evolve, but the impact it has already done in the big data ecosystem is something to take very seriously.













Alan Verdugo / 2016/02/09 / Uncategorized / 0 Comments

Installing and configuring a Hadoop cluster

    As I mentioned in a previous post, Hadoop has great potential and is one of the best known projects for Big Data. Now, let’s take a deeper look into Hadoop and how it works. So, let’s roll up our sleeves and actually work with Hadoop. For that, we will install and configure a Hadoop cluster. Our cluster will consists on four nodes (one master and three slaves). So I provisioned four cloud servers:

Architecture diagram.

Architecture diagram.

    We designated atlbz153122 as the master for no reason in particular. The aliases can be modified according to your needs. I.e. If you are using these same hosts for other cluster projects, you may want to specify the aliases as “hadoop-master” or “hadoop-slave-1”. They were created using the same image, so all of them have this setup:

Initial setup:
In every node, we will do the following:
-Create a hadoop user account (Of course, you can chose to name it differently), and add it to the sudoers group.

-Optionally (but extremely recommended), map all the hostnames in the respective /etc/hosts files.
-Copy SSH keys between all the nodes.
Obviously, first we need to make sure SSH is up and running on all the servers. This step could be different for you according to your security requirements, but the goal here is to allow Hadoop and its processes to communicate between the nodes without prompting for credentials.
-Generate SSH keys.

-Copy the keys.
What I do is to create a list of all the public keys and then add that master list to the authorized keys in all the hosts.
-Now check that you can ssh to the localhost and all the other nodes without a passphrase:

-Create a basic file structure and set the ownership to the Hadoop user you created before.

-Setup Java.
As mentioned above, in this image Java is pre-installed. The Java version that comes pre-installed in this RHEL image should work for us since the Hadoop Java versions page says it has been tested with IBM Java 6 SR 8. Just make sure that your Java version is compatible.
-Add the following to the root and Hadoop user ~/.bashrc file (and modify according to your needs):

From now on, in order to save time, we will install and do a basic setup of Hadoop in the master node, then copy that basic setup to the slaves and then perform specific customizations in each of them.

From the master node, download the latest Hadoop tarball (version 2.6.2 at the time of this writing) from one of the mirrors and extract its contents in the previously created Hadoop home:

Update core-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/core-site.xml), we will change “localhost” to our master node hostname, IP, or alias. In our case, it would be like this:

Update hdfs-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/hdfs-site.xml), we will change the replication factor to 3, and specify the datanode and namenodes directories (take note of these, since we will create them later), as well as add an http address:

Update yarn-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/yarn-site.xml). There will be 3 properties that we need to update with our master node hostname or alias:

Update mapred-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/mapred-site.xml), we will add the following properties:

Update the masters file (located in $HADOOP_HOME/hadoop/etc/masters). We only need to add the hostname or alias of the master node. So in our case this file will only contain one line:

Update the slaves file (located in $HADOOP_HOME/hadoop/etc/slaves). I am sure you already know what to do here, we will only add our slaves nodes to the file, one alias per line. In our case:

This concludes the basic setup, we will now transfer all the files to the slave nodes and continue the customizations there.

On the master node, run the following commands in order to copy the basic setup to the slaves (you can also do this with scp):

In our case, it would be something like this:

Now, we need to perform some changes exclusively to the master node:
We will create a directory that will be used for HDFS. I decided to create it in /opt/hadoop/hadoop_tmp
Inside that directory, create another with the name “hdfs” and there, create yet another one named “namenode”. So you will have something like this:

Once the basic installation is in all four nodes and the master is properly configured, we can go ahead and perform the customizations in the slaves nodes.
We will create a directory that will be used for HDFS. I decided to create it in /opt/hadoop/hadoop_tmp (you can choose anything else, but make sure it is consistent with the contents of hdfs-site.xml)
Inside that directory, create another with the name “hdfs” and there, create yet another one named “datanode”. So you will have something like this:

Again, all the files inside your Hadoop home should be owned by the Hadoop user you created. Otherwise you may get “Permission denied” errors while starting the services.

Back in the master, format the namenode:

Finally, start the services. Previously, Hadoop used the script for this, but it has been deprecated and the recommended method now is to use the start-<service>.sh scripts individually. In the master node, execute the following:

    There are several ways to confirm that everything is running properly, for example, you can point your browser to http://master:50070 or http://master:8088. You can also check the Java processes in the nodes. In the master node you should see at least 3 processes: NameNode, SecondaryNameNode and ResourceManager. In the slaves you should see only two: NodeManager and DataNode. My favourite way is to use the hdfs report command. This is the output report of my configuration:

    Congratulations! At this point, Hadoop is configured and running. You can stop here, but I would prefer to run a test, just to make sure everything is working correctly. We will upload some files to the HDFS.

Check the disk usage:

Make a test directory:


List the contents of root:

Let’s create a file locally and add text to it (courtesy of just so we see it is replicated to all the nodes:

Then add our text file to our new directory:


Now you can see that our file has been automatically replicated to all 3 nodes!



    Hadoop is not hard to install nor configure, but you need to pay very close attention to many aspects, otherwise you will get obscure errors and since this is a “new” technology, the amount of useful help you can get is not great. Also, since this a complex technology, many problems can arise and are hard to troubleshoot. Sadly, there are not many online resources that are easy to follow, or the ones that are written properly are mostly outdated already. Personally, I had to read many tutorials and do a lot of troubleshooting in order to setup this correctly because many of the tutorials I read omitted bits of important information. There are, however, other projects that can help with the administration of Hadoop and its components. Two good examples are IBM BigInsights and Ambari. Even without those tools, adding and removing Hadoop nodes is very easy once the basic setup is correctly configured. I am nowhere near of being an expert on Big Data yet, but I tried my best to write this post as clear as possible and I truly hope it is useful. Big Data is undoubtedly here to stay and this kind of publications are indeed a valuable resource for newcomers.


Alan Verdugo / 2016/01/05 / Uncategorized / 0 Comments