Massively modifying images with ImageMagick

    Web editors often have the need to edit a large number of images. For example, the large image size of professional cameras tends to be overkill for most sites. I wanted a quick and easy way to resize large images and that was how I found ImageMagick.

    ImageMagick is a suite of tools and according to the man page, we can “use it to convert between image formats as  well  as  resize  an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more”. First, let’s install imagemagick:

    Then, we can use the convert command to do the actual edition. Check the man page to see the astounding number of options this command has. For example, if I want to resize all the JPG images in the current directory to a width of 1280 pixels and save the resulting images as the same name but with “min-” before the name I would execute the following command:

    And here lies the advantage of ImageMagick: it can be used in a script to edit images extremely quickly. ImageMagick can also be used on Mac OS X and Windows. For more information about the convert command, refer to http://www.imagemagick.org/script/convert.php

Share

Alan Verdugo / 2016/12/12 / Uncategorized / 0 Comments

Broken WordPress after Ubuntu 16.04 upgrade

    After some delays, I finally upgraded the server’s OS to the LTS Ubuntu 16.04. At first I thought that everything went fine, but then I tried to access the blog and it did not work, it only showed a blank page. A very bad omen. Then, when I tried to login into WordPress, this horrible message appeared:

    The message was actually much longer, I am just posting the beginning. If you have suffered with PHP in the past (like me), you will notice that this uninterpreted PHP code. That was my first clue, something was wrong with PHP. I created the infamous test.php page to test if PHP is actually working correctly with Apache. For those of you who haven’t done this, it basically is a “hello world” approach to see if PHP is working correctly. We paste the following code into a file named test.php or pleasework.php or something like that.

    Then we move that file to the Apache public directory (/var/www/html, is the default in Ubuntu) and grant it appropriate permissions. Then we go to yourdomain.com/test.php and, if PHP is working, we should see a page with PHP’s logo and all sorts of information like System, Server API and many more. In my case, I only got another blank page. This meant that something was very wrong with PHP.

    So I went into the server via SSH and executed php -v. Turns out I didn’t even have the php command. How was that possible? Well, turns out PHP5 is no longer the default in Ubuntu 16.04, instead, PHP7 is the default. At some point during the upgrade, PHP was completely uninstalled. So, let’s install it again:

Then install libapache2-mod-php7.0:

Then install php7.0-mbstring:

Then install php7.0-mysql:

Finally, reload Apache’s configuration:

    Once all that was done, I reloaded the test.php page and it gave me all the information I mentioned before. I also logged in successfully into WordPress. Now I am wondering if I should change the OS to something else than Ubuntu, and if I should change the WordPress theme. There are other problems that need to be solved, but for now WordPress is working as it should and I am happy.

Share

Alan Verdugo / 2016/12/04 / Uncategorized / 1 Comment

Introduction to Apache Spark

SPARKlogosmall    Spark is an open source cluster computing framework widely known for being extremely fast. It was started by AMPLab at UC Berkeley in 2009. Now it is an Apache top-level project. Spark can run on its own or can run, for example, in Hadoop or Mesos, and it can access data from diverse sources, including HDFS, Cassandra, HBase and Hive. Spark shares some characteristics with Apache Hadoop but they have important differences. Spark was developed to overcome the limitations of Hadoop’s MapReduce in regards of iterative algorithms and interactive data analysis.

Logistic regression in Hadoop and Spark.

Logistic regression in Hadoop and Spark.

   Since the very beginning, Spark showed great potential. Soon after its creation, Spark was already showing that it was being ten or twenty times faster than MapReduce for certain jobs. If is often said that it can be as 100 times faster, and this has been proven many times. For that reason, it is now widely used in areas where analysis is fundamental, like retail, astronomy, biomedicine, physics, marketing, and of course, IT. Thanks to this, Spark has become synonymous with a new term: “Fast data”. This means having the capability to process large amounts of data as fast as possible. Let’s not forget Spark’s motto and raison d’être: “Lightning-fast cluster computing”.

   Spark can efficiently scale up and down using minimal resources, and developers enjoy a more concise API, which helps them be more productive. Spark supports Scala, Java, Python, and R. While it also offers interactive shells for Scala and Python.

Components:

  • Spark Core and Resilient Distributed Datasets. As its name implies, Spark Core contains the basic Spark functionality, it includes task scheduling, memory management, fault recovery, interaction with storage systems, and more. Resilient Distributed Datasets (RDDs) are Spark’s main programming abstraction, they are logical collections of data partitioned across nodes. RDDs can be seen as the “base unit” in Spark. Generally, RDDs reside in memory and will only use the disc as a last resource. This is the reason why Spark is usually much faster than MapReduce jobs.
  • Spark SQL. This is how Spark interacts with structured data (like SQL or HQL). Shark was a previous effort in doing this, which was later abandoned in favor of Spark SQL. Spark SQL allows developers to intermix SQL with any of Spark’s supported programming languages.
  • Spark Streaming. Streaming, in the context of Big Data, is not to be confused with video or audio streaming, even if they are similar concepts. Spark Streaming handles any kind of data instead of only video or audio feeds. So, it is used to process data that does not stop coming at any time in particular (like a stream). For example, Tweets or logs from production web servers.
  • MLlib. It is a Machine Learning library, it contains the common functionality of machine learning algorithms like classification, regression, clustering, and collaborative filtering. All of these algorithms are designed to scale out across the cluster.
  • GraphX. Apache Spark’s API for graphs and graph-parallel computation, it comes with a variety of graph algorithms.

spark2

   Spark can recover failed nodes by recomputing the Directed Acyclic Graph (DAG) of the RDDs, and it also supports a recovery method using “checkpoints”. This is a clever way of guaranteeing fault tolerance that minimizes network I/O. RDDs achieve this by using lineage, i.e. if an RDD is lost, the RDD has enough information about how it was derived from other RDD (i.e. the DAGs are “replayed”), so it can be rebuilt easily. This works better than fetching data from disk every time. In this way, fault tolerance is achieved without using replication.

Spark usage metrics:

   In late December 2014, Typesafe created a survey about Spark, and they noticed a “hockey-stick-like” growth in its use[1], with many people already using Spark in production or planning to do it soon. The survey reached 2,136 technology professionals. These are some of their conclusions:

  • 13% of respondents currently use Spark in production, while 31% are evaluating Spark and 20% planned to use it in 2015.
  • 78% of Spark users hope to use the tool to solve issues with fast batch processing of large data sets.
  • Low awareness and/or experience is currently the biggest barrier for users implementing Spark effectively.
  • Top 3 industries represented: Telecoms, Banks, Retail.
Typesafe survey results snapshot.

Typesafe survey results snapshot.

   So, is Spark better than Hadoop? This is a question that is very difficult to answer. This is a topic that is frequently discussed, and the only conclusion is that there is no clear winner. Both technologies offer different advantages and they can be used alongside perfectly. That is the reason why the Apache foundation has not merged both projects, and this will likely not happen, at least not anytime soon. Hadoop reputation was cemented as the Big Data poster child, and with good reason. And for that, with every new project that emerges, people wonder how it relates to Hadoop. Is it a complement for Hadoop? A competitor? An enabler? something that could leverage Hadoop’s capabilities? all of the above?

Comparison of Spark's stack and alternatives.

Comparison of Spark’s stack and alternatives.

   As you can see in the table above, Spark offers a lot of functionality out of the box. If you wanted to build an environment with the same capabilities, you would need to install, configure and maintain several projects at the same time. This is one of the great advantages in Spark: having a full-fledged data engine ready to work out of the box. Of course, many of the projects are interchangeable. For example, you could easily use HDFS for storage instead of Tachyon, or use YARN instead of Mesos. The fact that all of these are open source projects means they have a lot of versatility, and this means the users can have many options available, so they can have their cake and eat it. For example, if you are used to program in Pig and want to use it in Spark, a new project called Spork (you got to love the name) was created so you can do it. Hive, Hue, Mahout and many other tools from the Hadoop ecosystem already work or soon will work with Spark.[2]

   Let’s say you want to build a cluster, and you want it to be cheap. Since Spark uses memory heavily, and RAM is relatively expensive, one could think that Hadoop MapReduce is cheaper, since MapReduce relies more on disc space than on RAM. However, the potentially more expensive Spark cluster could finish the job faster precisely because it uses RAM heavily. So, you could end up paying a few hours of usage for the Spark cluster instead of days for the Hadoop cluster. The official Spark FAQ page talks about how Spark was used to sort 100 TB of data three times faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona Graysort Benchmark[3].

   If you have specific needs (like running machine learning algorithms, for example) you may decide over one technology or the other. It all really depends of what you need to do and how you are paying for resources. It is basically a case-by-case decision. Neither Spark nor Hadoop are silver bullets. In fact, I would say that there are no silver bullets in Big Data yet. For example, while Spark has streaming capabilities, Apache Storm generally is considered better at it.

Real-world use-cases:

   There are many ingenious and useful examples of Spark in the wild. Let’s talk about some of them.

   IBM has been working with NASA and the SETI Institute using Spark in order to analyze 100 million radio events detected over several years. This analysis could lead to the discovery of intelligent extraterrestrial life. [4] [5] [6]

   The analytic capabilities of Spark are also being used to identify suspicious vehicles mentioned in AMBER alerts. Basically, video feeds are entered into a Spark cluster using Spark Streaming, then they are processed with OpenCV for image recognition and MLlib for machine learning. Together, they would identify the model and color of cars. Which in turn could help to find missing children.[4] Spark’s speed here is crucial. In this example, huge amounts of live data need to be processed as quickly as possible and it also needs to be done continually, i.e. processing as the data is being collected, hence the use of Spark Streaming. [7]

   Warren Buffet created an application where social network analysis is performed in order to predict stock trends. Based on this, the user of the application gets recommendations from the application about when and how to buy, sell or hold stocks. It is obvious that a lot of people would be interested in suggestions like this, and specially when they are taken from live, real data like Tweets. All this is accomplished with Spark Streaming and MLlib.[4]

   Of course, there is also a long list of companies using Spark for their day-to-day analytics, the “Powered by Spark” page has a lot of important names like Yahoo, eBay, Amazon, NASA, Nokia, IBM Almaden, UC Berkeley, TripAdvisor, among many others.

   Take, for example, mapping Twitter activity based on streaming data. In the video below you can see a Spark Notebook that is consuming the Twitter stream, filtering the tweets that have geospatial information and plotting them on a map that is narrowing the view to the minimal bounding box enclosing the last batch’s tweets. It is very easy to imagine the collaboration that streaming technologies and the Internet of Things will end up doing. All that data generated by IoT devices will need to be processed and streaming tools like Spark Streaming and/or Apache Storm will be there to do the job.

Conclusion:

   Spark was designed in a very intelligent way. Since it is newer, the architects used the learned lessons from other projects (mainly from Hadoop). The emergence of the Internet of Things is already producing a constant flow of large amounts of data. There will be a need to gather that data, process it and draw conclusions on it. Spark can do all of this and do it blindingly fast.

   IBM has shown a tremendous amount of interest and commitment to Spark. For example, they founded the Spark Technology Center in San Francisco, enabled a Spark-As-A-Service model on Bluemix, and organized Spark hackatons[8]. They also committed to train more than 1 million of data scientists, and donated SystemML (a machine learning technology) to further advance Spark’s development [9]. That doesn’t happen if an initiative doesn’t have support at the highest levels of the company. In fact, they have called it “potentially, the most significant open source project of the next decade”. [10]

   All this heralds a bright future for Spark and the related projects. It is hard to imagine how the project will evolve, but the impact it has already done in the big data ecosystem is something to take very seriously.

References:

[1] http://www.slideshare.net/Typesafe_Inc/sneak-preview-apache-spark

[2] http://es.slideshare.net/sbaltagi/spark-or-hadoop-is-it-an-eitheror-proposition-by-slim-baltagi

[3] https://spark.apache.org/faq.html

[4] http://www.spark.tc/projects/

[5] http://blog.ibmjstart.net/2015/07/14/seti-sparks-machine-learning-to-sift-big-data/

[6] http://blog.ibmjstart.net/2015/08/06/types-of-bigdata-from-the-allen-telescope-array/

[7] https://github.com/hackspark/Amber-Alert-Aid

[8] http://blog.ibmjstart.net/2015/06/29/why-is-ibm-involved-with-apache-spark/

[9] http://www.ibm.com/analytics/us/en/technology/spark/

[10] https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Share

Alan Verdugo / 2016/02/09 / Uncategorized / 0 Comments

Installing and configuring a Hadoop cluster

    As I mentioned in a previous post, Hadoop has great potential and is one of the best known projects for Big Data. Now, let’s take a deeper look into Hadoop and how it works. So, let’s roll up our sleeves and actually work with Hadoop. For that, we will install and configure a Hadoop cluster. Our cluster will consists on four nodes (one master and three slaves). So I provisioned four cloud servers:

Architecture diagram.

Architecture diagram.

    We designated atlbz153122 as the master for no reason in particular. The aliases can be modified according to your needs. I.e. If you are using these same hosts for other cluster projects, you may want to specify the aliases as “hadoop-master” or “hadoop-slave-1”. They were created using the same image, so all of them have this setup:

Initial setup:
In every node, we will do the following:
-Create a hadoop user account (Of course, you can chose to name it differently), and add it to the sudoers group.

-Optionally (but extremely recommended), map all the hostnames in the respective /etc/hosts files.
-Copy SSH keys between all the nodes.
Obviously, first we need to make sure SSH is up and running on all the servers. This step could be different for you according to your security requirements, but the goal here is to allow Hadoop and its processes to communicate between the nodes without prompting for credentials.
-Generate SSH keys.

-Copy the keys.
What I do is to create a list of all the public keys and then add that master list to the authorized keys in all the hosts.
-Now check that you can ssh to the localhost and all the other nodes without a passphrase:

-Create a basic file structure and set the ownership to the Hadoop user you created before.

-Setup Java.
As mentioned above, in this image Java is pre-installed. The Java version that comes pre-installed in this RHEL image should work for us since the Hadoop Java versions page says it has been tested with IBM Java 6 SR 8. Just make sure that your Java version is compatible.
-Add the following to the root and Hadoop user ~/.bashrc file (and modify according to your needs):

From now on, in order to save time, we will install and do a basic setup of Hadoop in the master node, then copy that basic setup to the slaves and then perform specific customizations in each of them.

From the master node, download the latest Hadoop tarball (version 2.6.2 at the time of this writing) from one of the mirrors and extract its contents in the previously created Hadoop home:

Update core-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/core-site.xml), we will change “localhost” to our master node hostname, IP, or alias. In our case, it would be like this:

Update hdfs-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/hdfs-site.xml), we will change the replication factor to 3, and specify the datanode and namenodes directories (take note of these, since we will create them later), as well as add an http address:

Update yarn-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/yarn-site.xml). There will be 3 properties that we need to update with our master node hostname or alias:

Update mapred-site.xml (located in $HADOOP_HOME/hadoop/etc/hadoop/mapred-site.xml), we will add the following properties:

Update the masters file (located in $HADOOP_HOME/hadoop/etc/masters). We only need to add the hostname or alias of the master node. So in our case this file will only contain one line:

Update the slaves file (located in $HADOOP_HOME/hadoop/etc/slaves). I am sure you already know what to do here, we will only add our slaves nodes to the file, one alias per line. In our case:

This concludes the basic setup, we will now transfer all the files to the slave nodes and continue the customizations there.

On the master node, run the following commands in order to copy the basic setup to the slaves (you can also do this with scp):

In our case, it would be something like this:

Now, we need to perform some changes exclusively to the master node:
We will create a directory that will be used for HDFS. I decided to create it in /opt/hadoop/hadoop_tmp
Inside that directory, create another with the name “hdfs” and there, create yet another one named “namenode”. So you will have something like this:
/opt/hadoop_tmp/hdfs/namenode

Once the basic installation is in all four nodes and the master is properly configured, we can go ahead and perform the customizations in the slaves nodes.
We will create a directory that will be used for HDFS. I decided to create it in /opt/hadoop/hadoop_tmp (you can choose anything else, but make sure it is consistent with the contents of hdfs-site.xml)
Inside that directory, create another with the name “hdfs” and there, create yet another one named “datanode”. So you will have something like this:
/opt/hadoop_tmp/hdfs/datanode

Again, all the files inside your Hadoop home should be owned by the Hadoop user you created. Otherwise you may get “Permission denied” errors while starting the services.

Back in the master, format the namenode:

Finally, start the services. Previously, Hadoop used the start-all.sh script for this, but it has been deprecated and the recommended method now is to use the start-<service>.sh scripts individually. In the master node, execute the following:

    There are several ways to confirm that everything is running properly, for example, you can point your browser to http://master:50070 or http://master:8088. You can also check the Java processes in the nodes. In the master node you should see at least 3 processes: NameNode, SecondaryNameNode and ResourceManager. In the slaves you should see only two: NodeManager and DataNode. My favourite way is to use the hdfs report command. This is the output report of my configuration:

    Congratulations! At this point, Hadoop is configured and running. You can stop here, but I would prefer to run a test, just to make sure everything is working correctly. We will upload some files to the HDFS.

Check the disk usage:

Make a test directory:

hadoopFS1

List the contents of root:

Let’s create a file locally and add text to it (courtesy of loremgibson.com) just so we see it is replicated to all the nodes:

Then add our text file to our new directory:

hadoopFS2

Now you can see that our file has been automatically replicated to all 3 nodes!

hadoopFS3

Conclusion:

    Hadoop is not hard to install nor configure, but you need to pay very close attention to many aspects, otherwise you will get obscure errors and since this is a “new” technology, the amount of useful help you can get is not great. Also, since this a complex technology, many problems can arise and are hard to troubleshoot. Sadly, there are not many online resources that are easy to follow, or the ones that are written properly are mostly outdated already. Personally, I had to read many tutorials and do a lot of troubleshooting in order to setup this correctly because many of the tutorials I read omitted bits of important information. There are, however, other projects that can help with the administration of Hadoop and its components. Two good examples are IBM BigInsights and Ambari. Even without those tools, adding and removing Hadoop nodes is very easy once the basic setup is correctly configured. I am nowhere near of being an expert on Big Data yet, but I tried my best to write this post as clear as possible and I truly hope it is useful. Big Data is undoubtedly here to stay and this kind of publications are indeed a valuable resource for newcomers.

Share

Alan Verdugo / 2016/01/05 / Uncategorized / 0 Comments

Introduction to Apache Hadoop

    Hadoop is an open source project from the Apache foundation that was created thinking that hardware components in a network are prone to failure and that there should be a component handling those failures automatically in a way that does not create downtime for the system or affects the users in any way. Hadoop is mainly written in Java and it can run on the main operating systems. It was originally created by Doug Cutting while he was working on Yahoo. It was named after a toy elephant that belongs to Doug’s son.

    The ever-growing need of data storage is also a big part of the Hadoop inception. In this day and age, a company storing user data cannot afford to lose any of it due to hardware failures. Just imagine getting an email saying “We just lost all your photos/messages/work/art/portfolio/reports/money because one of our servers failed. Sorry!”. That is simply unacceptable, and Hadoop was created to prevent that problem.

    Just the task of storing that data is an enormous challenge. Being responsible for handling sensitive data in large amounts can be very intimidating. Just imagine being the architect of a successful startup that suddenly is required to not store and handle Gigabytes, but Terabytes, or even Petabytes of data. And more importantly, you are required to do it in a manner that is secure, cheap, fast, easy, scalable, and failure-proof. How would you accomplish that? Requesting larger and larger budgets for servers and hard-drives, all of them having a life-span of a few months? Or would you prefer to use commodity hardware with a framework that handles all that for you? If you chose the later, Hadoop may be the answer. Hadoop is the software component that allows the IT infrastructure to grow organically and easily when it needs to do it. It prevents individual hardware failures to cause any real impact to the IT services and, ultimately, to the user experience.

    Not convinced yet? What if I told you that you can use Hadoop to do parallel processing and leverage all that combined horsepower from your commodity hardware? We will talk about how Hadoop accomplishes all these wonderful things later, but for now, let’s talk about the main Hadoop components.

Hadoop components

    Hadoop has two main components, not counting the Hadoop core functionality and YARN (Yet Another Resource Negotiator, a resource-management platform responsible for managing computing resources in clusters). These two main components are MapReduce and HDFS.

    MapReduce is a framework for performing calculations on the data in the distributed file system. With MapReduce, applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce uses one “JobTracker” node, to which applications submit MapReduce jobs. The JobTracker “maps” or assigns work and pushes it out to available “TaskTracker” nodes in the cluster, striving to keep the work as close to the data as possible. Each TaskTracker performs the required processing and then the results are retrieved by the JobTracker node (this is the “reduce” part).

    The distributed filesystem component, the main example of which is HDFS (Hadoop Distributed File System), though other file systems, such as IBM GPFS-FPO, are supported. In HDFS, there is a “NameNode” keeping track of all the locations of all the data across the cluster. The other nodes are called “DataNodes” and store the data in “blocks” of information. The same blocks are copied in several DataNodes (generally in another DataNode in the same rack and also in another DataNode in a different rack). Obviously, the higher the data redundancy in the cluster, the safer the data is in case one or several DataNodes go down or suffer irreparable damage. A Hadoop user can easily specify the amount of replication that is needed and customize the way the blocks are handled.

High level architecture

Fig. 1: High level architecture diagram showing the the MapReduce and the HDFS layers.

Related projects

    Due to its potential and usefulness, Hadoop is one of the most famous projects related to big data and it has inspired many related projects. Many implementations of Hadoop use some (or many) of these projects in order to build a robust ad-hoc infrastructure. Some of them are:

    Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.

    Avro: A data serialization system.

    Cassandra: A scalable multi-master database with no single points of failure.

    Chukwa: A data collection system for managing large distributed systems.

    Flume: A distributed, reliable, and highly available service for efficiently moving large amounts of data around a cluster.

    HBase: A non-relational, scalable, distributed database that supports structured data storage for large tables.

    Hive: A data warehouse infrastructure that provides data summarization and ad-hoc querying.

    Jaql: A query language designed for JavaScript Object Notation (JSON), is primarily used to analyze large-scale semi-structured data. It is an open source project from Google, IBM took it over as primary data processing language for their Hadoop software package BigInsights (see the “Hadoop and IBM” section below).

    Mahout: A Scalable machine learning and data mining library.

    Oozie: A workflow coordination manager.

    Pig: A high-level data-flow language and execution framework for parallel computation.

   Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. I will also be talking about Spark very soon.

    Sqoop: A command-line interface application for transferring data between relational databases and Hadoop.

    Tez: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.

    ZooKeeper: A high-performance coordination service for distributed applications.

EcoSys_yarn

Fig. 2: An example of a Hadoop environment using several related projects.

Who is using Hadoop?

    Take a look at the “Powered by” page at the Hadoop site and the Hadoop page in Wikipedia. You will notice a very large list of important names and some very impressive numbers. For example, on June of 2010, Facebook announced to store 100 Petabytes of its data on Hadoop clusters, and on November 8 of 2012, they announced the data gathered in their warehouse grows by roughly half a Petabyte per day.

    Yahoo is by far the largest contributor to Hadoop, and there is a good reason for that. Yahoo Mail uses Hadoop to find spam. The Yahoo front page as well as the links and ads displayed to every user are both optimized using Hadoop.[1] Yahoo contributes all the work it does on Hadoop to the open-source community.

    Besides contributors to the Hadoop environment, the usage is widely spread among other companies which are merely users. Proof of that is that more than half of the Fortune 50 use Hadoop.[2]

Hadoop and IBM

    IBM is aware of the potential in Hadoop and is leveraging it in some of its projects. For example, Watson uses IBM’s DeepQA software and the Apache UIMA (Unstructured Information Management Architecture) framework. The system was written in various languages, including Java, C++, and Prolog, and runs on the SUSE Linux Enterprise Server 11 operating system using the Hadoop framework to provide distributed computing.[3] Hadoop enables Watson to access, sort, and process data in a massively parallel system (90+ server cluster/2,880 processor cores/16 terabytes of RAM/4 terabytes of disk storage).[4]

    IBM BigInsights is a platform for the analysis and visualization of Internet-scale data volumes. It is powered by Hadoop and it offers makes easier to install and administer Hadoop cluster using a web GUI. BigInsights makes it trivial to start and stop Hadoop services and adding, removing nodes to the cluster.

    In 2009, IBM discussed running Hadoop over the IBM General Parallel File System.[5] The source code was published in October 2009.[6]

Conclusion

    Supporting Hadoop will be critical in the near future. The advantages it offers are too many to just ignore it. As the use of big data grows (alongside with customer’s expectations), most companies will inevitably gravitate to Big Data, Hadoop and/or Hadoop-related technologies in the near future. This is probably not a end-user facing technology like a mobile application or a website, but Hadoop is already ingrained in the back-end of many popular services like Facebook, Twitter, Yahoo, Linkedin, Spotify and many others.

    In other words, Hadoop is not a prototype technology anymore, it is already an integral part of the infrastructure for live production applications with millions of concurrent users. Being capable to understand and support Hadoop and all the related technologies will be essential for any IT team very soon, because either clients will request it or it will be a necessity for the internal operations of the organization itself.

References:

  1. https://developer.yahoo.com/blogs/hadoop/ll-hadoop-400-alex-3901.html
  2. http://www.prnewswire.com/news-releases/altiors-altrastar—hadoop-storage-accelerator-and-optimizer-now-certified-on-cdh4-clouderas-distribution-including-apache-hadoop-version-4-183906141.html
  3. https://en.wikipedia.org/wiki/Watson_%28computer%29
  4. https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s
  5. https://www.usenix.org/legacy/events/hotcloud09/tech/full_papers/ananthanarayanan.pdf
  6. https://issues.apache.org/jira/browse/HADOOP-6330
Share

Alan Verdugo / 2015/12/02 / Uncategorized / 1 Comment