Correlations between ruling political parties and journalist assassinations

Abstract

    This research uses a dataset provided by the Committee to Protect Journalists in order to analyze the number of journalists’ deaths in Mexico and the US from 1992 to 2016, it compares both results, and tries to find out if the ruling political party is regarded as an important factor that could be the root cause of said deaths. The analysis points out that the ruling political party cannot be directly linked to the cause of the deaths, but the political strategies implemented by the government could be related indirectly.

Motivation

    Mexico is widely regarded as one of the most dangerous countries for journalists[1], even comparing to countries in a state of war. Added to that, the mexican government is also famous for being corrupt and untrustworthy, either by bribing the media or by threatening it. This has happened since the 1910s[1], and the involvement of Mexico in illegal drug traffic has only added to the dangers the journalists face.

    Since the mexican government has been known to bride or to threaten journalists, I wanted to get actual data that could show a correlation between the ruling political parties of the past and the amount of violence towards media workers. This could help the mexican citizens to make a decision before the presidential election of 2018.

Dataset

    The dataset that will be used is a comma-separated file provided by the Committee to Protect Journalists (https://cpj.org). It contains data about journalists assassinations committed from 1992 to 2016. The dataset contains 1782 records with 18 variables: Type, Date, Name, Sex, Country_killed, Organization, Nationality, Medium, Job, Coverage, Freelance, Local_Foreign, Source_fire, Type_death, Impunity_for_murder, Taken_captive, Threatened, Tortured.

    The dataset can be downloaded from Kaggle.com: https://www.kaggle.com/cpjournalists/journalists-killed-worldwide-since-1992

Data preparation and cleaning

    Some of the records did not have a date for the death of the journalist (it was either labeled “Unknown” or “Date unknown”), this prevent me from assigning it to a year and thus to a presidential administration, so, sadly, I had to ignore them. Besides, the date format was relatively hard to parse since it was not in a very standard format. For example, the date was in the form “February 9, 1998”, instead of the much more international and easy to use “1998-02-09”.

Research questions

    Is there a correlation between the ruling political party and the number of journalists assassinations?

    Can we identify a corrupt government by analyzing the acts of violence against journalists occurred during its administration?

Methods

    Using the Pandas library in Python, I filtered all records that belong to the interesting countries, then I grouped them by year and built charts using Matplotlib according to the length of each presidential administration. This allowed me to show visualizations that can easily convey any increase or decrease of journalist assassinations by the ruling political party in each period. After that, I built a pie chart showing the distribution of the “Source_fire” variable, which provides a better idea of the reason behind the assassination. This could help to understand if the death was a deliberate targeted attack on the journalists or if it could be regarded as a work-related accident.

    The entire script that processed the data and generated the visualizations can be found here: https://github.com/alanverdugo/journalists_deaths_analysis/blob/master/cpj.py

Findings

Figure 1: Number of journalists killed in Mexico.

Figure 1: Number of journalists killed in Mexico.

    In this chart we see the two parties that have been in power in Mexico (in different colors). An increase in journalists’ deaths occurred during the mid 2000s, then appeared to decrease but now seems to be increasing again.


Figure 2: Number of journalists killed in USA. Figure 2: Number of journalists killed in USA.

    For comparison purposes, this is the same chart, but using US data. It can be seen that the US is much safer for journalists, and that there is no clear correlation between the political party in power and journalists’ deaths.


Figure 3: Source of fire for journalists deaths in Mexico.

Figure 3: Source of fire for journalists deaths in Mexico.

    This is a pie chart of the source of the fire that caused the journalists deaths in Mexico. We can clearly see that most of the deaths were caused by criminal groups (very probably drug cartels).

Limitations

    The dataset is fairly small. This is one of the rare cases where not having a lot of data is a good thing (after all, even a single assassination is a tragic event). However, the relative low number of deaths makes it hard to safely find patterns or correlations. Due to the mystery and inherent danger behind some of the deaths, it may be probable that many of them are not reported to the authorities and even then, acts of corruption could hinder the reach or veracity of the data. In other words, we may probably be working with incomplete data.

Conclusions

    Practicing journalism in Mexico has been, and still is, a dangerous activity. Compared against other countries, we can see the relative dangerous situation that journalists located in Mexico experience every day. A change in the mexican political status quo did not solve the problem, in fact, it appeared to have increased it. This could mean that just changing the political party in power is not enough and that serious strategic changes in security, transparency and drug-related politics need to be done in order to ensure the safety of the journalists and of the mexican citizens in general.

    The war on drugs military campaign that started in 2006 was one of the main triggers for the increased amount in violence during the late 2000s[2]. Drug cartels fought against the military and against each other for the control of the territories. However, that does not mean that journalists did not experience attacks before the war on drugs or that they will not experience them in the future. It is unknown how many bribes or threats the journalists receive from corrupt officials or from criminal organizations, so these findings should not be regarded as definitive.

    The geo-political and socio-economic situation of each country is also a complex subject that cannot be fully grasped using such a small set of data. For these reasons, a more complete analysis should be conducted to safely identify the possible correlation between a country ruler and the acts of violence towards the media.

References

  1. List of journalists and media workers killed in Mexico. (2017, November 28). In Wikipedia, The Free Encyclopedia. Retrieved 17:38, December 9, 2017, from https://en.wikipedia.org/w/index.php?title=List_of_journalists_and_media_workers_killed_in_Mexico&oldid=812565094
  2. Timeline of the Mexican Drug War. (2017, December 4). In Wikipedia, The Free Encyclopedia. Retrieved 04:52, December 9, 2017, from https://en.wikipedia.org/w/index.php?title=Timeline_of_the_Mexican_Drug_War&oldid=813681343
  3. Mexican Drug War. (2017, December 4). In Wikipedia, The Free Encyclopedia. Retrieved 04:52, December 9, 2017, from https://en.wikipedia.org/w/index.php?title=Mexican_Drug_War&oldid=813724408
  4. List of Presidents of the United States. (2017, December 7). In Wikipedia, The Free Encyclopedia. Retrieved 05:09, December 8, 2017, from https://en.wikipedia.org/w/index.php?title=List_of_Presidents_of_the_United_States&oldid=814280717

Portable Stream and Batch Processing with Apache Beam

    I had the opportunity to attend another one of the Wizeline’s academy sessions. This time, it was about Apache Beam, a Batch and Stream processing open source project. Wizeline brought three instructors from Google itself to explain what Beam is, how to use it, and its main advantages over their competitors. All three instructors had impressive backgrounds and were very nice and open to comments and questions from a group of students that only had basic knowledge on the subject.

    Davor Bonaci’s explanations were particularly useful and interesting. He has a lot of experience talking in conferences about these topics and it shows. He was able to clearly explain such a complex technology in a way that could be understood by anyone while still ensuring that the we also understood the huge potential in Beam.

 

    There were three concepts that I found extremely interesting:  

    Windowing: In streaming, we will eventually receive data that is out of time in the context of when it was generated and when we received other data. The concept is useful to define how lenient we will be with this “late” data. We will have an easy way to specify categories of this data and group them according to our business rules.

    An example of this would getting records at 12:30 that were created at 12:10. At that point in time, we should be only processing records that were created in the last few minutes (or even seconds, according to your needs). However, that late record could be crucial for our processing and we need to find a way to discern if we keep it or if we ignore it completely. With windowing, we can achieve this.

    Autoscaling: This is probably the holy grail of IT infrastructure management. The ideal scenario would be to have a “lean infrastructure”. One that, in any given moment, only has the exact amount of processing power that is needed, no more and no less. However, thanks to the global nature of the Internet and the variations in usage according to time zones and seasons, it is practically impossible to achieve this. Resources are either over or under-allocated, the first option means wasting at least some of the infrastructure (and hence money). The second means to not be able to handle usage spikes when they occur (and they will occur eventually if you are doing things right).

    As the name implies, Autoscaling attempts to let the infrastructure grow organically when and how it needs to, and to reduce it once it is not needed anymore. This obviously has huge benefits, like having the peace of mind of knowing that the infrastructure can take care of itself but also knowing that servers will not be over-provisioned carelessly. The cloud only needs to be properly orchestrated according to the data processing needs, and we can finally deliver this. I can only imagine what would be possible once this is combined with the power of containers or even Unikernels and their transient microservices.

    You can read more about Autoscaling here: https://cloud.google.com/blog/big-data/2016/03/comparing-cloud-dataflow-autoscaling-to-spark-and-hadoop

    Dynamic workload rebalancing: When a processing job is created, it is (hopefully) distributed equally across all the nodes of a cluster, however, due to subtle differences in bandwidth, latency and other factors, some nodes always end up finishing their assignment later, and at the end of the day, we cannot say that the job is finished until the last of these stragglers is done. In the meantime, there will be many nodes that are idle, waiting for the stragglers to finish. Dynamic workload rebalancing means that these idle nodes will try to help the stragglers as much as possible, and in theory, this will reduce the overall completion time of the job. This, coupled with Autoscaling, could mean that the waste of resources is minimum.

    You can read more about dynamic workload rebalancing here: https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow

 

    One student asked if it would be worth it to study Beam and forget about Spark or the other platforms. It may sound like a simple question but it is something we all were thinking. Davor’s response was great. He said that whatever we studied, our main focus should be in writing code and building infrastructure that is able to scale regardless of the platform that we wish to use. Beam is not a Spark killer, they have different approaches, different methodologies, and the people working in these projects have different ambitions, goals, and beliefs. Besides, the community keeps evolving, the projects will continue to change, some will be forgotten and new ones will be created. There is a huge interest in data-processing tools due to the increased speed and volume of our data needs, which will only keep increasing. Because of that, this part of IT is experiencing violent and abrupt growing pains. I just don’t think that right now we can settle and learn just one technology since it may disappear (or change dramatically) in the very near future.

    There are some things that can still be improved in Beam. One example is the interaction with Python and Spark, another would be making it more user-friendly, but there is a group of smart people that is quickly tackling these issues and adding new and great features, so it would be a good idea to keep learning about Beam and to consider it for our batch and stream processing needs.

    Overall, I really enjoyed the workshop. As I mentioned before, all three instructors were very capable and had a deep understanding of the technology, its use cases and its potential. Besides it was really enjoyable to talk with them about the current state and the future of data processing. I will certainly keep paying attention to the Beam project.

    I would like to thank Davor, Pablo and Gris from Google, and all the team behind the Wizeline academy initiative.

Analyzing movie rating data from an IMDB.com dataset using Python, Pandas and Matplotlib

    Since the dawn of cinema, the quality and enjoyment produced by motion pictures has been a complicated and controversial subject. An entire sub-industry has been created to review, criticize, recommend, analyze, categorize and rate movies. This, added to the subjective nature of each individual likes and dislikes has resulted in mixed experiences and expectations for the public. Some movies that are regarded as timeless classics by some people are seen as boring or even as bad movies by other people. The passing of time and the recent heavy use of special effects and CGI in movies also affect how the movies will be regarded in a few years, when those special effects look outdated.

    However, we may find a general trend of increased satisfaction or dissatisfaction if we analyze a large number of movie ratings.

    The code I wrote for this analysis is available in my GitHub repository.

Research question

    Since many critics refer to their favorite period as the best era that cinema has to offer (or alternatively, that the movie quality is in decline), we will attempt to answer the following question:

Has the perceived quality of movies increased or decreased over time?

    For any answer we may find, we will demonstrate and provide a reason behind it in the form of data and its visualizations.

Findings

    Using an IMDB.com dataset, I analyzed 45,844 movies and 26,024,290 ratings for said movies. The oldest movie in the dataset was launched in 1874 and the newest in 2017.

    I grouped the movies by launch year and calculated the average rating for the movies launched in every year. Doing this, I wanted to get an idea of the overall quality of the movies trough time.

    While the technical aspect of motion pictures have obviously advanced thanks to new technology, it was not clear if this also improved the overall quality of the movie. In the next image I present a chart of the relation between launch year and average rating for the movies launched in each year.

Fig. 1 - Movie average rating per year.
Fig. 1 – Movie average rating per year.

    Some interesting facts I got from this analysis:

  • The initial period (from 1874 to around 1915) is chaotic and experimental. A film could be under a minute long. There was little to no cinematic technique, the film was usually black and white and it was without sound.
  • In the 1920s, begins a process of normalization. Movies are more popular and attainable. The public seem to have learned what to expect from directors and actors by this time. The primary steps in the commercialization of sound cinema were taken in the mid-to late 1920s, this could have helped to this normalization.
  • 2014 was a relatively disastrous year for cinema. The average rating for this year is 2.95. (The lowest since 1917, which is still part of the “experimental period”). The causes for this are beyond this analysis, but I will remind the reader that in 2014 we got movies like Transformers: Age of extinction and Left behind which currently has a 1% score in rottentomatoes.com
  • In an scale from 0 to 5, The average tends to be slightly above 3. There is no noticeable increment or decrement from this average in the last century. So, to answer the research question, we cannot said that the quality of cinema has increased nor decreased substantially.

References

  1. History of film. (2017, November 26). In Wikipedia, The Free Encyclopedia. Retrieved 04:03, November 29, 2017, from https://en.wikipedia.org/w/index.php?title=History_of_film&oldid=812220038
  2. Sound film. (2017, November 28). In Wikipedia, The Free Encyclopedia. Retrieved 04:04, November 29, 2017, from https://en.wikipedia.org/w/index.php?title=Sound_film&oldid=812587250

Probando Microsoft SQL Server en Linux

    Hace años, nunca hubiera pensado escribir un título como ese, pero las cosas cambian y Microsoft ahora está prestando atención a otros ambientes que no son los suyos. Esto nos ha dado la oportunidad de probar algunas de sus herramientas sin tener que vernos obligados a instalar Windows, lo cual se agradece mucho.

    Voy a instalar SQL Server en Linux (específicamente en Linux Mint). Aquí voy a detallar el proceso de instsalación a modo de tutorial y también voy a ejecutar algunas consultas sencillas para demostrar el uso de SQL Server.

    Durante el proceso de instalación, necesitaremos permisos de administrador, así que nos cambiaremos a nuestra cuenta root:

    Creamos un directorio destinado para herramientas y notas sobre SQL Server y nos cambiamos a él (este paso es totalmente opcional):

    Bajamos e instalamos las llaves de los repositorios de Microsoft:

    Añadimos el repositorio de Microsoft:

    Actualizamos nuestra lista de repositorios:

    Finalmente, instalamos SQL Server:

    Ejecutamos la herramienta de configuración:

    Verificamos que el servicio de SQL Server está ejecutándose correctamente:

    Ahora instalaremos las herramientas para la línea de comandos:

    Nos conectamos a nuestra instancia local:

    Creamos una base de datos de prueba y mostramos las bases de datos existentes:

    En nuestra nueva base de datos, creamos una tabla de prueba e insertamos un registro:

    Hacemos una consulta de prueba:

   Como se puede ver, SQL Server fue inesperadamente sencillo de instalar, configurar y utilizar, sobre todo teniendo en mente que utilizamos un sistema operativo que no es del propio Microsoft. En esta práctica aprendimos a crear tablas y hacer consultas con las herramientas de SQL Server, las cuales son muy parecidas a otras herramientas de otros RDBMS, como MySQL.

Getting out of the maze with A star

    A local IT company (who shall remain unnamed in this post and shall be thankful for that) was offering free tickets to this year’s Campus Party event in Mexico. To get the tickets, you needed to complete a programming challenge. Since I’ve never attended any Campus Party* and I enjoyed solving the programming challenge for the Wizeline event, I took some time to solve this one.

    Basically, the challenge was to find the optimal way out of a squared, bi-dimensional maze. Using an API, you registered for the challenge, requested a maze of size n by n, then you were supposed to find the optimal path to get out of it (using a program, of course) and then you had to submit your path as a list of (x,y) positions, starting at (0,0) and finishing at a position where a goal (designated by “x”) was placed. An example of a small 8×8 maze would be something like this:

    The zeroes are obstacles or walls, the ones are clear paths. So the solution in the previous example is obvious: (0,0), (0,1), (0,2), (0,3), (1,3), (2,3), (2,2), (3,2), (4,2), (4,3), (4,4), (5,4), (6,4), (7,4), (7,5), (7,6), (7,7).

    Here is another example:

    Which is still obvious. However, when I began to request mazes of bigger sizes I noticed the full complexity of the problem: There were many bifurcations and dead-ends. Of course, mazes are supposed to be confusing, that is the whole point of their existence. And not only that, I had to submit the shortest path from start to finish. That meant I could not use a brute-force method to find the goal by walking every possible path in the maze until I found it. The best part was that I had to solve a 1000×1000 maze to claim the prize.

    A 1000×1000 maze might not sound very big, but once you think about all the possible configurations in that space, you realize it is not an easy task. Thankfully, getting out of mazes is a very old problem, pioneered by Cretan kings who wanted to hide away their funny-looking stepsons. For that reason, a lot of smart people have spent a lot of time trying to find the best solution to such a problem, better known as the “shortest path problem”. Among those people was Edsger W. Dijkstra, a dutch mathematician and a computer scientist who rarely used a computer. Dijkstra is one of the elder gods of computer science and now spends his afterlife looking disapprovingly at students who use GOTO statements.

    In 1959, Mr. Dijkstra successfully designed an eponymous algorithm to find the shortest path between two points in any structure where there could be obstacles, varying distances, bifurcations, and dead-ends. This algorithm (or a similar algorithm, at least) is what mapping software uses to recommend trajectories (I believe Google maps uses Contraction hierarchies since they need and can pre-compute routes in order to improve execution times).

    One of these variations of Dijkstra’s algorithm is the A* algorithm (pronounced “A Star”). It was created in 1968 by Peter Hart, Nils Nilsson and Bertram Raphael, all of them Stanford scientists. A*, in turn, has many variations.

    So, I used an implementation of the A* algorithm to successfully find the shortest path in the 1000×1000 maze. I sent my solution and even when the API itself confirmed it was the optimal path to exit the maze, I never got the prize, not even a reply saying “Somebody solved it before you” or “We ran out of prizes”. I found that very unprofessional and irritating since the rules specifically said to send an email to a certain person notifying about the solution.

    Since the Campus Party is now over and I am still a little salty about being ignored, I uploaded my solution to my Github repository. It is a very quick and dirty solution, but it works, so don’t laugh too much (or better yet, improve it and create a pull request). Thankfully, I learned many interesting things and had fun doing this programming exercise so it was not a complete waste of time.

 

* The total number of actual parties I’ve attended tends to zero.