Ciencia de datos en las ciencias sociales y las humanidades: twitter

Mostrando las entradas con la etiqueta twitter. Mostrar todas las entradas

martes, 15 de abril de 2014

Social Media Mining with R

Dannemann, N and Heimann, R. (2014). Social Media Mining with R, Kindle Edition: File Size: 1414 KB,

One of the trends in the analysis of human intelligence states that cultural products -and language is one of those, widely mediate not only our thoughts, but also mould our representation of reality (Bruner, 1991). This trend argues, according to Bruner, that every individual's working intelligence can only be understood by taking into account the

"reference books, notes, computer programs and data bases, or most important of all, the network of friends, colleagues, or mentors on whom [this working intelligence]leans for help and advice" (p. 3)

If we understand intelligence, beyond the narrow academic skill, which is mostly based upon book learning and test taking (Wikipedia), and define it as

"a broader and deeper capability for comprehending our surroundings --`catching on,' `making sense' of things, or `figuring out' what to do" (Ib.)

one way to understand how people are creating their reality and predict it, is to investigate those sources of information and, when possible, the network of friends.

The WWW with its 1.7 billion pages (worldwidewebsize.com) has become that source of information and a "place" to investigate the networks of friends on whom people are leaning for help and advice today. The WWW, provides not only physical and tangible products but also ideas and perspectives. Baby Boomers and Millennials, people who were born between 1977 and 1995, equally search on the Web before making decisions (Bazaar Voice, 2012). While 71% of the Baby Boomers use the information collected to buy in stores, Millennials, who are constantly connected and highly dependent on social media, prefer to buy online (52%) (ib.). Actually, Millennials have more confidence in people's opinions about brands than they have on the information provided by companies (Ib). The Web, in short, is mentoring people in making sense of things, figuring out what to do, and catching on. Consequently, to understand, serve, and predict individuals working intelligence today, we need to mine the WWW.

This is not an easy task because although there are a lot of information about mining text, in general, and the Web, this information, nevertheless, tends to focus on algorithms and model developments, which makes it a highly specialized information and out of reach of common people. Here is where Social Media Mining with R, written by Nathan Danemman and Richard Heimann, comes into play. This book offers us some insights about the theory behind mining the Web, specifically social media, suggests an open source and free software for doing it - R, as well as three case of studies from which we can grasp the procedure, all these in only 120 pages.

In three of the six chapters, the authors provide us with a short background about their approach to social data mining. Among the topics discussed, we can find why the Web is an extraordinary source of information for opinionated social data, meaning data generated by people or by their interactions with which they expose their sentiments, evaluations and opinions. This data is produced in real time as well as in big scale. Following the tradition of social science, the authors intend to use social media data to ask and answer questions on individual and group level behavior. Danemman and Heimann are aware of the pitfalls and failures of social media data; consequently, they devote an entire chapter to discuss them. They also illustrate the differences between traditional social data commonly used in social science and social media data. They advocate for the latter despite its limitations, used with creativity, curiosity and a dose of healthy skepticism, because it is available and can help answer vast majority of emerging questions related to business, politics, and social life (p. 55), questions for which actually there is no traditional social data.

In two chapters, 2-3, the authors introduce us to R, and teach us how to collect tweets using the package twitteR. Regrettably, the procedure they provide to get the Oauth for Twitter does not work, and generates the same message error that has been posting on different R blogs and e-mail lists lately. It does not mean that readers won't finally figure out how to harvest tweets but it is not going to be easy. In any case, it is worth reading chapter 3 and getting an idea about the possible outcomes from this analysis. Finally, in chapter 5 and 6, the authors gave us the framework to cope with social media data. In chapter 5, they give us the fundamentals to extract sentiments as well as the theoretical foundations to understand the techniques they will apply in chapter 6. There are three methods suggested. Two unsupervised learning (processes that do not need previous data to generate an outcome), a lexicon-based sentiment approach and an Item Response Theory for Text Scaling, ITS - and a supervised one, a Naïve Bayes Classifier. The first method consists on counting the opinion words from a subset of data from a particular source (p. 60). The ITS approach takes the previous opinions of people on a given topic and according to the sentiments they have used, locates them, or the documents, in a continuum scale that represents the author's sentiment toward the topic under study (p. 63). As for the Naïve Bayes Classifier, it is used to classify new observations, in this case opinionated data, based on existing data.

Finally, chapter six wraps up everything discussed in chapter 5. The authors use two different social media data as case of studies. The Beige Book Summary of Commentary on Current Economic Conditions, published by the Federal Research Board (FRB), and 4000 tweets hashtagged as #prolife and #prochoice. The authors apply the lexicon-based sentiment approach to the Beige Book, and the ITS and the Naïve Bayes Classifier to the tweets. Danemman and Richard Heimann take us by the hand and guide us step by step through each of these methods, so that we can even calculate how much RAM we may need depending on our data. Each code, which can be downloaded from the publisher web page, is fully explained, so that we can "see" what we are doing. I downloaded the code, but R complains with a message indicating a deprecated function/command in the R sources provided.

I recommend this book to anybody who wants to start this fascinating task of mining social data and to capture the reality created by people in almost real time. You can read in in 4-6 hours, and depending on your ability to quickly catch up with R, two or three days to replicate the case studies. Totally beginners will have some problems, though.

References

Baazar Voice. (2012). Talking to Strangers. Millennials Trust People Over Brands. In: http://www.semiootika.ee/sygiskool/tekstid/bruner.pdfhttp://resources.bazaarvoice.com/rs/bazaarvoice/images/201202_Millennials_whitepaper.pdf

Bruner, J. (1991). The Narrative Construction of Reality. In: Intelligence In.http://en.wikipedia.org/wiki/Intelligence

The size of the World Wide Web (The Internet). (2014). In: http://worldwidewebsize.com/

domingo, 20 de enero de 2013

Diputados de Venezuela. Uso de twitter. Henry Ramos Allup

Aunque más activa que la cuenta de la diputada Machado, la cuenta del diputado Ramos Allup sigue un patrón similar: periodos de gran actividad, seguramente en respuesta a alguna situación coyuntural, seguidos periodos de inactividad. E igualmente, la cuenta adquiere mucho movimiento particularmente en la primera semana de enero.

lunes, 2 de julio de 2012

Twitter @VTVcanal8.

sábado, 30 de junio de 2012

Campaña presidencial. Inicio en 100 tweets de los candidatos. Julio 1, 00:00 am.

sábado, 16 de junio de 2012

Elecciones.Candidatos presidenciales. Twitter. comparacion de frecuencias

Al hacer un análisis de los tópico tratados el 16/06/2012 a las 10 pm en los tweets de los candidatos presidenciales, observamos que ambos tienen una serie de temas que se repiten con la misma frecuencia en cada twitter. Sin embargo, cada uno planteó también temas que el otro no abordó.- Así, por ejemplo, al restar la frecuencia de los términos empleados, Chavez-Capriles, encontramos los tópicos menos tratados por Chávez y más tratado por Capriles. los tópicos más cercanos a cero indican un tratamiento leve tratamiento menor en Chávez. Tienen un color azul claro. Aquellos tópicos más empleados por Capriles y no empleados o empleados mínimamente por Chávez se alejan ampliamente del cero, y tienen un color azul oscuro.

Elecciones presidenciales. Tweets de los candidatos, 16/12/2006, 10 pm

Ambos candidatos han expresado algunos tópicos con la misma frecuencia:

a pesar de las diferencias...

@chavezcandanga. Autodefinición de los seguidores. Análisis de correspondencia

en los 3000 seguidores de @candanga analizados podemos encontrar al menos 4 grupos a los que hemos denominado intelectuales: estudiantes, profesionales e internacionales. Los primeros emplean términos relacionados con actividades académicas o profesionales para describirse: los sentimentales, por su parte, : en sus definiciones predominan palabras relacionadas con las emociones y los sentimientos. Son dos grupo muy relacionado porque se trata básicamente de estudiantes y profesionales.

Los degustadores: se autodefinen a a partir de las cosas que les gusta o prefieren hacer, y, finalmente, dos grupos a los que hemos denominado genéricamente como internacionales, si bien el contenido de sus mensajes difiere.

Desagregando los grupos para facilitar la lectura de los términos que emplearon para autodescribirse, encontramos:

chavezcandanga. audefinición de los seguidores

@chavezcandanga. Agrupación de 3000 seguidores según numero de tweets, personas a las que siguen y seguidores que poseen

@chavezcandanga. Frecuencias de caracteres, palabras y extensión de caracteres usados

@chavezcandanga. términos más frecuentes

@chavezcandanga. Hashtags, retuiteo y @menciones más frecuentes

Elecciones ¿De que habla @chavezcandanga?

Hemos usado un análisis de correspondencia para obtener nueve grupos a partir del contenido de los mensajes de @chavezcandanga. Un primer grupo de mensajes, amarillo, recoge las respuestas a solicitudes de ayuda o información planteadas por algunos seguidores del twitter. En este grupo se menciona a Tarek, varias veces. Es muy posible que se intente canalizar esas solicitudes a través de esta persona.
los grupos coloreados con marron, verde, y azul se relacionan con proyectos y misiones. En el grupo marrón se comenta los proyectos que están en marcha pero no se han culminado, algunos que están por empezar o se ejecutan con lentitud. Se pide paciencia. En el grupo azul se informa sobre los recursos aprobados para diversos proyectos. Finalmente, en el grupo verde se habla de los distintos proyectos a partir de los objetivos sociales que se cumplen o cumplirán con ellos.

Hay una relación fuerte entre los grupos de color rojo, naranja violeta. los tres contienen mensajes referidos a Venezuela como patria y los proyectos que se ejecutan para la construcción de la Venezuela deseada (de allí que este grupo esté muy cercano al grupo azul, y se relacione con el grupo verde), y es un mensaje dirigido sobre todo a los venezolanos. El grupo naranja tiene términos que se relacionan con América Latina como patria grande, en tanto que el violeta se relaciona con la identidad venezolana y latinoamericana.

El grupo fucsia se relaciona con mensajes sobre qué hacer el día 7 de octubre. finalmente en gris encontramos mensajes relacionados con ejecución. Dado que el grupo está aislado, es posible que se trate de ejecución general en la administración pública, o del proyecto socialista del Presidente. Desagregando algunos mensajes y cambiando el color del grupo amarillo para mejorar la lectura, obtenemos:

@chavezcandanga. personas/instituciones seguidas

Venezuela. Elecciones. Nube comparativa de 1500 tweets en @hcpriles y @chavezcandanga

Haciendo algunas modificaciones para mejorar la lectura de los términos:

El tamaño de las palabras indican la frecuencia de aparición de cada término en los distintos tweets.
Los términos en azul o con mayor predominio del color azul han sido tuiteadopor Capriles o más tuiteado por el; y en aquellos en donde predomina el color rojo, o son rojos han sido usado más por Chávez, o usados unicamente por el

Ciencia de datos en las ciencias sociales y las humanidades