Expongo diversos proyectos de ciencias de datos en el contexto de las ciencias sociales (computational social sciences) y las humanidades (digital humanities)
viernes, 3 de octubre de 2014
Obtener y limpiar texto de la Web empleando 'nltk' (Python)
Notebook
Veamos un ejemplo de cómo emplear el paquete 'nltk' de Python para bajar documentos de la red y limpiarlos para, posteriormente, hacer análisis con el contenido. Ofrecemos un ejemplo muy breve de una análisis de concordancia, una vez que hemos limpiado el texto
Populating the interactive namespace from numpy and matplotlib
In [6]:
# import los paquetes necesarios para trabajar htmlfromurllibimporturlopen
In [14]:
# crear la dirección desde la que obtendremos el corpus desde la webpop="http://www.foreignaffairs.com/articles/141191/cynthia-j-arnson-and-carlos-de-la-torre/viva-el-populismo"
In [15]:
# bajar el archivopopulismo=html=urlopen(pop).read()
In [16]:
# verificar el tipo de data correspondiente a, en este caso, populismotype(populismo)
Out[16]:
str
In [17]:
# inspeccionar la extensión del archivo bajadolen(populismo)
Out[17]:
68394
In [29]:
# escoger al azar un subtramo de la cadena de textopopulismo[120:2380]
Out[29]:
'rs \n \n \n \n \n \n \n \n\n \n \n \n \n\n \n \n \n \n \n \n \n\n\n\n\n \t \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n\n \n\n \n \n \n \n \n Skip to Navigation \n\n \n \n \n \n\n \n \n \n \n\n \n \n \n\n \n \n \n \n\n \n \n \n \n \n\n \n\n \n \n \n \n \n \n \n \n Foreign Affairs \n \n \n \n \n \n \n \n \n\n \n \n \n Home \n International Editions \n Digital Newsstand \n Job Board \n Account Management \n RSS \n Newsletters \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n Login \n Register \n My Cart \n \n\n \n \n \n\n \n\n \n\n \n\n \n \n \n \n \n \n\n \n \n New Issue \n Archive \n Regions Africa \n Americas \n Asia \n Europe \n Middle East \n Russia & FSU \n Global Commons \n \n Topics Economics \n Environment \n Security \n Law & Institutions \n Politics & Society \n U.S. Policy \n \n Features Snapshots \n Letters From \n P.S. \n Reading Lists \n Comments \n Essays \n Responses \n \n Discussions Interviews \n Roundtables \n Letters to the Editor \n News & Events \n \n Video \n Books & Reviews Review Essays \n Capsule Reviews \n FA Books \n \n Classroom \n About Us Submissions \n Staff \n Employment \n Advertising \n Sponsored Sections \n Contact Us \n History \n \n Subscribe \n \n\n \n \n \n\n \n\n \n \n \n Home \xe2\x80\xba Features \xe2\x80\xba Snapshots \n \n \n Viva el Populismo? \n The Tense Future of Latin American Politics \n \n \n \n By Cynthia J. Arnson and Carlos de la Torre \n \n CYNTHIA J. ARNSON is director of the Latin American Program at the Woodrow Wilson International Center for Scholars. CARLOS DE LA TORRE is director of international studies and professor of sociology at the University of Kentucky, Lexington. They are the editors of Latin American Populism in the Twenty-First Century (Woodrow Wilson Center Press and The Johns Hopkins University Press, 2013), upon which this essay draws. \n See more by Cynthia J. Arnson See more by Carlos de la Torre \n \n \n April 16, 2014 \n \n \n \n Venezuelan President Nicolas Maduro waves to supporters during a campaign rally on April 6, 2013 (Courtesy Reuters) \n \n\n \n \n \n \n \n \n\n \n \n \n'
Lo que tenemos es la página cruda. Debemos limpiarla para poder extraer toda la información que necesitamos. El paquete 'nltk' tiene una función que nos permite hacer rápidamente esta limpieza:
In [21]:
populismo=nltk.clean_html(populismo)
In [28]:
populismo[1:2380]
Out[28]:
'ynthia J. Arnson and Carlos de la Torre | The Tense Future of Latin American Politics | Foreign Affairs | Foreign Affairs \n \n \n \n \n \n \n \n\n \n \n \n \n\n \n \n \n \n \n \n \n\n\n\n\n \t \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n\n \n\n \n \n \n \n \n Skip to Navigation \n\n \n \n \n \n\n \n \n \n \n\n \n \n \n\n \n \n \n \n\n \n \n \n \n \n\n \n\n \n \n \n \n \n \n \n \n Foreign Affairs \n \n \n \n \n \n \n \n \n\n \n \n \n Home \n International Editions \n Digital Newsstand \n Job Board \n Account Management \n RSS \n Newsletters \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n\n \n \n\n \n \n \n\n \n \n \n Login \n Register \n My Cart \n \n\n \n \n \n\n \n\n \n\n \n\n \n \n \n \n \n \n\n \n \n New Issue \n Archive \n Regions Africa \n Americas \n Asia \n Europe \n Middle East \n Russia & FSU \n Global Commons \n \n Topics Economics \n Environment \n Security \n Law & Institutions \n Politics & Society \n U.S. Policy \n \n Features Snapshots \n Letters From \n P.S. \n Reading Lists \n Comments \n Essays \n Responses \n \n Discussions Interviews \n Roundtables \n Letters to the Editor \n News & Events \n \n Video \n Books & Reviews Review Essays \n Capsule Reviews \n FA Books \n \n Classroom \n About Us Submissions \n Staff \n Employment \n Advertising \n Sponsored Sections \n Contact Us \n History \n \n Subscribe \n \n\n \n \n \n\n \n\n \n \n \n Home \xe2\x80\xba Features \xe2\x80\xba Snapshots \n \n \n Viva el Populismo? \n The Tense Future of Latin American Politics \n \n \n \n By Cynthia J. Arnson and Carlos de la Torre \n \n CYNTHIA J. ARNSON is director of the Latin American Program at the Woodrow Wilson International Center for Scholars. CARLOS DE LA TORRE is director of international studies and professor of sociology at the University of Kentucky, Lexington. They are the editors of Latin American Populism in the Twenty-First Century (Woodrow Wilson Center Press and The Johns Hopkins University Press, 2013), upon which this essay draws. \n See more by Cynthia J. Arnson See more by Carlos de la Torre \n \n \n April 16, 2014 \n \n \n \n Venezuelan President Nicolas Maduro waves to supporters during a campaign rally on April 6, 2013 (Courtesy Reuters) \n \n\n \n \n \n \n \n \n\n \n \n \n'
In [43]:
popTokens=nltk.word_tokenize(populismo)
In [45]:
len(popTokens)
Out[45]:
1274
Podemos mediante ensayo y error obtener el comienzo del documento
Building index...
Displaying 1 of 1 matches:
are the editors of Latin American Populism in the Twenty-First Century ( Wood
In [72]:
popTokenText.concordance('venezuela')
Displaying 1 of 1 matches:
hich killed hundreds of civilians. Venezuela is still suffering the consequence
In [73]:
popTokenText.concordance('maduro')
Displaying 2 of 2 matches:
, 2014 Venezuelan President Nicolas Maduro waves to supporters during a campai
g Chávez’s death in March 2013 , Maduro won a special election by a mere 1.
In [74]:
popTokenText.concordance('chávez')
Displaying 4 of 4 matches:
launch the political career of Hugo Chávez , one of the officers involved. In
of the officers involved. In 1998 , Chávez made a successful bid for the presi
e vote. He remains president today. Chávez redistributed wealth and created ne
er rate has more than doubled since Chávez first took office in 1998. Today ,
No hay comentarios.:
Publicar un comentario