Creating word clouds with R

In this post I want to exemplify how to create word clouds in R.

Word clouds visualize word frequencies of either single corpora or they visualize different corpora. Although word clouds are not really used in academic linguistics, they are a neat way to display the themes – which may be thought of as the semantic content – of corpora.
To exemplify how to use word clouds, we are going to have a look at the election programs (Wahlkampfprogramme) of German political parties for the Bundestag elections 2013.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
###############################################################
### WORD CLOUD
###############################################################
### --- Prepare data
# Remove all lists from the current workspace
rm(list=ls(all=T))
 
# Install packages we need or which may be useful
# (to activate just delete the #)
#install.packages("tm")
#install.packages("wordcloud")
#install.packages("Rstem")
#install.packages("stringr")
#install.packages("SnowballC")
 
# Initiate the packages
library(tm)
library(wordcloud)
library(Rstem)
library(stringr)
library(SnowballC)
 
# Read in data
corp  <- Corpus(DirSource("C:\\Corpora\\original versions\\Wahlkampfprogramme Bundestagswahl 2013\\corpus"), readerControl = list(language = "german")) #specifies the exact folder where my text file(s) is for analysis with tm.
###############################################################
 
corp <- Corpus(VectorSource(corp)) # Create a corpus from the vectors
#corp <- tm_map(corp, stemDocument, language = "german") # stem words (inactive because I want intakt words)
corp <- tm_map(corp, removePunctuation) # remove punctuation
corp <- tm_map(corp, tolower) # convert all words to lower case
corp <- tm_map(corp, removeNumbers) # remove all numerals
corp <- tm_map(corp, function(x)removeWords(x, stopwords("german"))) # remove grammatical words such as "ein", "ist", "war", etc.
 
# clean corpus content
corp <- sapply(corp, function(x) {
  x <- gsub("„", "", x)
  x <- gsub("ãÿ", "", x)
  x <- gsub("fiir", "für", x)
  x <- gsub("ens", "en", x)
  x <- gsub("ungen", "ung", x)
  x <- gsub(" ver ", "", x)
  x <- gsub(" wei ", "", x)
   } )
 
corp <- sapply(corp, function(x) {
  x <- gsub("ue", "ü", x)
  x <- gsub("aün", "auen", x)
  x <- gsub("eün", "euen", x)
  x <- gsub("eü", "eue", x)
  x <- gsub("oe", "ö", x)
  x <- gsub("ae", "ä", x)  }  )
 
corp <- Corpus(VectorSource(corp))  # convert vectors back into a corpus
 
# Create a term document matrix
term.matrix <- TermDocumentMatrix(corp)  # crate a term document matrix
term.matrix <- removeSparseTerms(term.matrix, 0.5) # remove infrequent words
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- c("CDU/CSU", "FDP", "Grüne", "Die Linke", "SPD") # add column labels to tdm
# clean row names
 
# normalize absolute frequencies: convert absolute frequencies 
# to relative freqeuncies (per 1,000 words)
#colSums(term.matrix)
term.matrix[, 1] <- as.vector(unlist(sapply(term.matrix[, 1], function(x) round(x/colSums(term.matrix)[1]*1000, 0) )))
term.matrix[, 2] <- as.vector(unlist(sapply(term.matrix[, 2], function(x) round(x/colSums(term.matrix)[2]*1000, 0) )))
term.matrix[, 3] <- as.vector(unlist(sapply(term.matrix[, 3], function(x) round(x/colSums(term.matrix)[3]*1000, 0) )))
term.matrix[, 4] <- as.vector(unlist(sapply(term.matrix[, 4], function(x) round(x/colSums(term.matrix)[4]*1000, 0) )))
term.matrix[, 5] <- as.vector(unlist(sapply(term.matrix[, 5], function(x) round(x/colSums(term.matrix)[5]*1000, 0) )))
#colSums(term.matrix)
 
# Create word clouds
#wordcloud(corp, max.words = 100, colors = brewer.pal(6, "Dark2"), random.order = FALSE)
comparison.cloud(term.matrix, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
#commonality.cloud(term.matrix, max.words = 100, random.order = FALSE)
Comparative word cloud showing distinctive words in the election programms of German political parties for the Bundestag election 2013.

Comparative word cloud showing distinctive words in the election programms of German political parties for the Bundestag election 2013.

At first I thought that word clouds are simply a fancy but not very helpful way to inspect language data but I have to admit that word clouds really surprised me as they do appear to possess potential to offer an idea of what groups of people are talking about.

The comparative word cloud shows that the FDP stresses concepts like „wettbewerb“, „freiheit“, „chancen“, „liberal“ thereby stressing their liberal outlook (they didn’t make it and didn’t deserve it by the way – just my opinion).
Die Grünen support every nonsense as they are „für“ everything and relied more on emphasizing „frauen“, „zukunft“, and „teilhabe“ which is in line with their feel-good philosophy.
Die Linke rallied on about what we has to be done („müssen“), and used words like „sozial“, „beschäftigten“, and „öffentlich“ a lot showing their emphasis on economic issues.
The social democrats (SPD) addressed topics like „kommunen“, „arbeit“, „gesellschaft“, „bildung“, and „gerechtigkeit“ – so they essentially used their typical buzz words (just sayin‘).
Finally, the CDU/CSU mentioned „ländlich“, „wohlstand“, „unser*, „and „weiterhin“ to suggest that they will just continue with whatever nothing that hve been doing over the past years.

In conclusion, I honestly didn’t think that I would get meaningful results but the comparative word cloud does a rather good job at that. So that was it on word clouds in R.

References
http://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.