(Syntactic) Parsing in R

This post will exemplify how to syntactically parse a corpus with R (here is the code with the paRsing function). Syntactic Parsing is a form of annotating text in which POS tags are assigned to lexical items and then lexical items are grouped together in phrasal constituents. Syntactic parsing is thus an extension of POS tagging as syntactic parsing requires POS tagging. This post will not go into the theoretical background and various approaches to syntactic parsing – syntactic parsing is quite complex both in terms of theory and practical implementation – but it will simply show how you can use R to parse some text based on the Apache OpenNLP Maxent Parser.

In R we can syntactically parse large amounts of text using the openNLP package, which also requires the NLP package and installing the models on which the openNLP package works – you can find more information on the openNLP package and how it works here. The openNLP package uses the Apache OpenNLP Maxent Parser which is a trained parser, which works in two steps. In a first step, the included POS tagger assigns POS tags based on the probability of what the correct POS tag is – the POS tag with the highest probability is selected. In a next step, the lexical items are grouped together into phrasal and finally clausal constituents.

Unforunately, there is a real issue when R interfaces with Java (which is what we do when we use the openNLP package) – R will report an error:
java.lang.OutOfMemoryError: Java heap space
This error indicates that the memory that is taken up by the task is exploding. There is a way around it which is however not very nice: you need to close R and then run your function again (the command gc() is also meant to prevent the memory from becoming too big but it does not seem to work properly). So, if this error occurs, close R, open it again, then call the function, and apply it to some text.

Below is an example of how you can implement syntactic parsing in R.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
###############################################################
### --- write a function which syntactically parses text in corpus files
###############################################################
# write function
paRsing <- function(path){
  require("NLP")
  require("openNLP")
  require("openNLPmodels.en")
  require("stringr")
  corpus.files = list.files(path = path, pattern = NULL, all.files = T,
    full.names = T, recursive = T, ignore.case = T, include.dirs = T)
  corpus.tmp <- lapply(corpus.files, function(x) {
    scan(x, what = "char", sep = "\t", quiet = T) }  )
  corpus.tmp <- lapply(corpus.tmp, function(x){
    x <- paste(x, collapse = " ")  }  )
  corpus.tmp <- lapply(corpus.tmp, function(x) {
    x <- enc2utf8(x)  }  )
  corpus.tmp <- gsub(" {2,}", " ", corpus.tmp)
  corpus.tmp <- str_trim(corpus.tmp, side = "both")
  sent_token_annotator <- Maxent_Sent_Token_Annotator()
  word_token_annotator <- Maxent_Word_Token_Annotator()
  parse_annotator <- Parse_Annotator()
  Corpus <- lapply(corpus.tmp, function(x){
    x <- as.String(x)  }  )
  lapply(Corpus, function(x){
    annotated <- annotate(x, list(sent_token_annotator, word_token_annotator))
# Compute the parse annotations only.
    parsed <- parse_annotator(x, annotated)
# Extract the formatted parse trees.
    parsedtexts <- sapply(parsed$features, '[[', "parse")
# Read into NLP Tree objects.
    parsetrees <- lapply(parsedtexts, Tree_parse)
    gc()
        return(list(parsedtexts, parsetrees)) 
 }  )
  }
 
##########################################################
##########################################################
##########################################################
# test the function
parsetest <- paRsing(path = "C:\\03-MyProjects\\PosTagging\\TestCorpus")
parsetest
 
##########################################################
##########################################################
##########################################################
### --- The END
##########################################################
##########################################################
##########################################################
 
#[[1]]
#[[1]][[1]]
#[1] "(TOP (S (NP (DT This)) (VP (VBZ is) (NP (NP (DT the) (JJ first) (NN sentence)) (PP (IN in) (NP (NP (DT the) (JJ first) (NN file)) (PP (IN of) (NP (DT the) (NN test) (NN corpus)))))))(. .)))"                                                                                                                                                                 
#[2] "(TOP (S (S (NP (DT This)) (VP (VBZ is) (NP (NP (DT a) (JJ second) (NN sentence)) (PP (IN in) (NP (DT the) (NN test) (NN corpus)))))) (CC but) (S (NP (PRP I)) (VP (VBP am) (ADJP (RB too) (JJ lazy) (S (VP (TO to) (VP (VB write) (ADVP (RB much) (RBR more)) (SBAR (IN so) (S (NP (DT this)) (VP (VBZ has) (S (VP (TO to) (VP (VB suffice)))))))))))))(. .)))"
#[3] "(TOP (S (ADVP (RB well))(, ,) (NP (CD one) (JJR more) (NN sentence)) (VP (MD should) (VP (VB do)))(. .)))"                                                                                                                                                                                                                                                     
#
#[[1]][[2]]
#[[1]][[2]][[1]]
#(TOP
#  (S
#    (NP (DT This))
#    (VP
#      (VBZ is)
#      (NP
#        (NP (DT the) (JJ first) (NN sentence))
#        (PP
#          (IN in)
#          (NP
#            (NP (DT the) (JJ first) (NN file))
#            (PP (IN of) (NP (DT the) (NN test) (NN corpus)))))))
#    (. .)))
#
#[[1]][[2]][[2]]
#(TOP
#  (S
#    (S
#      (NP (DT This))
#      (VP
#        (VBZ is)
#        (NP
#          (NP (DT a) (JJ second) (NN sentence))
#          (PP (IN in) (NP (DT the) (NN test) (NN corpus))))))
#    (CC but)
#    (S
#      (NP (PRP I))
#      (VP
#        (VBP am)
#        (ADJP
#          (RB too)
#          (JJ lazy)
#          (S
#            (VP
#              (TO to)
#              (VP
#                (VB write)
#                (ADVP (RB much) (RBR more))
#                (SBAR
#                  (IN so)
#                  (S
#                    (NP (DT this))
#                    (VP (VBZ has) (S (VP (TO to) (VP (VB suffice)))))))))))))
#    (. .)))
#
#[[1]][[2]][[3]]
#(TOP
#  (S
#    (ADVP (RB well))
#    (, ,)
#    (NP (CD one) (JJR more) (NN sentence))
#    (VP (MD should) (VP (VB do)))
#    (. .)))
#
#
#
#[[2]]
#[[2]][[1]]
#[1] "(TOP (S (NP (DT This)) (VP (VBZ is) (NP (NP (DT a) (JJ second) (NN file)) (PP (IN with) (NP (DT some) (NN sample) (NN content)))))(. .)))"                                                                                                                                                                                                                                                           
#[2] "(TOP (S (NP (PRP It)) (VP (MD will) (VP (VB be) (VP (VBN used) (S (VP (TO to) (VP (VB test) (NP (NP (DT a) (NN part-of-speech) (NN tagger)) (PP (IN in) (NP (NNP R.))) (SBAR (S (NP (PRP I)) (VP (VBP dont) (ADVP (RB really)) (VP (VB know) (SBAR (IN if) (S (S (NP (PRP it)) (VP (VBZ works))) (CC but) (S (NP (PRP I)) (ADVP (RB definitely)) (VP (VBP hope) (ADVP (RB so)))))))))))))))))(. .)))"
#
#[[2]][[2]]
#[[2]][[2]][[1]]
#(TOP
#  (S
#    (NP (DT This))
#    (VP
#      (VBZ is)
#      (NP
#        (NP (DT a) (JJ second) (NN file))
#        (PP (IN with) (NP (DT some) (NN sample) (NN content)))))
#    (. .)))
#
#[[2]][[2]][[2]]
#(TOP
#  (S
#    (NP (PRP It))
#    (VP
#      (MD will)
#      (VP
#        (VB be)
#        (VP
#          (VBN used)
#          (S
#            (VP
#              (TO to)
#              (VP
#                (VB test)
#                (NP
#                  (NP (DT a) (NN part-of-speech) (NN tagger))
#                  (PP (IN in) (NP (NNP R.)))
#                  (SBAR
#                    (S
#                      (NP (PRP I))
#                      (VP
#                        (VBP dont)
#                        (ADVP (RB really))
#                        (VP
#                          (VB know)
#                          (SBAR
#                            (IN if)
#                            (S
#                              (S (NP (PRP it)) (VP (VBZ works)))
#                              (CC but)
#                              (S
#                                (NP (PRP I))
#                                (ADVP (RB definitely))
#                                (VP (VBP hope) (ADVP (RB so)))))))))))))))))
#    (. .)))
#
#
#
#[[3]]
#[[3]][[1]]
#[1] "(TOP (S (S (ADVP (RB Finally))(, ,) (NP (DT this)) (VP (VBZ is) (NP (NP (DT the) (JJ last) (NN file)) (PP (IN of) (NP (DT the) (NN test) (NN corpus)))))) (CC and) (S (NP (PRP I)) (ADVP (RB really)) (VP (VBP dont) (VP (VB want) (S (VP (TO to) (VP (VB write) (NP (DT a) (NN lot) (RBR more))))))))(. .)))"
#[2] "(TOP (S (SBAR (IN Since) (S (NP (PRP I)) (VP (VBP am) (ADJP (RB quite) (JJ lazy)))))(, ,) (NP (DT this)) (VP (VBZ is)\n(NP (NP (DT the) (JJ last) (NN sentence)) (PP (IN in) (NP (PRP$ my) (JJ tiny) (NN test) (NN corpus)))))(. .)))"                                                                        
#
#[[3]][[2]]
#[[3]][[2]][[1]]
#(TOP
#  (S
#    (S
#      (ADVP (RB Finally))
#      (, ,)
#      (NP (DT this))
#      (VP
#        (VBZ is)
#        (NP
#          (NP (DT the) (JJ last) (NN file))
#          (PP (IN of) (NP (DT the) (NN test) (NN corpus))))))
#    (CC and)
#    (S
#      (NP (PRP I))
#      (ADVP (RB really))
#      (VP
#        (VBP dont)
#        (VP
#          (VB want)
#          (S
#            (VP
#              (TO to)
#              (VP (VB write) (NP (DT a) (NN lot) (RBR more))))))))
#    (. .)))
#
#[[3]][[2]][[2]]
#(TOP
#  (S
#    (SBAR
#      (IN Since)
#      (S (NP (PRP I)) (VP (VBP am) (ADJP (RB quite) (JJ lazy)))))
#    (, ,)
#    (NP (DT this))
#    (VP
#      (VBZ is)
#      (NP
#        (NP (DT the) (JJ last) (NN sentence))
#        (PP (IN in) (NP (PRP$ my) (JJ tiny) (NN test) (NN corpus)))))
#    (. .)))
#
# inspect the first sentence
parsetest[[1]][[1]][[1]]
 
#[1] "(TOP (S (NP (DT This)) (VP (VBZ is) (NP (NP (DT the) (JJ first) (NN sentence)) (PP (IN in) (NP (NP (DT the) (JJ first) (NN file)) (PP (IN of) (NP (DT the) (NN test) (NN corpus)))))))(. .)))"
 
 
 
##########################################################

You can use the output to find e.g. all noun phrases in a text or to create fancy sampletreetree diagrams like the one here:

I hope this helps and I will also be posting some updates and show what parsing can be used for.

Ein Gedanke zu „(Syntactic) Parsing in R

  1. Bryan Murphy

    I am also trying to use the OpenNLP package in R, and had the same problem when trying to use the Maxent_POS_Tag Function in that the heap space error came up. I tried closing R and reopening it and then rerunning my code, but that didn’t solve it. Could you elaborate on what you did/do you know of any other way to make it work? Thanks.

    Antworten

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.