classification based chunking involves classifying the text as a group of words rather than individual words. a simple scenario is tagging the text in sentences. we will use a corpus to demonstrate the classification. we choose the corpus conll2000 which has data from the of the wall street journal corpus (wsj) used for noun phrase-based chunking.
first, we add the corpus to our environment using the following command.
import nltk
nltk.download('conll2000')
lets have a look at the first few sentences in this corpus.
from nltk.corpus import conll2000
x = (conll2000.sents())
for i in range(3):
print x[i]
print '\n'
when we run the above program we get the following output −
['confidence', 'in', 'the', 'pond', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figres', 'for', 'september', ',', 'de', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'sbstantial', 'improvement', 'from', 'jly', 'and', 'agst', "'s", 'near-record', 'deficits', '.'] ['chancellor', 'of', 'the', 'excheqer', 'nigel', 'lawson', "'s", 'restated', 'commitment', 'to', 'a', 'firm', 'monetary', 'policy', 'has', 'helped', 'to', 'prevent', 'a', 'freefall', 'in', 'sterling', 'over', 'the', 'past', 'week', '.'] ['bt', 'analysts', 'reckon', 'nderlying', 'spport', 'for', 'sterling', 'has', 'been', 'eroded', 'by', 'the', 'chancellor', "'s", 'failre', 'to', 'annonce', 'any', 'new', 'policy', 'measres', 'in', 'his', 'mansion', 'hose', 'speech', 'last', 'thrsday', '.']
next we use the fucntion tagged_sents() to get the sentences tagged to their classifiers.
from nltk.corpus import conll2000
x = (conll2000.tagged_sents())
for i in range(3):
print x[i]
print '\n'
when we run the above program we get the following output −
[('confidence', 'nn'), ('in', 'in'), ('the', 'dt'), ('pond', 'nn'), ('is', 'vbz'), ('widely', 'rb'), ('expected', 'vbn'), ('to', 'to'), ('take', 'vb'), ('another', 'dt'), ('sharp', 'jj'), ('dive', 'nn'), ('if', 'in'), ('trade', 'nn'), ('figres', 'nns'), ('for', 'in'), ('september', 'nnp'), (',', ','), ('de', 'jj'), ('for', 'in'), ('release', 'nn'), ('tomorrow', 'nn'), (',', ','), ('fail', 'vb'), ('to', 'to'), ('show', 'vb'), ('a', 'dt'), ('sbstantial', 'jj'), ('improvement', 'nn'), ('from', 'in'), ('jly', 'nnp'), ('and', 'cc'), ('agst', 'nnp'), ("'s", 'pos'), ('near-record', 'jj'), ('deficits', 'nns'), ('.', '.')]
[('chancellor', 'nnp'), ('of', 'in'), ('the', 'dt'), ('excheqer', 'nnp'), ('nigel', 'nnp'), ('lawson', 'nnp'), ("'s", 'pos'), ('restated', 'vbn'), ('commitment', 'nn'), ('to', 'to'), ('a', 'dt'), ('firm', 'nn'), ('monetary', 'jj'), ('policy', 'nn'), ('has', 'vbz'), ('helped', 'vbn'), ('to', 'to'), ('prevent', 'vb'), ('a', 'dt'), ('freefall', 'nn'), ('in', 'in'), ('sterling', 'nn'), ('over', 'in'), ('the', 'dt'), ('past', 'jj'), ('week', 'nn'), ('.', '.')]
[('bt', 'cc'), ('analysts', 'nns'), ('reckon', 'vbp'), ('nderlying', 'vbg'), ('spport', 'nn'), ('for', 'in'), ('sterling', 'nn'), ('has', 'vbz'), ('been', 'vbn'), ('eroded', 'vbn'), ('by', 'in'), ('the', 'dt'), ('chancellor', 'nn'), ("'s", 'pos'), ('failre', 'nn'), ('to', 'to'), ('annonce', 'vb'), ('any', 'dt'), ('new', 'jj'), ('policy', 'nn'), ('measres', 'nns'), ('in', 'in'), ('his', 'prp$'), ('mansion', 'nnp'), ('hose', 'nnp'), ('speech', 'nn'), ('last', 'jj'), ('thrsday', 'nnp'), ('.', '.')]