Python Data Science Tutorial on Python Word Tokenization

Back to Course

Python Data Processing

Python Data Operations

Read

Python Data Cleansing

Read

Python Processing CSV Data

Read

Python Processing JSON Data

Read

Python Processing XLS Data

Read

Python Relational Databases

Read

Python NoSQL Databases

Read

Python Date and Time

Read

Python Data Wrangling

Read

Python Data Aggregation

Read

Python Reading HTML Pages

Read

Python Processing Unstructured Data

Read

Python Word Tokenization

Read

Python Stemming and Lemmatization

Read

Python Data Visualization

Python Chart Properties

Read

Python Chart Styling

Read

Python Box Plots

Read

Python Heat Maps

Read

Python Scatter Plots

Read

Python Bubble Charts

Read

Python 3D Charts

Read

Python Time Series

Read

Python Geographical Data

Read

Python Graph Data

Read

Statistical Data Analysis

Python Measuring Central Tendency

Read

Python Measuring Variance

Read

Python Normal Distribution

Read

Python Binomial Distribution

Read

Python Poisson Distribution

Read

Python Bernoulli Distribution

Read

Python PValue

Read

Python Correlation

Read

Python ChiSquare Test

Read

Python Linear Regression

Read

word tokenization is the process of splitting a large sample of text into words. this is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. the natural language tool kit(nltk) is a library used to achieve this. install nltk before proceeding with the python program for word tokenization.

conda install -c anaconda nltk

next we use the word_tokenize method to split the paragraph into individual words.

import nltk

word_data = "it originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

when we execute the above code, it produces the following result.

['it', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

tokenizing sentences

we can also tokenize the sentences in a paragraph like we tokenized the words. we use the method sent_tokenize to achieve this. below is an example.

import nltk
sentence_data = "sun rises in the east. sun sets in the west."
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

when we execute the above code, it produces the following result.

['sun rises in the east.', 'sun sets in the west.']

Previous Lesson

Next Lesson