Question

3. Explain the following terms in detail : i) Tokens ii) Pattern iii) Lexemes iv) Sentinels...

3. Explain the following terms in detail :

i) Tokens

ii) Pattern

iii) Lexemes

iv) Sentinels

v) Sentential foam

0 0
Add a comment Improve this question Transcribed image text
Answer #1

i)
In Natural Language Processing(NLP) before processing the text data, we need to break down the text data into the smallest meaningful units. These units can be made of a string of characters, numbers, punctuation marks, etc. The process of dividing the text data into tokens is called tokenization and after the process of tokenization, the tokens can be used for advanced text processing. The tokens are identified by understanding the word boundaries, basically to understand where does a meaningful token starts and ends. It can be done by checking the token next tp spaces in sentences or by delimiters depending upon the document.
For e.g, A particular sentence could be This is life. Here if we start to divide this sentence in terms of tokens, we can have 3 meaningful tokens which are This | is | life. All 3 tokens are meaningful.
Further, there could be another example, involving digits, There are 7 sisters. The tokens will be, There | are | 7 | sisters. Here even the digits are classified as a token.

The tokenization process is done using libraries such as NLTK and Spacey in Python.
It can be also Identifiers, Keywords, Constant, etc.

ii)
Patterns are nothing but a set of Regular Expression that is used to identify the Regular Expression in the text document. The Regular Expression is also known as Regex and is used for a variety of purposes such as string replacement, string search, string extraction, etc. The text document is not as organized as a database where useful values can be extracted from a column by applying correct filters. In-text documents, the process of finding a particular pattern can be very exhausting and manual activity. It becomes really hard when the text size is really large. In these cases, we come up with string matching operations using the Pattern matching feature.
For this process, we need to first identify the appropriate pattern. Let us say if we need to identify the money values from a document. We will first need to understand, what currency we are searching for. If it is a dollar value that we are looking at, we will use a $ sign.Next, we have to specify, what we are looking after $ sign, is the money value present in letters or numbers. If the money value is present in numbers, we will use the expression [0-9] to find those numbers. So our regular expression or pattern will be using $[0-9] as a pattern to identify the dollar values in the text document.

iii)

Lexemes are the most basic unit in language processing. These are nothing but a group of characters that are used to match a pattern which in turn is used to identify tokens.

Lexemes ----> Pattern Matching -----> Tokens

They are equipped to handle alphanumeric characters. Once a set of strings is identified then it uses pattern matching to put up these sets of strings in different types of tokens.

For e.g. The line int num1=5;

First, we will identify all the different lexemes in this line. Based on the space delimiter the lexemes are int | num1 | = | 5 |;

Here the int can be matched with the existing pattern of keywords and can be classified as a keyword.
The word num1 will not match any set of keyword pattern, thus this lexeme would be classified into variables.
= and ; will be classified as operators and 5 would be classified as constant.

iv) Got no information from search

v)  A sentential form is nothing but derived from the start symbol. It is a string value that consists of terminals and non-terminal values. The difference between sentential form and sentence is that sentences don't consist of non-terminals.
The sentential form consists of only terminal symbols. There could be right and left sentential form based on the direction of expansion.

For e.g a relation

S→aSa ∣ bSb∣ ϵ

Here the sentential form can be derived using the derivation process. The Sentential form can be abbSbba while the Sentence, in this case, will be abbbba because it doesn't have any terminal values.

Add a comment
Know the answer?
Add Answer to:
3. Explain the following terms in detail : i) Tokens ii) Pattern iii) Lexemes iv) Sentinels...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT