Quick Contact

    What is Text Analysis?

    Text analysis is known as text analytics. It refers to the representation, processing, and modeling of textual data to derive beneficial insights. An important element of text analysis is text mining, the process of finding relationships and interesting patterns in large text collections.

    Steps of Text Analysis

    A text analysis problem usually includes three important steps: parsing, search and retrieval, and text mining.

    Parsing:

    Parsing is the process that takes the unstructured text and imposes a structure for further analysis. The unstructured text can be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders it in a more structured way for the subsequent steps.

    Search and retrieval:

    Search, and retrieval is the identification of the documents in a corpus that contain search items such as specific words, phrases, topics, or entities like people or organizations. These search items are generally known as key terms. Search, and retrieval originated from the field of library science and is now used extensively by web search engines.

    Text mining:

    Text mining uses the terms and indexes produced by the prior two phases to find meaningful insights pertaining to domains or problems of interest.

    Representing Text

    Tokenization is the function of separating (also called tokenizing) words from the body of the text. Raw text is modified into a set of tokens after the tokenization, where each token is generally a word. A common approach is tokenizing on spaces. For example, the tweet has shown previously:

    I once had a gf back in the day. Then the bPhone came out lol

    tokenization based on spaces would output a list of tokens.

    {I, once, had, a, gf, back, in, the, day., Then, the, bPhone, came, out, lol}

    Tokenization is a much more difficult task than one may expect. For example, should words like state-of-the-art, Wi-Fi, and San Francisco be considered one token or more? Should words like Résumé, résumé, and resume all map to the same token? Tokenization is even more difficult beyond English. In German, for example, there are many unsegmented compound nouns. In Chinese, there are no spaces between words. Japanese has several alphabets intermingled. This list can go on.

    Another text normalization technique is called case folding, which reduces all letters to lowercase (or the opposite if applicable). For the previous tweet, after case folding the text would become this:

    i once had a gf back in the day. Then the bphone came out lol

    After normalizing the text by tokenization and case folding, it needs to be represented in a more structured way. A simple yet widely used approach to represent text is called bag-of-words.

    Given a document, bag-of-words represents the document as a set of terms, ignoring information such as order, context, inferences, and discourse. Each word is considered a term or token (which is often the smallest unit for the analysis). In many cases, bag-of-words additionally assumes every term in the document is independent. The document then becomes a vector with one dimension for every distinct term in the space, and the terms are unordered.

    The permutation D* of a document D contains the same words exactly the same number of times but in a different order. Therefore, using the bag-of-words representation, document D and its permutation D* would share the same representation.

    Determining Sentiments

    Sentiment analysis refers to a group of tasks that use statistics and natural language processing to mine opinions to identify and extract subjective information from texts.

    Example

    import nltk.classify.util

    from nltk.classify import NaiveBayesClassifier

    from nltk.corpus import movie_reviews

    from collections import defaultdict

    import numpy as np

    # define an 80/20 split for train/test

    SPLIT = 0.8

    def word_feats(words):

    feats = defaultdict(lambda: False)

    for word in words:

    feats[word] = True

    return feats

    posids = movie_reviews.fileids(‘pos’)

    negids = movie_reviews.fileids(‘neg’)

    posfeats = [(word_feats(movie_reviews.words(fileids=[f])), ‘pos’)

    for f in posids]

    negfeats = [(word_feats(movie_reviews.words(fileids=[f])), ‘neg’)

    for f in negids]

    cutoff = int(len(posfeats) * SPLIT)

    trainfeats = negfeats[:cutoff] + posfeats[:cutoff]

    testfeats = negfeats[cutoff:] + posfeats[cutoff:]

    print ‘Train on %d instances\nTest on %d instances’ % (len(trainfeats),

    len(testfeats))

    classifier = NaiveBayesClassifier.train(trainfeats)

    print ‘Accuracy:’, nltk.classify.util.accuracy(classifier, testfeats)

    classifier.show_most_informative_features()

    # prepare confusion matrix

    pos = [classifier.classify(fs) for (fs,l) in posfeats[cutoff:]]

    pos = np.array(pos)

    neg = [classifier.classify(fs) for (fs,l) in negfeats[cutoff:]]

    neg = np.array(neg)

    print ‘Confusion matrix:’

    print ‘\t’*2, ‘Predicted class’

    print ‘-‘*40

    print ‘|\t %d (TP) \t|\t %d (FN) \t| Actual class’ % (

    (pos == ‘pos’).sum(), (pos == ‘neg’).sum()

    print ‘-‘*40

    print ‘|\t %d (FP) \t|\t %d (TN) \t|’ % (

    (neg == ‘pos’).sum(), (neg == ‘neg’).sum())

    print ‘-‘*40

    Copyright 1999- Ducat Creative, All rights reserved.