Quick Contact


    String Manipulation

    Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made easy with the string object’s built-in methods. For more complex design matching and text manipulations, regular expressions may be needed. pandas add to the mix by enabling us to use string and regular expressions concisely on entire arrays of data, additionally handling the annoyance of missing data.

    String Object Methods

    In many strings munging and scripting applications, built-in string methods are adequate. As an example, a comma-separated string can be divided into pieces with split:

    Example

    val = 'a,b, guido'

    val.split(',')

    Output

    ['a', 'b', ' guido']

    split is combined with strip to trim whitespace (including line breaks):

    Python built-in string methods
    Argument Description
    Count It return the number of non-overlapping occurrences of a substring in the string.
    Endswith It returns True if the string ends with the suffix.
    Startswith It returns True if the string starts with the prefix
    Join It can use string as a delimiter for concatenating a sequence of other strings.
    Index It can return the position of the first character in substring is found in the string; raises ValueError if not found.
    Find It can return the position of the first character of the first occurrence of a substring in the string; like index but returns 1 if not found.
    rfind It can return the position of the first character of the last occurrence of a substring in the string; returns –1 if not found
    Replace It can replace occurrences of a string with another string
    Strip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
    split It can break the string into a list of substrings using a passed delimiter.
    lower It can convert alphabet characters to lowercase.
    upper It can convert alphabet characters to uppercase.
    casefold It can convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
    ljust, rjust It can Left justify or right justify, respectively; pad opposite side of the string with spaces to return a string with minimum width.
    Regular Expressions

    Regular expressions provide a dynamic way to search or match (often more complex) string patterns in text. A single expression, commonly known as regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings.

    The re module functions fall into three types: pattern matching, substitution, and splitting. Naturally, these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes.

    Example

    import re

    text = "foo bar\t baz \tqux"

    re.split('\s+', text)

    Output

    ['foo', 'bar', 'baz', 'qux']

    When you call re.split(‘\s+’, text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object:

    Example

    regex = re.compile('\s+')

    regex.split(text)

    Output

    ['foo', 'bar', 'baz', 'qux']

    Regular expression methods
    Argument Description
    Findall It return all non-overlapping matching patterns in a string as a list.
    Finditer Like findall, but returns an iterator.
    Match Match pattern at the start of the string and optionally segment pattern elements into groups. If the pattern connects returns a match object, and otherwise None.
    search Scan string for the match to the pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning.
    Split It can break the string into pieces at each occurrence of pattern.
    sub, subn R It can replace all (sub) or first n occurrences (subn) of pattern in a string with replacement expression; use symbols \1, \2, … to refer to match group elements in the replacement string.
    Partial listing of vectorized string methods
    Method Description
    Cat It can concatenate strings element-wise with an optional delimiter.
    contains It can return boolean array if each string contains pattern/regex.
    Count It can count occurrences of pattern.
    extract It can use a regular expression with groups to extract one or more strings from a sequence of strings; the result will be a DataFrame with one column per group.
    endswith It is equivalent to x.endswith(pattern) for each element
    startswith It is equivalent to x.startswith(pattern) for each element
    findall It can compute list of all occurrences of pattern/regex for each string
    get It can index into each element (retrieve i-th element)
    isalnum It is equivalent to built-in str.alnum.
    isalpha It is equivalent to built-in str.isalpha.
    isdecimal It is equivalent to built-in str.isdecimal.
    isdigit It is equivalent to built-in str.isdigit.
    islower It is equivalent to built-in str.islower.

    Copyright 1999- Ducat Creative, All rights reserved.