Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made easy with the string object’s built-in methods. For more complex design matching and text manipulations, regular expressions may be needed. pandas add to the mix by enabling us to use string and regular expressions concisely on entire arrays of data, additionally handling the annoyance of missing data.
String Object Methods
In many strings munging and scripting applications, built-in string methods are adequate. As an example, a comma-separated string can be divided into pieces with split:
val = 'a,b, guido'
['a', 'b', ' guido']
split is combined with strip to trim whitespace (including line breaks):
Python built-in string methods
|Count||It return the number of non-overlapping occurrences of a substring in the string.|
|Endswith||It returns True if the string ends with the suffix.|
|Startswith||It returns True if the string starts with the prefix|
|Join||It can use string as a delimiter for concatenating a sequence of other strings.|
|Index||It can return the position of the first character in substring is found in the string; raises ValueError if not found.|
|Find||It can return the position of the first character of the first occurrence of a substring in the string; like index but returns 1 if not found.|
|rfind||It can return the position of the first character of the last occurrence of a substring in the string; returns –1 if not found|
|Replace||It can replace occurrences of a string with another string|
|Strip||Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.|
|split||It can break the string into a list of substrings using a passed delimiter.|
|lower||It can convert alphabet characters to lowercase.|
|upper||It can convert alphabet characters to uppercase.|
|casefold||It can convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.|
|ljust, rjust||It can Left justify or right justify, respectively; pad opposite side of the string with spaces to return a string with minimum width.|
Regular expressions provide a dynamic way to search or match (often more complex) string patterns in text. A single expression, commonly known as regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings.
The re module functions fall into three types: pattern matching, substitution, and splitting. Naturally, these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes.
text = "foo bar\t baz \tqux"
['foo', 'bar', 'baz', 'qux']
When you call re.split(‘\s+’, text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object:
regex = re.compile('\s+')
['foo', 'bar', 'baz', 'qux']
Regular expression methods
|Findall||It return all non-overlapping matching patterns in a string as a list.|
|Finditer||Like findall, but returns an iterator.|
|Match||Match pattern at the start of the string and optionally segment pattern elements into groups. If the pattern connects returns a match object, and otherwise None.|
|search||Scan string for the match to the pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning.|
|Split||It can break the string into pieces at each occurrence of pattern.|
|sub, subn R||It can replace all (sub) or first n occurrences (subn) of pattern in a string with replacement expression; use symbols \1, \2, … to refer to match group elements in the replacement string.|
Partial listing of vectorized string methods
|Cat||It can concatenate strings element-wise with an optional delimiter.|
|contains||It can return boolean array if each string contains pattern/regex.|
|Count||It can count occurrences of pattern.|
|extract||It can use a regular expression with groups to extract one or more strings from a sequence of strings; the result will be a DataFrame with one column per group.|
|endswith||It is equivalent to x.endswith(pattern) for each element|
|startswith||It is equivalent to x.startswith(pattern) for each element|
|findall||It can compute list of all occurrences of pattern/regex for each string|
|get||It can index into each element (retrieve i-th element)|
|isalnum||It is equivalent to built-in str.alnum.|
|isalpha||It is equivalent to built-in str.isalpha.|
|isdecimal||It is equivalent to built-in str.isdecimal.|
|isdigit||It is equivalent to built-in str.isdigit.|
|islower||It is equivalent to built-in str.islower.|