
Quick Contact
Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Files
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
What is Plotting?
Plotting data is an essential part of any exploratory or forecasting data analysis and probably the essential part of report writing. There are three principal methods to programmable plotting. We begin an incremental plot with a blank plot canvas and then insert graphs, axes, labels, legends, etc. additionally using specialized functions. Finally, we display the plot image and optionally store it into a file. Examples of incremental plotting tools contains the R language function plot(), the Python module pyplot, and the gnuplot command-line plotting program.
Monolithic plotting systems pass all important parameters, describing the graphs, charts, axes, labels, legends, etc. to the plotting function. We plot, decorate, and store the final plot at once. An example of a monolithic plotting tool is the R language function xyplot().
Finally, layered tools defines what to plot, how to plot, and any additional features as virtual “layers”; we add more layers as required to the “plot” object. An example of a layered plotting tool is the R language function ggplot().
Example
import matplotlib, matplotlib.pyplot as plt
import pickle, pandas as pd
# The NIAAA frame has been pickled before
alco = pickle.load(open("alco.pickle", "rb"))
del alco["Total"]
columns, years = alco.unstack().columns.levels
# The state abbreviations come straight from the file
states = pd.read_csv(
"states.csv",
names=("State", "Standard", "Postal", "Capital"))
states.set_index("State", inplace=True)
# Alcohol consumption will be sorted by year 2009
frames = [pd.merge(alco[column].unstack(), states,
left_index=True, right_index=True).sort_values(2009)
for column in columns]
# How many years are covered?
span = max(years) - min(years) + 1
The first code fragment simply imports all necessary modules and frames. It then combines NIAAA data and the state abbreviations into one frame and splits it into three separate frames by beverage type. The next code fragment is in charge of plotting.
# Select a good-looking style
matplotlib.style.use("ggplot")
STEP = 5
# Plot each frame in a subplot
for pos, (draw, style, column, frame) in enumerate(zip(
(plt.contourf, plt.contour, plt.imshow),
(plt.cm.autumn, plt.cm.cool, plt.cm.spring),
columns, frames)):
# Select the subplot with 2 rows and 2 columns
plt.subplot(2, 2, pos + 1)
# Plot the frame
draw(frame[frame.columns[:span]], cmap=style, aspect="auto")
# Add embellishments
plt.colorbar()
plt.title(column)
plt.xlabel("Year")
plt.xticks(range(0, span, STEP), frame.columns[:span:STEP])
plt.yticks(range(0, frame.shape[0], STEP), frame.Postal[::STEP])
plt.xticks(rotation=-17)
The functions imshow(), contour(), and contourf() (at 1) display the matrix as an image, a contour plot, and a filled contour plot, respectively. Don’t use these three functions (or any other plotting functions) in the same subplot, because they superimpose new plots on the previously drawn plots—unless that’s your intention, of course. The optional parameter cmap (at 3) specifies a prebuilt palette (color map) for the plot.
You can also add notes with annotate(), arrows with arrow(), and a legend block with legend(). In general, refer to the pyplot documentation for the complete list of embellishment functions and their arguments, but let’s at least add some arrows, notes, and a legend to an already familiar NIAAA graph:
Example
import matplotlib, matplotlib.pyplot as plt
import pickle, pandas as pd
# The NIAAA frame has been pickled before
alco = pickle.load(open("alco.pickle", "rb"))
# Select the right data
BEVERAGE = "Beer"
years = alco.index.levels[1]
states = ("New Hampshire", "Colorado", "Utah")
# Select a good-looking style
plt.xkcd()
matplotlib.style.use("ggplot")
# Plot the charts
for state in states:
ydata = alco.ix[state][BEVERAGE]
plt.plot(years, ydata, "-o")
# Add annotations with arrows
plt.annotate(s="Peak", xy=(ydata.argmax(), ydata.max()),
xytext=(ydata.argmax() + 0.5, ydata.max() + 0.1),
arrowprops={"facecolor": "black", "shrink": 0.2})
# Add labels and legends
plt.ylabel(BEVERAGE + " consumption")
plt.title("And now in xkcd...")
plt.legend(states)
plt.savefig("../images/pyplot-legend-xkcd.pdf")
Plotting with Pandas
Both pandas frames and series support plotting through pyplot. When the plot() function is called without any parameters, it line-plots either the series or all frame columns with labels. If you specify the optional parameters x and y, the function plots column x against column y.
pandas also supports other types of plots via the optional parameter kind. The admissible values of the parameter are “bar” and “barh” for bar plots, “hist” for histograms, “box” for boxplots, “kde” for density plots, “area” for area plots, “scatter” for scatter plots, “hexbin” for hexagonal bin plots, and “pie” for pie charts. All plots allow a variety of embellishments, such as legends, color bars, controllable dot sizes (option s), and colors (option c).
Example
import matplotlib, matplotlib.pyplot as plt
import pickle, pandas as pd
# The NIAAA frame has been pickled before
alco = pickle.load(open("alco.pickle", "rb"))
# Select a good-locking style
matplotlib.style.use("ggplot")
# Do the scatter plot
STATE = "New Hampshire"
statedata = alco.ix[STATE].reset_index()
statedata.plot.scatter("Beer", "Wine", c="Year", s=100, cmap=plt.cm.autumn)
plt.title("%s: From Beer to Wine in 32 Years" % STATE)
plt.savefig("../images/scatter-plot.pdf")