Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Working with Network Data
Network analysis is a new area of data analysis. Network science borrowed network theoretical methods partly from mathematics (graph theory), partly from social sciences, and a lot from sociology. A network is a collection of connected objects. We can treat all kinds of numeric and non-numeric (text) objects as networks, as long as there is a way to interconnect them. Depending on our background and the field of study, we may call the network objects “nodes,” “vertexes,” or “actors,” and call the connections between them “arcs,” “edges,” “links,” or “ties.” We may represent networks graphically and mathematically as graphs.
Graph Elements, Types, and Density
If at least one graph edge is directed (say, it connects node A to node B but not the other way around), the graph itself is directed and is called a digraph. If there are parallel edges in a graph (that is, node A may be connected to node B by more than one edge), the graph is called a multigraph. The edge from node A to itself is called a loop. A graph without loops and parallel edges is called a simple graph.
Weights can be assigned to graph edges. Weight is usually (but not necessarily) a number between 0 and 1, inclusive. The larger the weight, the stronger the connection between the nodes. A graph with weighted edges is called a weighted graph.
Edge weight is an example of an edge attribute. Edges may have other attributes, including numeric, Boolean, and string. Graph nodes can have attributes, too.
The degree of a graph node is defined as the number of edges connected (incident) to the node. For a directed graph, indegree (the number of edges coming into the node) and outdegree (the number of edges going out of the node) are distinguished.Graph density d (0≤d≤1) tells how close the graph is to a complete graph—a cobweb of all possible edges. For example,for a directed graph with e edges and n nodes:
For an undirected graph:
Graphs and the networks that they induce are exciting, diverse, and multifaceted objects. A walk in any sequence of edges such that the end of one edge is the beginning of another edge. When we take a bus from node “home” to node “subway station A,” then take a train from node “subway station A” to node “subway station B,” and then walk from node “subway station B” to node “our office,” we complete a graph walk (even though only part of the walk involves actual walking). We call a walk that doesn’t intersect itself (in other words, it doesn’t include the same node twice, except perhaps for the first-last node) a path. We call a closed path a loop. It’s a loop that takes us back home at the end of a hard day.
A connected component, or merely a component, is a set of all nodes in a graph such that there is a path from each node in the set to each other node in the set. For a directed graph, strongly connected components (connected by actual paths) and weakly connected components (connected by the paths where the edges were converted to undirected) are distinguished.
Sometimes two parts of a graph are connected, but the connection is so subtle that the removal of a single edge breaks the graph apart. Such an edge is called a bridge.
A clique is a set of nodes such that each node is directly connected node in the set. D’Artagnan and Three Musketeers formed a clique, and so does any other band that operates under the “all for one and one for all” principle. We call the largest clique in a graph the maximum clique. If a clique cannot be enlarged by adding another node to it, it is called a maximal clique. A maximum clique is always a maximal clique, but the converse is not always true. A complete graph is a maximal clique.
A star is a set of nodes such that one node is connected to all other nodes in the set, but the other nodes are not connected to each other. Stars are generally found in hierarchical multilevel systems (for example, corporations, military institutions, and the Internet).
A set of nodes directly connected to a node A is called the neighbourhood (G(A) of A). Neighbourhoods are a key part in snowballing, which is a data acquisition technique whereby the edges of a randomly selected seed node are followed to its neighbours, then to the neighbours of the neighbours (a second-degree neighbourhood), and so on.
The local clustering coefficient of a node A (also known as merely the clustering coefficient) is the number CC (A) of edges in A’s neighbourhood (excluding the edges directly connected to A), divided by the maximum possible number of edges. In other words, it’s the density of the neighbourhood of A without node A itself. The clustering coefficient of any node in a star is 0. The clustering coefficient of any node in a complete graph is 1. CC(A) can be used as a measure of star likeness or completeness of G(A).
Centrality is a measure of the importance of a node in a network. Several types of centrality measures measure different aspects of importance. For convenience, centralities are often scaled to the range between 0 (an unimportant, peripheral node) and 1 (an important, central node).
Degree centrality of A is the number of neighbours of A, which is the same as the degree of A or the size of G (A). You can scale it by dividing by the maximum possible number of neighbours of A, n-1.
Closeness centrality of A is the reciprocal of the average shortest path length LBA from all other nodes to A:
Betweenness centrality of A is the fraction of all shortest paths between all pairs of nodes in the network, excluding A, that contain A.
Eigenvector centrality of A is defined recursively as the scaled sum of the eigenvector centralities of all neighbours of A:
The last two centrality measures are computationally expensive, and it may not be practical to calculate them for large networks.
Network Analysis Sequence
A typical network analysis sequence consists of the following steps:
- It starts with identifying discrete entities and the relations between them. The entities become the network nodes, and the relations become edges. If the relations are binary (present vs. absent), they directly define the network edges. Suppose the relations are continuous or discrete, but not binary. In that case, you can either treat them as weighted edges or convert them into unweighted edges, but only if the value of the relation is at or above a threshold. The latter transformation is called sampling. The sampling threshold is chosen based on empirical and pragmatic considerations. If the threshold is too high, the network is too sparse and falls apart into many small components; if the threshold is too low, the network loses any community structure and becomes a tangle.
- Various network measures are calculated: density, number of components, GCC size, diameter, centralities, clustering coefficients, and so on.
- Network communities are identified. If the network ends up being modular, you can assign labels to the communities, replace the communities with “supernodes,” and study the new induced network.
- Finally, just like in any other data science experiment, results are interpreted, and a report with a lot of appealing pictures is produced.
The network module contains essential tools for creating, modifying, exploring, plotting, exporting, and importing networks. It supports simple and directed graphs and multigraphs. You will learn how to construct and modify a network by adding and removing nodes, edges, and attributes; how to calculate various network measures (such as centralities); and how to explore network community structure.
import networkx as nx
borders = nx.Graph()
not_borders1 = nx.DiGraph() # Just for our reference
not_borders2 = nx.MultiGraph() # Just for our reference
You can modify an existing network graph by adding or removing individual nodes or edges, or groups of nodes or edges. When you remove a node, all incident edges are removed, too. When you add an edge, its end nodes are added, too, unless they already existed in the graph. You can label nodes with either numbers or strings:
borders.add_nodes_from(["Lugandon", "Zambia", "Portugal", "Kuwait",
borders.add_edges_from([("Uganda", "Rwanda"), ("Uganda", "Kenya"),
("Uganda", "South Sudan"), ("Uganda", "Tanzania"),
("Uganda", "Democratic Republic of the Congo")])