Francisco A. Rodrigues, University of São Paulo. http://conteudo.icmc.usp.br/pessoas/francisco

Characterization of Network Structure

Before start programing, we have to set the working directory.

In [1]:
import os
os.chdir('/Users/franciscoaparecidorodrigues/Dropbox/0Programs/python/Book-chapter-examples');

We also need some library to process the data. Let us import the Numpy library because we will work with vectors.

In [2]:
from numpy  import *
import numpy as np

We have to import the matplotlib library to plot the results.

In [3]:
import matplotlib.pyplot as plt

To process the networks, we consider the Networkx library. This library can be installed following the steps in this link: https://networkx.github.io/documentation/latest/install.html Since import Nextwork as nx, all commands from this library is prefixed by "nx.".

In [4]:
import networkx as nx

To read the network from a file, we use the command read_edgelist. Notice that the network format is the edge list and we consider the weight version, since the file can have three columns (two for the connection and the third one for the weigth of the connection). In our case, we ignore the third connection, because we are considering only unweighted networks.

In [5]:
G= nx.read_edgelist("data/lesmis.txt", nodetype=int, data=(('weight',float),)) # Read the network
# or use G= G=nx.read_edgelist("data/zachary.txt", nodetype=int) if the data file has only two columns.

We transfor the network into the undirected version. In our case, this step is not be necessary because the default output from read_edgelist is an undirected graph. To obtain a directed graph, we need to include the field "create_using= nx.DiGraph()" as input. Anyway, let us perform this transformation here to exemplify this operation.

In [6]:
G = G.to_undirected()

We consider only the largest component here. The selection of the largest component is important mainly in the calculation of the measures related to distance, because the distance between pairs of nodes is defined only for nodes in the same component. Remember that a component is a graph in which all of its nodes can access each other.

In [7]:
Gcc=sorted(nx.connected_component_subgraphs(G), key = len, reverse=True)
G=Gcc[0]

Sometimes the node labels are not in the sequential order or labels are used. To facilitate our implementation, let us convert the labels to integers starting with the index zero, because Python uses 0-based indexing.

In [8]:
G = nx.convert_node_labels_to_integers(G, first_label=0)

Let us verify the number of nodes and edges of the network.

In [9]:
N = len(G)
M = G.number_of_edges()
print('Number of nodes:', N)
print('Number of edges:', M)
Number of nodes: 77
Number of edges: 254

Now, let us calculate measures to characterize the network structure.

Since the degree() function from Networkx provides an iterator for (node, degree) , we will use only the values of the degree and ignore the label of the nodes. We also convert the list to a numpy array, since this structure is easier to be manipuled than other structures.

In [10]:
vk = dict(G.degree())
vk = list(vk.values())
print('Degree', vk)
Degree [1, 10, 3, 3, 1, 1, 1, 1, 1, 1, 36, 1, 2, 1, 1, 1, 15, 11, 16, 11, 17, 4, 8, 4, 1, 2, 6, 6, 6, 6, 6, 3, 2, 22, 7, 7, 19, 15, 13, 10, 10, 10, 9, 3, 7, 9, 7, 7, 7, 7, 7, 2, 11, 3, 2, 3, 1, 7, 4, 1, 2, 11, 13, 2, 1, 11, 9, 11, 12, 12, 10, 2, 2, 7, 2, 1, 1]

The mean degree of the network:

In [11]:
md = mean(vk)
print('Mean degree: ', md)
Mean degree:  6.5974025974

From the node degrees, we can calculate several statistical measures. Let us develop a function to calculate the degree distribution.

In [12]:
def degree_distribution(G):
    vk = dict(G.degree())
    vk = list(vk.values()) # we get only the degree values
    maxk = np.max(vk)
    mink = np.min(min)
    kvalues= arange(0,maxk+1) # possible values of k
    Pk = np.zeros(maxk+1) # P(k)
    for k in vk:
        Pk[k] = Pk[k] + 1
    Pk = Pk/sum(Pk) # the sum of the elements of P(k) must to be equal to one
    return kvalues,Pk

Thus, to obtain the degree distribution, we have:

In [13]:
ks, Pk = degree_distribution(G)

Thus, we can plot the degree distribution in log-log scale.

In [14]:
plt.figure()
plt.loglog(ks,Pk,'bo',basex=10,basey=10)
plt.xlabel("k", fontsize=20)
plt.ylabel("P(k)", fontsize=20)
plt.title("Degree distribution", fontsize=20)
plt.grid(True)
plt.savefig('degree_dist.eps') #save the figure into a file
plt.show(True)

We can also calculate the statistical moments of the degree distribution. A function to calculate the m-th moment of the degree distribution is defined as:

In [15]:
def momment_of_degree_distribution(G,m):
    k,Pk = degree_distribution(G)
    M = sum((k**m)*Pk)
    return M

Or, we can also calculate through:

In [16]:
def momment_of_degree_distribution2(G,m):
    M = 0
    for i in G.nodes:
        M = M + G.degree(i)**m
    M = M/N
    return M

The first statistical moment is equal to the mean degree:

In [17]:
k1 = momment_of_degree_distribution(G,1)
print("Mean degree = ", mean(vk))
print("First moment of the degree distribution = ", k1)
Mean degree =  6.5974025974
First moment of the degree distribution =  6.5974025974

And the second momment of the degree distribution:

In [18]:
k2 = momment_of_degree_distribution(G,2)
print("Second moment of the degree distribution = ", k2)
Second moment of the degree distribution =  79.5324675325

The variance is calculated as: $V(k) = \langle k^2 \rangle - k^2$

In [19]:
variance = momment_of_degree_distribution(G,2) - momment_of_degree_distribution(G,1)**2
print("Variance of the degree = ", variance)
Variance of the degree =  36.0067465003

The level of network heterogeneity with respect to the number of connections can be quantified by the Shannon entropy. A function to calculate the Shannon entropy of the degree distribution can be defined as:

In [20]:
def shannon_entropy(G):
    k,Pk = degree_distribution(G)
    H = 0
    for p in Pk:
        if(p > 0):
            H = H - p*math.log(p, 2)
    return H

The Shannon entropy of P(k)

In [21]:
H = shannon_entropy(G)
print("Shannon Entropy = ", "%3.4f"%H)
Shannon Entropy =  3.5957

We can also calculate the normalize version of the entropy ($H_n = H/H_{max}$, where $H_{max} = \log N$)

In [22]:
def normalized_shannon_entropy(G):
    k,Pk = degree_distribution(G)
    H = 0
    for p in Pk:
        if(p > 0):
            H = H - p*math.log(p, 2)
    return H/math.log(len(G),2)
In [23]:
H = normalized_shannon_entropy(G)
print("Normalized Shannon Entropy = ", "%3.4f"%H)
Normalized Shannon Entropy =  0.5738

Transitivity and clustering

In addition to the degree, another important property of networks is related to the number of triangles, which is related to the concept of transitivity. The transitivity of the network G.

In [24]:
CC = (nx.transitivity(G)) 
print("Transitivity = ","%3.4f"%CC)
Transitivity =  0.4989

The level of triangles in a network can also be quantified by the average clustering coefficient, calculated from the local clustering coefficient, i.e.,

In [25]:
avc = nx.average_clustering(G)
print("Average clustering:", "%3.4f"%avc)
Average clustering: 0.5731

The clustering of each node is defined by the fraction of edges among the neighbors of each node.

In [26]:
vcc = []
for i in G.nodes():
    vcc.append(nx.clustering(G, i))
vcc= np.array(vcc)
print('Clustering of all nodes:', vcc)
Clustering of all nodes: [ 0.          0.06666667  1.          1.          0.          0.          0.
  0.          0.          0.          0.12063492  0.          1.          0.
  0.          0.          0.31428571  0.49090909  0.40833333  0.38181818
  0.32352941  0.33333333  0.64285714  0.66666667  0.          1.          1.
  1.          1.          1.          1.          1.          1.
  0.35497835  0.47619048  0.42857143  0.33333333  0.60952381  0.76923077
  0.8         0.8         0.71111111  0.83333333  1.          1.
  0.61111111  1.          1.          1.          1.          1.          1.
  0.45454545  1.          0.          0.33333333  0.          0.9047619   1.
  0.          0.          0.69090909  0.75641026  0.          0.
  0.92727273  1.          0.92727273  0.86363636  0.86363636  0.93333333
  1.          1.          1.          1.          0.          0.        ]

The statistical distribution of the clustering coefficient:

In [27]:
plt.figure()
plt.hist(vcc, bins  = 10, normed=True)
plt.title("Distribution of the clustering coefficient")
plt.ylabel("P(cc)")
plt.xlabel("Clustering coefficient (cc)")
plt.grid(True)
plt.savefig('clustering.eps') #save the figure into a file
plt.show()

The clustering in function of the degree can reveal a hierarchical organization if $C(k) \approx k^{-\beta}$.

In [28]:
#Average clustering for each degree k
ck = list()
ks = list()
for k in arange(np.min(vk), np.max(vk)):
    aux = vk == k
    if(len(vcc[aux]) > 0):
        cm = mean(vcc[aux]) #average clustering among all the nodes with degree k
        ck.append(cm)
        ks.append(k)
plt.loglog(ks,ck,'bo',basex=10,basey=10)
plt.title("Clustering coefficient according to degree")
plt.ylabel("cc(k)")
plt.xlabel("k")
plt.grid(True)
plt.savefig('cck.eps')
plt.show(True)

Distance

The distance between pairs of nodes is given by the number of edges between them, where edges and nodes are not repeated. The distance with the minimal length is called the shortest path. The average shortest path length can be calculated by:

In [29]:
if nx.is_connected(G) == True:
    l = nx.average_shortest_path_length(G)
    print("Average shortest path length:", "%3.4f"%l)
Average shortest path length: 2.6411

The shortest path with the longest length is the diameter.

In [30]:
d = nx.diameter(G)
print('Network diameter:', d)
Network diameter: 5

Another measure related to distances is the network efficiency, which is a measure of how efficiently are network exchanges information. The average global efficiency of a graph is the average efficiency of all pairs of nodes. The higher this measure, the better the information transmission in a network.

In [31]:
E = nx.global_efficiency(G)
print('Network efficiency', E)
Network efficiency 0.43528708133971533

And the local efficiency of a node $i$ characterizes how well information is exchanged by the neighbors of a node $i$ when it is removed.

In [32]:
leff = nx.local_efficiency(G)
print('The average local efficiency of the network:', leff)
The average local efficiency of the network: 0.9258044278687318

The matrix of distances stores the distances between all pairs of nodes. From this matrix, we can calculate the distribution of the shortest paths between all pairs of nodes. Notice that $D$ is symmetric, because are considering undirected networks.

In [33]:
if nx.is_connected(G) == True:
    D = zeros(shape=(N,N)) # D is the matrix of distances
    vl = []
    for i in arange(0,N):
        for j in arange(i+1, N):
            if(i != j):
                aux = nx.shortest_path(G,i,j)
                dij = len(aux)
                D[i][j] = dij
                D[j][i] = dij
                vl.append(dij)
    x = range(0,d+1)
    plt.hist(vl, bins = x, normed=True)
    plt.title("Distribution of the geodesic distances")
    plt.ylabel("P(l)")
    plt.xlabel("Shortest path length (l)")
    plt.grid(True)
    plt.savefig('av_short_path.eps')
    plt.show(True)
else:
    print("The graph has more than one connected component")