4. Performance Analysis

In this Chapter we analyze the behavior of the various node coloring algorithms in the gcol library. Where appropriate, we also make comparisons to similar algorithms from the networkx library.

Here, algorithms are evaluated by looking at solution quality and run times. Details on asymptotic algorithm complexity (in terms of big O notation) can be found in gcol’s documentation. Here, all tests are conducted on randomly generated Erdos-Renyi graphs, commonly denoted by \(G(n,p)\). These graphs are constructed by taking \(n\) nodes and adding an edge between each node pair at random with probability \(p\). The expected number of edges in a \(G(n,p)\) graph is therefore \(\binom{n}{2}p\), while the expected node degree is \((n-1)p\).

For these tests, we use differing values for \(n\) but keep \(p\) fixed at \(0.5\). This is due to a result of Nick Wormald, who has established that for \(n \gtrapprox 30\), a set of randomly constructed \(G(n, 0.5)\) graphs can be considered equivalent to a random sample from the population of all unlabeled \(n\)-node graphs. Note, however, that although this allows us to make general statistical statements about performance, different observations may well be made when executing these algorithms on specifically chosen topologies, such as scale-free graphs and planar graphs. Examples of these differences are discussed in this book, where a wider set of trials is conducted.

In the code below, each trial involves generating a set of \(G(n,0.5)\) graphs using a range of values of \(n\). The results of the algorithms are then written to a Pandas dataframe df, aggregated in a pivot table, and summarized in charts. Lines in the charts give mean values, while the shaded areas indicate one standard deviation on either side of these means. All results below were found by executing the code on a 3.0 GHtz Windows 11 PC with 16 GB of RAM.

4.1. Differing Node Coloring Strategies

In our first experiment, we compare the different constructive strategies available for node coloring in the gcol library (namely 'random', 'welsh_powell', 'dsatur', and 'rlf') using \(G(n,0.5)\) graphs with values of \(n\) between \(50\) and \(500\). The results are shown in the charts below. Further details on these algorithms can be found in gcol’s documentation.

import pandas as pd
import networkx as nx
import gcol
import matplotlib.pyplot as plt
import time

#Carry out the trials and put the results into a list
results = []
nVals = range(50,501,50)
for n in nVals:
    for seed in range(50):
        G = nx.gnp_random_graph(n, 0.5, seed)
        for strategy in ["random", "welsh_powell", "dsatur", "rlf"]:
            start = time.time()
            c = gcol.node_coloring(G, strategy)
            results.append([n, seed, strategy, max(c.values()) + 1, time.time()-start])

# Create a pandas dataframe from this list and make a pivot table
df = pd.DataFrame(results, columns=["n", "seed", "strategy", "cols", "time"])
pivot = df.pivot_table(columns='strategy', aggfunc=['mean','std'], values=['cols','time'], index='n')

# Now use the pivot table to make a chart that compares mean solution quality
mean1, SD1 = pivot[("mean","cols","random")], pivot[("std","cols","random")]
mean2, SD2 = pivot[("mean","cols","welsh_powell")], pivot[("std","cols","welsh_powell")]
mean3, SD3 = pivot[("mean","cols","dsatur")], pivot[("std","cols","dsatur")]
mean4, SD4 = pivot[("mean","cols","rlf")], pivot[("std","cols","rlf")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='Random')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='Welsh-Powell')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.plot(nVals, mean3, linestyle='-.', linewidth=1.5, color="g", label='DSatur')
plt.fill_between(nVals, mean3-SD3, mean3+SD3, color='k', alpha=0.2)
plt.plot(nVals, mean4, linestyle=':', linewidth=1.5, color="black", label='RLF')
plt.fill_between(nVals, mean4-SD4, mean4+SD4, color='g', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Colours")
plt.legend()
plt.show()

# and do the same for mean run times
mean1, SD1 = pivot[("mean","time","random")], pivot[("std","time","random")]
mean2, SD2 = pivot[("mean","time","welsh_powell")], pivot[("std","time","welsh_powell")]
mean3, SD3 = pivot[("mean","time","dsatur")], pivot[("std","time","dsatur")]
mean4, SD4 = pivot[("mean","time","rlf")], pivot[("std","time","rlf")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='Random')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='Welsh-Powell')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.plot(nVals, mean3, linestyle='-.', linewidth=1.5, color="g", label='DSatur')
plt.fill_between(nVals, mean3-SD3, mean3+SD3, color='k', alpha=0.2)
plt.plot(nVals, mean4, linestyle=':', linewidth=1.5, color="black", label='RLF')
plt.fill_between(nVals, mean4-SD4, mean4+SD4, color='g', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Run time (s)")
plt.legend()
plt.show()
../_images/output_1_0.png ../_images/output_1_1.png

The results above show that the random and welsh-powell strategies produce the poorest solutions overall (in terms of the number of colors they use) while the RLF algorithm produces the best. This gap also seems to widen for larger values of \(n\). On the other hand, the RLF algorithm has less favorable run times, as shown in the second chart. This is to be expected because the RLF algorithm has a higher complexity than the other options. A good compromise seems to be struck by the dsatur strategy, which features comparatively good solution quality and run times.

4.2. Comparison to NetworkX

The next set of experiments compares the performance of gcol’s local search routines and NetworkX’s interchange coloring routine. As a benchmark, we also include gcol’s dsatur option from earlier, which has also been used to produce the initial solutions for the local search algorithms. For comparative purposes, both of gcol’s local search algorithms (opt_alg=2 and opt_alg=3) are used here, and we impose a fixed iteration limit of \(n\). The results are collected and displayed in the same manner as the previous example.

#Carry out the trials and put the results into a list
results = []
nVals = range(50,601,50)
for n in nVals:
    for seed in range(50):
        G = nx.gnp_random_graph(n, 0.5, seed)
        start = time.time()
        c = nx.greedy_color(G, "largest_first", interchange=True)
        results.append([n, seed, "networkx", max(c.values()) + 1, time.time()-start])
        start = time.time()
        c = gcol.node_coloring(G)
        results.append([n, seed, "dsatur", max(c.values()) + 1, time.time()-start])
        start = time.time()
        c = gcol.node_coloring(G, opt_alg=2, it_limit=len(G))
        results.append([n, seed, "opt_alg=2", max(c.values()) + 1, time.time()-start])
        start = time.time()
        c = gcol.node_coloring(G, opt_alg=3, it_limit=len(G))
        results.append([n, seed, "opt_alg=3", max(c.values()) + 1, time.time()-start])

# Create a pandas dataframe from this list and make a pivot table
df = pd.DataFrame(results, columns=["n", "seed", "alg", "cols", "time"])
pivot = df.pivot_table(columns='alg', aggfunc=['mean','std'], values=['cols','time'], index='n')

# Use the pivot table to make charts as before
mean1, SD1 = pivot[("mean","cols","networkx")], pivot[("std","cols","networkx")]
mean2, SD2 = pivot[("mean","cols","dsatur")], pivot[("std","cols","dsatur")]
mean3, SD3 = pivot[("mean","cols","opt_alg=2")], pivot[("std","cols","opt_alg=2")]
mean4, SD4 = pivot[("mean","cols","opt_alg=3")], pivot[("std","cols","opt_alg=3")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='NetworkX')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='DSatur')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.plot(nVals, mean3, linestyle='-.', linewidth=1.5, color="g", label='opt_alg=2')
plt.fill_between(nVals, mean3-SD3, mean3+SD3, color='k', alpha=0.2)
plt.plot(nVals, mean4, linestyle=':', linewidth=1.5, color="black", label='opt_alg=3')
plt.fill_between(nVals, mean4-SD4, mean4+SD4, color='g', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Colours")
plt.legend()
plt.show()

mean1, SD1 = pivot[("mean","time","networkx")], pivot[("std","time","networkx")]
mean2, SD2 = pivot[("mean","time","dsatur")], pivot[("std","time","dsatur")]
mean3, SD3 = pivot[("mean","time","opt_alg=2")], pivot[("std","time","opt_alg=2")]
mean4, SD4 = pivot[("mean","time","opt_alg=3")], pivot[("std","time","opt_alg=3")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='NetworkX')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='DSatur')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.plot(nVals, mean3, linestyle='-.', linewidth=1.5, color="g", label='opt_alg=2')
plt.fill_between(nVals, mean3-SD3, mean3+SD3, color='k', alpha=0.2)
plt.plot(nVals, mean4, linestyle=':', linewidth=1.5, color="black", label='opt_alg=3')
plt.fill_between(nVals, mean4-SD4, mean4+SD4, color='g', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Run time (s)")
plt.legend()
plt.show()
../_images/output_3_01.png ../_images/output_3_1.png

It is clear from the above results that the local search algorithms make significant improvements to the solutions provided by the dsatur strategy, albeit with additional time requirements. The solutions and run times of these local search algorithms are also superior to NetworkX’s node coloring routines. Note that further improvements in solution quality might also be found by increasing the iteration limit of the local search algorithms.

4.3. Exact Algorithm Performance

In addition to the two local search heuristics, the gcol library also features an exact, exponential-time algorithm for node coloring, based on backtracking. This algorithm is invoked by setting opt_alg=1. At the start of this algorithm’s execution, a large clique \(C\) is identified in \(G\) using the NetworkX function nx.max_clique(). The nodes of \(C\) are then permanently assigned to different colors. The main backtracking algorithm is then executed and only halts only when a solution using \(C\) colors has been identified, or when the algorithm has backtracked to the root of the search tree. In both cases the returned solution will be optimal (that is, will be using the minimum number of colors).

The following code evaluates the performance of this algorithm on \(G(n,0.5)\) graphs for a range of \(n\)-values.

results = []
nVals = range(2,55,2)
for n in nVals:
    for seed in range(25):
        G = nx.gnp_random_graph(n, 0.5, seed)
        start = time.time()
        c = gcol.node_coloring(G, opt_alg=1)
        results.append([n, seed, "opt_alg=1", max(c.values()) + 1, time.time()-start])

# Create a pandas dataframe from this list and make a pivot
df = pd.DataFrame(results, columns=["n", "seed", "alg", "cols", "time"])
pivot = df.pivot_table(columns='alg', aggfunc=['mean','std'], values=['cols','time'], index='n')

# Use the pivot table above to make the charts as before
mean1, SD1 = pivot[("mean","cols","opt_alg=1")], pivot[("std","cols","opt_alg=1")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='opt_alg=1')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Colours")
plt.legend()
plt.show()

mean1, SD1 = pivot[("mean","time","opt_alg=1")], pivot[("std","time","opt_alg=1")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='opt_alg=1')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylim((0, 600))
plt.ylabel("Run time (s)")
plt.legend()
plt.show()
../_images/output_5_01.png ../_images/output_5_1.png

The first chart above shows the chromatic numbers from a sample of \(G(n,0.5)\) graphs for an increasing number of nodes \(n\). It can be seen that the chromatic number rises in a close-to-linear fashion in relation to \(n\). The second figure also demonstrates the disadvantages of using an exponential-time algorithm: once \(n\) is increased beyond a moderately small value (approximately 50 here), run times become high and unpredictable. Note, however, that the specific \(n\)-values that give these long run times can vary considerably depending on the topology of the graph. For example, planar graphs and scale-free graphs can often be solved very quickly for graphs with several hundred nodes. These sorts of results will usually need to be confirmed empirically.

4.4. Equitable Coloring

In the equitable node-coloring problem, we are interested in coloring the nodes with a user-defined number of colors \(k\) so that (a) adjacent nodes have different colors, and (b) the number of nodes in each color is as equal as possible. The following trials run the gcol.equitable_node_k_coloring() method on a sample of random \(G(500,0.5)\) graphs over a range of suitable \(k\)-values. The reported cost is simply the difference in size between the largest and smallest color classes in a solution. Hence, if \(k\) is a divisor of \(n\), a cost of zero indicates an equitable \(k\)-coloring, else a cost of one indicates an equitable coloring.

results = []
n = 500
kVals = range(70, 300, 1)
for seed in range(50):
    G = nx.gnp_random_graph(n, 0.5, seed)
    for k in kVals:
        start = time.time()
        c = gcol.equitable_node_k_coloring(G, k, opt_alg=2, it_limit=len(G))
        P = gcol.partition(c)
        cost = max(len(j) for j in P) - min(len(j) for j in P)
        results.append([k, seed, "opt_alg=2", cost, time.time()-start])

# Create a pandas dataframe from this list and make a pivot table
df = pd.DataFrame(results, columns=["k", "seed", "alg", "cost", "time"])
pivot = df.pivot_table(columns='alg', aggfunc=['mean','std'], values=['cost','time'], index='k')

# Use the pivot table above to make charts as before
mean1, SD1 = pivot[("mean","cost","opt_alg=2")], pivot[("std","cost","opt_alg=2")]
plt.plot(kVals, mean1, linestyle='-', linewidth=1.5, color="b", label='opt_alg=2')
plt.fill_between(kVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.xlabel("Number of colors $k$")
plt.ylabel("Cost")
plt.legend()
plt.show()

mean1, SD1 = pivot[("mean","time","opt_alg=2")], pivot[("std","time","opt_alg=2")]
plt.plot(kVals, mean1, linestyle='-', linewidth=1.5, color="b", label='opt_alg=2')
plt.fill_between(kVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.xlabel("Number of colors $k$")
plt.ylabel("Run time (s)")
plt.legend()
plt.show()
../_images/output_7_01.png ../_images/output_7_1.png

The first chart above demonstrates that the gcol.equitable_node_k_coloring() method consistently achieves equitable node \(k\)-colorings. The exceptions occur for low values of \(k\) (which are close to the chromatic number) and when \(k\) is a divisor of \(n\). In the former case, the low number of available colors restricts the choice of appropriate colors for each node, often leading to inequitable colorings. On the other hand, when \(k\) is a divisor of \(n\), the algorithm is seeking a solution with a cost of zero, meaning that each color class must have exactly the same number of nodes. If this cannot be achieved, then a cost of at least two must be incurred.

The second chart above also indicates that runtimes of this routine increase slightly when \(k\) is a divisor of \(n\). Run times also lengthen due to increases in \(k\). The latter is due to the larger number of solutions that need to be evaluated in each iteration of the local search algorithm used with this routine. More details on this algorithm can be found in gcol’s documentation.

Finally, note that NetworkX features an exact equitable node \(k\)-coloring routine, but this can only be used for values of \(k\geq \Delta(G)\), where \(\Delta(G)\) is the highest node degree in the graph. In the \(G(500,0.5)\) graphs considered here, the minimum valid value for \(k\) is approximately 280.

4.5. Independent Set Comparison

Our final set or trials looks at the performance the gcol.max_independent_set() routine and compares it to the approximation algorithm included in NetworkX for the same problem. As before, we use an iteration limit of \(n\) for the former.

#Carry out the trials and put the results into a list
results = []
nVals = range(50,501,50)
for n in nVals:
    for seed in range(50):
        G = nx.gnp_random_graph(n, 0.5, seed)
        start = time.time()
        S = gcol.max_independent_set(G, it_limit=len(G))
        results.append([n, seed, "gcol", len(S), time.time()-start])
        start = time.time()
        S = nx.approximation.maximum_independent_set(G)
        results.append([n, seed, "networkx", len(S), time.time()-start])

# Create a pandas dataframe from this list and make a pivot table
df = pd.DataFrame(results, columns=["n", "seed", "alg", "size", "time"])
pivot = df.pivot_table(columns='alg', aggfunc=['mean','std'], values=['size','time'], index='n')

# Create the charts as before
mean1, SD1 = pivot[("mean","size","networkx")], pivot[("std","size","networkx")]
mean2, SD2 = pivot[("mean","size","gcol")], pivot[("std","size","gcol")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='NetworkX')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='GCol')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Independent Set Size")
plt.legend()
plt.show()

mean1, SD1 = pivot[("mean","time","networkx")], pivot[("std","time","networkx")]
mean2, SD2 = pivot[("mean","time","gcol")], pivot[("std","time","gcol")]
plt.plot(nVals, mean1, linestyle='-', linewidth=1.5, color="b", label='NetworkX')
plt.fill_between(nVals, mean1-SD1, mean1+SD1, color='b', alpha=0.2)
plt.plot(nVals, mean2, linestyle='--', linewidth=1.5, color="r", label='GCol')
plt.fill_between(nVals, mean2-SD2, mean2+SD2, color='r', alpha=0.2)
plt.xlabel("Number of nodes $n$")
plt.ylabel("Run time (s)")
plt.legend()
plt.show()
../_images/output_9_01.png ../_images/output_9_1.png

The results above show quite clearly that the gcol.max_independent_set() routine produces larger independent sets (and therefore better quality solutions) in less run time. As before, further improvements in solution quality (but longer run times) may also be found by increasing the it_limit parameter.