GRAPHS ====== A graph G is a set V of vertices (sometimes called nodes), and a set E of edges (sometimes called arcs) that each connect two vertices together. To confuse you, mathematicians often use the notation G = (V, E). Here, "(V, E)" is an _ordered_pair_ of sets. This isn't as deep and meaningful as it sounds; some people just love formalism. The notation is convenient when you want to discuss several graphs with the same vertices; e.g. G = (V, E) and T = (V, F). Graphs come in two types: _directed_ and _undirected_. In a directed graph (or _digraph_ for short), every edge e is directed from some vertex v to some vertex w. We write "e = (v, w)" (also an ordered pair), and draw an arrow pointing from v to w. The vertex v is called the _origin_ of e, and w is the _destination_ of e. In an undirected graph, edges have no favored direction, so we draw a curve connecting v and w. We still write e = (v, w), but now it's an unordered pair, which means that (v, w) = (w, v). One application of a graph is to model a street map. For each intersection, define a vertex that represents it. If two intersections are connected by a length of street with no intervening intersection, define an edge connecting them. We might use an undirected graph, but if there are one-way streets, a directed graph is more appropriate. We can model a two-way street with two edges, one pointing in each direction. On the other hand, if we want a graph that tells us which cities adjoin each other, an undirected graph makes sense. --- Bancroft --- --- -------- ------------ |1|<------------|2|<------------|3| |Albany|------|Kensington| --- --- --- -------- ------------ | ^ | ^ \ / Dana | Telegraph | Bowditch | | ------------ ---------- v | v | |Emeryville|-----|Berkeley| --- --- --- ------------ ---------- |4|------------>|5|------------>|6| \ / --- Durant --- --- --------- ---------- |Oakland|-----|Piedmont| Multiple copies of an edge are forbidden, --------- ---------- but a directed graph may contain both (v, w) and (w, v). Both types of graph can have _self-edges_ of the form (v, v), which connect a vertex to itself. (Many applications, like the two illustrated above, don't use these.) A _path_ is a sequence of vertices such that each adjacent pair of vertices is connected by an edge. If the graph is directed, the edges that form the path must all be aligned with the direction of the path. The _length_ of a path is the number of edges it traverses. Above, <4, 5, 6, 3> is a path of length 3. It is perfectly respectable to talk about a path of length zero, such as <2>. The _distance_ from one vertex to another is the length of the shortest path from one to the other. A graph is _strongly_connected_ if there is a path from every vertex to every other vertex. (This is just called _connected_ in undirected graphs.) Both graphs above are strongly connected. The _degree_ of a vertex is the number of edges incident on that vertex. (Self-edges count just once in 61B.) Hence, Berkeley has degree 4, and Piedmont has degree 1. A vertex in a directed graph has an _indegree_ (the number of edges directed toward it) and an _outdegree_ (the number of edges directed away). Intersection 6 above has indegree 2 and outdegree 1. Graph Representations --------------------- There are two popular ways to represent a graph. The first is an _adjacency_ _matrix_, a |V|-by-|V| array of boolean values (where |V| is the number of vertices in the graph). Each row and column represents a vertex of the graph. Set the value at row i, column j to true if (i, j) is an edge of the graph. If the graph is undirected (below right), the adjacency matrix is _symmetric_: row i, column j has the same value as row j, column i. 1 2 3 4 5 6 Alb Ken Eme Ber Oak Pie 1 - - - T - - Albany - T - T - - 2 T - - - - - Kensington T - - T - - 3 - T - - - T Emeryville - - - T T - 4 - - - - T - Berkeley T T T - T - 5 - T - - - T Oakland - - T T - T 6 - - T - - - Piedmont - - - - T - It should be clear that the maximum possible number of edges is |V|^2 for a directed graph, and slightly more than half that for an undirected graph. In many applications, however, the number of edges is much less than Theta(|V|^2). For instance, our maps above are _planar_graphs_ (graphs that can be drawn without edges crossing), and all planar graphs have O(|V|) edges. A graph is called _sparse_ if it has far fewer edges than the maximum possible number. ("Sparse" has no precise definition, but it usually implies that the number of edges is asymptotically smaller than |V|^2.) For a sparse graph, an adjacency matrix representation is very wasteful. A more memory-efficient data structure for sparse graphs is the _adjacency_ _list_. An adjacency list is actually a collection of lists. Each vertex v maintains a list of the edges directed out from v. The standard list representations all work--arrays (below left), linked lists (below right), even search trees (because you can traverse one in linear time). --- ----- --- ------ ------ 1 |.+-> | 4 | Albany |.+-> |Ken.+-> |Ber*| --- ===== === ====== ====== 2 |.+-> | 1 | Kensington |.+-> |Alb.+-> |Ber*| --- =====---- === ====== ====== 3 |.+-> | 2 | 6 | Emeryville |.+-> |Ber.+-> |Oak*| --- =====---- === ====== ====== ------ ------ 4 |.+-> | 5 | Berkeley |.+-> |Alb.+-> |Ken.+-> |Eme.+-> |Oak*| --- =====---- === ====== ====== ====== ------ 5 |.+-> | 2 | 6 | Oakland |.+-> |Eme.+-> |Ber.+-> |Pie*| --- =====---- === ====== ------ ------ 6 |.+-> | 3 | Piedmont |.+-> |Oak*| --- ----- --- ------ The total memory used by all the lists is Theta(|V| + |E|). If your vertices have consecutive integer names, you can declare an array of lists, and find any vertex's list in O(1) time. If your vertices have names like "Albany," you can use a hash table to map names to lists. Each entry in the hash table uses a vertex name as a key, and a List object as the associated value. You can find the list for any label in O(1) average time. An adjacency list is more space- and time-efficient than an adjacency matrix for a sparse graph, but less efficient for a _complete_graph_. A complete graph is a graph having every possible edge; that is, for every vertex u and every vertex v, (u, v) is an edge of the graph. Graph Traversals ---------------- We'll look at two types of graph traversals, which can be used on either directed or undirected graphs to visit each vertex once. Depth-first search (DFS) starts at an arbitrary vertex and searches a graph as "deeply" as possible as early as possible. For example, if your graph is an undirected tree, DFS performs a preorder (or if you prefer, postorder) tree traversal. Breadth-first search (BFS) starts by visiting an arbitrary vertex, then visits all vertices whose distance from the starting vertex is one, then all vertices whose distance from the starting vertex is two, and so on. If your graph is an undirected tree, BFS performs a level-order tree traversal. In a graph, unlike a tree, there may be several ways to get from one vertex to another. Therefore, each vertex has a boolean field called "visited" that tells us if we have visited the vertex before, so we don't visit it twice. When we say we are "marking a vertex visited", we are setting its "visited" field to true. Assume that we are traversing a strongly connected graph, thus there is a path from the starting vertex to every other vertex. When DFS visits a vertex u, it checks every vertex v that can be reached by some edge (u, v). If v has not yet been visited, DFS visits it recursively. public class Graph { // Before calling dfs(), set every "visited" flag to false; takes O(|V|) time public void dfs(Vertex u) { u.visit(); // Do some unspecified thing to u u.visited = true; // Mark the vertex u visited for (each vertex v such that (u, v) is an edge in E) { if (!v.visited) { dfs(v); } } } } In this DFS pseudocode, a "visit()" method is defined that performs some action on a specified vertex. For instance, if we want to count the total population of the city graph above, "visit()" might add the population of the visited city to the grand total. The order in which cities are visited depends partly on their order in the adjacency lists. The sequence of figures below shows the behavior of DFS on our street map, starting at vertex 1. A "V" is currently visited; an "x" is marked visited; a "*" is a vertex which we try to visit but discover has already been visited. V<-2<-3 x<-2<-3 x<-2<-3 x<-V<-3 *<-V<-3 x<-x<-3 x<-x<-V x<-*<-V x<-x<-V | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ v | v v | v v | v v | v v | v v | v v | v v | v v | v 4->5->6 V->5->6 x->V->6 x->x->6 x->x->6 x->x->V x->x->x x->x->x x->x->* DFS runs in O(|V| + |E|) time if you use an adjacency list; O(|V|^2) time if you use an adjacency matrix. Hence, an adjacency list is asymptotically faster if the graph is sparse. What's an application of DFS? Suppose you want to determine whether there is a path from a vertex u to another vertex v. Just do DFS from u, and see if v gets visited. (If not, you can't there from here.) More on the Running Time of DFS ------------------------------- Why does DFS on an adjacency list run in O(|V| + |E|) time? The O(|V|) component comes up solely because we have to initialize all the "visited" flags to false (or at least construct an array of flags) before we start. The O(|E|) component is trickier. Take a look at the "for" loop of the dfs() pseudocode above. How many times does it iterate? If the vertex u has outdegree d(u), then the loop iterates d(u) times. Different vertices have different degrees. Let the total degree D be the sum of the outdegrees of all the vertices in V. D = sum d(v). v in V A call to dfs(u) takes O(d(u)) time, NOT counting the time for the recursive calls it makes to dfs(). A depth-first search never calls dfs() more than once on the same vertex, so the total running time of all the calls to dfs() is in O(D). How large is D? - If G is a directed graph, then D = |E|, the number of edges. - If G is an undirected graph with no self-edges, then D = 2|E|, because each edge offers a path out of two vertices. - If G is an undirected graph with one or more self-edges, then D < 2|E|. In all three cases, the running time of depth-first search is in O(|E|), NOT counting the time required to initialize the "visited" flags. Running Time of BFS ------------------- Breadth-first search (BFS) is a little more complicated than depth-first search, because it's not naturally recursive. We use a queue so that vertices are visited in order according to their distance from the starting vertex. public void bfs(Vertex u) { for (each vertex v in V) { // O(|V|) time v.visited = false; } u.visit(null); // Do some unspecified thing to u u.visited = true; // Mark the vertex u visited q = new Queue(); // New queue... q.enqueue(u); // ...initially containing u while (q is not empty) { // With adjacency list, O(|E|) time v = q.dequeue(); for (each vertex w such that (v, w) is an edge in E) { if (!w.visited) { w.visit(v); // Do some unspecified thing to w w.visited = true; // Mark the vertex w visited q.enqueue(w); } } } public class Vertex { } protected Vertex parent; protected int depth; Notice that when we visit a vertex, protected boolean visited; we pass the edge's origin vertex as a parameter. This allows us to public void visit(Vertex origin) { do a computation such as finding this.parent = origin; the distance of the vertex from if (origin == null) { the starting vertex, or finding this.depth = 0; the shortest path between them. } else { The visit() method at right this.depth = origin.depth + 1; accomplishes both these tasks. } } } When an edge (v, w) is traversed to visit a Vertex w, the depth of w is set to the depth of v plus one, and v is set to become the _parent_ of w. The sequence of figures below shows BFS running on the city adjacency graph (Albany, Kensington, Emeryville, Berkeley, Oakland, Piedmont) from last lecture, starting from Albany. A "V" is currently visited; a digit shows the depth of a vertex that is marked visited; a "*" is a vertex which we try to visit but discover has already been visited. Underneath each figure of the graph, I depict the queue and the current value of the variable "v" in bfs(). V-K 0-V 0-1 *-1 0-1 *-1 0-* 0-1 0-1 0-1 0-1 0-1 0-1 0-1 0-1 0-1 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| E-B E-B E-V E-1 E-* E-1 E-1 V-1 2-1 2-* 2-1 *-1 2-* 2-1 2-1 2-1 |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ |/ O-P O-P O-P O-P O-P O-P O-P O-P V-P 2-P *-P 2-P 2-P 2-V *-3 2-3 === === === === === === === === === === === === === === === === A K KB B B E EO O O P === === === === === === === === === === === === === === === === v=A v=A v=K v=K v=B v=B v=B v=B v=E v=E v=O v=O v=O v=P After we finish, we can find the shortest path from any vertex to the 0<--1 starting vertex by following the parent pointers (right). These ^ pointers form a tree rooted at the starting vertex. Note that they \ point in the direction _opposite_ the search direction that got us there. \ 2-->1 Why does this work? The starting vertex is enqueued first, then all the ^ vertices at a distance of 1 from the start, then all the vertices at a / distance of 2, and so on. Why? When the starting vertex is dequeued, / all the vertices at a distance of 1 are enqueued, but no other vertex 2<--3 is. When the depth-1 vertices are dequeued and processed, all the vertices at a distance of 2 are enqueued, because every vertex at a distance of 2 must be reachable by a single edge from some vertex at a distance of 1. No other vertex is enqueued, because every vertex at a distance less than 2 has been marked, and every vertex at a distance greater than 2 is not reachable by a single edge from some vertex at a distance of 1. Recommendation: pull out a piece of paper, draw a graph and a program stack, and simulate BFS, with you acting as the computer and executing bfs() line by line. You will understand it much better after taking the time to do this. BFS, like DFS, runs in O(|V| + |E|) time if you use an adjacency list; O(|V|^2) time if you use an adjacency matrix. Weighted Graphs --------------- A weighted graph is a graph in which each edge is labeled with a numerical weight. A weight might express the distance between two nodes, the cost of moving from one to the other, the resistance between two points in an electrical circuit, or many other things. In an adjacency matrix, each weight is stored in the matrix. Whereas an unweighted graph uses an array of booleans, a weighted graph uses an array of ints, doubles, or some other numerical type. Edges missing from the graph can be represented by a special number like Integer.MIN_VALUE, at the cost of declaring that number invalid as an edge weight. (If you want to permit every int to be a valid edge weight, you might use an additional array of booleans as well.) In an adjacency list, recall that each edge is represented by a listnode. Each listnode must be enlarged to include a weight, in addition to the reference to the destination vertex. (If you're using an array implementation of lists, you'll need two separate arrays: one for weights, and one for destinations.) There are two particularly common problems involving weighted graphs. One is the _shortest_path_problem_. Suppose a graph represents a highway map, and each road is labeled with the amount of time it takes to drive from one interchange to the next. What's the fastest way to drive from Berkeley to Los Angeles? A shortest path algorithm will tell us. The second problem is constructing a _minimum_spanning_tree_. Suppose that you're wiring a house for electricity. Each node of the graph represents an outlet, or the source of electricity. Every outlet needs to be connected to the source, but not necessarily directly--possibly routed via another outlet. The edges of the graph are labeled with the length of wire you'll need to connect one node to another. How do you connect all the nodes together with the shortest length of wire? You'll learn several of these algorithms if you take CS 170.