Analyze Bank Transaction Data using Graph (Part 3/3)

Ryota Yamanaka
Oracle Developers
Published in
11 min readJan 17, 2022

--

The content of this article is based on previous products. If you are using newer products (Oracle Database 23c and Graph Server 23.2) and the latest SQL syntax, please refer to this article.

Photo by Eduardo Soares on Unsplash (edited)

In Part 1 and Part 2, we explained that we can use Oracle Database itself as a graph database to store property graph datasets and execute PGQL queries. Also, for seamless integration, the CREATE PROPERTY GRAPH statement provides a declarative mapping method to convert datasets from tables to graphs.

In this part of the series, I’ll show you how to utilize the graph defined in Part 2, using Graph Server in addition to Oracle Database. Graph Server consists of a high-performance in-memory graph analysis engine and a database integration component. Graph Server is available in all editions of Oracle Database without an additional license. We call this architecture that combines Graph Server with Oracle Database the “3-tier deployment” of Oracle Graph.

Load graphs to Graph Server

The first step is to log into the OS console where Graph Server and Client are installed. For example, if you are using the Docker container described in this article, log in to the Graph Server container as shown below. If you are in a cloud environment, use SSH to connect instead.

$ docker exec -it graph-server /bin/bash

Next, launch the Python client already installed (opg4py) to connect to Graph Server running on the same host.

$ opg4py -b https://localhost:7007 -u graphuser
password: (Welcome1)

>>> setloglevel("ROOT", "WARN")

There is a graph in the database called GRAPH2 (as a PG view over the table dataset), which we created in Part 2, so we will load this graph. Its variable name in Python does not need to be the same, but we’ll use graph2 to minimize confusion.

>>> graph2 = session.read_graph_by_name("GRAPH2", "pg_view")

This operation reads the table dataset in the database, converts it into a graph according to the mapping defined, and expands it in the memory of the Graph Server.

>>> graph2
PgxGraph(name: GRAPH2, v: 180, e: 3100, directed: True, memory(Mb): 0)p

Let’s try executing a PGQL query against this graph.

graph2.query_pgql("""
SELECT c.first_name, LABEL(e), a.acc_id
FROM MATCH (c:customer)-[e:owns]->(a:account)
WHERE c.cst_id = 10
""").print()

+--------------------------------+
| first_name | LABEL(e) | acc_id |
+--------------------------------+
| Laura | OWNS | 10 |
| Laura | OWNS | 90 |
+--------------------------------+

Now, you can execute PGQL queries on the Graph Server, just as well as on the database.

Run path-finding queries

Although PGQL queries can still be executed in the database (by being rewritten into SQL queries), we can expect some queries to become faster when the graph is deployed in the memory of Graph Server. Path-finding is one example.

The following query uses the TOP K SHORTEST clause to retrieve the shortest K (in this case, 5) paths between two accounts (one with acc_id = 10 and another with accc_id = 50). The condition COUNT(a) = COUNT(DISTINCT a)eliminates paths that go through the same account more than once.

graph2.query_pgql("""
SELECT ARRAY_AGG(a.acc_id) AS acc_ids
, ARRAY_AGG(t.txn_id) AS txn_ids
, ARRAY_AGG(t.amount) AS amounts
FROM MATCH TOP 5 SHORTEST ((a1) (-[t:transferred_to]->(a))+ (a2))
WHERE a1.acc_id = 10 AND a2.acc_id = 50
AND COUNT(a) = COUNT(DISTINCT a)
""").print()
+------------------------------------------+
| acc_ids | txn_ids | amounts |
+------------------------------------------+
| [62, 50] | [1025, 6212] | [100.0, 900.0] |
| [62, 50] | [1025, 6227] | [100.0, 500.0] |
| [62, 50] | [1044, 6212] | [200.0, 900.0] |
| [62, 50] | [1044, 6227] | [200.0, 500.0] |
| [62, 50] | [1041, 6212] | [300.0, 900.0] |
+------------------------------------------+

You can also include a condition that the amount of all transfers must be greater than 500.

graph2.query_pgql("""
SELECT ARRAY_AGG(a.acc_id) AS acc_ids
, ARRAY_AGG(t.txn_id) AS txn_ids
, ARRAY_AGG(t.amount) AS amounts
FROM MATCH TOP 5 SHORTEST
((a1) (-[t:transferred_to]->(a) WHERE t.amount > 500)+ (a2))
WHERE a1.acc_id = 10 AND a2.acc_id = 50
AND COUNT(a) = COUNT(DISTINCT a)
""").print()

+-----------------------------------------------------------+
| acc_ids | txn_ids | amounts |
+-----------------------------------------------------------+
| [49, 58, 50] | [1022, 4900, 5805] | [800.0, 700.0, 900.0] |
| [49, 58, 50] | [1018, 4900, 5805] | [800.0, 700.0, 900.0] |
| [49, 58, 50] | [1021, 4900, 5805] | [700.0, 700.0, 900.0] |
| [61, 62, 50] | [1007, 6111, 6212] | [700.0, 800.0, 900.0] |
| [61, 62, 50] | [1047, 6111, 6212] | [900.0, 800.0, 900.0] |
+-----------------------------------------------------------+

If the start and end nodes of the path are the same node (a1), we can find the circulating paths.

graph2.query_pgql("""
SELECT ARRAY_AGG(a.acc_id) AS acc_ids
, ARRAY_AGG(t.txn_id) AS txn_ids
, ARRAY_AGG(t.amount) AS amounts
FROM MATCH TOP 5 SHORTEST
((a1) (-[t:transferred_to]->(a) WHERE t.amount > 500)+ (a1))
WHERE a1.acc_id = 10
AND COUNT(a) = COUNT(DISTINCT a)
""").print()

+------------------------------------------+
| acc_ids | txn_ids | amounts |
+------------------------------------------+
| <null> | <null> | <null> |
| [25, 10] | [1014, 2531] | [900.0, 700.0] |
| [25, 10] | [1028, 2531] | [900.0, 700.0] |
| [49, 10] | [1022, 4902] | [800.0, 600.0] |
| [49, 10] | [1018, 4902] | [800.0, 600.0] |
+------------------------------------------+

Visualize paths

The built-in Graph Visualization tool can visualize graphs on the Graph Server. Let’s use this to visualize the paths found. Since graphs in Graph Server memory are associated with the user sessions, we will first check the session ID of this Python client session.

>>> session
PgxSession(id: bee563a9-7378-4ec2-a8a0-00dcdd69672e, name: OPGShell)

Use this session ID to log in to Graph Visualization.

Select GRAPH2 as a target graph and execute the previous path-finding query. The returned results will be displayed in a table since they cannot be shown in a graph.

SELECT ARRAY_AGG(a.acc_id) AS acc_ids
, ARRAY_AGG(t.txn_id) AS txn_ids
, ARRAY_AGG(t.amount) AS amounts
FROM MATCH TOP 5 SHORTEST
((a1) (-[t:transferred_to]->(a) WHERE t.amount > 500)+ (a2))
WHERE a1.acc_id = 10 AND a2.acc_id = 50
AND COUNT(a) = COUNT(DISTINCT a)

Selecting one of the resulting paths, we can visualize it by using the list of transaction IDs. This series of queries and the logic between them would be implemented in the application layer in a real system.

SELECT *
FROM MATCH (a1)-[t]->(a2)
WHERE t.txn_id IN (1022, 4900, 5805)

This visualization can be done in a single query below. The keyword OEN ROW PER STEP retrieves the steps in the paths and maps them to new variables v1, e, v2.

SELECT v1, e, v2
FROM MATCH SHORTEST
((a1) (-[t:transferred_to]->(a) WHERE t.amount > 500)+ (a2))
ONE ROW PER STEP (v1, e, v2)
WHERE a1.acc_id = 10 AND a2.acc_id = 50
AND COUNT(a) = COUNT(DISTINCT a)

Run graph algorithms

Graph Server has many built-in graph algorithms, so let’s first try the well-known PageRank algorithm. PageRank calculates the importance of nodes in the graph, based on both the number of incoming edges, as well as the importance of their source nodes.

>>> analyst.pagerank(graph2)
VertexProperty(name: pagerank, type: double, graph: GRAPH2)

This algorithm stores the calculation result as the value of a new vertex property, pagerank, whose type is double.

Let’s also look at degree centrality. This score is simply the total number of edges that are connected. We will also calculate the total number of incoming and outgoing edges separately.

analyst.degree_centrality(graph2)
analyst.in_degree_centrality(graph2)
analyst.out_degree_centrality(graph2)

The results can be easily retrieved with a PGQL query. The ability to operate queries and algorithms from the same interface is an advantage of Graph Server.

graph2.query_pgql("""
SELECT
a.acc_id
, a.pagerank
, a.degree
, a.in_degree
, a.out_degree
FROM MATCH (a)
ORDER BY a.pagerank DESC
LIMIT 10
""").print()

+-----------------------------------------------------------------+
| acc_id | pagerank | degree | in_degree | out_degree |
+-----------------------------------------------------------------+
| 41 | 0.019603664018674367 | 102 | 62 | 40 |
| 33 | 0.015104950170843338 | 108 | 58 | 50 |
| 5 | 0.014983680499273774 | 92 | 42 | 50 |
| 24 | 0.014966074159304933 | 89 | 49 | 40 |
| 36 | 0.014897989873184218 | 71 | 41 | 30 |
| 40 | 0.014860174576443065 | 72 | 32 | 40 |
| 23 | 0.014700205912993914 | 84 | 54 | 30 |
| 73 | 0.0141410094293568 | 77 | 47 | 30 |
| 30 | 0.013816617289907915 | 102 | 52 | 50 |
| 32 | 0.013339752300665857 | 42 | 32 | 10 |
+-----------------------------------------------------------------+

The new properties added by the algorithms can also be seen in Graph Visualization. For example, you can change the size of the vertices depending on the PageRank score.

If you look closely, you can see that some vertices are larger than others.

Transform graphs

In the previous section, we ran algorithms to find the importance of vertices, but we actually need to consider a few things about the target graph. This graph has customer vertices, account vertices, ownership edges, and money transfer edges. However, we should not include the customer vertices and ownership edges in the calculation if we want to focus on the importance of accounts focusing on the transfer relationship only. Also, there are often multiple transfer edges between two accounts, but some algorithms are more efficient or produce more valid results without such multi edges. This section will show you how to transform a graph using PGQL modification queries.

First, since the graph currently in memory (graph2) has new properties added by the algorithms, we load the graph from the database again to restore its initial state.

graph2 = session.read_graph_by_name("GRAPH2", "pg_view")

The following query deletes customer vertices. When one or both endpoint vertices are deleted, the “dangling” edges are also removed.

graph3 = graph2.clone_and_execute_pgql("""
DELETE c FROM MATCH (c:customer)
""")

From the graph size shown below, we can see that 80 vertices and 100 edges have been deleted.

>>> graph3
PgxGraph(name: sub-graph_xx, v: 100, e: 3000, directed: True, memory(Mb): 0)

Next, when there are multiple transfer edges, we will combine them into one. Here, we want to keep the information of how many transfers have been made between the accounts, so we add a new property, cnt. The values to be stored should be integer numbers, but we will set the type of this property to double since the algorithm we will run later supports double.

>>> graph3.create_edge_property("double", "cnt")
EdgeProperty(name: cnt, type: double, graph: sub-graph_xx)

At this point, the values of the new property cnt are all set to 0.

graph3.query_pgql("""
SELECT t.cnt, COUNT(t)
FROM MATCH ()-[t:transferred_to]->()
GROUP BY t.cnt
ORDER BY t.cnt ASC

""").print()
+----------------+
| cnt | COUNT(t) |
+----------------+
| 0.0 | 3000 |
+----------------+

The next query will create a new edge for each pair of accounts when there are one or more transfer edges (considering their directions). In each aggregation, the number of original transfer edges is stored into the cnt property of the new edge.

graph4 = graph3.clone_and_execute_pgql("""
INSERT EDGE e BETWEEN a1 AND a2
LABELS ( transferred_to ) PROPERTIES ( e.cnt = COUNT(t) )
FROM MATCH (a1)-[t:transferred_to]->(a2)
GROUP BY a1, a2

""")

At this point, the edges with cnt = 0 are the original 3,000 transfer edges, and the others are the newly added edges.

graph4.query_pgql("""
SELECT t.cnt, COUNT(t)
FROM MATCH ()-[t:transferred_to]->()
GROUP BY t.cnt
ORDER BY t.cnt ASC

""").print()
+----------------+
| cnt | COUNT(t) |
+----------------+
| 0.0 | 3000 |
| 1.0 | 682 |
| 2.0 | 485 |
| 3.0 | 328 |
| 4.0 | 55 |
| 5.0 | 17 |
| 6.0 | 6 |
| 7.0 | 2 |
| 9.0 | 1 |
+----------------+

Delete the original edges with the condition cnt = 0.

graph5 = graph4.clone_and_execute_pgql("""
DELETE t
FROM MATCH ()-[t:transferred_to]->()
WHERE t.cnt = 0

""")

Only the newly added transfer edges remain.

graph5.query_pgql("""
SELECT t.cnt, COUNT(t)
FROM MATCH ()-[t:transferred_to]->()
GROUP BY t.cnt
ORDER BY t.cnt ASC

""").print()
+----------------+
| cnt | COUNT(t) |
+----------------+
| 1.0 | 682 |
| 2.0 | 485 |
| 3.0 | 328 |
| 4.0 | 55 |
| 5.0 | 17 |
| 6.0 | 6 |
| 7.0 | 2 |
| 9.0 | 1 |
+----------------+

Run graph algorithms, again

Now we will calculate the centrality scores with the transformed graph in the same way as above.

analyst.pagerank(graph5)
analyst.degree_centrality(graph5)
analyst.in_degree_centrality(graph5)
analyst.out_degree_centrality(graph5)

Get the result in PGQL. You should see different results from before.

graph5.query_pgql("""
SELECT
a.acc_id
, a.pagerank
, a.degree
, a.in_degree
, a.out_degree
FROM MATCH (a)
ORDER BY a.pagerank DESC
LIMIT 10

""").print()
+-----------------------------------------------------------------+
| acc_id | pagerank | degree | in_degree | out_degree |
+-----------------------------------------------------------------+
| 41 | 0.015927856082418812 | 52 | 27 | 25 |
| 33 | 0.01572808354417436 | 55 | 28 | 27 |
| 73 | 0.01467580042697759 | 44 | 22 | 22 |
| 53 | 0.014544138405858852 | 45 | 21 | 24 |
| 59 | 0.014241045755115847 | 41 | 24 | 17 |
| 49 | 0.014199578954729926 | 29 | 21 | 8 |
| 47 | 0.014141568475804285 | 43 | 22 | 21 |
| 5 | 0.013433911533465633 | 46 | 21 | 25 |
| 68 | 0.012841252743896187 | 43 | 20 | 23 |
| 90 | 0.01278925469055179 | 38 | 22 | 16 |
+-----------------------------------------------------------------+

Finally, we will also try the weighted PageRank algorithm. Since each edge has a cnt property, we will treat it as a weight so that the higher the cnt is, the stronger the PageRank will propagate. We will run the algorithm giving the edge property as the second argument.

cnt = graph5.get_edge_property("cnt")
analyst.weighted_pagerank(graph5, cnt)

We can see a difference in the rankings between the results from the two calculations.

graph5.query_pgql("""
SELECT
a.acc_id
, a.pagerank
, a.weighted_pagerank
FROM MATCH (a)
ORDER BY a.pagerank DESC
LIMIT 10

""").print()
+------------------------------------------------------+
| acc_id | pagerank | weighted_pagerank |
+------------------------------------------------------+
| 41 | 0.015927856082418812 | 0.02090853868414549 |
| 33 | 0.01572808354417436 | 0.016067973689548414 |
| 73 | 0.01467580042697759 | 0.015001994965949424 |
| 53 | 0.014544138405858852 | 0.011917160588132424 |
| 59 | 0.014241045755115847 | 0.01275290770633708 |
| 49 | 0.014199578954729926 | 0.013315017840049879 |
| 47 | 0.014141568475804285 | 0.01222283230220626 |
| 5 | 0.013433911533465633 | 0.016221144544670288 |
| 68 | 0.012841252743896187 | 0.011829108962126801 |
| 90 | 0.01278925469055179 | 0.011897751825643397 |
+------------------------------------------------------+

We can access the transformed graphs from Graph Visualization, so let’s visualize the vertex with the highest weighted PageRank score and the connected vertices. The thickness of the edges now represents the weight.

SELECT *
FROM MATCH (a)-[t]-()
WHERE a.weighted_pagerank > 0.020

This article showed that we can load the graphs defined in the database into Graph Server and run path-finding queries and graph algorithms on the graphs in memory. The unique design of the 3-tier deployment enables the seamless integration of graphs defined on the database with the high-performance in-memory graph analytics.

Finally, I showed how to use PGQL modification queries to transform graphs for analytics. For simple cases like this one, SQL aggregations in the database can also achieve the same transformation. Still, this graph transformation technique is sometimes very powerful, for example, when converting multi-hop paths into single edges. I will introduce such complex use cases in the future.

Please learn more about Oracle Graph from:

--

--