Distributed joins more efficiently with keyvalue storage

I’ve been thinking how to execute a distributed join when not all data is on the same server. Joins can be scheduled to run – there is some data movement that needs to happen for a join to work. This movement can be planned for or scheduled.

Essentially, the join keys should be shared across the network before the join happens. It’s the join keys that need to be synced up. The materialized data can come later.

For example;

select products.price, people.people_name,
    items.search from items
    inner join people on people.id = items.people
    inner join products on items.search = products.name

The only keys that need to be synced are all people ids, items people, items.search and products names. These are the join keys. If this fits on one machine, then the join can take place in memory. In order to find these efficiently, we need to distribute this scan work across the servers and in parallel download all the join keys.

If you have 100 million products, 1 million people and 10 million items, you’ll have to copy 100 million + 1 million + 10 million join keys before the join can begin. We join people.id = items.people] then [items.search = products.name] in pairs, so we might be able to copy the keys for a join simultaneously with the transfer for the next join. In fact, after looping through the bigger set of items to do the hash join, we don’t need the data that we have looped past because it’s just a join key.

Is there a way to distribute the join? What if the number of keys that need to be transferred is larger than system memory?

Indexing the join

Can we spend some disk space to index a join? So that for non-adhoc queries are fast because they can use a join index.

We have rows on server A and server B. If they were together on the same machine, we could spot a join condition. On insert, we do a query to another server to see if they have the matching data, if they do, we need to upload something to represent the join.

Ideally at join query time, we just want to query each server once to get a list of matched joins. The problem occurs when the joined data is across two servers. We need pairs to be on the same server, but it doesn’t matter. We can arrange for matching join key data to always be available on any given server.

First we tell the server about the join, so it knows it has to do something special when these particular keys match.

create join on items
    inner join people on people.id = items.people
    inner join products on items.search = products.name

At insert time into items, people, products we run a select joinValue where joinKey = joinValue for each server whenever we insert and for those servers we insert the corresponding join key. For the above example, two join keys need to be inserted on every insert to items. This way the joins can be run independently on each server and they will always return all join results. We pick a random server to do the initial join, they should all return the same items. The missing data is fetched in parallel. Writes become more complicated. We have to update all join keys.

Imagine we have the following Join:

select products.price, people.people_name,
    items.search from items
    inner join people on people.id = items.people
    inner join products on items.search = products.name

If we want this join to work on multiple decentralised servers we have to synchronize the join keys with other servers.

Assuming we have already inserted people and products, when we insert into items (‘Spanner’, 2), we need:

  • for each server, we need to insert an items row placeholder so joins can happen on that server
  • for each server, we need to insert a people place holder
  • for each server, we need to insert a products placeholder

The outcome is that any server can execute a full join because every server has every join key. But some data is missing in the join statement, specifically, the selects parameters – the products price, people_name are missing data. So we need to fetch the missing data by doing a second join with the results of the first join. This time we join on people_name and price and each server contributes its values for these fields.

Writing a horizontally scalable decentralized SQL database

I have written a simple SQL database and a distributed hash table. Now is the time to combine them into one product. This post is me walking through how it works.

We create a table row by storing a keyvalue per column in this format R.table_name.row_number.field_name (R.people.3.age) and for efficiency, we also store column based indexing in the format C.table_name.field_name.row_number (C.people.age.3) – this keeps the data near together in memory so that when it goes to the CPU, we get nearby entries on the cache line too. A key belongs to a server based on its consistent hash of its partition key. All columns for a row are stored on the same machine.

To do a query, we have to aggregate these key values into in memory objects which become rows. This is in order so we can run a hash join against them. We also have to do it efficiently, we don’t want to materialize the entire database into memory before we can do queries against it. We have to use a streaming approach whereby we only construct one item memory at a time everything must work in lockstep with this item in memory.

We run the majority of the SQL statement where the data resides, this should be in memory and aggregate the data from each server as each server stores a different set of rows, depending on the consistent hash.

To do an update, we have to delete all the keyvalues associated with the previous value and then repopulate the keys with the updated value.

Joins across machines are fairly ugly. Some special case logic has to happen and the materialized views are transferred to the server for the join to take place. This works. I would like to also have the possibility of doing a join on a single partition, this can be logically consistent if all the data is on one server.


As long as data is distributed across a cluster of nodes, the SQL should execute efficiently because searches are done in parallel. We use tries to make prefix searches efficient. For example, to do a select * from people WHERE age = 31 query, we do a prefix query of S.people.age.31 to find S.tablename.column.columnvalue.rownumber This is because we have already indexed this keyvalue of the column’s data. We rely on this denormalisation of data to speed up searches. To join together the database and the where clause data, we hash join a collection of WHERE matches which look like {“id”: 4} against a prefix query of R.people which is the full data row. As a result, we get all the items that match the WHERE clause.

Speeding up

Row count information. We currently don’t have row count information, so we don’t efficiently loop over the smaller collection to do the hash join. This is because the data is streamed currently. We don’t know how large the collections are until we loop over them.

Fabric unlimited programs – distance from CPU management

Storage essentially caches what is to be executed by the CPU and there is orders of magnitude speed differences between storage types: Hard drive, SSD, RAM, NVM, L1, L2, cache etc.

We use b trees to efficiently fetch data from disk. And hash tables in RAM for in-memory databases. Why does the access mechanism have to be different for each? We still write SQL for b trees and key-value access for RAM.

  • Why do I have to worry about tape length? (Disk space, memory space)
  • Why do I have to write code to explicitly stream data?
  • Why not the same retrieval mechanism for both RAM and disk?
  • I should be able to process big data on a single underpowered machine and have the shifting around of data handled behind the scenes automatically. Organise the data into batches.
  • While I’m processing data, I should be loading/moving data for the next batch of operations.
  • Why can’t the computer maintain what is kept in memory for me so processing is always efficient?
  • Having to write fsync manually is a recipe for complexity.
  • Why can’t computers arrange data layout so data always resides in the caches?
  • When serving a web request, it’s mostly the explosion in relations that I have to solve. Postgres is limited by RAM to materialize tuples. There’s a lot of data in the system. I get too many results back. I need to paginate.
  • Can I handle every request from RAM without hitting disk? How do I ensure that RAM is kept up-to-date. I need an advanced caching update system on writes.
  • Why do I have to write custom code to serialize a hash table?
  • Can I write naive programs that automatically scale to extreme levels? There must be a canonical representation of “what I want to happen”. And then doing it in a naive way or an efficient way. Loading a large JSON file should use a streaming JSON parser for example.
  • Mark data as being persistent and never have to worry about changes not being persisted.

Dream desktop

My computer should try benefit my life as much as possible. Modern desktops don’t even try. They don’t learn about me or give me advice. I want to give my demographics to my computer, my location, my salary and have the computer calculate things about me. We have all this untapped computer resource but most people only idle their computers.

Here are some crazy ideas that desktop computers could do.

  • Crazy remote indexes – We’ve got spare resources on an average desktop. Synchronize my files with a remote server, then do some crazy indexes on the files for all sorts of combinations. Index my files contents for content based searches. It doesn’t matter if the indexing is expensive, because it’s on a different server.
  • Arbitrary Correlations It’s easy to arbitrarily correlate behaviours of humans on a computer. We can track what times the user opens certain programs and then open them automatically at the correlated time.
  • Information age worker Download various data sets to the desktop and allow users to run queries against the data. Would be nice if you could import Wikidata data this way.
  • Digital Shop Provide tools to view or start an online shop, including a product search engine.
  • Overlay network My desktop should form an overlay network with all my phones and servers so that I can share services privately. It should connect as part of login to the computer. Should be like Zerotier, but with automatic login and connection.
  • Social P2P network My desktop should connect to a federated social network, a bit like twitter where you can share things. I can create mockups for an app in my desktop environment and share it.
  • Program cross-referencer We often run multiple applications in parallel and even arrange them so we can see both of them at the same time. What if you could write queries to open applications to combine data across two applications? Or write simple loops over Spreadsheet data and interact with the browser with an API? I should be able to write joins across programs visually.
  • Display Cone It would be nice if one could control the window manager and compositing with a text API. So one could create convert a window into a custom geometry and display that. So one could create dashboards of existing programs. This would be useful for information radiators without having to run Xmonad or other tiling window manager.
  • Payments integration Support payments and the listing of digital services in an online marketplace. We need to democratize the act of taking money and exchanging it. It shouldn’t be the purview of large corporations only taking payments. Anybody should be able to take a payment.
  • Website management Run a web site with your desktop computer. Desktop acts like a VPS, with dynamic IP hosting, web server and database.
  • Object maker Describe the properties an object should have and have them stored in a collection. Write queries to retrieve them, sort them. Drag and drop text boxes, sliders etc to create a GUI.
  • Live tree and rich GUI editor Every window, widget, dialogue, GUI on the screen forms part of a global tree data structure which can be used to interrogate what is on the screen and automate behaviour. Think of it as a global document object model (DOM) for the desktop but is actually a scenegraph. So every widget is accounted for. Someone can copy and paste a tree branch an create an identical but separated GUI from the root and modify it with drag and drop.
  • Life strategy quiz Ask where you’re sleeping tonight. What you doing for food. With the ability to order food via the system or acquire accommodation with a search. When the user answers they have no food or no where to sleep, dispense information to counteract homelessness and other social problems.
  • CPU manuals The desktop should come with documentation about CPU instructions, a compiler, and documentation about the desktop environment. It should be possible to create apps from an installation with minimal expertise.
  • API suite The desktop should have an installable set of pluggable APIs that are really easy to use from different places: from programming languages, over the network and from an executor GUI. I should be able to create an API invocation and chain them together with a GUI.
  • Event maker The ability to view events that get fired by the desktop environment, from the network stack, from arbitrary applications and add behaviour to events. A way to hook into events over the network, so you can ‘subscribe’ to events on another machine.
  • Log aggregation Log errors and error dialogs to be automatically synced to the cloud where other they turn into community issues. Issues can be commented by other users.
  • Errors become things Errors become a thing in my desktop. They appear in the system tray and don’t go away so easily and they can be re-tried easily. They appear in an error viewer. You can then try solutions against the error to see if it works, it it works, the error will go away. You can see comments about the error and try known solutions to the error. Known solutions are gathered online.
  • Social desktop There should be a chat room associated with every view. If you’re copying files, you can join the copy files chat room.
  • Shared desktop sessions It should be possible for multiple users to join a desktop session. The session is not single player, reserved for one person but possible for multiple people to join the same session.
  • Cloud designer Advertise cloud features to me and let me use cloud features in my home network such as S3.

Advertising features to me

My desktop computer has installed various services and APIs. My desktop should advertise what features are available to me.

Horizontally scaling a query

I have a simple distributed hash database at hash-db/server.py at master · samsquire/hash-db (github.com)

The data re-balances itself across nodes as they join. When a fresh node joins, the server sends that nodes’ data to the new server.

One thing I’m not sure how to do is how to scale the queries. I have queries such as sort key begins with, sort key between and partition key and sort key both between. There’s probably a way to send these queries to the cluster to compute in parallel. Currently they’re only executed by the server node but I’ve taken steps to move begins_with query execution to client nodes.

One of the data structures I’m using for between queries is a binary search tree and another is a prefix tree (trie). It would be nice if we could cut the tree in half and send it to two computers to walk in parallel. Perhaps it would be faster to index the list of sorted values with a tree, then each branch refers to a sublist of 32 or so items. Then in parallel, we loop over those sublists and find values that are between the query values. It’s really wasteful to do a full scan when you have lots of records.

We can do a full table scan in parallel. We can maintain a sorted list of data. Then partition it. remembering the start value and end value of each partition. To parallelise, we synchronize the index with multiple machines. We assign the partitions to each machine, each machine does a scan >= start_value <= end_value, with some optimisations on partition start value and end values, we can skip partitions that obviously don’t need traversal.

House prices unaffordable pandemic

House prices in the developed world are overpriced. Do not buy.

Do you really want to be working 25 years to afford a house? Houses used to be really cheap. But now you’re working more and more as a proportion of your life just to pay for a mortgage.

Why do it to yourself?

Scientific taxation

When society has a goal, it sets up various actions to meeting that goal such as tax law, exemptions, funding. It doesn’t scientifically collect evidence that the goal is working. The spend on combatting poverty is massive but the research on verifying the results is pretty non-existent. We’re terrible at verifying our spending, tax exemptions are doing what they say they are doing. As a result, companies pay little tax due to exemptions that were made to encourage certain behaviours. Nobody puts 2 and 2 together to work out if we’re getting the outcome in behaviour that we planned for.

An occasional PhD study on poverty a few years after the taxation laws are changed is not good enough. We need to close the feedback cycle of taxation and outcomes. There should be a department that studies the outcomes of tax exemptions The exemption amounts themselves should be tracked, as this represents decentralised spending on policy goals.

Government business incubator

People doing things is a good thing. I want there to be lots of businesses: restaurants, coffee shops, consulting firms, retail shops and internet businesses.

But starting a business is difficult. I propose a type of incubator that comes after age 18 that simply implements all the skeletal parts of a business. One can work for the skeletal business. People are staffed in roles of the nascent business and the business is separated from the incubator at some later date when it can survive on its own two feet.

Trickle up philosophy

We need to distribute money to the poorest in society and have it circulate around the low end economy. We need universal basic income. Society’s ills would dissipate as everybody has what they need.

Shelter should be free. Shelter is a human right. I don’t see why people should have to pay landlords for rent when landlords do nothing. Landlords are rent seekers and do not offer any useful service or function in an economy. They suck the life force out of economies.

If my shelter was free, it would free up £650 to be spent in the economy, that could give people jobs. It could mean that coffee shops could stay open longer. It could mean more restaurants. More people exchanging money, doing transactions, raising corporate and small business profits.

90% of the cache line is wasted or Instruction selection is only worth 10% of total performance

If data layout is the main contributor in application performance, why isn’t there any tools to help program data layout?

Code shouldn’t have such drastically different cache performance, code we have written that works should have maximum (reliable) performance regardless.

I’m thinking of automatic batching and automatic layout algorithms to place data in a way that is efficient for the code in question.

So you could write some code and have the layout transformed by the data layout compiler.

The layout and the code that accesses the data must be rewritten to accommodate the different styles of accessing batched data. The actual layout and code executed should be transparent to the programmer. It would make debugging harder but would improve performance.

Array of Structures and Structures of Array should be hidden from the programmer with code written to be ergonomic as possible. The transformation should happen behind the scenes.