Image for post
Image for post

SQL Crash Course Ep 3: Merging Tables Using JOIN

Learn how to combine and pull data from multiple sources using SQL JOINs

In this episode of SQL Crash Course, we’re going to learn how to combine multiple tables in SQL, and maximize the efficacy of our SELECT statements using a JOIN clause. We’re going to learn about the 4 different types of JOIN methods and when we might want to use each one. We’re finally going to wrap up by creating a database and some tables, adding some data, and trying to implement some JOINs ourselves.

This is episode 3 of this SQL Crash Course series. If you’re unfamiliar with SQL concepts like pulling data, creating tables, adding and deleting rows, etc. I recommend briefly reviewing the previous two articles in the series.

SQL Crash Course Ep 1: What Is SQL? How Can I Start Selecting Data?

SQL Crash Course Ep 2: Creating Tables, Adding and Deleting Data

SQL Crash Course Ep 3: Merging Tables Using JOIN

  • Overview
  • What are JOINs? Why are they useful?
  • JOIN Clause Syntax
  • The 4 JOIN Methods
  • Create Some Dummy Data
  • Practice JOINs
  • Conclusion

What are JOINs? Why are they useful?

JOIN clauses are used in SELECT statements to merge rows from two (or more) tables using a shared column. Of course, the values in this shared column represent a portion of a full row in each table, and therefore, we use matching values in this related column as a “link” to pass relevant information between tables. This allows us to pull data from multiple sources, filter it, and return an extremely informative and specific set of data.

It is extremely rare to find a database containing only one table. Most databases are relational databases (RDBs), meaning tables are designed to recognize related information in other tables. Tables in relational databases are typically smaller and contain more specific information (fewer columns). This super-specific table configuration allows us to store data a lot more efficiently, cleanly, and with less risk of error.

For example

Let’s say we’re in the business of selling medical supplies, and we want some method of tracking our orders so that we can process and fill them. We’ve decided to store this information in a SQL database.

We could have one massive table — ORDERS — that stores customer_name, email, shipping_address, phone_number, product, quantity, price, etc.

OR we could have two tables: ORDERS — that stores customer_id, product, quantity and price, and CUSTOMERS — that stores customer_id, customer_name, email, shipping_address, and phone_number.

Think about trying to add a new order for a returning customer. In the first configuration, we would still have to know and add information about our customer each time they ordered (maybe they have to fill all this out on their order form every time they make a new purchase).

In the second configuration, we only need to record information on each customer once. If we have a returning customer, the only new information we need is related to their order. Everything else is accessable to us in the CUSTOMERS table via a JOIN clause.

We can see now that if we have relevant data stored in multiple locations, we’d need some way to combine information from these sources when needed. JOINs are useful because they allow us to perform this exact action.

JOIN Clause Syntax

Here is the basic formula for writing a JOIN clause:

SELECT ..., ..., ...   FROM table1    [JOIN] table2    ON table1.shared_col = table2.shared_col;
  • …, …, … — all the columns you wish to select (belonging to both tables)
  • [JOIN] — will be one of the 4 JOIN methods
  • table1, table2 — the two tables to be connected
  • ON xyz.123 = zyx.123 — this is how we declare our shared column relationship
  • shared_col — common column between the two tables, our “link” column

Here are the steps for including a JOIN in your query:

  • Figure out what type of JOIN you want to use (more on this in the next section).
  • Write a SELECT statement for all the columns (from both tables) you wish to access. Initially, we only select FROM the first table.
SELECT *
FROM clients # "Left table"
  • Add our JOIN clause followed by the second table we wish to access
   INNER JOIN countries    "Right table"
  • Specify “how” to connect rows from these two tables by using an ON clause.
   ON clients.Country_ID = countries.Country_ID;
  • If you want to do any filtering or aggregating (WHERE, GROUP BY, COALESCE, etc.) that would come last.

And that’s that, JOIN complete. Remember to always include your ON clause so SQL knows what columns to associate with each other. Now let’s take a look at our 4 options for JOIN methods.

The 4 JOIN Methods

When talking about relational tables, especially in JOIN statements, it’s best practice to refer to the first table (the one being added onto) as the “left table”, and the second table (the one being added) as the “right table”.

As mentioned earlier, there are actually 4 different JOIN methods. We know that all JOIN clauses are used to combine rows based on related (matching) information between tables, but what happens when an entry in the left table has no match in the right table?

Each JOIN method is designed to have different behavior in this situation, meaning we have ultimate control over what happens during this mismatch.

An INNER JOIN will only return connected rows if there is a matching value in both tables.

When a row from one table has a match in the other table (all of its values are present after JOIN), we call it a “connected row”.

# Example return from INNER JOINt1.order_ID |t1.customer_ID |t2.customer_name |t2.customer_ID...          1               '...'             1...          2               '...'             2...          3               '...'             3

We can see that we have only complete rows in our example return. customer_ID allows us to verify that values are matched correctly. Any rows containing customer_IDs with no match in the opposite table have been removed.

Venn diagrams make it easier for us to understand what data will be pulled in each scenario.

Image for post
Image for post

A LEFT JOIN will return every row from the left table (even if there is no matching value in the right table), but only rows from the right table if they have a corresponding value in the left table (will return NULLs in place).

# Example return from LEFT JOINt1.order_ID |t1.customer_ID |t2.customer_name |t2.customer_ID...          4               '...'             4...          5               NULL              NULL...          6               '...'             6

We can see that rows with matching customer_IDs are returned completely. However, rows from the left table without a matching customer_ID in the right table have NULL instead of values in the columns belonging to t2.

Image for post
Image for post

A RIGHT JOIN will return every row from the right table (even if there is no matching row in the left table), and will only return rows from the left table that have corresponding values in the right table (will return NULLs in place).

# Example return from RIGHT JOINt1.order_ID |t1.customer_ID |t2.customer_name |t2.customer_ID...          7               '...'             7NULL         NULL            '...'             8...          9               '...'             9

Again, we can see that rows with matching customer_IDs are returned completely. However, this time, rows from the right table without a matching customer_ID in the left table have NULL instead of values in the columns belonging to t1.

Image for post
Image for post

An OUTER JOIN (also called a FULL OUTER JOIN) is a combination of the LEFT JOIN and RIGHT JOIN. It returns every row from the left and right tables. When corresponding columns have matching values, rows are connected. When corresponding rows do not have a match in the other table, NULLs are used in their place

# Example return from OUTER JOINt1.order_ID |t1.customer_ID |t2.customer_name |t2.customer_ID...          10              '...'             10NULL         NULL            '...'             11...          12              NULL              NULL

Here we have one fully connected row (customer_ID 10) while customer_ID 11 and 12 contain NULLs in some of the columns indicating we did not get a successful match. We still see these rows in the return because remember OUTER JOIN doesn't care about matches, however, it would like to connect rows if possible.

Image for post
Image for post

Currently, SQLite doesn’t support OUTER JOIN or RIGHT JOIN but there are some simple workarounds you can use by creatively implementing LEFT JOINs. Here’s a great workaround for emulating an OUTER JOIN using 2 LEFT JOINs: https://www.sqlitetutorial.net/sqlite-full-outer-join/

Create Some Dummy Data

Now that we’re familiar with what a JOIN is in SQL, let’s try writing some of our own SELECT statements using some of the JOIN methods.

Before we can start merging tables and pulling data, though, we need some tables and data to merge and pull from.

Setting the stage: We are analysts at an international medical supplies manufacturer. Recently (we can’t imagine why), business has started to pick up and we are no longer able to track our clients, orders, etc. on paper. Our solution was to move all of our operations tracking over to a SQL database. This database will contain all the information we need about our products, clients, locations, and orders, but for storage purposes, information will located in multiple different tables.

For this tutorial, we’ll be using 4 tables: clients, countries, products and orders. Let’s create those now.

clients table:

CREATE TABLE clients(
generated_id INT PRIMARY KEY,
Client_Name TEXT NOT NULL,
Country_ID INT NOT NULL
);

countries table:

CREATE TABLE countries(
generated_ID INT PRIMARY KEY,
Country_Name TEXT NOT NULL,
Country_ID INT NOT NULL
);

products table:

CREATE TABLE products(
generated_ID INT PRIMARY KEY,
Product_Name TEXT NOT NULL,
Product_Code INT NOT NULL
);

orders table:

CREATE TABLE orders(
generated_ID INT PRIMARY KEY,
Client_Name TEXT NOT NULL,
Product_Code INT NOT NULL,
Quantity INT NOT NULL,
Order_Date TEXT NOT NULL
);

Now, let’s verify all these tables exist using the .tables command.

sqlite> .tablesclients    countries  orders     products   # EXPECTED OUTPUT

Now that our tables are set up, we need to create some data for them to hold.

First, let’s add information about the countries we will be dealing with.

INSERT INTO countries(Country_Name, Country_ID)
VALUES
("United States", 8001),
("Canada", 8002),
("United Kingdom", 8003),
("Japan", 8004);

Now that we have information about the countries we serve, we need some international clients.

INSERT INTO clients (Client_Name, Country_ID)
VALUES
("Paddy's Pub", 8001),
("Universal Exports", 8003),
("Bubbles' Shed n' Breakfast", 8002),
("Stay Puft", 8001),
("Han's Garage", 8004);

We manufacture and supply a random assortment of goods, we need to fill out our products table with information on these supplies.

INSERT INTO products(Product_Name, Product_Code)
VALUES
("N65 Mask (20ct)", 176123),
("Powder Free Latex Gloves (200ct)", 143185),
("Hand Sanitizer 8oz.", 735162),
("Facial Tissue (120ct per box)", 196109),
("Small Plastic Drum (5 gal)", 935671),
("Medium Plastic Drum (10 gal)", 935682),
("Large Plastic Drum (20 gal)", 935693);

Last, but certainly not least, we already have new orders coming in. These need to be added to the orders table.

INSERT INTO orders(Client_Name, Product_Code, Quantity, Order_Date)
VALUES
("Paddy's Pub", 735162, 20, DATE()),
("Stay Puft", 176123, 5, DATE()),
("Bubbles' Shed n' Breakfast", 935671, 5, DATE()),
("Bubbles' Shed n' Breakfast", 935693, 2, DATE()),
("Universal Exports", 143185, 100, DATE()),
("Han's Garage", 176123, 20, DATE()),
("Paddy's Pub", 176123, 1, DATE());

Practice JOINs

As an analyst, a huge part of our job is to serve requests from other departments for specific sets of data. This could be a set containing only a single row of data for a specific order, or it could be a set of all clients with two or more active orders. Whatever the request, it’s our job to deliver.

We have just been handed this list of requests:

  1. Sales would like a list of current orders that need filled. Return a dataset containing Client_Name, Product_Name, and Quantity for all current orders (any order still in our orders table is considered outstanding).
  2. Marketing would like to send holiday cards to our current customers. Return a dataset containing the names (Client_Name) of all of our current customers and their home countries (Country_Name)
  3. Due to a manufacturing error, all shipments of “Large Plastic Drums” will be delayed. We need to notify our customers of this delay, so our Sales Department has asked for a list of every client (Client_Name) that has a current outstanding order for large plastic drums.
  4. *Challenge*: Our Shipping Department has been optimizing how they make deliveries. They would now like to ship orders out in groups by country. Return a list of countries that currently have unfilled orders (Country_Name) and the number of orders that need to be delivered to that country. (note: you’ll need to use more than one JOIN, and a couple other built-in functions)

Let’s work through these requests together. (HINT: they all require a JOIN )

The columns we need are Client_Name, Product_Name and Quantity. Remember, these columns exist in multiple locations, it’s up to us to figure out how to best combine tables to serve this request.

Client_Name and Quantity exist in orders, but we need access to Product_Code which is in products. To do this, we can combine products and orders via our shared column — Product_Code — so that we have access to this information. Since we don’t have any other conditions to meet, we don’t need a WHERE clause.

SELECT Client_Name, Product_Name, Quantity
FROM orders o
INNER JOIN products p
ON o.Product_Code = p.Product_Code;

Output (we should see Country_Name, Product_Name and Quantity for all 7 orders)

Paddy's Pub|Hand Sanitizer 8oz.|20
Stay Puft|N65 Mask (20ct)|5
Bubbles' Shed n' Breakfast|Small Plastic Drum (5 gal)|5
Bubbles' Shed n' Breakfast|Large Plastic Drum (20 gal)|2
Universal Exports|Powder Free Latex Gloves (200ct)|100
Han's Garage|N65 Mask (20ct)|20
Paddy's Pub|N65 Mask (20ct)|1

The only two columns we need are Client_Name and Country_Name. Of course, these are in two different tables, so we have to use a JOIN. The shared column here is Country_ID, so that’s what we use in our ON clause. Again, there is no additional filtering required so we end our statement after JOIN…ON.

SELECT Client_Name, Country_Name
FROM clients cl
INNER JOIN countries co
ON cl.Country_ID = co.Country_ID;

Output (we should see all 5 Client_Names and their corresponding Country_Name)

Paddy's Pub|United States
Universal Exports|United Kingdom
Bubbles' Shed n' Breakfast|Canada
Stay Puft|United States
Han's Garage|Japan

The only column we need is Client_Name, however, we need to filter our return based on values in our orders and products tables. In order to filter for “Large Plastic Drums”, we’ll need to know Product_Name (found in products). This means we’ll need a connection between products and orders again. Also, this time we will need a WHERE clause.

SELECT Client_Name
FROM orders o
INNER JOIN products p
ON o.Product_Code = p.Product_Code
WHERE Product_Name = "Large Plastic Drum (20 gal)";

Output (should only return 1 client: “Bubbles’ Shed n’ Breakfast”)

Bubbles’ Shed n’ Breakfast

This one is tricky. The only column we need to return is Country_Name, but we also need to return a number representing the count of orders for each country. This value isn’t given to us, we have to generate it using the COUNT() method. We also have to filter based on values in 3 different tables (countries, clients and orders), meaning, we’ll need 2 JOIN clauses. Finally, we’ll need some way to tell SQL that we need a count value for each Country_Name. This can be accomplished using a GROUP BY clause.

SELECT Country_Name, COUNT(o.Client_Name)
FROM countries co
INNER JOIN clients cl
ON co.Country_ID = cl.Country_ID
INNER JOIN orders o
ON cl.Client_Name = o.Client_Name
WHERE co.Country_ID = cl.Country_ID
GROUP BY Country_Name;

Output (we should have only 4 countries, with a count representing the actual number of orders belonging to clients of each country)

Canada|2
Japan|1
United Kingdom|1
United States|3

Conclusion

Thanks for tuning in for this episode on SQL Crash Course! To summarize what we just covered, today we learned what a JOIN is in SQL. We learned that there are 4 different kinds of JOINs, how they differ, that mismatched values are handled differently depending on the type of JOIN we use. We then got some hands on experience serving data requests using JOINs, and even learned about some more complex querying methods like how to perform a multi-table JOIN, COUNT(), and the GROUP BY clause.

Join us (no pun intended, leave me alone) next week for SQL Crash Course Ep 4 where we dive deep into SQL’s massive library of built-in functions and methods. We’ll be covering topics like aggregate, window and control flow functions, and more!

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store