Shipping multi-tenant SaaS using Postgres Row-Level Security

2022-07-26

●

15 min read

●

Miki Pokryvailo

Shipping multi-tenant SaaS using Postgres Row-Level Security

At Nile, we’re making it easier for companies to build world-class control planes for their infra SaaS products. Multi-tenancy is core to all SaaS products and especially those with control-plane architectures. At Nile, we’ve built multi-tenancy into our product from day one. If you are working on an infra SaaS product and need a multi-tenant control plane, you should talk to us.

From previous experience, we’re familiar with multiple multi-tenant SaaS architecture options. We decided to store everything in a single Postgres schema since it provides a balance of scalability, cost optimization, and flexibility. However, this requires serious investment in database authorization to guarantee that we never leak customer data.

Authorization in a multi-tenant db is something many companies have to deal with, and in previous companies, I saw authorization implemented in probably the most common way: appending WHERE user_id = $USER_ID to queries. This is also the way things started out at Nile, but as we added more features we noticed that we were forced to add many branching and repetitive WHEREs to our code. We needed a solution that would allow us to add features quickly and confidently, and using custom filters in every single query was error-prone and hard to evolve if our data model changed.

RLS code excerpt

One solution that I knew about was Postgres Row-Level Security (RLS), a db-level mechanism that filters rows based on a per-user basis. I expected it would allow us to iterate faster and dramatically reduce security risks. You can learn the basics with these two blogs that show how to build multi-tenant apps using Postgres RLS. As with most solutions, the blog version was easy to implement, but there was an especially long tail to ship to production.

In this blog post, I’ll talk about the alternatives we considered - both for multi-tenant architecture and for securing data access - why we chose RLS, and the various challenges we encountered and overcame while shipping it to production.

Existing multi-tenancy solutions

Schema-per-tenant and database-per-tenant

We considered both of these approaches but went with the single-schema approach for its minimal operational complexity, low cost, and ability to scale later on. I won’t go into detail about these approaches, as there are countless resources on the topic. Here are two resources I’ve found to be helpful:

Multi-tenant SaaS patterns - Azure SQL Database | Microsoft Docs
A great paper from Microsoft -Multi-Tenant Data Architecture

Single schema with dynamic WHERE queries

Pros

Easiest and most straightforward zero-to-one solution.
Transparent and easy to reason about.

Cons

Possibility of forgetting to add a filter to a query. Since queries are permissive by default, this is easy to miss and hard to detect without extensive testing. There are some solutions to this ( i.e: @Filter in Hibernate) but I find that ORMs make simple querying easier and complicated querying harder. At Nile, our authorization model is complicated enough that we didn’t want to rely on Hibernate for this.
Repetitive, ugly, and annoying to implement. Imagine you have 20 API endpoints that require authorization and 2 different types of roles, USER and ADMIN. The access controls for these two roles are different, so you might have to define 40 WHEREs across your codebase. This doesn't scale well when adding new roles or modifying existing ones across many API endpoints.

External authorization systems

Pros

Highly flexible
(Claim to be) scalable

Cons

$$$ cost, if managed. Operational cost, if self-hosted.
Unnecessary if the permissioning model isn’t particularly complicated. At Nile, so far it’s not.
External dependencies often make testing more difficult and reduce engineering velocity. The benefits have to outweigh these costs.
As a control plane, multi-tenancy is core to our product. We believe in building foundational capabilities in-house so that we can push the envelope rather than be constrained by external solutions.

What might a better solution look like?

After we chose to use a single multi-tenant schema, we were looking for a solution that would be cleaner and less error-prone than dynamic queries and lighter than an external authorization system.

In the rest of this blog post, I’ll lay out what I discovered about RLS in the few weeks I spent researching and implementing it at Nile, and how it solved our problem (at least for now) of building authorization with speed, confidence, and maintainable architecture.

A quick overview of RLS

The high-level process to set up RLS is:

Define your data model as usual, but include a tenant identifier in every table
Define RLS policies on your tables (i.e: “only return rows for the current tenant”)
Define a db user (i.e: app_user) with all the privileges your application will need to interact with the db, but without any superuser roles. In Postgres, this is necessary since superuser roles bypass all permission checks , including RLS (more on that later).

A simple org access control example

Imagine your API has an /orgs endpoint that should only return organizations the calling user is a member of. To achieve this via RLS, you’d define your tables, policies, and db user as such:

CREATE
  TABLE
    users(
      id SERIAL PRIMARY KEY
    );

CREATE
  TABLE
    orgs(
      id SERIAL PRIMARY KEY
    );

CREATE
  TABLE
    org_members(
      user INTEGER REFERENCES users NOT NULL,
      org INTEGER REFERENCES orgs NOT NULL
    );

-- ** RLS setup **
ALTER TABLE
  orgs ENABLE ROW LEVEL SECURITY;

-- Create a function, current_app_user(),
-- that returns the user to authorize against.
CREATE
  FUNCTION current_app_user() RETURNS INTEGER AS $$ SELECT
    NULLIF(
      current_setting(
        'app.current_app_user',
        TRUE
      ),
      ''
    )::INTEGER $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
  POLICY org_member_policy ON
  orgs
    USING(
    EXISTS(
      SELECT
        1
      FROM
        org_members
      WHERE
        user = current_app_user()
        AND org = id
    )
  );

-- Create the db user that'll be used in your application.
CREATE
  USER app_user;

GRANT ALL PRIVILEGES ON
ALL TABLES IN SCHEMA public TO app_user;

GRANT ALL PRIVILEGES ON
ALL SEQUENCES IN SCHEMA public TO app_user;

The above RLS policy will only return true for organizations that the current user is a member of. Simple enough. Later on, we’ll see how things can get more complicated.

Note the current_app_user() function. In the traditional use case of direct db access, RLS works by defining policies on tables that filter rows based on the current db user. For a SaaS application, however, defining a new db user for each app user is clunky. For an application use case you can dynamically set and retrieve users using Postgres’ current_settings() function ( i.e: SET app.current_app_user = ‘usr_123’ and SELECT current_settings(‘app.current_app_user)).

What it looks like from request to response

Request to response diagram

Why we chose RLS

It fails by default - and therefore secure by default

The biggest benefit of RLS is that if you define a policy that’s too restrictive, or forget to define a policy, things just fail. Compared to dynamic queries where forgetting to add a WHERE will leak data, this is a big win for security. I didn’t appreciate this until I wrote some integration tests for access patterns (i.e: testing if a user can access orgs they’re a part of). Initially, all the tests failed, and for cases where users should have access tests only passed when I added the appropriate RLS policies.

RLS is, of course, not a silver bullet. Accidentally defining an overly permissive policy is hard to catch without extensive tests so it’s important to still be careful.

Defined once, applied everywhere

One of the main challenges with dynamic queries in single-schema multi-tenancy is that changes to tables often require touching many different queries. RLS solves this problem since policies are tied to tables and not queries. After modifying a table, all you need to do is to change its access policies, which will be applied to all queries.

Composability

With RLS, it’s easy to add more access rules as your multi-tenant data model evolves. According to the Postgres docs:

“When multiple policies apply to a given query, they are combined using either OR (for permissive policies, which are the default) or using AND (for restrictive policies).”

Since by default policies are combined with OR, this makes it super easy to define more policies as your access rules get more complex. This isn’t so straightforward with dynamic queries, where you might have to define your own logic for combining access rules. Or, as probably many of us have seen before, just create monster WHERE statements.

Separation of Concerns

Instead of mixing filters that are related to our application logic with filters that are related to the multi-tenant database design in the same WHERE clauses, we now have a clean separation:

Our application applies all the filters that are requested by users through APIs and other application logic.
RLS is responsible for filters that are required due to the multi-tenant database design.

Cases where RLS isn’t a great fit

Every technology has its tradeoffs and cases where you shouldn’t use it. Here are two cases where we think RLS isn’t a great fit:

If you need stronger isolation between tenants

RLS in a multi-tenant db isolates access to database rows, but all other database resources are still shared between tenants. It doesn’t help with limiting the disk space, CPU, or db cache used per tenant. If you need stronger isolation at the db level, you will need to look elsewhere.

If you have sophisticated access policies

As you will see in the next section, our current access policy is fairly simple - tenants are isolated from each other, and within a tenant, you have administrators with additional access. More mature access control policies such as RBAC/ABAC require their own schema design and can be more challenging to integrate with RLS and even more challenging to make performant.

We’ve recently started the design for the RBAC/ABAC feature in Nile (talk to us if you are interested in joining the conversation), and we will have a follow-up blog with recommendations on best practices for adding RBAC/ABAC to multi-tenant SaaS.

Implementation challenges

A few gotchas

One gotcha we encountered was that RLS doesn’t apply to superusers and table owners. According to the Postgres docs:

“Superusers and roles with the BYPASSRLS attribute always bypass the row security system when accessing a table. Table owners normally bypass row security as well, though a table owner can choose to be subject to row security with ALTER TABLE ... FORCE ROW LEVEL SECURITY.”

Both of the blogs I shared earlier create a user called app_user that’s used in the application. We did this as well, locally, but didn’t change the database user when deploying to our testing environment. Thankfully, we caught and fixed this issue quickly.

Another issue we caught during testing was that some requests were being authorized with a previous request’s user id. We discovered that since the user id for RLS was being stored in thread-local storage and threads were being reused for requests, it was necessary to set up a post-response handler to reset thread-local storage.

Overall, so far the gotchas haven’t been too tricky to diagnose and resolve, but as one might expect with anything security-related, they do have serious consequences if not addressed.

Initial widespread code changes

Although RLS addresses the problem of continuous widespread changes well (see “Defined once, applied everywhere”), initially switching from dynamic queries to RLS requires more code changes than you might think. Here’s an example of how RLS might affect an API endpoint to update an organization that’s only callable by users in that org:

/*
** ---- Without RLS ---- **

1. Check if user is a member of the org
  a. If so, execute the update query
  b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
  if (dao.isOrgMember(userId, orgId)) {
    return dao.updateOrg(updatePayload);
  } else {
    throw new NotFoundException();
  }
}

/* -- DAO layer -- */

boolean isOrgMember(userId, orgId) {
  return query("EXISTS(SELECT 1 ...)");
}

Org updateOrg(updatePayload) {
  return query("UPDATE orgs SET ... RETURNING *");
}

/*
** ----- With RLS ---- **

1. Execute the update query
  a. If the org was returned from the db, return the org in the response
  b. Else, return a 404
*/

Org update(userId, orgId, updatePayload) {
    Optional<Org> maybeOrg = dao.updateOrg(updatePayload);
    if (maybeOrg.isPresent()) {
      return maybeOrg.get();
    } else {
       throw new NotFoundException();
    }
}

/* -- DAO layer -- */

Optional<Org> updateOrg(updatePayload) {
  return query("UPDATE orgs SET ... RETURNING *");
}

In this example, authorization without RLS is done before writing to the db. With RLS, since authorization is determined at query time, write queries might fail so error handling has to be pushed down to the db level. This isn’t a mind-boggling change but is one you should keep in mind when planning to add RLS in any project that involves a multi-tenant db.

The gaps between blog-ready and production-ready RLS

Recursive permission policies

Let’s say you want to add an admin user type and implement the following access rules:

Users can read, update, and delete their own user profiles.
Users can read the profiles of other users who belong to the same tenant.
Users with admin access can read, update, and delete other users who belong to the same tenant.

The first two use cases are possible with straightforward RLS policies, but the third isn’t. This is because we must query the users table to see if the user in question is an admin ( i.e: SELECT 1 FROM users WHERE id = current_app_user() AND is_admin = TRUE). Since querying a table triggers its RLS policy checks, executing this query within a users RLS policy will trigger users RLS policy checks, which will call this query, which will trigger RLS policy checks, resulting in an infinite loop. Postgres will catch this error instead of timing out, but you should make sure to test your policies so this doesn’t happen at runtime. You can avoid this problem by defining a function with SECURITY DEFINER permissions that’s to be used in the RLS policy. According to the Postgres docs:

"SECURITY DEFINER specifies that the function is to be executed with the privileges of the user that owns it."

In our case, this user is the superuser that you probably used to set up your database. So they bypass RLS.

:::note By using SECURITY DEFINER you are allowing users to bypass the security policy and use superuser privileges regardless of who they really are, so you must be careful. I recommend reviewing the “Writing SECURITY DEFINER Functions Safely ” section of the Postgres documentation before using this capability. :::

Here’s an example of how to implement RLS policies that satisfy the three use cases above:

CREATE
  TABLE
    users(
      id SERIAL PRIMARY KEY,
      is_admin BOOLEAN
    );

ALTER TABLE
  users ENABLE ROW LEVEL SECURITY;

-- Users can do anything to themselves.
CREATE
  POLICY self_policy ON
  users
    USING(
    id = current_app_user()
  );

CREATE
  FUNCTION is_user_admin(
    _user_id INTEGER
  ) RETURNS bool AS $$ SELECT
    EXISTS(
      SELECT
        1
      FROM
        users
      WHERE
        id = _user_id
        AND is_admin = TRUE
    ) $$ LANGUAGE SQL SECURITY DEFINER;

CREATE
  FUNCTION do_users_share_org(
    _user_id_1 INTEGER,
    _user_id_2 INTEGER
  ) RETURNS bool AS $$ SELECT
    EXISTS(
      SELECT
        1
      FROM
        org_members om1,
        org_members om2
      WHERE
        om1.user != om2.user
        AND om1.org = om2.org
        AND om1.user = _user_id_1
        AND om2.user = _user_id_2
    ) $$ LANGUAGE SQL SECURITY INVOKER;

-- Non-admins can only read users in their orgs.
CREATE
  POLICY read_in_shared_orgs_policy ON
  users FOR SELECT
      USING(
      do_users_share_org(
        current_app_user(),
        id
      )
    );

CREATE
  POLICY admin_policy ON
  users
    USING(
    do_users_share_org(
      current_app_user(),
      id
    )
    AND is_user_admin(
      current_app_user()
    )
  );

Note the use of the do_users_share_org() SECURITY INVOKER function. According to the Postgres docs:

“SECURITY INVOKER indicates that the function is to be executed with the privileges of the user that calls it.”

In our case, this is app_user (who doesn’t bypass RLS), so we just define these functions for reusability purposes.

Logging

It’s important to set up logging before shipping any feature to production. This is especially true with RLS where logging the execution of the actual policies isn’t directly possible . For each request, it’s helpful to log the user and tenant IDs to be used for RLS when:

Parsing them from auth headers
Setting and getting them from thread-local storage
Setting them in the db connection
This makes it easier to identify bugs related to thread-local storage When resetting them in thread-local storage

It’s also a good idea to enable more detailed logging in the db, at least initially, to see the values actually being inserted/retrieved. If policies return too few/many results, or inserts fail unexpectedly, it’s easier to figure out what went wrong.

Testing

In multi-tenant SaaS, guaranteeing the security of each tenant is critical. We have an extensive suite of integration tests that test every access pattern to make sure that nothing ever leaks. The tests spin up a Postgres Testcontainer and call the relevant API endpoints, checking that proper access is always enforced.

In order to minimize the execution time of a large suite of integration tests, we avoid setup and teardown of the database between tests and annotate the order in which tests run to make sure the results are deterministic even without a full cleanup in between tests. As we scale, we’ll look into other options like property-based testing and parallelizing our tests.

The switch from dynamic queries to RLS has been seamless in our integration tests. All we had to do was to make sure our tests were using the newly-created app_user that doesn’t bypass RLS.

Conclusion

Every modern SaaS product is multi-tenant, but the good ones are also scalable, cost-effective, and maintainable. Scalability and cost-effectiveness are the results of careful system design. Maintainability includes design considerations such as the DRY principle (don’t repeat yourself) and a separation of concerns, which make mistakes less likely and testing and troubleshooting easier.

As we’ve shown, a single-schema multi-tenant database with RLS ticks all the checkboxes for scalable, cost-effective, and maintainable architecture. This blog includes everything you need to get started with your own multi-tenant SaaS architecture. But if this seems like too much and you’d rather have someone else handle this for you - talk to us :)

Postgres built for modern SaaS