NeetoDeploy is a Heroku alternative for deploying web applications. Currently NeetoDeploy is being used by Neeto for all pull request review applications and staging application. NeetoDeploy is a platform as a service (PaaS) solution that makes it easy to manage and deploy applications with features like database backup, auto deployment from GitHub, live logs, SSL certificates management, auto scaling and more.
NeetoDeploy's PostgreSQL addon has a continuous backup feature. This feature allows you to roll your database back to a specific time in the past. We can split the continuous backup process into data backup and recovery. Let's discuss these steps in detail.
Initial setup
Let us create a new PostgreSQL cluster and populate it with some data.
Creating parent database
1initdb /usr/local/var/postgresql-parent
Creating some sample data
Start the PostgreSQL server and login using psql. Provide role and password if needed.
1pg_ctl -D /usr/local/var/postgresql-parent start 2psql -d postgres
Create a new table
1postgres=# 2CREATE TABLE customers ( 3 user_id SERIAL PRIMARY KEY, 4 username VARCHAR(50), 5 email VARCHAR(100), 6 registration_date DATE 7);
Seed sample data
1postgres=# 2INSERT INTO customers (username, email, registration_date) 3VALUES 4('johndoe', '[email protected]', '2021-01-01'), 5('janedoe', '[email protected]', '2021-02-15'), 6('bobsmith', '[email protected]', '2021-03-10'), 7('sarahlee', '[email protected]', '2021-04-05'), 8('maxwell', '[email protected]', '2021-05-20');
Data backup
To perform recovery, we need to have access to the existing data. Hence we must store all the database changes in a separate location, allowing us to utilize this data for recovery.
This is where the Write Ahead Log (WAL) in PostgreSQL comes into play. PostgreSQL follows a mechanism that writes all transactions to the WAL before committing them to the table data files. By preserving the WAL, we can reconstruct the data by traversing the logged transactions. Utilizing the information stored in the WAL, we can systematically reapply the recorded transactions to recreate the data and achieve the desired recovery. Your WAL files are in your PostgreSQL data directory's pg_wal folder.
WAL archiving
As data modifications occur in the databases, new WAL files are generated while older WAL files are eventually discarded. Therefore, it is crucial to store the WAL files before they are deleted. Here we will use an AWS S3 bucket to store our WAL files. Also, we will use wal-e and envdir to simplify the process.
Install helper modules
1python3 -m pip install wal-e envdir
If you face any dependency issues, please refer wal-e to fix them.
Store AWS S3 credentials
To use "envdir" with your AWS credentials, store them in a folder. The following code snippet demonstrates how to accomplish this:
1mkdir ~/wal-e.env 2echo "YOUR_AWS_REGION" > ~/wal-e.env/AWS_REGION 3echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/wal-e.env/AWS_SECRET_ACCESS_KEY 4echo "YOUR_AWS_ACCESS_KEY_ID" > ~/wal-e.env/AWS_ACCESS_KEY_ID 5echo "YOUR_AWS_STORAGE_PATH" > ~/wal-e.env/WALE_S3_PREFIX
Configure WAL archiving
Now, let's configure PostgreSQL to store the WAL files in your S3 bucket before they are deleted. Open your PostgreSQL configuration file "postgresql.conf", which is located in your data directory i.e., "/usr/local/var/postgresql-parent", and make the following changes.
1# To enable WAL archiving. 2archive_mode = on 3 4# Determines how much information is written to the WAL. 5wal_level = replica 6 7# Force PostgreSQL to switch WAL file in every 60 seconds. 8archive_timeout = 60 9 10# Command for pushing the wal files to the S3 bucket. 11archive_command = 'envdir ~/wal-e.env wal-e wal-push %p'
You can find more information about the above configurations by referring to this documentation.
Now let's restart the server.
1 data_directory="/usr/local/var/postgresql-parent" 2 pg_ctl -D $data_directory restart -l $data_directory/postgresql.log
You can watch the PostgreSQL logs like this.
1 tail -f /usr/local/var/postgresql-parent/postgresql.log
After starting your server, monitor the PostgreSQL logs. If you don't observe any WAL archive logs even after 60 seconds, it is because a WAL file switch will only occur if there are any modifications. To generate the desired outcome, make updates in the database. Upon doing so, you will be able to see lines similar to the following in your PostgreSQL logs:
1Message: 'MSG: completed archiving to a file\nDETAIL: Archiving to "s3://neeto-deploy-backups/local/wal_005/000000020000000000001.lzo" complete at 158.209KiB/s.\nSTRUCTURED: time=2023-05-18T06:22:27.613242-00 pid=263 action=push-wal key=s3://neeto-deploy-backups/local/wal_005/000000020000000000001.lzo prefix=local/ rate=158.209 seg=000000020000000000001 state=complete'
Base backup
Since we have enabled WAL archiving, all new changes are stored in the bucket as WAL files. However, what about the data we created before configuring WAL archiving? To ensure a complete recovery, we also need that data. Therefore, we need to take a base backup of our database. The following code will initiate the base backup process and push it to the S3 bucket.
1envdir ~/wal-e.env wal-e backup-push /usr/local/var/postgresql-parent
On running the above command you will get the below output.
1wal_e.main INFO MSG: starting WAL-E 2DETAIL: The subcommand is "backup-push". 3STRUCTURED: time=2023-05-18T06:24:27.613242-00 pid=41178 4wal_e.worker.upload INFO MSG: begin uploading a base backup volume 5 6DETAIL: Uploading to "s3://neeto-deploy-backups/local/basebackups_005/ 7base_000000010000000000005_00000040/tar_partitions/part_00000000.tar.lzo". 8STRUCTURED: time=2023-05-18T06:24:29.527622-00 pid=41178 9wal_e.worker.upload INFO MSG: finish uploading a base backup volume 10 11DETAIL: Uploading to "s3://neeto-deploy-backups/local/basebackups_005/ 12base_000000010000000000005_00000040/tar_partitions/part_00000000.tar.lzo" 13complete at 1459.68KiB/s. 14STRUCTURED: time=2023-05-18T06:24:33.121716-00 pid=41178 15NOTICE: all required WAL segments have been archived 16
After completing the base backup, you can verify the presence of the base backup folder in your S3 bucket. Now we have the backup and new WAL files are being pushed to our bucket.
Data recovery
Let's consider a scenario where you need to perform a rollback. Suppose you accidentally deleted a table from your database that contained multiple records, making it impractical to recall and reinsert them manually. As a first step, make a note of the time at which the table was deleted. For instance, let's say it occurred on 2023-05-18 11:57:11 UTC. The objective is to roll back to a point before that time. Let's roll back to 2023-05-18 at 11:57:00 UTC.
Let's make a mistake
We are going to drop a table.
1postgres=# 2DROP TABLE customers;
Note the time the table got deleted. Here it is 2023-05-18 11:57:11 UTC. Now let's discuss how we can restore the deleted table.
Creating a new data directory
Let's create a new data directory where we can recover the data.
1initdb /usr/local/var/postgresql-child
Fetch the base backup
The base backup serves as the starting point, while the remaining data will be regenerated by PostgreSQL using the WAL files. To fetch the base backup, you can use the following code snippet.
1envdir ~/wal-e.env wal-e backup-fetch /usr/local/var/postgresql-child LATEST
The "LATEST" tag will pull the latest base backup if multiple base backups are in the bucket. On running the above command you will receive an output similar to the following.
1wal_e.main INFO MSG: starting WAL-E 2DETAIL: The subcommand is "backup-fetch". 3STRUCTURED: time=2023-05-18T06:29:22.826910-00 pid=41789 4wal_e.worker.s3.s3_worker INFO MSG: beginning partition download 5DETAIL: The partition being downloaded is part_00000000.tar.lzo. 6HINT: The absolute S3 key is local/basebackups_005/ 7base_000000010000000000000005_00000040/tar_partitions/part_00000000.tar.lzo. 8STRUCTURED: time=2023-05-18T06:29:24.674572-00 pid=41789
Configuring recovery
Now, we need to regenerate the remaining data from the WAL files. To do that, make the following changes in the PostgreSQL configuration file.
1#postgresql.conf 2 3# Command to fetch each wal files from the bucket. 4restore_command = 'envdir ~/wal-e.env wal-e wal-fetch %f %p' 5 6# Time at which we want to stop recovery. 7recovery_target_time = '2023-05-18 11:57:00'
We can stop the parent server since it is no longer needed.
1pg_ctl -D /usr/local/var/postgresql-parent stop
To initiate the server in recovery mode and to allow PostgreSQL to regenerate the data using the WAL files, create a file named "recovery.signal" in the new data directory.
1touch /usr/local/var/postgresql-child/recovery.signal
Start the child server pointing to the new data directory by running the following command.
1 data_directory="/usr/local/var/postgresql-child" 2 pg_ctl -D $data_directory start -l $data_directory/postgresql.log
If you examine the PostgreSQL logs, you will notice that the recovery process is paused at the specified time.
12023-05-18 12:07:41.688 IST [43159] LOG: restored log file "000000010000000000006" from archive 22023-05-18 12:07:41.692 IST [43159] LOG: recovery stopping before commit of transaction 743, time 2023-05-18 11:57:11.135927+05:30 32023-05-18 12:07:41.692 IST [43159] LOG: pausing at the end of recovery 42023-05-18 12:07:41.692 IST [43159] HINT: Execute pg_wal_replay_resume() to promote.
It's important to note that you won't be able to connect to the server using psql at this point because it is not ready to accept connections. By default, PostgreSQL pauses the recovery process after reaching the recovery target time. To change this behavior and to allow the server to accept connections after reaching the target time, add the following configuration to your PostgreSQL configuration file.
1recovery_target_action = 'promote'
After making the above change, restart the server once again by running the following command.
1data_directory="/usr/local/var/postgresql-child" 2pg_ctl -D $data_directory restart -l $data_directory/postgresql.log
Now, you will observe that the server is ready to accept connections after completing the recovery process. Once the recovery is successfully completed, PostgreSQL automatically deletes the "recovery.signal" file. You can verify the successful deletion of the recovery file from your data directory.
Once the server is up and ready, connect to the server and check if the deleted table is present. You will find that the deleted table is now restored and available in the database.
1postgres=# 2select * from customers; 3 4 user_id | username | email | registration_date 5---------+----------+----------------------+------------------- 6 1 | johndoe | johndoe@example.com | 2021-01-01 7 2 | janedoe | janedoe@example.com | 2021-02-15 8 3 | bobsmith | bobsmith@example.com | 2021-03-10 9 4 | sarahlee | sarahlee@example.com | 2021-04-05 10 5 | maxwell | maxwell@example.com | 2021-05-20 11(5 rows)
To learn more about PostgreSQL continuous backup, you can refer to the official PostgreSQL documentation.
Challenges faced
Restoration of PostgreSQL roles
When we restored the PostgreSQL database, the roles and passwords were also restored in the database. However, we create a new database with the desired data in the continuous rollback feature. Therefore, we should use a new role and password to ensure security. We resolved this issue by changing the restored role and password to a new one.
Early recovery completion
Another issue we encountered was the early recovery of the database before reaching the specified target time. Suppose no activity occurs in the database, and we attempt to roll back to the current time. PostgreSQL initiates the recovery process and recovers all the available WAL files. However, it fails to locate the target time since no corresponding WAL file is available for the given timestamp. PostgreSQL raises the following error in this scenario and shuts down the server.
12023-05-18 02:12:15.846 UTC [76] LOG: last completed transaction 2was at log time 2023-05-18 02:04:02.755919+00 32023-05-18 02:12:15.846 UTC [76] FATAL: recovery ended before configured 4recovery target was reached 52023-05-18 02:12:15.850 UTC [75] LOG: startup process (PID 76) exited with 6exit code 1 72023-05-18 02:12:15.850 UTC [75] LOG: terminating any other active server 8processes 92023-05-18 02:12:15.853 UTC [75] LOG: shutting down due to startup 10process failure 112023-05-18 02:12:15.859 UTC [75] LOG: database system is shut down
In this case, we can remove the specified target time and restart the server. If we do not provide any target time, PostgreSQL will recover until the last available WAL file, which will contain the latest transaction of our database.
Recovery is an asynchronous process
As mentioned above, we must update the restored role and password to new ones.
1update-role-and-password(){ 2# Create a new temporary user as we cannot update the current session user. 3psql -d <DB_NAME> \ 4-c "CREATE USER tempuser WITH SUPERUSER LOGIN PASSWORD 'tempuser-pwd';" 5 6# Login as temporary user and update the parent role and password. 7PGPASSWORD=tempuser PGUSER=tempuser-password psql -d <DB_NAME> \ 8-c "ALTER ROLE <parent-role> RENAME TO <NEW_ROLE>; 9ALTER ROLE <NEW_ROLE> WITH PASSWORD <NEW_PASSWORD>;" 10 11# Login as the new user and remove the temporary user. 12PGPASSWORD=<NEW_ROLE> PGUSER=<NEW_PASSWORD> psql -d <DB_NAME> \ 13-c "DROP ROLE tempuser;" 14} 15 16 17# Start the server in recovery mode. 18pg_ctl -D /usr/local/var/postgresql-child start 19 20# Call the method to update role and password. 21update-role-and-password 22
If you try the above code, you will encounter the following error.
1ERROR: cannot execute CREATE ROLE in a read-only transaction
Since the recovery process is asynchronous, the system may attempt to update the role before completing the recovery. It is important to note that the server will be in read-only mode during the recovery, which can lead to the error mentioned.
To solve this issue, it is necessary to determine when the recovery process was completed. This can be achieved by checking for the existence of the "recovery. signal" file in a loop.
1wait_until_recovery_is_completed() 2{ 3 recovery_signal_path="/usr/local/var/postgresql-child/recovery.signal" 4 while [ -f "$file_path" ]; do 5 sleep 1 # Wait for 1 second before checking again. 6 done 7} 8
1pg_ctl -D /usr/local/var/postgresql-child start 2wait_until_recovery_is_completed 3update-role-and-password