A standard data management procedure is the import and export of data from data repositories. Almost everyone must extract data from database management systems, from developers to administrators. AWS S3 serves as the data storage layer in the AWS Data Lake idea, while Redshift serves as the computation layer that can connect, process, and aggregate amounts of data. It is frequently necessary to export the data from Redshift and host it in other repositories that are appropriate for the type of consumption to provide the data housed there. One of AWS’s core storage repositories, S3, is linked with practically all of the company’s analytics and data offerings. Due to this, Redshift experts’ first requirement is to export data from Redshift to AWS S3. In this post, we’ll go through the many methods for exporting data from Amazon Redshift to Amazon S3 and how to do it step-by-step.
Setup for AWS Redshift
This article assumes that an operational AWS Redshift cluster is already in place. To set up a new AWS Redshift cluster, newbies to Redshift might refer to this article, Getting started with AWS Redshift. The Redshift Clusters page’s appearance once the cluster has been installed will be shown below.
We require sample data to extract the data from the AWS Redshift cluster. It is expected that you already have at least some sample data. If not, I already described how to load data into Redshift in one of my posts, which you may refer to produce some example data. Connect to the cluster as soon as it is ready with some test data. In the Redshift cluster, I have a users table that appears as follows. Use your preferred IDE to connect to the Redshift cluster.
Choose your Redshift cluster data
Consider that we want to export this data into an S3 bucket on AWS. The “Unload” command is the primary way AWS Redshift natively enables exporting data. The Unload command has the syntax listed below. This command offers a variety of formatting choices and allows you to specify the schema of the data that is being exported. This post will look at some of the more popular choices.
Command syntax for unloading
Before we can run the unload command, a few conditions must be met. It is necessary to have an AWS S3 bucket where the exported data will be loaded. AWS Redshift cluster attachment and write access to Amazon S3 need the creation of an IAM role. Execute the command as indicated below, supposing these configurations are in place. Let’s attempt to comprehend each line of this instruction.
- The command’s first line defines the query that will be used to extract the requested dataset. In this instance, we want all the table’s rows and fields.
- The location of the Amazon S3 bucket from which we want to extract the data is specified in the second line of the command.
- The IAM role that the Redshift cluster will use to write the data to the Amazon S3 bucket is specified in the third line.
- The data format that we desire to export the data in is specified in the last line. Here, we have supplied the phrase CSV Unload command since we want to export the data in CSV format.
When you return to the AWS S3 bucket, the output will appear, as seen below. Each node in the design of AWS Redshift has a predetermined number of node slides. Depending on the number of node slices in the cluster, the unload command, by default, exports data in parallel to several files. In this instance, the data was exported in two different files from a single node cluster.
discharge output in S3
Any of these files, when opened, would appear as below. Using the unload command, all of the fields and data segments would be exported to the file.
In general, one would want to compress the data when exporting vast amounts of data because it minimizes the storage footprint and lowers expenses. It requires additional time and effort to export the data in an uncompressed format before compressing it. The Unload command provides options for exporting data in a compressed format. There are also occasions when the data must be in a single file rather than being joined in a single file and then read by the consumption tools. Add the phrases GZIP and PARALLEL OFF to the preceding command, as shown below. This will compress the exported data in gzip format and stop AWS Redshift from exporting data in parallel mode, resulting in a single file output.
command to unload with parallelism and data compression disabled
Return to the AWS S3 bucket after the data has been exported, and the result will appear as seen below. The result would be one gzip-encoded file.
Put output in the S3 bucket.
The instructions above are appropriate for specific export instances where all the data has to be exported in one location. Imagine a situation where the data is pretty extensive, and gathering it all in one or more files would not be beneficial since there would be too much data in a single file, or the relevant material would be dispersed across too many files. Data organization in AWS S3 following the export procedure doesn’t require more work because the objective is to group the data into separate buckets according to a specific criterion. As part of the export procedure, we must split the data. “State” is one of the fields in the dataset we are using. Let’s imagine we require the data to be divided into buckets based on states, with each bucket including all the files containing the rows that pertain to that state. The unload command makes the partition keyword available, enabling us to accomplish this specific goal. Execute the command as seen below, referencing the partitioned keyword and the attribute “state.”
order to unload
Returning to the AWS S3 bucket will allow you to view the output displayed below. Many folders will be created in the destination bucket where the data is exported. Each individual state is given its own folder, with the name of the state serving as both the folder’s name and its value.
Put output in an S3 bucket on AWS.
You may retrieve the exported data in multiple files by opening any folder. The explanation is that the unload command mentioned above produced numerous files since we neglected to include the PARALLEL OFF option. If the command needs to be a single file, add this parameter. You should be able to discover all the entries with the same status if you open any one of the files in any given folder.
There are other options for the unload command. Take a closer look at and give these AWS Redshift documentation choices a try.
This post taught us how to export the data to AWS S3 using the AWS Redshift Unload command. Additionally, we learned how to use the same command with multiple arguments to export data, compress data, export data with or without parallelism, and arrange the produced data.