Best Practices for Data Import
Importing Data from OBS in Parallel
- Splitting a data file into multiple files
Importing a huge amount of data takes a long period of time and consumes many computing resources.
To improve the performance of importing data from OBS, split a data file into multiple files as evenly as possible before importing it to OBS. The preferred number of split files is an integer multiple of the DN quantity.
- Verifying data files before and after an import
When importing data from OBS, first import your files to your OBS bucket, and then verify that the bucket contains all the correct files, and only those files.
After the import is complete, run the SELECT statement to verify that the required files have been imported.
Using GDS to Import Data
- Data skew causes the query performance to deteriorate. Before importing all the data from a table containing over 10 million records, you are advised to import some of the data and check whether there is data skew and whether the distribution keys need to be changed. Troubleshoot the data skew if any. It is costly to address data skew and change the distribution keys after a large amount of data has been imported. For details, see Checking for Data Skew.
- To speed up the import, you are advised to split files and use multiple GDSs to import data in parallel. An import task can be split into multiple concurrent import tasks. If multiple import tasks use the same GDS, you can specify the -t parameter to enable GDS multi-thread concurrent import. To prevent physical I/O and network bottleneck, you are advised to mount GDSs to different physical disks and NICs.
- If the GDS I/O and NICs do not reach their physical bottlenecks, you can enable SMP on DWS for acceleration. SMP will multiply the pressure on GDSs. Note that SMP adaptation is implemented based on the DWS CPU pressure rather than the GDS pressure. For details about SMP, see Recommended Suggestions for SMP.
- For the proper communication between GDSs and DWS, you are advised to use 10GE networks because 1GE networks cannot bear the high-speed data transmission. To maximize the import rate of a single file, ensure that a 10GE network is used and the data disk group I/O rate is greater than the upper limit of the GDS single-core processing capability (about 400 MB/s).
- Similar to the single-table import, ensure that the I/O rate is greater than the maximum network throughput in the concurrent import.
- It is recommended that the ratio of GDS quantity to DN quantity be in the range of 1:3 to 1:6.
- To improve the efficiency of importing data in batches to column-store partitioned tables, the data is buffered before being written into a disk. You can specify the number of buffers and the buffer size by setting partition_mem_batch and partition_max_cache_size, respectively. The smaller the values, the slower the batch import to column-store partitioned tables. The larger the values, the higher the memory consumption.
Using INSERT to Insert Multiple Rows
If the COPY statement cannot be used and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data of only one row or a few rows at a time.
Multi-row inserts improve performance by batching up a series of inserts. The following example inserts three rows into a three-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert. For details about how to create a table, see Creating a Table.
To insert multiple rows of data to the table customer_t1, run the following statement:
INSERT INTO customer_t1 VALUES (6885, 'maps', 'Joes'), (4321, 'tpcds', 'Lily'), (9527, 'world', 'James');
For more details and examples, see INSERT.
Using COPY to Import Data
The COPY statement imports data from local and remote databases in parallel. COPY imports large amounts of data more efficiently than using INSERT statements.
For details about how to use the COPY statement, see Data Import Using COPY FROM STDIN.
Using a gsql Meta-Command to Import Data
The \copy command can be used to import data after you log in to a database through any psql client. Unlike the COPY statement, the \copy command reads from or writes into a file.
Data read or written using the \copy command is transferred through the connection between the server and the client and may not be efficient. The COPY statement is recommended when the amount of data is large.
For details about how to use the \copy command, see Using a gsql Meta-Command to Import Data.