Updated on 2022-06-01 GMT+08:00

Suggestions

HQL Compilation – Implicit Type Conversion

If the query statements use the field value for filtering, do not use the implicit type conversion of Hive to compile HiveQL. The reason is that the implicit type conversion is not conducive to code reading and migration.

Suggested:

select * from default.tbl_src where id = 10001;
select * from default.tbl_src where name = 'TestName';

Not suggested:

select * from default.tbl_src where id = '10001';
select * from default.tbl_src where name = TestName; 

In the tbl_src table, the id field is of the Int type and the name field is of the String type.

HiveQL Compilation – Object Name Length

HiveQL object names include table names, field names, view names, and index names. It is recommended that an object name not exceed 30 bytes.

If the length of any object name in Oracle exceeds 30 bytes, an error will be reported. To be compatible with Oracle, limit the length of an object name to a maximum of 30 bytes.

Excessive long object names are not conductive to code reading, migration, and maintenance.

HiveQL Compilation – Number of Records

You are advised to use select count(1) from table_name to count the number of all records in a table.

You are advised to use select count(column_name) from table_name to count the number of valid records of a field in a table.

JDBC Timeout Limit

The JDBC implementation provided by Hive has a timeout limit. The default value is 5 minutes. You can set the value using java.sql.DriverManager.setLoginTimeout (int seconds). The unit of seconds is second.

Concurrently Compiling SQL Statements in Hive

Set hive.driver.parallel.compilation to true for HiveServer and restart HiveServer to enable HiveServer to support concurrent SQL compilation between multiple sessions.

By default, the HiveServer disables concurrent SQL compilation, which means that the HiveServer can only compile SQL statements in serial mode. When a large number of SQL statements with short execution time exist, the overall performance is affected. For example, if a large number of Analyse operations are performed, enabling concurrent compilation can greatly improve the performance.

Policy to Create a Hive Partitioned Table

If there is a large amount of data in the directory of a table, Hive searches for all files in the table, which is time-consuming. If you know some features of the data, you can split the data and store them in different directories of Hive in advance. During query, you can filter the features in the WHERE clause. Then, the data operation is performed only in the subdirectory that meets the conditions, other directories that do not meet the requirements will not be read. This method of dispersing data in a table to sub-directories is called partitioned table.

Therefore, the prerequisite for creating a partitioned table is that the table contains a large amount of data or will store a large amount of data in the future. Otherwise, a large number of small files will be generated in the file system. You are advised to create a partitioned table when the following conditions are met:

  • Data in the table is imported by time, and the data size in each time period is greater than 32 MB. In this case, the time field can be used as a partition field to create a partitioned table.
  • If the total size of a table is expected to be greater than 10 GB, you can create a partitioned table.

Increasing the Query Success Rate of the Standby GaussDB Node

The following error information is displayed when you perform a query on the standby database:

When the standby database synchronizes data from the primary database, it detects that the standby database executes the SQL statement that takes a long time and automatically cancels the SQL statement. The maximum SQL statement execution time is 30s by default. Run the following statement on the standby GaussDB node:

su ommdba
gs_guc reload -c "hot_standby_feedback=on"

Setting this parameter has both advantages and disadvantages. The advantage is that conflicts are reduced. The disadvantage is that the primary database needs to wait for the transaction of the standby database to end, which may cause data expansion of the primary database in frequent update scenarios. Therefore, you are advised to execute only the SQL statements that are not frequently executed and take a long time on the standby database.