Help Center > > User Guide> MRS Cluster Component Operation Guide> Using Spark SQL from Scratch

Using Spark SQL from Scratch

Updated at: Nov 06, 2019 GMT+08:00

To process structured data, Spark provides Spark SQL, which is similar to SQL.

You can create a table named src_data, write a data entry in each row of the src_data table, and store data in the mrs_20160907 cluster. You can then use SQL statements to query data in the src_data table. Afterward, you can delete the src_data table.

Prerequisites

You have administrator rights on MRS Manager.

You have obtained the AK/SK for writing data from the OBS data source to the Spark SQL table. The method for obtaining the AK/SK is as follows:
  1. Log in to the management console.
  2. Click the username and choose My Credentials from the drop-down list.
  3. Click Access Keys.
  4. Click Create Access Key to switch to the Create Access Key page.
  5. Enter the login password, the short message verification code and click OK to download the access key. Keep the access key secure.

Procedure

  1. Prepare data sources for Spark SQL analysis.

    The following is an example of a text file:

    abcd3ghji
    efgh658ko
    1234jjyu9must be unique or else the bucket cannot be created
    7h8kodfg1
    kk99icxz3

  2. Upload data to OBS.

    1. Log in to the OBS management console.
    2. Click Create Bucket to create a bucket and name it. The name must be unique; otherwise the bucket cannot be created. Here name sparksql will be used as an example.
    3. In the sparksql bucket, click Create Folder to create the input folder.
    4. Go to the input folder, click to select a local text file, and click Upload.

  3. Import the text file in OBS to HDFS.

    1. Log in to the MRS management console. In the navigation tree on the left, choose Clusters > Active Clusters and click the cluster named mrs_20160907. The mrs_20160907 cluster was created in section Creating a Cluster.
    2. Select Files tab page.
    3. Click Create Folder and create the userinput file folder.
    4. Go to the userinput file folder, and click Import Data.
    5. Select the OBS and HDFS paths and click OK.

      OBS path: s3a://sparksql/input/sparksql-test.txt

      HDFS path: /user/userinput

  4. Submit the Spark SQL statement.

    1. On the Jobs tab page, click Create to go to the Create Job page. For details, see Running a SparkSql Job.

      Only when the mrs_20160907 cluster is in the running state can jobs be submitted.

    2. Enter the Spark SQL statement to create a table.

      When entering Spark SQL statements, ensure that the characters contained are fewer than 10,000.

      The syntax is as follows:

      CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path];

      You can use either of the following two methods to create a table:

      • Method 1: Create table src_data and write data in every row.

        The data source is stored in the /user/userinput file folder of HDFS: create external table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location '/user/userinput';

        The data source is stored in the /sparksql/input file folder of OBS: create external table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location 's3a://AK:SK@sparksql/input';

        For the method of obtaining the AK/SK, see the description in Prerequisites.

      • Method 2: Create table src_data1 and load data to the src_data1 table in batches.

        create table src_data1 (line string) row format delimited fields terminated by ',' ;

        load data inpath '/user/userinput/sparksql-test.txt' into table src_data1;

      When method 2 is used, the data from OBS cannot be loaded to the created tables directly.

    3. Enter the Spark SQL statement to query a table.

      The syntax is as follows:

      SELECT col_name FROM table_name;

      To query data in the src_data table, for example, enter the following statement:

      select * from src_data;

    4. Enter the Spark SQL statement to delete a table.

      The syntax is as follows:

      DROP TABLE [IF EXISTS] table_name;

      For example:

      drop table src_data;

    5. Click Check to check whether the statements are correct.
    6. Click Submit.

      After submitting Spark SQL statements, you can check whether the execution is successful in Last Execution Result and view detailed execution results in Last Query Result Set.

  5. Terminate a cluster.

    For details, see Terminating a Cluster in the User Guide.

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel