Contents
Hive是什么
- 定义:Hive是基于Hadoop的一个数据仓库工具,用于进行数据提取、转化、加载(ETL)。它可以将结构化的数据文件映射为数据库表,并提供SQL查询功能,使得用户可以通过类似SQL的方式对数据文件进行读写和管理。
- 特点
- 学习成本低:Hive允许通过类似SQL的语句实现快速的MapReduce统计,简化了MapReduce的使用。
- 支持SQL查询:Hive可以将SQL语句转换为MapReduce任务来执行,使得不熟悉MapReduce的用户也能方便地进行数据查询、汇总和分析。
- 伸缩性和可扩展性:Hive支持UDF(用户自定义函数)、UDAF(用户自定义聚合函数)和UDTF(用户自定义表生成函数),为数据操作提供了良好的伸缩性和可扩展性。
- 适用场景:Hive最适合于大数据集的批处理作业,如网络日志分析等,因为它在处理大规模数据时可能会有一定的时间延迟。
Hive的安装部署
环境
- hadoop-2.10.2
- hive-3.1.3
- centos7
- jdk8
hadoop hdfs集群已经建好,本片基于之前的配置
Hive下载安装
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz --no-check-certificate
tar -zxf apache-hive-3.1.3-bin.tar.gz -C /usr/local/
ln -s /usr/local/apache-hive-3.1.3-bin/ /usr/local/hive
配置环境变量
vim ~/.bashrc
cat >>~/.bashrc<<EOF
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
EOF
source ~/.bashrc
内置 derby 数据库测试
cd /usr/local/hive/
./bin/schematool -initSchema -dbType derby
初始化包SLF4j错误,删除即可,因为hadoop也有,两者冲突
[root@dnode3 hive]# ./bin/schematool -initSchema -dbType derby
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Initialization script completed
schemaTool completed
删除生成的meta文件
[root@dnode3 hive]# rm -fr metastore_db/
[root@dnode3 hive]# mv lib/log4j-slf4j-impl-2.17.1.jar lib/log4j-slf4j-impl-2.17.1.jar.bak
重新生初始化schema
[root@dnode3 hive]# ./bin/schematool -initSchema -dbType derby
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.derby.sql
Initialization script completed
schemaTool completed
./bin/hive
hive># 执行一些语句
hive>show databases;
hive>show tables;
hive>create table stu(id int, name string);
hive> insert into table stu(id,name) values(1,"ss");
hive> select * from stu;
OK
1 ss
Time taken: 0.308 seconds, Fetched: 1 row(s)
查看在hdfs集群中的数据
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /
Found 3 items
drwxr-xr-x - root supergroup 0 2024-06-23 09:28 /input
drwxr-xr-x - root supergroup 0 2024-06-23 09:30 /ooresult
drwx------ - root supergroup 0 2024-06-23 09:29 /tmp
创建数据库之前
创建数据库之后
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /
Found 4 items
drwxr-xr-x - root supergroup 0 2024-06-23 09:28 /input
drwxr-xr-x - root supergroup 0 2024-06-23 09:30 /ooresult
drwx------ - root supergroup 0 2024-06-23 20:44 /tmp
drwxr-xr-x - root supergroup 0 2024-06-23 20:45 /user
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /user
Found 1 items
drwxr-xr-x - root supergroup 0 2024-06-23 20:45 /user/hive
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /user/hive
Found 1 items
drwxr-xr-x - root supergroup 0 2024-06-23 20:45 /user/hive/warehouse
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - root supergroup 0 2024-06-23 20:56 /user/hive/warehouse/stu
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /user/hive/warehouse/stu
Found 1 items
-rw-r--r-- 2 root supergroup 5 2024-06-23 20:55 /user/hive/warehouse/stu/000000_0
删除数据库
[root@nnode1 hadoop]# ./bin/hadoop fs -rm -r /user
[root@nnode1 hadoop]# rm -fr metastore_db/
mysql安装
# 更新密钥
rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
# 安装mysql库
rpm -Uvh http://repo.mysql.com//mysql57-community-release-el7-7.noarch.rpm
#卸载mariadb
yum remove mariadb*
rm -f /etc/my.cnf
rm -rf /var/lib/mysql
# 安装mysql
yum -y install mysql-community-server
# 启动mysql
systemctl enable --now mysqld
# 查看mysql状态
systemctl status mysqld
# 获取mysql的初始化密码
grep 'password' /var/log/mysqld.log
如果找不到密码,在my.cnf中配置
[mysqld]
skip-grant-tables
validate_password_policy=LOW; # 密码安全级别低
validate_password_length=6; # 密码长度最低4位即
systemctl restart mysqld
# 修改root用户密码
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'Root!@#$11111';
Query OK, 0 rows affected (0.00 sec)
mysql> set global validate_password_policy=LOW;
Query OK, 0 rows affected (0.00 sec)
mysql>
mysql> set global validate_password_length=6;
Query OK, 0 rows affected (0.00 sec)
mysql> FLUSH PRIVILEGES; ^C
mysql>
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY '123456';
Query OK, 0 rows affected (0.00 sec)
mysql> grant all privileges on *.* to root@"dnode3" identified by '123456' with grant option;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE DATABASE hive CHARSET UTF8;
Query OK, 1 row affected (0.00 sec)
Hive启动部署
cd /usr/local/hive/conf/
cp hive-default.xml.template hive-default.xml
cat > hive-site.xml <<EOF
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node1:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>dnode3</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://dnode3:9083</value>
</property>
</configuration>
EOF
下载mysql驱动包
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.34/mysql-connector-java-5.1.34.jar
mv mysql-connector-java-5.1.34.jar /usr/local/hive/lib/
初始化Schema
[root@dnode3 hive]# bin/schematool -initSchema -dbType mysql -verbos
必须先启动service metastore
[root@dnode3 hive]# mkdir logs
后台启动服务
[root@dnode3 hive]# nohup bin/hive --service metastore >> logs/metastore.log 2>&1 &
启动hive并进行测试
[root@dnode3 hive]# ./bin/hive
hive> show databases;
OK
default
Time taken: 0.816 seconds, Fetched: 1 row(s)
hive>create table stu(id int, name string);
hive> insert into table stu(id,name) values(1,"ss");
hive> select * from stu;
OK
1 ss
Time taken: 0.426 seconds, Fetched: 1 row(s)
[root@nnode1 hadoop]# ./bin/hadoop fs -ls /user/hive/warehouse/stu
Found 1 items
-rw-r--r-- 2 root supergroup 5 2024-06-23 23:21 /user/hive/warehouse/stu/000000_0
部署 hiveserver2
除了使用bin/hive方式外,还可以使用hiveserver2,让其他方式进行连接
<!--添加配置 -->
<!-- 指定 hiveserver2 连接的 端口 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- 启用多用户 访问和修改 Hive 表 -->
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<!-- 启用强制执行表分桶,可以提高某些查询性能 -->
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<!-- 动态分区模式,nonstrict 表示只有在写入数据分区不存在时才会创建 -->
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<!-- 数据库事务管理器,支持 ACID(Atomicity, Consistency, Isolation, Durability) 特性 -->
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<!-- 启用 Hive 表压缩器的初始程序 -->
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<!-- 压缩器 占用的工作线程数 -->
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
nohup bin/hive --service metastore >> logs/metastore.log 2>&1 &
nohup bin/hive --service hiveserver2 >> logs/hiveserver2.log 2>&1 &
netstat -anp|grep 10000
先用beelin进行测试,提示错误
[root@dnode3 hive]# ./bin/beeline
Beeline version 3.1.3 by Apache Hive
beeline> !co
!connect !columns !commit
beeline> !connect jdbc:hive2://dnode3:10000
Connecting to jdbc:hive2://dnode3:10000
Enter username for jdbc:hive2://dnode3:10000: root
Enter password for jdbc:hive2://dnode3:10000: ******
24/06/24 09:49:08 [main]: WARN jdbc.HiveConnection: Failed to connect to dnode3:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://dnode3:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
beeline>
添加代理用户配置
[root@nnode1 hadoop]# vim etc/hadoop/core-site.xml
<!-- 配置所有节点的 root 用户都可以作为代理用户 -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<!-- 配置 root 用户能够代理的组为任意组 -->
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<!-- 配置 root 用户能够代理用户为任意用户 -->
<property>
<name>hadoop.proxyuser.root.users</name>
<value>*</value>
</property>
#停止集群
[root@nnode1 hadoop]# ./sbin/stop-all.sh
#分发配置
[root@nnode1 hadoop]# for i in nnode2 dnode{1..3};do rsync -aXSH --delete /usr/local/hadoop ${i}:/usr/local/ & done
[1] 19462
[2] 19463
[3] 19464
[4] 19465
#开启集群
[root@nnode1 hadoop]# ./sbin/start-all.sh
[root@dnode3 hive]# ./bin/beeline
Beeline version 3.1.3 by Apache Hive
beeline> !connect jdbc:hive2://dnode3:10000
Connecting to jdbc:hive2://dnode3:10000
Enter username for jdbc:hive2://dnode3:10000: root
Enter password for jdbc:hive2://dnode3:10000: 密码无所谓,为空乱填都可
Connected to: Apache Hive (version 3.1.3)
Driver: Hive JDBC (version 3.1.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://dnode3:10000>
0: jdbc:hive2://dnode3:10000> show databases;
INFO : Compiling command(queryId=root_20240624102823_80ba88cd-5a25-42f2-9870-d04aace4f352): show databases
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=root_20240624102823_80ba88cd-5a25-42f2-9870-d04aace4f352); Time taken: 1.12 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20240624102823_80ba88cd-5a25-42f2-9870-d04aace4f352): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20240624102823_80ba88cd-5a25-42f2-9870-d04aace4f352); Time taken: 0.039 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (1.773 seconds)
Hive第三方客户端
有DataGrip、Dbeaver、SQuirrel SQL Client等第三方客户端可以使用
新建数据库连接,第一次使用需要下载驱动,点击下载即可
[root@dnode3 hive]# ./bin/hive
hive> select * from stu;
OK
1 ss
2 DBeaver
Time taken: 2.799 seconds, Fetched: 2 row(s)
Hive SQL操作
Hive表无法用第三方客户端进行修改
这是因为默认表不是事务性的(transactional)。Hive在默认情况下,表是非事务性的,这意味着它们不支持行级别的更新或删除操作。
Hive 0.13及以后版本引入了ACID(原子性、一致性、隔离性、持久性)事务支持,允许对Hive表进行行级别的更新、删除和合并操作。但是,这需要Hive表是以特定方式创建的,即使用事务性存储格式(如ORC)和配置为支持事务的表属性。
打开命令行输入以下代码创建事务表,并右键执行
CREATE TABLE default.sstu (
id INT,
name STRING
)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
此时就可以修改数据了
由于数据的内容是存放在hdfs系统的,所以会有一定的延迟
Hive的数据类型
- 基本数据类型:
TINYINT
:1字节有符号整数SMALLINT
:2字节有符号整数INT
:4字节有符号整数BIGINT
:8字节有符号整数BOOLEAN
:布尔类型,TRUE/FALSEFLOAT
:单精度浮点数DOUBLE
:双精度浮点数STRING
:字符串BINARY
:字节数组TIMESTAMP
:时间戳DATE
:日期DECIMAL
:任意精度的十进制数,如DECIMAL(10,0)
表示没有小数部分的十进制数,最大长度为10
- 复杂数据类型:
ARRAY<T>
:T类型元素的数组MAP<K,V>
:K和V分别表示键和值的类型STRUCT<S1:T1, S2:T2, ...>
:结构,其中每个S是字段名,T是字段类型。例如,STRUCT<name:STRING, age:INT>
表示一个包含name和age字段的结构。UNIONTYPE<T1, T2, ...>
:联合类型,表示字段可以是多种类型之一。但在Hive中,这个类型并不常用,因为Hive 0.13.0之后的版本移除了对UNIONTYPE
的支持。
- 用户定义的数据类型 (UDTF, UDF, UDAF):
虽然这不是直接的数据类型,但Hive允许用户定义自己的函数,包括用户定义的表生成函数 (UDTF)、用户定义的函数 (UDF) 和用户定义的聚合函数 (UDAF)。这些函数可以扩展Hive的功能,使其能够处理更复杂的数据类型或逻辑。
数据库操作
1、创建数据库
create database if not exists hive;
2、查询数据库
show databases;
3、过滤查询数据库以h开头
show databases like 'h.*';
4、选择数据库
use hive;
5、删除不含表的数据库
drop database if exists hive;
6、删除数据库和它中的表
drop database if exists hive cascade;
表整体的操作
Hive的表主要分为以下几种类型:
- 内部表(Managed Table 或受控表)
- 当Hive表被创建为内部表时,Hive会完全管理表的数据和元数据。
- 数据通常存储在Hive的默认数据仓库目录中,由
hive.metastore.warehouse.dir
属性指定。 - 删除内部表时,Hive会同时删除其数据和元数据。
- 创建内部表时,可以使用
CREATE TABLE
语句,与MySQL中的表创建方式类似。
- 外部表(External Table)
- 外部表允许Hive查询存储在HDFS(或其他Hadoop支持的文件系统)上的数据,但Hive不负责管理这些数据。
- 当删除外部表时,Hive仅删除元数据,而数据保持不变。
- 创建外部表时,需要使用
EXTERNAL
关键字,并指定数据的存储位置。 - 外部表适用于数据由其他系统或进程管理,但需要Hive进行查询的场景。
- 分区表(Partitioned Table)
- 分区表是根据某一列或多列的值将数据划分为多个目录的表。
- 每个分区对应HDFS上的一个目录,可以单独管理和查询。
- 分区表可以提高查询效率,因为Hive可以跳过不必要的分区进行扫描。
- 创建分区表时,可以使用
PARTITIONED BY
子句指定分区列。
- 临时表(Temporary Table)
- 临时表只在当前会话中存在,当会话结束时,临时表会自动删除。
- 创建临时表时,需要使用
TEMPORARY
或TEMP
关键字。 - 临时表通常用于测试或临时存储中间结果。
- 分桶表(Bucketed Table)
- 分桶表是对数据进行更细粒度的划分,将数据分配到不同的桶中。
- 分桶表可以提高某些查询的性能,如抽样和map-side join。
- 创建分桶表时,需要使用
CLUSTERED BY
子句指定分桶列,并使用INTO
子句指定桶的数量。
- 视图(View)
- 视图是一个虚拟的表,它基于SQL查询的结果集。
- 视图本身不存储数据,只是保存了查询语句的定义。
- 通过视图,可以简化复杂的查询,并限制对基础数据的访问。
这些表类型在Hive中各有其用途和特性,可以根据具体需求选择适合的表类型。
1、查看表结构
describe/desc usr;
2、创建内部表
create table if not exists hive.usr(
name string comment 'username',
pwd string comment 'password',
address struct<street:string,city:string,state:string,zip:int> comment 'home address',
identify map<int,tinyint> comment 'number,sex'
) comment 'description of the table'
tblproperties('creator'='me','time'='2016.1.1');
3、创建外部表
create external table if not exists usr2(
name string,
pwd string,
address struct<street:string,city:string,state:string,zip:int>,
identify map<int,tinyint>)
row format delimited fields terminated by ','
location '/tmp/hivedata/usr';
row format delimited fields terminated by '\t' :表示以 \t 分隔
在某些情境下,Hive默认使用**^A(即ASCII码为1的字符,也称作Ctrl+A或SOH,八进制表示为'\001')**作为列分割符。这是因为在Hive中,默认的列序列化/反序列化器(SerDe)LazySimpleSerDe使用这种特殊字符作为字段之间的分隔符。
mkdir /tmp/hivedata/ -p
cat >/tmp/hivedata/usr<<EOF
John Doe,password123,{"street":"123 Main St","city":"Anytown","state":"CA","zip":12345},{1:1, 2:0, 3:1}
Jane Smith,janepass456,{"street":"456 Elm St","city":"Othertown","state":"NY","zip":67890},{4:0, 5:1, 6:0}
Bob Johnson,bobpass789,{"street":"789 Oak St","city":"Sometown","state":"TX","zip":98765},{7:1, 8:0, 9:1}
EOF
/usr/local/hadoop/bin/hadoop fs -put /tmp/hivedata/usr /tmp/hivedata/usr
4、内部表和外部表可以相互转换
alter table usr2 set tblproperties('EXTERNAL'='TRUE');
alter table usr2 set tblproperties('EXTERNAL'='FALSE');
5、创建分区表
分区表是按分区进行存储数据,但是在表中又是一个整体,可以按多个字段进行多级分区
CREATE TABLE IF NOT EXISTS hive.usr3 (
name STRING,
pwd STRING,
address STRUCT<street:STRING,city:STRING,state:STRING,zip:INT>,
identify MAP<INT,TINYINT>
)
PARTITIONED BY (city STRING, state STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ;
cat >/tmp/hivedata/usr<<EOF
John Doe0,password123,{"street":"123 Main St","city":"Anytown","state0":"CA","zip":12345},{1:1, 2:0, 3:1}
John Doe,password123,{"street":"123 Main St","city":"Anytown","state":"CA","zip":12345},{1:1, 2:0, 3:1}
Jane Smith,janepass456,{"street":"456 Elm St","city":"Othertown","state":"NY","zip":67890},{4:0, 5:1, 6:0}
Bob Johnson,bobpass789,{"street":"789 Oak St","city":"Sometown","state":"TX","zip":98765},{7:1, 8:0, 9:1}
Bob Johnson2,bobpass789,{"street":"789 Oak St","city":"Sometown","state":"TX","zip":98765},{7:1, 8:0, 9:1}
EOF
load data local inpath '/tmp/hivedata/usr' into table hive.usr3;
load local是加载外部数据到数据库,不加local是从hdfs加载
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -ls /user/hive/warehouse/hive.db/usr3
Found 4 items
drwxr-xr-x - root supergroup 0 2024-06-24 19:29 /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22CA%22
drwxr-xr-x - root supergroup 0 2024-06-24 19:29 /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22NY%22
drwxr-xr-x - root supergroup 0 2024-06-24 19:29 /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22TX%22
drwxr-xr-x - root supergroup 0 2024-06-24 19:35 /user/hive/warehouse/hive.db/usr3/city=%22state0%22%3A%22CA%22
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -ls /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22CA%22/state=%22zip%22%3A12345}
Found 2 items
-rw-r--r-- 2 root supergroup 60 2024-06-24 19:35 /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22CA%22/state=%22zip%22%3A12345}/000000_0
-rw-r--r-- 2 root supergroup 60 2024-06-24 19:29 /user/hive/warehouse/hive.db/usr3/city=%22state%22%3A%22CA%22/state=%22zip%22%3A12345}/000000_2
#为表增加一个分区
ALTER TABLE usr3 ADD PARTITION (city='beijing', state='China')
LOCATION '/usr/local/hive/warehouse/usr3/China/beijing';
#删除分区
ALTER TABLE usr3 DROP PARTITION (city='beijing', state='China');
5、创建临时表
CREATE TEMPORARY TABLE hive.usrtmp (
name STRING,
pwd STRING,
address STRUCT<street:STRING,city:STRING,state:STRING,zip:INT>,
identify MAP<INT,TINYINT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> show tables;
OK
usr2
usr3
usrtmp
Time taken: 0.051 seconds, Fetched: 3 row(s)
hive> [root@dnode3 hive]#
[root@dnode3 hive]# ./bin/hive
hive> use hive;
OK
Time taken: 0.616 seconds
hive> show tables;
OK
usr2
usr3
Time taken: 0.162 seconds, Fetched: 2 row(s)
6、分桶表
分桶的原理:分桶操作基于某一列(或称为分桶列)的具体数据,使用哈希取模的方式随机、均匀地分发到各个桶文件中。因此,指定的分桶列必须基于表中的某一列(字段),且不能重复。
根据需要CLUSTER BY语句创建分桶表,INTO num BUCKETS 指定了桶的数量,通常选择偶数或2的幂次方
分桶操作数据查询和插入都比较慢
cat >/tmp/hivedata/usr<<EOF
1,"zs"
2,"zs"
3,"zs"
4,"zs"
5,"zs"
6,"zs"
EOF
CREATE TABLE bucketed_table (
id INT,
name STRING
)
CLUSTERED BY (id)
INTO 4 BUCKETS -- 这里指定了桶的数量,通常选择偶数或2的幂次方
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ;
load data local inpath '/tmp/hivedata/usr' into table hive.bucketed_table;
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -ls /user/hive/warehouse/hive.db/bucketed_table/
Found 4 items
-rw-r--r-- 2 root supergroup 7 2024-06-24 20:26 /user/hive/warehouse/hive.db/bucketed_table/000000_0
-rw-r--r-- 2 root supergroup 14 2024-06-24 20:26 /user/hive/warehouse/hive.db/bucketed_table/000001_0
-rw-r--r-- 2 root supergroup 14 2024-06-24 20:26 /user/hive/warehouse/hive.db/bucketed_table/000002_0
-rw-r--r-- 2 root supergroup 7 2024-06-24 20:26 /user/hive/warehouse/hive.db/bucketed_table/000003_0
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -cat /user/hive/warehouse/hive.db/bucketed_table/000000_0
4,"zs"
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -cat /user/hive/warehouse/hive.db/bucketed_table/000001_0
5,"zs"
1,"zs"
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -cat /user/hive/warehouse/hive.db/bucketed_table/000002_0
6,"zs"
2,"zs"
[root@dnode3 hive]# /usr/local/hadoop/bin/hadoop fs -cat /user/hive/warehouse/hive.db/bucketed_table/000003_0
3,"zs"
7、创建视图
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name
[(column_name [COMMENT column_comment], ...)]
[COMMENT view_comment]
[TBLPROPERTIES (property_name=property_value, ...)]
AS SELECT ...;
CREATE VIEW IF NOT EXISTS hive.viewtest
AS
SELECT usr2.name, usr3.city
FROM usr2
JOIN usr3 ON usr2.name = usr3.name;
8、修改表属性
ALTER TABLE usr2
SET TBLPROPERTIES (
'owner'='shadow',
'description'='This is a sample table',
'created_date'='2023-10-23'
);
hive> DESCRIBE FORMATTED usr2;
OK
# col_name data_type comment
name string
pwd string
address struct<street:string,city:string,state:string,zip:int>
identify map<int,tinyint>
# Detailed Table Information
Database: hive
OwnerType: USER
Owner: root
CreateTime: Mon Jun 24 19:03:50 CST 2024
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://nngroup/tmp/hivedata/usr
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
bucketing_version 2
created_date 2023-10-23
description This is a sample table
last_modified_by root
last_modified_time 1719233095
owner shadow
transient_lastDdlTime 1719233095
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.172 seconds, Fetched: 36 row(s)
9、重命名表
alter table usr2 rename to usr;
10、复制新建表
create table if not exists hive.usr1 like hive.usr;
11、删除表
drop table if exists usr1;
表内部操作
1、添加列
alter table usr add columns(hobby string);
2、修改列
注意已经存在数据的修改属性需要兼容,int可以改string
ALTER TABLE usr CHANGE COLUMN hobby hobby0 STRING;
ALTER TABLE bucketed_table CHANGE COLUMN id ids STRING;
3、向表中加载数据
local表示的本地文件,不是hdfs,into是追加overwrite是覆盖所有数据
load data local inpath '/tmp/hivedata/usr' into table hive.usr3;
load data local inpath '/tmp/hivedata/usr' overwrite into table hive.usr3;
创建表并插入数据
create table usr4 as select name,pwd from usr3;
overweite覆盖原表所有数据
insert overwrite table usr4 select name,pwd from usr;
4、到处数据
直接拷贝表文件
/usr/local/hadoop/bin/hadoop fs -get /user/hive/warehouse/hive.db/usr4/000000_0 /tmp/hivedata/
local表示本地
insert overwrite local directory '/tmp/hivedata/usr4' select name,pwd from usr4;
hive特有查询操作
case when else对单列进行处理
SELECT
CASE
WHEN name = "John Doe" THEN 'first'
WHEN name = "Jane Smith" THEN 'second'
ELSE 'third'
END AS category
FROM usr4;
OK
first
second
third
Time taken: 0.234 seconds, Fetched: 3 row(s)
问题及解决
参考
大数据技术原理与应用 第十四章 基于Hadoop的数据仓库Hive 学习指南_厦大数据库实验室博客 (xmu.edu.cn)
Hadoop:YARN、MapReduce、Hive操作_hadoop 中mapreduce.reduce.env-CSDN博客