Contents
Hadoop简介
Hadoop是一个由Apache基金会所开发的分布式系统基础架构,主要用于海量数据的存储和海量数据的分析计算。以下是关于Hadoop的详细介绍:
一、起源与背景
Hadoop起源于Apache Nutch项目,该项目始于2002年,是Apache Lucene的子项目之一。在受到Google的MapReduce论文启发后,Doug Cutting等人开始尝试实现MapReduce计算框架,并将其与NDFS(Nutch Distributed File System)结合,用以支持Nutch引擎的主要算法。由于NDFS和MapReduce在Nutch引擎中表现出色,它们于2006年2月被分离出来,形成了一套完整而独立的软件,并被命名为Hadoop。到了2008年年初,Hadoop已成为Apache的顶级项目,并被广泛应用于包括Yahoo在内的许多互联网公司。
二、核心组件
Hadoop的核心设计包括两个主要组件:
- HDFS(Hadoop Distributed File System):这是一个分布式文件系统,具有高容错性,并且设计用来部署在低廉的硬件上。HDFS提供了高吞吐量来访问应用程序的数据,适合那些有着超大数据集的应用程序。它通过放宽POSIX的要求,可以以流的形式访问文件系统中的数据。
- MapReduce:这是一个分布式运算编程框架,用于在集群服务器上进行分布式并行运算。MapReduce将大型数据处理任务分解为多个小任务,并在集群中并行处理,从而加快处理速度。
三、特点与优势
Hadoop具有以下几个显著的特点和优势:
- 高可靠性:Hadoop底层维持多个数据副本,即使Hadoop某个计算元素或存储出现故障,也不会导致数据丢失。
- 高拓展性:Hadoop可以在集群间分配任务数据,方便地扩展到数以千计的节点中。
- 高效性:Hadoop在MapReduce的思想下,可以并行工作,以加快任务处理速度。
- 高容错性:Hadoop能够自动将失败的任务重新分配。
- 低成本:Hadoop可以运行在廉价的硬件上,降低了硬件成本。
Hadoop常用组件:
组件名称 | 描述 |
---|---|
HDFS (Hadoop Distributed FileSystem) | Hadoop的分布式文件系统,用于存储大数据集,提供高吞吐量的数据访问。 |
MapReduce | 分布式计算框架,允许开发人员以键值对的方式处理大量数据集,并在大型集群上进行分布式计算。 |
YARN (Yet Another Resource Negotiator) | Hadoop的集群资源管理系统,负责管理和调度集群资源,如内存、CPU等,为应用程序提供资源。 |
Zookeeper | 分布式协作服务,用于维护配置信息、命名、提供分布式同步和提供组服务等。 |
HBase | 分布式列存数据库,基于Hadoop构建,提供高可靠性、高性能、列存储、可伸缩的、实时读写的数据库。 |
Hive | 基于Hadoop的数据仓库工具,允许使用SQL查询语言进行数据查询和分析。 |
Sqoop | 数据同步工具,用于在Hadoop和结构化数据存储(如关系型数据库)之间高效传输大量数据。 |
Pig | 基于Hadoop的数据流系统,允许使用Pig Latin(一种过程式查询语言)来处理大规模数据集。 |
Mahout | 数据挖掘算法库,为Hadoop提供了可扩展的机器学习和数据挖掘功能。 |
Flume | 日志收集工具,用于从各种源(如Web服务器、日志文件等)收集、聚合和传输大量日志数据到Hadoop。 |
主机 | ip | 角色 |
---|---|---|
nnode1 | 192.168.126.21 | NameNode SecondaryNameNode |
dnode1 | 192.168.126.22 | DataNode |
dnode2 | 192.168.126.23 | DataNode |
Hadoop环境准备
1、系统:centos7
2、关闭防火墙
3、关闭selinux
Hadoop下载安装
Hadoop依赖java环境,安装java开发包即可
yum install -y java-1.8.0-openjdk-devel
wget https://dlcdn.apache.org/hadoop/common/hadoop-2.10.2/hadoop-2.10.2.tar.gz
[root@nnode1 software]# tar -zxf hadoop-2.10.2.tar.gz -C /usr/local/
[root@nnode1 software]# ls /usr/local/hadoop-2.10.2/
bin etc include lib libexec LICENSE.txt NOTICE.txt README.txt sbin share
[root@nnode1 local]# cd /usr/local/
[root@nnode1 local]# ln -s hadoop-2.10.2 hadoop
1、bin 可执行文件目录
2、et 配置文件目录
3、sbin 启动服务等脚本目录
Hadoop单点部署
[root@nnode1 hadoop-2.10.2]# rpm -ql java-1.8.0-openjdk
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre/bin/policytool
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre/lib/amd64/libawt_xawt.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre/lib/amd64/libjawt.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre/lib/amd64/libjsoundalsa.so
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre/lib/amd64/libsplashscreen.so
/usr/share/applications/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64-policytool.desktop
/usr/share/icons/hicolor/16x16/apps/java-1.8.0-openjdk.png
/usr/share/icons/hicolor/24x24/apps/java-1.8.0-openjdk.png
/usr/share/icons/hicolor/32x32/apps/java-1.8.0-openjdk.png
/usr/share/icons/hicolor/48x48/apps/java-1.8.0-openjdk.png
编辑hadoop环境变量,修改java路径和hadoop路径
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre
[root@nnode1 hadoop-2.10.2]# vim etc/hadoop/hadoop-env.sh
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.412.b08-1.el7_9.x86_64/jre"
export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop/"
查看hadoop的版本
[root@nnode1 hadoop]# ./bin/hadoop version
Hadoop 2.10.2
Subversion Unknown -r 965fd380006fa78b2315668fbc7eb432e1d8200f
Compiled by ubuntu on 2022-05-24T22:35Z
Compiled with protoc 2.5.0
From source with checksum d3ab737f7788f05d467784f0a86573fe
This command was run using /usr/local/hadoop-2.10.2/share/hadoop/common/hadoop-common-2.10.2.jar
统计热点词汇
[root@nnode1 hadoop]# mkdir testdata
[root@nnode1 hadoop]# cp *.txt testdata/
[root@nnode1 hadoop]# ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.2.jar wordcount testdata/ testresult
根据testdata目录下的文件统计词频,结果输出到testresult目录,目录自动创建不能存在
查看examples包有哪些方法
[root@nnode1 hadoop]# ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.2.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.