hadoop搭建

2018/06/03 大数据 阅读次数: hadoop

hadoop单机版和集群搭建

环境介绍:

系统 centos7  
java: jdk-8u181-linux-x64  
软件包存放/data/  
解压目录/opt/  
hadoop: hadoop-2.6.5.tar.gz  
hive: apache-hive-2.3.4-bin.tar.gz  
服务器的防火墙要关闭并禁止开机自启

系统java环境变量配置:

  1. 解压java压缩包

     tar -zxvf jdk-8u181-linux-x64.tar.gz -C /opt/
    
  2. 配置环境变量

     vi /etc/profile
     export JAVA_HOME=/opt/jdk1.8.0_181
     export JRE_HOME=${JAVA_HOME}/jre
     export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
     export PATH=${JAVA_HOME}/bin:$PATH
     source /etc/profile
    

    搭建hadoop

  3. 解压hadoop压缩包

     tar -zxvf hadoop-2.6.5.tar.gz -C /opt/
    
  4. 配置hadoop环境变量

     vi /etc/profile
     export HADOOP_HOME=/opt/hadoop-2.6.5
     export PATH=$PATH:$HADOOP_HOME/bin
     source /etc/profile
    
  5. 配置hadoop
    3.1 配置hadoop-env.sh

     vi /opt/hadoop-2.6.5/etc/hadoop/hadoop-env.sh
     # The java implementation to use.
     export JAVA_HOME=${JAVA_HOME}
    

    替换后

     export JAVA_HOME=/opt/jdk1.8.0_181
        
    

    配置yarn-env.sh

     export JAVA_HOME=/opt/jdk1.8.0_181
    

    3.2 配置core-site.xml;file是hadoop路径,hdfs:是虚拟机ip

     vi /opt/hadoop-2.6.5/etc/hadoop/core-site.xml
     
     <configuration>
        <property>
             <name>hadoop.tmp.dir</name>
             <value>/opt/hadoop-2.6.5/tmp</value>
             <description>Abase for other temporary directories.</description>
         </property>
         <property>
             <name>fs.defaultFS</name>
             <value>hdfs://172.16.161.150:8888</value>
         </property>     
          <property>
             <name>hadoop.proxyuser.root.hosts</name>
             <value>*</value>
          </property>
          <property>
             <name>hadoop.proxyuser.root.groups</name>
             <value>*</value>
          </property> 
               
     </configuration>
    
    

    3.3 配置hdfs-site.xml

     vi /opt/hadoop-2.6.5/etc/hadoop/hdfs-site.xml
      <configuration>
       <property>
             <name>dfs.replication</name>
             <value>1</value>
         </property>
         <property>
             <name>dfs.namenode.name.dir</name>
             <value>/opt/hadoop-2.6.5/tmp/dfs/name</value>
         </property>
         <property>
             <name>dfs.datanode.data.dir</name>
             <value>/opt/hadoop-2.6.5/tmp/dfs/data</value>
         </property>
    
     </configuration>
    
    
  6. 配置ssh免密登录  

     ssh-keygen -t rsa 虚拟机没有.ssh目录
     #本机的id_rsa.pub,复制到虚拟机
     ssh-copy-id -i ~/.ssh/id_rsa.pub root@172.16.161.150
     cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
    
  7. Hadoop启动和停止

    5.1 首次需要初始化

     cd /opt/hadoop-2.6.5
     ./bin/hdfs namenode -format 
    

    5.2 启动

     ./sbin/start-dfs.sh
    

    浏览器输入:http://172.16.161.150:50070

    命令行输入:hadoop fs -mkdir /test

    在浏览器中能看到,刚创建的文件夹,就表示成功

    5.3 停止

     ./sbin/stop-dfs.sh
    
  8. 配置yarn

    6.1 配置mapred-site.xml

     cd /opt/hadoop-2.6.5/etc/hadoop/
     cp mapred-site.xml.template mapred-site.xml
     vi mapred-site.xml
    
     <configuration>
     <!-- 通知框架MR使用YARN -->
         <property>
             <name>mapreduce.framework.name</name>
             <value>yarn</value>
         </property>
     </configuration>
    
    

    6.2 配置yarn-site.xml

     vi yarn-site.xml
        
     <configuration>
     <!-- Site specific YARN configuration properties -->
        
      <!-- reducer取数据的方式是mapreduce_shuffle -->
         <property>
             <name>yarn.nodemanager.aux-services</name>
             <value>mapreduce_shuffle</value>
         </property>
     </configuration>
    
    

    6.3 yarn启动与停止

     cd /opt/hadoop-2.6.5
     ./sbin/start-yarn.sh  
    

    浏览器输入:http://172.16.161.150:8088

    停止 

     ./sbin/stop-yarn.sh 
    
  9. 查看进程

        
     [root@localhost hadoop-2.6.5]# jps
     2032 SecondaryNameNode
     1896 DataNode
     1787 NameNode
     2365 ResourceManager
     2749 Jps
     2638 NodeManager
     [root@localhost hadoop-2.6.5]# 
        
    

单机hadoop部署完毕

安装hive

  1. 解压hive压缩包
    tar -zxvf apache-hive-2.3.4-bin.tar.gz -C /opt/
    

    配置hive环境变量

    export HIVE_HOME=/opt/apache-hive-2.3.4-bin
    export PATH=$PATH:$HIVE_HOME/bin
    
  2. 配置hive 2.1 创建HDFS目录

      Hive 中创建表之前需要创建以下 HDFS 目录并给它们赋相应的权限
        
     hdfs dfs -mkdir -p /user/hive/warehouse  
     hdfs dfs -mkdir -p /tmp/hive/  
     hdfs dfs -chmod 777 /user/hive/warehouse  
     hdfs dfs -chmod 777 /tmp/hive 
     #测试是否成功
     $HADOOP_HOME/bin/hadoop   fs   -ls   /user/hive/ 
        
     mkdri /opt/apache-hive-2.3.4-bin/tmp
     chmod 777 mkdri /opt/apache-hive-2.3.4-bin/tmp
    

    2.2 hive-site.xml

     cd /opt/apache-hive-2.3.4-bin/conf
        
     cp hive-default.xml.template hive-site.xml
        
     vi hive-site.xml
        
     <?xml version="1.0" encoding="UTF-8" standalone="no"?>
     <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
     <configuration>
          <property>
             <name>javax.jdo.option.ConnectionURL</name>
             <value>jdbc:mysql://172.16.161.1:3306/hive?&amp;createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
          </property>
         <property>
             <name>javax.jdo.option.ConnectionUserName</name>
             <value>hive</value>
         </property>
         <property>
             <name>javax.jdo.option.ConnectionPassword</name>
             <value>yuan121423</value>
         </property>
         <property>
             <name>javax.jdo.option.ConnectionDriverName</name>
             <value>com.mysql.jdbc.Driver</value>
         </property>
         <property>
             <name>datanucleus.schema.autoCreateAll</name>
             <value>true</value> </property>
         <property>
             <name>hive.metastore.schema.verification</name>
             <value>false</value>
          </property>
         <property>
             <name>hive.server2.authentication</name>
             <value>NOSASL</value>
             <description>
               Expects one of [nosasl, none, ldap, kerberos, pam, custom].
               Client authentication types.
                 NONE: no authentication check
                 LDAP: LDAP/AD based authentication
                 KERBEROS: Kerberos/GSSAPI authentication
                 CUSTOM: Custom authentication provider
                         (Use with property hive.server2.custom.authentication.class)
                 PAM: Pluggable authentication module
                 NOSASL:  Raw transport
             </description>
           </property>
        </configuration>
        #  将hive-site.xml文件中的${system:java.io.tmpdir}替换为hive的临时目录
           
    

    2.3 hive-env.sh

     cp hive-env.sh.template hive-env.sh
     vi hive-env.sh
        
     # Set HADOOP_HOME to point to a specific hadoop install directory
     # HADOOP_HOME=${bin}/../../hadoop
        
     # Hive Configuration Directory can be controlled by:
     # export HIVE_CONF_DIR=
     
    

    修改成

        
         HADOOP_HOME=/opt/hadoop-2.6.5
         export HIVE_CONF_DIR=/opt/apache-hive-2.3.4-bin/conf
    
    

    2.4 初始化

    schematool -initSchema -dbType mysql
        
     [root@localhost conf]# schematool -initSchema -dbType mysql 
     SLF4J: Class path contains multiple SLF4J bindings.
     SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
     SLF4J: Found binding in [jar:file:/opt/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
     SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
     SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
     Metastore connection URL:	 jdbc:mysql://172.16.161.1:3306/hive?createDatabaseIfNotExist=true
     Metastore Connection Driver :	 com.mysql.jdbc.Driver
     Metastore connection User:	 hive
     Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
     Starting metastore schema initialization to 2.3.0
     Initialization script hive-schema-2.3.0.mysql.sql
     Initialization script completed
     schemaTool completed
    
      
    
  3. 启动  

     nohup hive --service metastore & 
     nohup  hive --service hiveserver2 &
    

安装问题

  1. 报错:
     Error: Could not open client transport with JDBC Uri: jdbc:hive2://172.16.161.150:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
    

    解决:

       <property>
          <name>hadoop.proxyuser.root.hosts</name>
         <value>*</value>
        </property>
        <property>
         <name>hadoop.proxyuser.root.groups</name>
          <value>*</value>
        </property>
        
    

    zookeeper 环境搭建

zookeeper下载地址

1 解压
```shell script tar zxvf apache-zookeeper-3.5.6.tar.gz



#### 安装HBase    

1  先下载HBase 

**[hbase下载地址](https://www-us.apache.org/dist/hbase/)**

> 我的hadoop2.6.5 ,下载了hbase1.3.6    

2  解压下载的文件,进入hbase压缩包存放的目录  
```python
tar zxvf ./hbase-1.3.6-bin.tar.gz
cd hbase-1.3.6/conf

3 编辑hbase-env.sh

vi hbase-env.sh 
# 设置,是用hbase自带的ZooKeeper:true,如果使用独立的ZooKeeper:false
HBASE_MANAGES_ZK=true

# 设置JAVA_HOME为实际jdk路
JAVA_HOME=/opt/jdk1.8.0_171/

4 编辑hbase-site.xml文件

vi hbase-site.xml

<configuration>

    <property>
        <name> hbase.rootdir </name>
        <value>hdfs://master:9000/hbase</value>
        <description> hbase.rootdir是RegionServer的共享目录用于持久化存储HBase数据默认写入/tmp中如果不修改此配置在HBase重启时数据会丢失此处一般设置的是hdfs的文件目录如NameNode运行
在namenode.Example.org主机的9090端口则需要设置为hdfs://namenode.example.org:9000/hbase
        </description>
    </property>

    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
        <description>此项用于配置HBase的部署模式false表示单机或者伪分布式模式true表不完全分布式模式
        </description>
    </property>

    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>master,salve01,salve02</value>
        <description>此项用于配置ZooKeeper集群所在的主机地址examplel example2example3是运行数据节点的主机地址
        </description>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/var/zookeeper</value>
        <description>此项用于设置存储ZooKeeper的元数据如果不设置默认存在/tmp下重启时数据会丢失
        </description>
    </property>

    <property>

        <name>hbase.master</name>
        <value>hdfs://master:60000</value>
        <description>hbasemaster的主机和端口</description>
    </property>


</configuration>

5 regionservers配置

```shell script vi regionservers localhost #修改后 master salve01 salve02


6 分发hbase文件夹到每个节点   
```shell script
scp -r hbase-1.3.6 root@salve01:/opt/
scp -r hbase-1.3.6 root@salve02:/opt/

7 启动hbase,启动之前保证hdfs已经启动
```shell script cd /opt/hbase-1.3.6/bin ./start-hbase.sh


8 启动完成后效果   

![master](/images/hadoop/8.png)     
![salve01](/images/hadoop/9.png)    
![salve02](/images/hadoop/10.png)   
![web](/images/hadoop/11.png)   

> 在 HBase 0.98.x 以上, HBase Web UI 的端口从主节点的 60010 和 RegionServer 的 60030 变化为 16010 和 16030

9 配置hbase系统环境变量,方便操作    
```shell script
vi /etc/profile

export HBASE_HOME=/opt/hbase-1.3.6
export PATH=$PATH:$HBASE_HOME/bin

source /etc/profile

10 让python3连接hbase
```shell script ./hbase-daemon.sh start thrift

![](/images/hadoop/13.png)  
```python
import happybase
# 创建连接
conn = connection = happybase.Connection(host="172.16.161.152",port=9090,timeout=None,autoconnect=True,table_prefix=None,table_prefix_separator=b'_',compat='0.98', transport='buffered',protocol='binary')
# 查询所有表
tables = conn.tables()
print(tables)
# 建表
families = {
	'user_info': dict(),
	'history': dict()
}
# 创建表
conn.create_table('table_name', families)
tables = conn.tables()
print(tables)

# 删除表,需要先禁表
connection.delete_table('table_name',disable=True)

# conn.disable_table('table_name')
# conn.delete_table('table_name')
tables = conn.tables()
print(tables)

分布式集群搭建

hadoop2.6.5 centos7

  1. 环境准备,修改主机名的时候不要有_/特殊字符
    主机名称              IP                角色
    master  172.16.161.152    名称节点
    slave01  172.16.161.154    数据节点
    slave02  172.16.161.155    数据节点
    
  2. 域名解析和关闭防火墙(所有机器)
vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.161.152 master
172.16.161.154 slave01
172.16.161.155 slave02

关闭 selinux
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/sysconfig/selinux
setenforce 0

关闭 iptables
service iptables stop
chkconfig iptables off 

各个机器检查防火墙,要关闭,不关闭后面很多问题

Ubuntuubuntu-12.04-desktop-amd64

查看防火墙状态ufw status

关闭防火墙ufw disable

---------------------------------------------------------------


centos6.0

查看防火墙状态service iptables status

关闭防火墙chkconfig iptables off    #开机不启动防火墙服务

--------------------------------------------------------------

centos7.0(默认是使用firewall作为防火墙如若未改为iptables防火墙使用以下命令查看和关闭防火墙)

查看防火墙状态firewall-cmd --state

关闭防火墙systemctl stop firewalld.service

禁止firewall开机启动: systemctl disable firewalld.service 
  1. 配置jdk环境变量(所有机器上),下载jdk并解压,放在/opt/目录下
vi /etc/profile
export JAVA_HOME=/opt/jdk1.8.0_181
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
source /etc/profile
  1. 配置ssh免密登录(所有机器上),没有ssh,要先安装
# 这里以172.16.161.152为例,每台都要这样操作,要注意是第一次登录,需要输入密码
1) ssh-keygen
2) 分别ssh另外两台虚拟机
   ssh-copy-id -i .ssh/id_rsa.pub root@172.16.161.154
   ssh-copy-id -i .ssh/id_rsa.pub root@172.16.161.155
  1. 下载hadoop-2.6.5,并解压到/opt/下
/opt/hadoop2.6.5
  1. 修改hadoop-env.sh文件
cd /opt/hadoop-2.6.5/etc/hadoop/
vi hadoop-env.sh 
export JAVA_HOME=${JAVA_HOME}
修改为:export JAVA_HOME=/opt/jdk1.8.0_171 #修改为jdk目录
  1. 修改slaves文件,添加datanode节点主机名
vi slaves
slave01
slave02
  1. 修改masters文件,添加namenode节点主机名
vi masters
master            
  1. 修改core-site.xml文件
vim core-site.xml    #内容如下
 
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop-2.6.5/tmp</value>
    </property>
     <property>  
        <name>io.file.buffer.size</name>  
        <value>131072</value>  
    </property>  
    <property>
        <name>fs.default.name</name>
        <value>hdfs://172.16.161.152:9000</value>
    </property>
</configuration>
  1. 修改hdfs-site.xml文件
vi hdfs-site.xml
<configuration>
    <property>
        <name>dfs.name.dir</name>
        <value>file:/opt/hadoop-2.6.0/dfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:/opt/hadoop-2.6.0/dfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>

  1. 修改mapred-site.xml文件
vi mapred-site.xml
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

  1. 修改yarn-site.xml文件
vi yarn-site.xml

<property>
 <name>yarn.resourcemanager.address</name>
   <value>master:18040</value>
 </property>
 <property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>master:18030</value>
 </property>
 <property>
   <name>yarn.resourcemanager.webapp.address</name>
   <value>master:18088</value>
 </property>
 <property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>master:18025</value>
 </property>
 <property>
   <name>yarn.resourcemanager.admin.address</name>
   <value>master:18141</value>
 </property>
 <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>

  1. 将配置好的hadoop复制到其他节点
scp -rp /opt/ root@slave01:/
scp -rp /opt/ root@slave02:/
  1. 初始化hadoop集群(所有机器)
cd /opt/hadoop-2.6.5/bin
./hadoop namenode -format
  1. 启动hadoop集群
cd /opt/hadoop-2.6.5/sbin 
./start-all.sh

15.手动搭建集群,如果要跑大量MR任务,是需要配置资源,不然会报:is running beyond virtual memory limits. Current usage: 142.0 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.

yarn-site.xml

<!-- YARN分配 6G-->
<property>
	<name>yarn.nodemanager.resource.memory-mb</name> 
		<value>6144</value>
</property>

<!--每个容器 1G-->
<property>
	<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>1024</value>
</property>

<!-- 虚拟内存上限= 4 * 0.7 = 2.8 GB -->
<property>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
		<value>0.7</value>
</property>




mapred-site.xml

<!-- 分配的物理RAM总量= map 4 GB -->
<property>
	<name> mapreduce.map.memory.mb</name>
		<value>4096</value>
</property>
<!-- 分配的物理RAM总量= reduce 4 GB -->
<property>
	<name>mapreduce.reduce.memory.mb</name>
		<value>4096</value>
</property>

<!-- JVM堆大小应设置为低于上面定义的Map和Reduce内存 -->
<property>
	<name>mapreduce.map.java.opts</name>
		<value>-Xmx3072m</value>
</property>

<property>
	<name>mapreduce.reduce.java.opts</name>
		<value>-Xmx3072m</value>
</property>

ERROR

  1. put: Call From master/172.16.161.152 to master:9000 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
vi core-site.xml
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://172.16.161.152:9000</value>
    </property>
  1. java.net.NoRouteToHostException: 没有到主机的路由 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
保证每台机器,能通过主机名互通
检查1, 每台机器上hosts
检查2, slaves,masters配置

groupadd hadoop
useradd -g hadoop hadoop

vi /etc/sudoers
hadoop  ALL=(ALL:ALL)  ALL

Search

    Table of Contents