网盘下载地址: https://pan.baidu.com/s/1YhiGBudtYMp_CdGm_x7ORQ 提取码: 4p6r
链接: https://pan.baidu.com/s/19qWnP6LQ-cHVrvT0o1jTMg 密码: 44hs
https://pan.baidu.com/s/1Oti-_WVGLmKiRWNO0n-BsA 提取码: 8iaa
大数据 https://naotu.baidu.com/file/afa7f9a64e22a23dfc237395cf1eea53
安装指南:http://dblab.xmu.edu.cn/blog/install-hadoop/
各步骤用到的命令:
2.安装关系型数据库MySQL
sudo apt-get update sudo apt-get install mysql-server sudo netstat -tap | grep mysql
service mysql stop service mysql start
mysql -u root -p show databases;
3.安装大数据处理框架Hadoop
创建Hadoop用户
sudo useradd -m hadoop -s /bin/bash sudo passwd hadoop sudo adduser hadoop sudo
su hadoop #切换到hadoop用户
SSH登录权限设置
sudo apt-get install openssh-server ssh localhost exit cd ~/.ssh/ ssh-keygen -t rsa cat ./id_rsa.pub >> ./authorized_keys ssh localhost ps -e | grep ssh
安装Java环境
sudo apt-get install default-jre default-jdk gedit ~/.bashrc export JAVA_HOME=/usr/lib/jvm/default-java source ~/.bashrc java -version
单机安装配置
下载、解压、修改文件夹名与权限即可使用
sudo tar -zxf ~/hadoop-2.7.1.tar.gz -C /usr/local cd /usr/local sudo mv ./hadoop-2.7.1 ./hadoop sudo chown -R hadoop:hadoop ./hadoop
查看hadoop版本
cd /usr/local/hadoop
./bin/hadoop version
cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/*.xml ./input
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep ./input/ ./output \'dfs[a-z.]+\'
cat ./output/*
伪分布式安装配置
配置文件
cd /usr/local/hadoop gedit ./etc/hadoop/core-site.xml gedit ./etc/hadoop/hdfs-site.xml ./bin/hdfs namenode -format ./sbin/start-dfs.sh jps ./sbin/stop-dfs.sh jps
一、Hadoop伪分布式配置
Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。Hadoop的配置文件是 xml 格式.
修改配置文件 core-site.xml:
通过 gedit 编辑会比较方便: gedit /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
修改配置文件 hdfs-site.xml:
gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property> </configuration>
配置完成后,执行 NameNode 的格式化:
./bin/hdfs namenode -format
成功的话,会看到 “successfully formatted” 和 “Exitting with status 0” 的提示.
Hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件),因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。
伪分布式运行MapReduce作业:
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output \'dfs[a-z.]+\'
配置hadoop环境变量
gedit ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_HOME=/usr/local/Hadoop export PATH=$PATH:$HADOOP_HOEM:$HADOOP_HOEM/sbin:$HADOOP_HOEM/bin
source ~/.bashrc
HDFS Shell 操作案例:
上传--统计单词个数--下载
启动
jps
cd /usr/local/hadoop
./sbin/start-dfs.sh
jps
创建与查看目录
cd bin
hdfs dfs -ls /
hdfs dfs -ls
hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -ls
ls ~
hdfs dfs -help
hdfs dfs -help put
hdfs dfs -mkdir input
hdfs dfs -ls
上传文件
hdfs dfs -put /usr/local/hadoop/etc/hadoop/*.xml input
hdfs dfs -ls input
运行示例作业:
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input/ output \'dfs[a-z.]+\'
hdfs dfs -ls output
查看输出结果:
hdfs dfs -cat output/part-r-00000
hdfs dfs -cat output/*
下载文件:
hdfs dfs -get output ~/output
查看下载的本地文件:
ls ~
ls ~/output
cat ~/output/part-r-00000
停止hdfs:
cd ..
./sbin/stop-dfs.sh
jps
HDFS Java API及应用实例
WriteFile:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.Path; public class WriteFile { public static void main(String[] args) { try { Configuration conf = new Configuration(); conf.set("fs.defaultFS","hdfs://localhost:9000"); conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem"); FileSystem fs = FileSystem.get(conf); byte[] buff = "file1".getBytes(); // 要写入的内容 String filename = "file1.txt"; //要写入的文件名 FSDataOutputStream os = fs.create(new Path(filename)); os.write(buff,0,buff.length); System.out.println("Create:"+ filename); os.close(); fs.close(); } catch (Exception e) { e.printStackTrace(); } } }
ReadFile:
import java.io.InputStreamReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
public class ReadFile {
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://localhost:9000");
conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
FileSystem fs = FileSystem.get(conf);
Path file = new Path("file1.txt");
FSDataInputStream getIt = fs.open(file);
BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
String content = d.readLine(); //读取文件一行
System.out.println(content);
d.close(); //关闭文件
fs.close(); //关闭hdfs
} catch (Exception e) {
e.printStackTrace();
}
}
}
----------------------------------------------------------------------------------------------------------------------------------
二、Hbase伪分布式配置
HBase安装启停全部命令:
sudo tar –zxvf ~/hbase-1.1.5.tar.gz –C /usr/local cd /usr/local sudo mv ./hbase-1.1.5 ./hbase sudo chown –R hadoop ./hbase gedit ~/.bashrc source ~/.bashrc gedit /usr/local/hbase/conf/hbase-env.sh gedit /usr/local/hbase/conf/hbase-site.xml start-dfs.sh jps start-hbase.sh jps hbase shell exit stop-hbase.sh stop-dfs.sh jps
1.配置/usr/local/hbase/conf/hbase-env.sh。命令如下:
gedit /usr/local/hbase/conf/hbase-env.sh
配置JAVA_HOME,HBASE_CLASSPATH,HBASE_MANAGES_ZK.HBASE_CLASSPATH设置为本机Hadoop安装目录下的conf目录(即/usr/local/hadoop/conf)
export JAVA_HOME=/usr/lib/jvm/default-java export HBASE_CLASSPATH=/usr/local/hadoop/conf export HBASE_MANAGES_ZK=true
2.配置/usr/local/hbase/conf/hbase-site.xml,命令如下:
gedit /usr/local/hbase/conf/hbase-site.xml
如果有提示错误,hbase-site.xml用以下配置:
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.unsafe.stream.capability.enforce</name> <value>false</value> </property> </configuration>
------------------------------------------------------
三、Python - MapReduce - WorldCount
1 Map阶段:mapper.py
#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print "%s\t%s" % (word, 1)
2 Reduce阶段:reducer.py
#!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split(\'\t\', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print "%s\t%s" % (current_word, current_count) current_count = count current_word = word if word == current_word: print "%s\t%s" % (current_word, current_count)
3 本地测试代码(cat data | map | sort | reduce)
$echo "foo foo quux labs foo bar quux" | ./mapper.py $echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py
4 在Hadoop上运行python代码
下载电子书 www.gutenberg.org
wget http://www.gutenberg.org/files/1342/1342-0.txt
配置Hadoop Streaming路径
~/.bashrc
export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
hadoop-streaming命令
run.sh(用户目录下)
hadoop jar $STREAM \
-file /home/hadoop/wc/mapper.py \
-mapper /home/hadoop/wc/mapper.py \
-file /home/hadoop/wc/reducer.py \
-reducer /home/hadoop/wc/reducer.py \
-input /user/hadoop/input/*.txt \
-output /user/hadoop/wcoutput
run.sh(hadoop安装目录下)
hadoop jar $STREAM \ -file /usr/local/hadoop/wc/mapper.py \ -mapper /usr/local/hadoop/wc/mapper.py \ -file /usr/local/hadoop/wc/reducer.py \ -reducer /usr/local/hadoop/wc/reducer.py \ -input /user/hadoop/input/*.txt \ -output /user/hadoop/wcoutput
四、处理气象数据:
气象数据集下载地址为:ftp://ftp.ncdc.noaa.gov/pub/data/noaa
下载 wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*
解压 zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*.gz >qxdata.txt
一行示例:
0230592870999992020010100004+23220+113480FM-12+007299999V0200451N0018199999999011800199+01471+01011102791ADDAA106999999AA224000091AJ199999999999999AY101121AY201121GA1999+009001999GE19MSL +99999+99999GF107991999999009001999999IA2999+01309KA1240M+02331KA2240N+01351MA1999999101931MD1210131+0161MW1001OC100671OD141200671059REMSYN004BUFR
示例数据链接: https://pan.baidu.com/s/1WNNki76ok0isQCho-TNN5Q 提取码: m5tn
fo = open(\'qxdata.txt\',\'r\') line = fo.readline() fo.close() print(line) print(line[15:23]) print(line[87:92])
fo = open(\'qxdata.txt\',\'r\')
lines =fo.readlines()
fo.close()
for line in lines[:10]:
print(len(line))
print(line[15:27],line[87:92]) fo.close()
参考:
使用Hadoop分析气象数据完整版
https://blog.****.net/qq_39410381/article/details/106367411
--------------------------------------------
大数据实验环境虚拟机镜像文件
链接: https://pan.baidu.com/s/1fGgk9TuYGVYKp9aR9x1sXQ 提取码: e6kj
--------------------------------------------------------------
安装JDK1.8
下载 jdk-8u162-linux-x64.tar.gz
https://pan.baidu.com/s/1Oti-_WVGLmKiRWNO0n-BsA 提取码: 8iaa
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
解压
tar –zxvf jdk-8u162-linux-x64.tar.gz
移动到相应位置
sudo mv jdk1.8.162 /usr/lib/jvm/default-java
检查是否安装成功
java -version
------------------------------
hive配置
3. 修改/usr/local/hive/conf
下的hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> <description>password to use against metastore database</description> </property> </configuration>
-------------------------------------------
Hive WordCount
create table word_counts as select word,count(1) as count from (select explode(split(line,\' \')) as word from docs) word group by word order by word;
---------------------------------------------
hive mysql sqoop
mysql> create table if not exists `wc3` (`word` varchar(100), `count` int) engine=InnoDB DEFAULT CHARSET =utf8; hive>
create table if not exists wc1 row format delimited fields terminated by \'\t\' as select word,count(1) as count from (select explode(split(line,\' \')) as word from wctext) word group by word order by word ;
$ sqoop export --connect jdbc:mysql://127.0.0.1:3306/dblab?useSSL=false --username root --password hadoop --table wc3 --export-dir /user/hive/warehouse/hive.db/wc3 --input-fields-terminated-by \'\t\'
-------------------------------------------------
Hive user
pre_deal.sh
#!/bin/bash infile=$1 outfile=$2 awk -F "," \'BEGIN{ srand(); id=0; Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市"; Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; Province[14]="浙江";Province[15]="贵州";Province[16]="*";Province[17]="*";Province[18]="江西";Province[19]="湖南";Province[20]="湖北"; Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川"; Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="*"; } { id=id+1; value=int(rand()*34); print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value] }\' $infile > $outfile
Hive user analyse
CREATE EXTERNAL TABLE dblab.bigdata_user(id INT,uid STRING,item_id STRING,behavior_type INT,item_category STRING,visit_date DATE,province STRING) COMMENT \'Welcome to dblab!\' ROW FORMAT DELIMITED FIELDS TERMINATED BY \'\t\' STORED AS TEXTFILE LOCATION \'/bigdatacase/dataset\';
查询不重复的数据有多少条
select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;
5.https://www.cnblogs.com/kaituorensheng/p/3826114.html
https://blog.****.net/qq_39662852/article/details/84318619
https://www.liaoxuefeng.com/article/1280231425966113
https://blog.****.net/helloxiaozhe/article/details/88964067
https://www.jianshu.com/p/21c880ee93a9
wget http://www.gutenberg.org/files/5000/5000-8.txt
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt