计算均值mean的MapReduce程序Computing mean with MapReduce

时间:2021-07-26 04:18:30

In this post we'll see how to compute the mean of the max temperatures of every month for the city of Milan. 
The temperature data is taken from http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/, but since the data are shown in tabular form, we had to sniff the HTTP conversation to see that the data come from this URL and are in JSON format. 
Using Jackson, we could transform this JSON into a format simpler to use with Hadoop: CSV. The result of conversion is this:

01012000,-4.0,5.0
02012000,-5.0,5.1
03012000,-5.0,7.7
04012000,-3.0,9.7
...

If you're curious to see how we transformed it, take a look at the source code

Let's look at the mapper class for this job:

public static class MeanMapper extends Mapper<Object, Text, Text, SumCount> { private final int DATE = 0; private final int MIN = 1; private final int MAX = 2; private Map<Text, List<Double>> maxMap = new HashMap<>(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // gets the fields of the CSV line String[] values = value.toString().split((",")); // defensive check if (values.length != 3) { return; } // gets date and max temperature String date = values[DATE]; Text month = new Text(date.substring(2)); Double max = Double.parseDouble(values[MAX]); // if not present, put this month into the map if (!maxMap.containsKey(month)) { maxMap.put(month, new ArrayList<Double>()); } // adds the max temperature for this day to the list of temperatures maxMap.get(month).add(max); } @Override protected void cleanup(Context context) throws IOException, InterruptedException { // loops over the months collected in the map() method for (Text month