在matplotlib中设置时间间隔并为直方图添加限制

时间:2021-05-16 14:55:12

I am new to python as well as matplotlib. I am trying to plot trip data for each city using a histogram from matplotlib. Here is the sample data i am trying to plot.

我是python以及matplotlib的新手。我试图使用matplotlib的直方图绘制每个城市的旅行数据。这是我试图绘制的示例数据。

Data:

数据:

     duration  month  hour day_of_week   user_type
0   15.433333      3    23    Thursday  Subscriber
1    3.300000      3    22    Thursday  Subscriber
2    2.066667      3    22    Thursday  Subscriber
3   19.683333      3    22    Thursday  Subscriber
4   10.933333      3    22    Thursday  Subscriber
5   19.000000      3    21    Thursday  Subscriber
6    6.966667      3    21    Thursday  Subscriber
7   17.033333      3    20    Thursday  Subscriber
8    6.116667      3    20    Thursday  Subscriber
9    6.316667      3    20    Thursday  Subscriber
10  11.300000      3    20    Thursday  Subscriber
11   8.300000      3    20    Thursday  Subscriber
12   8.283333      3    19    Thursday  Subscriber
13  36.033333      3    19    Thursday  Subscriber
14   5.833333      3    19    Thursday  Subscriber
15   5.350000      3    19    Thursday  Subscriber

Code:

码:

def get_durations_as_list(filename):
        with open(filename, 'r') as f_in:
            reader = csv.reader(f_in)
            next(reader, None)
            for row in reader:
                if row[4] in ['Subscriber','Registered'] and float(row[0]) < 75:
                    subscribers.append(float(row[0]))
                elif row[4] in ['Casual','Customer'] and float(row[0]) < 75:
                    customers.append(float(row[0]))
            return subscribers,customers

data_files = ['./data/Washington-2016-Summary.csv','./data/Chicago-2016-Summary.csv','./data/NYC-2016-Summary.csv',]
for file in data_files:
    city = file.split('-')[0].split('/')[-1]
    subscribers,customers = get_durations_as_list(file)

plt.hist(subscribers,range=[min(subscribers),max(subscribers)],bins=5)
plt.title('Distribution of Subscriber Trip Durations for city {}'.format(city))
plt.xlabel('Duration (m)')
plt.show()

plt.hist(customers,range=[min(subscribers),max(subscribers)],bins=5)
plt.title('Distribution of Customers Trip Durations for city {}'.format(city))
plt.xlabel('Duration (m)')
plt.show()

Now the question is how to set the time interval to 5mins wide and how to plot only the trips which are less than 75mins.

现在的问题是如何将时间间隔设置为5分钟宽,以及如何仅绘制小于75分钟的行程。

I have gone through the documentation but it looks complicated. After reading few * question i found that bins are used to set the time interval. Is my assumption correct.

我已经阅读了文档,但看起来很复杂。在阅读了几个*问题后,我发现箱子用于设置时间间隔。我的假设是否正确。

3 个解决方案

#1


1  

I cannot try it out but here are my thoughts:

我无法尝试,但这是我的想法:

The bins argument can also be a sequence of bin edges. Therefore you can take the minimum and maximum of durations and create a sequence with a step size of 5 (here using the numpy library):

bin参数也可以是bin边缘序列。因此,您可以获取最小和最大持续时间,并创建步长为5的序列(此处使用numpy库):

import numpy as np
sequence = np.arange(min(dat['duration']), max(dat['duration']), 5) 

(Maybe you want to floor/ceil the minimum and maximum values to integers.) Here the code relies on the fact that I read the data using the pandas library. It can easily be filtered using pandas as well:

(也许你想要将最小值和最大值放在最小值或整数上。)这里的代码依赖于我使用pandas库读取数据的事实。它也可以使用pandas轻松过滤:

import pandas as pd
dat = pd.read_csv('YOURFILE.csv')
dat_filtered = dat[dat['duration'] < 75]

Happy Holidays.

节日快乐。

#2


0  

yes, your assumption is very much correct you can use bins parameter as a sequence. in your case, it will be like.

是的,您的假设是非常正确的,您可以使用bin参数作为序列。在你的情况下,它会像。

 b = [ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]

you can use numpy to genrate the above list.

你可以使用numpy来生成上面的列表。

 bins = numpy.arange(0,75,5)

Also, you can use Subscriber and Customer data set in one go below is the function

此外,您可以使用Subscriber和Customer数据集,如下所示是该功能

def plot_duration_type(filename):
    city = filename.split('-')[0].split('/')[-1]
    with open(filename, 'r') as f_in:
        reader = csv.DictReader(f_in)
        subscriber_duration = []
        customer_duration = []
        for row in reader:
            if float(row['duration']) < 75 and row['user_type'] == 'Subscriber':
                subscriber_duration.append(float(row['duration']))
            elif float(row['duration']) < 75 and row['user_type'] == 'Customer':
                customer_duration.append(float(row['duration']))
    b = [ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
    plt.hist([subscriber_duration, customer_duration], bins=b, color=['orange', 'green'],
                 label=['Subscriber', 'Customer'])
    title = "{} Distribution of Trip Durations".format(city)
    plt.title(title)
    plt.xlabel('Duration (m)')
    plt.show()

data_file = ['./data/Washington-2016-Summary.csv', './data/Chicago-2016-Summary.csv', './data/NYC-2016-Summary.csv']
for datafile in data_file:
    print(plot_duration_type(datafile))

#3


0  

To set the interval of 5 mins with max duration as 75 min, you would need 15 intervals. Hence your bin size will be 75/5. you can write it either bins=int(75/5) or as @om tripathi suggested as numpy.arange(0,75,5). Also you need not filter the duration greater than 75 min in the data filtering stage. You can always set the range as range = range(0, 75) in histogram to discard values greater than 75.

要设置5分钟的间隔,最长持续时间为75分钟,则需要15个间隔。因此你的垃圾箱大小将是75/5。你可以把它写成bins = int(75/5)或@om tripathi建议为numpy.arange(0,75,5)。此外,您无需在数据过滤阶段过滤大于75分钟的持续时间。您始终可以在直方图中将范围设置为范围=范围(0,75),以丢弃大于75的值。

e.g. pyplot.hist(data, bins=numpy.arange(0,75,15) ,range=(0, 75))

例如pyplot.hist(data,bins = numpy.arange(0,75,15),range =(0,75))

#1


1  

I cannot try it out but here are my thoughts:

我无法尝试,但这是我的想法:

The bins argument can also be a sequence of bin edges. Therefore you can take the minimum and maximum of durations and create a sequence with a step size of 5 (here using the numpy library):

bin参数也可以是bin边缘序列。因此,您可以获取最小和最大持续时间,并创建步长为5的序列(此处使用numpy库):

import numpy as np
sequence = np.arange(min(dat['duration']), max(dat['duration']), 5) 

(Maybe you want to floor/ceil the minimum and maximum values to integers.) Here the code relies on the fact that I read the data using the pandas library. It can easily be filtered using pandas as well:

(也许你想要将最小值和最大值放在最小值或整数上。)这里的代码依赖于我使用pandas库读取数据的事实。它也可以使用pandas轻松过滤:

import pandas as pd
dat = pd.read_csv('YOURFILE.csv')
dat_filtered = dat[dat['duration'] < 75]

Happy Holidays.

节日快乐。

#2


0  

yes, your assumption is very much correct you can use bins parameter as a sequence. in your case, it will be like.

是的,您的假设是非常正确的,您可以使用bin参数作为序列。在你的情况下,它会像。

 b = [ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]

you can use numpy to genrate the above list.

你可以使用numpy来生成上面的列表。

 bins = numpy.arange(0,75,5)

Also, you can use Subscriber and Customer data set in one go below is the function

此外,您可以使用Subscriber和Customer数据集,如下所示是该功能

def plot_duration_type(filename):
    city = filename.split('-')[0].split('/')[-1]
    with open(filename, 'r') as f_in:
        reader = csv.DictReader(f_in)
        subscriber_duration = []
        customer_duration = []
        for row in reader:
            if float(row['duration']) < 75 and row['user_type'] == 'Subscriber':
                subscriber_duration.append(float(row['duration']))
            elif float(row['duration']) < 75 and row['user_type'] == 'Customer':
                customer_duration.append(float(row['duration']))
    b = [ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
    plt.hist([subscriber_duration, customer_duration], bins=b, color=['orange', 'green'],
                 label=['Subscriber', 'Customer'])
    title = "{} Distribution of Trip Durations".format(city)
    plt.title(title)
    plt.xlabel('Duration (m)')
    plt.show()

data_file = ['./data/Washington-2016-Summary.csv', './data/Chicago-2016-Summary.csv', './data/NYC-2016-Summary.csv']
for datafile in data_file:
    print(plot_duration_type(datafile))

#3


0  

To set the interval of 5 mins with max duration as 75 min, you would need 15 intervals. Hence your bin size will be 75/5. you can write it either bins=int(75/5) or as @om tripathi suggested as numpy.arange(0,75,5). Also you need not filter the duration greater than 75 min in the data filtering stage. You can always set the range as range = range(0, 75) in histogram to discard values greater than 75.

要设置5分钟的间隔,最长持续时间为75分钟,则需要15个间隔。因此你的垃圾箱大小将是75/5。你可以把它写成bins = int(75/5)或@om tripathi建议为numpy.arange(0,75,5)。此外,您无需在数据过滤阶段过滤大于75分钟的持续时间。您始终可以在直方图中将范围设置为范围=范围(0,75),以丢弃大于75的值。

e.g. pyplot.hist(data, bins=numpy.arange(0,75,15) ,range=(0, 75))

例如pyplot.hist(data,bins = numpy.arange(0,75,15),range =(0,75))