利用Python进行数据分析笔记-时间序列(时区、周期、频率)

利⽤Python进⾏数据分析笔记－时间序列（时区、周期、频率）

时区处理

时区可以理解为UTC的偏移（offset），例如，在夏令时，纽约时间落后于UTC时间四个⼩时，⽽在⼀年的其他时间⾥，纽约时间落后于UTC时间五个⼩时。

在python中，时区信息来⾃第三⽅的pytz库，这个库利⽤的是奥尔森数据库，这个数据库汇集了世界时区信息。这个信息对于历史数据很重要，因为夏令时（daylight saving time，DST）的交接⽇（transition date）取决于当地政府的⼼⾎来潮。在美国，⾃1900年后，夏令时的交接⽇已经被改了很多次。

关于pytz库的更多信息，需要查看相关的⽂档。本书中pandas包含了⼀些pytz的功能，除了时区的名字，其他的API都不⽤去查。时区名字可以通过下⾯的⽅法获得：

import pytz

pytzmon_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

# 从pytz中得到⼀个时区对象，使⽤pytz.timezone

tz = pytz.timezone('America/New_York')

1、时区定位和转换

默认的，pandas中的时间序列是time zone naive（朴素时区）。

import pandas as pd

import numpy as np

rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

2012-03-09 09:30:00 0.070052

2012-03-10 09:30:00 0.721449

2012-03-11 09:30:00 -0.266241

2012-03-12 09:30:00 -1.022387

2012-03-13 09:30:00 -1.476888

2012-03-14 09:30:00 0.770954

Freq: D, dtype: float64

# 使⽤tz_localize⽅法，可以实现从朴素到本地化（naive to localized）的转变

# 定位时区

ts_utc = ts.tz_localize('UTC')

ts_utc

2012-03-09 09:30:00+00:00 0.070052

2012-03-10 09:30:00+00:00 0.721449

2012-03-11 09:30:00+00:00 -0.266241

2012-03-12 09:30:00+00:00 -1.022387

2012-03-13 09:30:00+00:00 -1.476888

2012-03-14 09:30:00+00:00 0.770954

Freq: D, dtype: float64

⼀旦时间序列被定位到某个时区，那么它就可以被转换为任何其他时区，使⽤tz_convert：

# 转换时区

_convert('America/New_York')

2012-03-09 04:30:00-05:00 0.070052

2012-03-10 04:30:00-05:00 0.721449

2012-03-11 05:30:00-04:00 -0.266241

2012-03-12 05:30:00-04:00 -1.022387

2012-03-13 05:30:00-04:00 -1.476888

2012-03-14 05:30:00-04:00 0.770954

Freq: D, dtype: float64

在处理时间序列的时候，我们可以先把时间定位到纽约时间，然后转换到柏林时间

# 定位纽约再转换成UTC时区

ts_eastern = ts.tz_localize('America/New_York')

_convert('UTC')

2012-03-09 14:30:00+00:00 0.070052

2012-03-10 14:30:00+00:00 0.721449

2012-03-11 13:30:00+00:00 -0.266241

2012-03-12 13:30:00+00:00 -1.022387

2012-03-13 13:30:00+00:00 -1.476888

2012-03-14 13:30:00+00:00 0.770954

Freq: D, dtype: float64

# 转换到柏林时间

_convert('Europe/Berlin')

2012-03-09 15:30:00+01:00 0.070052

2012-03-10 15:30:00+01:00 0.721449

2012-03-11 14:30:00+01:00 -0.266241

2012-03-12 14:30:00+01:00 -1.022387

2012-03-13 14:30:00+01:00 -1.476888

2012-03-14 14:30:00+01:00 0.770954

Freq: D, dtype: float64

tz_localize和tz_convert也是DatetimeIndex上的实例⽅法（instance methods）

_localize('Asia/Shanghai')

DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',

'2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',

'2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],

dtype='datetime64[ns, Asia/Shanghai]', freq='D')

2、时区的操作-意识到时间戳对象

和时间序列或⽇期范围（date ranges）相似，单独的Timestamp object（时间戳对象）也能从朴素（即⽆时区）本地化为有时区的⽇期，然后就可以转换为其他时区了

折叠麻将桌stamp = pd.Timestamp('2011-03-12 04:00')

stamp_utc = _localize('utc') # 定位本地的时区

_convert('America/New_York') # 转换成纽约时区

Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

# 创建Timestamp的时候，我们可以传递⼀个时区

stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')

stamp_moscow

Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

有时区的Timestamp对象内部存储了⼀个UTC时间戳，这个值是从Unix纪元（即1907年1⽉1⽇）到现在的纳秒；这个UTC值在即使换了不同的时区，也是不变的

stamp_utc.value

tz15

1299902400000000000

_convert('America/New_York').value

1299902400000000000

在使⽤pandas的DateOffset对象进⾏算数运算的时候，如果夏令时存在，pandas也会考虑进去。这⾥我们构建⼀个时间戳，正好出现在夏令时转换前。⾸先，在变为夏令时的前30分钟

from pandas.tseries.offsets import Hour

stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')

stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

stamp + Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

变为夏令时的90分钟前

stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')

stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

stamp + 2 * Hour()

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

3、不同时区间的运算

如果两个不同时区的时间序列被合并，那么结果为UTC。因为时间戳是以UTC为背后机制的，这种变化是直接的，不需要⼿动转换

rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B') # freq='B'表⽰按上班时间

ts = pd.Series(np.random.randn(len(rng)), index=rng)

2012-03-07 09:30:00 1.128677

2012-03-08 09:30:00 0.865172

2012-03-09 09:30:00 1.003891

2012-03-12 09:30:00 0.594445

2012-03-13 09:30:00 -0.779890

2012-03-14 09:30:00 0.561338

2012-03-15 09:30:00 0.101160

2012-03-16 09:30:00 -0.314883

2012-03-19 09:30:00 -0.385164

2012-03-20 09:30:00 0.708143

Freq: B, dtype: float64

ts1 = ts[:7].tz_localize('Europe/London')

ts2 = ts1[2:].tz_convert('Europe/Moscow')

result = ts1 + ts2

result.index

DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',

'2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',

'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',

'2012-03-15 09:30:00+00:00'],

dtype='datetime64[ns, UTC]', freq='B')

周期和周期运算

Periods（周期）表⽰时间跨度（timespans），⽐如天，⽉，季，年。Period类表⽰的就是这种数据类型，构建的时候需要⽤字符串或整数氟硅酸钙

p = pd.Period(2007, freq='A-DEC')

Period('2007', 'A-DEC')

Period对象代表了整个2007年⼀年的时间跨度，从1⽉1⽇到12⽉31⽇。在Period对象上进⾏加减，会有和对频度进⾏位移（shifting）⼀样的效果

p + 5

Period('2012', 'A-DEC')

p - 2

电暖画

笔杆贴标机Period('2005', 'A-DEC')

如果两个周期有相同的频度，⼆者的区别就是它们之间有多少个单元（units）

pd.Period('2014', freq='A-DEC') - p

7传输带

固定范围的周期（Regular ranges of periods）可以通过period_range函数创建

rng = pd.period_range('2000-01-01', '2000-06-03', freq='M') # freq='M'表⽰按⽉

rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

PeriodIndex类能存储周期组成的序列，⽽且可以作为任何pandas数据结构中的轴索引（axis index）

pd.Series(np.random.randn(6), index=rng)

2000-01 0.180966

2000-02 -0.801255

2000-03 -0.269305

2000-04 -1.614798

2000-05 -0.577700

2000-06 1.717878

Freq: M, dtype: float64

如果我们有字符串组成的数组，可以使⽤PeriodIndex类

values = ['2001Q3', '2002Q2', '2003Q1']

index = pd.PeriodIndex(values, freq='Q-DEC')

index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

1、周期频度转换

通过使⽤asfreq⽅法，Periods和PeriodIndex对象能被转换为其他频度。例如，假设我们有⼀个年度期间（annual period），并且想要转换为⽉度期间（monthly period），做法⾮常直观：

p = pd.Period('2007', freq='A-DEC') # freq='A-DEC'指定周期结束⽉份为12⽉

Period('2007', 'A-DEC')

p.asfreq('M', how='start')

Period('2007-01', 'M')

p.asfreq('M', how='end')

Period('2007-12', 'M')

我们可以认为Period('2007', freq='A-DEC')是某种指向时间跨度的光标，⽽这个时间跨度被细分为⽉度期间。可以看下⾯的图⽰：

如果⼀个财政年度（fiscal year）是在1⽉结束，⽽不是12⽉，那么对应的⽉度期间会不⼀样：

p = pd.Period('2007', freq='A-JUN') # freq='A-JUN'指定周期结束⽉份为6⽉

Period('2007', 'A-JUN')

p.asfreq('M', 'start')

Period('2006-07', 'M')

p.asfreq('M', 'end')

Period('2007-06', 'M')

当我们转换⾼频度为低频度时，pandas会根据 subperiod（次周期；⼦周期）的归属来决定superperiod（超周期；母周期）。例如，在A-JUN频度中，⽉份Aug-2007其实是个2008周期的⼀部分：

p = pd.Period('Aug-2007', 'M')

p.asfreq('A-JUN')

Period('2008', 'A-JUN')

整个PeriodIndex对象或时间序列可以被转换为⼀样的语义（semantics）：

rng = pd.period_range('2006', '2009', freq='A-NOV') # freq='A-JUN'指定周期结束⽉份为11⽉

ts = pd.Series(np.random.randn(len(rng)), index=rng)

2006 0.518204

2007 -1.310516

2008 0.879978

2009 0.452713

Freq: A-NOV, dtype: float64

ts.asfreq('M', how='start')

2005-12 0.518204

2006-12 -1.310516

2007-12 0.879978

2008-12 0.452713

Freq: M, dtype: float64

这⾥，年度周期可以⽤⽉度周期替换，对应的第⼀个⽉也会包含在每个年度周期⾥。如果我们想要每年的最后⼀个⼯作⽇的话，可以使⽤’B’频度，并指明我们想要周期的结尾

ts.asfreq('B', how='end')

2006-11-30 0.518204

2007-11-30 -1.310516

2008-11-28 0.879978

2009-11-30 0.452713

Freq: B, dtype: float64

2、季度周期频度

季度数据经常出现在会计，经济等领域。⼤部分季度数据都与财政年度结束⽇（fiscal year end）相关，⽐如12⽉最后⼀个⼯作⽇。因此，根据财政年度结束的不同，周期2012Q4也有不同的意义。pan

das⽀持所有12个周期频度，从Q-JAN到Q-DEC。

本文发布于:2024-09-22 07:27:40，感谢您对本站的认可！

本文链接：https://www.17tex.com/tex/2/343836.html

上一篇：NBA出场时间

下一篇：linux下c语言获取系统时区的方法

标签：时间序列时候转换时区对象频度

留言与评论（共有 0 条评论）