Timeseries datarames

Timeseries dataframes

Timeseries data is a cornerstone of our data manipulation and most processing is on them

`set_time_index_zone`

Processing may depend on proper timezone awareness, this utility to set the timezone on a datetime index

set_time_index_zone


def set_time_index_zone(
    df:DataFrame, # Dataframe to set or convert the timeindex on
    timezone, # Timezone to set
):

Sets the time zone of the index of a pandas DataFrame.

Args: df (pd.DataFrame): The DataFrame whose index time zone is to be set. timezone (str): The desired time zone.

Returns: pd.DataFrame: The modified DataFrame with its index time zone set to the specified time zone.

Raises: None

Examples: >>> df = pd.DataFrame({‘A’: [1, 2, 3]}, index=pd.DatetimeIndex([‘2022-01-01’, ‘2022-01-02’, ‘2022-01-03’])) >>> set_time_index_zone(df, ‘Europe/Berlin’) A 2022-01-01 1 2022-01-02 2 2022-01-03 3 DatetimeIndex: 3 entries, 2022-01-01 01:00:00+01:00 to 2022-01-03 01:00:00+01:00

Example

df = pd.DataFrame({'A': [1, 2, 3]}, index=pd.DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03']))
set_time_index_zone(df, 'Europe/Berlin')
df.index

DatetimeIndex(['2022-01-01 01:00:00+01:00', '2022-01-02 01:00:00+01:00',
               '2022-01-03 01:00:00+01:00'],
              dtype='datetime64[ns, Europe/Berlin]', name='time', freq=None)

timeseries_dataframe

Converts Pandas dataframes and series, Numpy array’s and recarrays or a dictionary of individual timeseries into a Pandas dataframe with one datetime index. With all arrays dataframes and series it is assumed that the first column contains the timestamps.

timeseries_dataframe


def timeseries_dataframe(
    data:Union, timezone:str='UTC', columnnames:NoneType=None
):

Convert various tabular data formats to timeseries DataFrame

Args: data (Union[pd.DataFrame, pd.Series, dict, np.ndarray, np.recarray]): The input data to be converted. timezone (str, optional): The timezone to set for the index of the DataFrame. Defaults to ‘UTC’. columnnames (Optional[List[str]]): The column names to use for the DataFrame. Defaults to None.

Returns: pd.DataFrame: The converted timeseries DataFrame with the index set to the specified timezone.

`timeseries_dataframe_from_datadict`

timeseries_dataframe_from_datadict


def timeseries_dataframe_from_datadict(
    data:dict, timecolumns:NoneType=None, recordformat:str='records', nested:bool=False
):

Converts a data dict into a pandas DataFrame based on the specified record format. Parameters: - data: A dictionary containing the data to convert. - timecolumns: A list of column names to be treated as time columns. - recordformat: A string specifying the format of the data records (‘records’, ‘table’, ‘split’, ‘index’, ‘tight’). Returns: - df: A pandas DataFrame with a DatetimeIndex representing the converted data.

Exported source

def timeseries_dataframe_from_datadict(
        data:dict, 
        timecolumns=None,
        recordformat='records',
        nested=False
):
        
    """
    Converts a data dict into a pandas DataFrame based on the specified record format. 
    Parameters:
        - data: A dictionary containing the data to convert.
        - timecolumns: A list of column names to be treated as time columns.
        - recordformat: A string specifying the format of the data records ('records', 'table', 'split', 'index', 'tight').
    Returns:
        - df: A pandas DataFrame with a DatetimeIndex representing the converted data.
    """

    orient = recordformat.lower()
    assert orient in ['records', 'table', 'split', 'index', 'tight']
    assert timecolumns, 'No time columns specified'

    #print(f"Converting {'nested' if nested else 'flat'} data dict to DataFrame with orient={orient} and timecolumns={timecolumns}")
    
    if orient == 'records':
        if nested:
            # data is a nested structure, we need to normalize it
            df = pd.json_normalize(data, sep='.', errors='ignore')  # type: ignore

        else:
            # data is a structured ndarray, sequence of tuples or dicts, or DataFrame
            df = pd.DataFrame.from_records(data, coerce_float=True)  # type: ignore
            
        time_columns_in_df = [C for C in df.columns if C in timecolumns]
        if not  time_columns_in_df:
            time_column = df.columns[0]
        else:
            time_column = time_columns_in_df[0]

    elif orient == 'table':
        # data is in pandas table format
        time_column = data['schema']['primaryKey'][0]
        df = pd.DataFrame.from_dict(data['data'], coerce_float=True).set_index(data['schema']['primaryKey'])
        df.index.name = 'time'
    else:
        # data  is formatted according to 'orient' parameter (pandas)
        df = pd.DataFrame.from_dict(data, orient=orient, coerce_float=True) # type: ignore
        time_column = df.index.name


    df.columns = list(df.columns)
    df[time_column] = pd.to_datetime(df[time_column],utc=True,format='ISO8601')
    df.set_index(time_column, inplace=True)
    df.index = pd.DatetimeIndex(df.index).round('ms')
    
    df.index.name = 'time'

    return df

df = timeseries_dataframe_from_datadict(test_data_dict_3_samples, timecolumns=['time'])
df

	value
time
2023-05-04 10:04:49+00:00	16.72
2023-05-04 10:24:51+00:00	16.65
2023-05-04 10:44:53+00:00	16.55

df.index

DatetimeIndex(['2023-05-04 10:04:49+00:00', '2023-05-04 10:24:51+00:00',
               '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)

df = set_time_index_zone( 
    timeseries_dataframe_from_datadict(
        test_data_dict_3_samples, 
        timecolumns=['time']
    ), 
    timezone='Europe/Amsterdam'
)
df

	value
time
2023-05-04 12:04:49+02:00	16.72
2023-05-04 12:24:51+02:00	16.65
2023-05-04 12:44:53+02:00	16.55

df.index

DatetimeIndex(['2023-05-04 12:04:49+02:00', '2023-05-04 12:24:51+02:00',
               '2023-05-04 12:44:53+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', name='time', freq=None)

Timezones

rng = pd.date_range(pd.Timestamp("2018-04-10T09:01:01.123+02:00"), periods=3, freq='s').tz_convert('Europe/Amsterdam')
rng

DatetimeIndex(['2018-04-10 09:01:01.123000+02:00',
               '2018-04-10 09:01:02.123000+02:00',
               '2018-04-10 09:01:03.123000+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq='s')

rng.strftime("%FT%R:%S%z")

Index(['2018-04-10T09:01:01+0200', '2018-04-10T09:01:02+0200',
       '2018-04-10T09:01:03+0200'],
      dtype='object')

pd.DatetimeIndex(rng.strftime("%FT%R:%S%z")).round('ms')

DatetimeIndex(['2018-04-10 09:01:01+02:00', '2018-04-10 09:01:02+02:00',
               '2018-04-10 09:01:03+02:00'],
              dtype='datetime64[ns, UTC+02:00]', freq=None)

rng.tz_convert('UTC').strftime("%FT%R:%SZ")

Index(['2018-04-10T07:01:01Z', '2018-04-10T07:01:02Z', '2018-04-10T07:01:03Z'], dtype='object')

# .map(lambda x: x.isoformat())

rng = pd.date_range(pd.Timestamp("2018-04-10T09:01:01.123+02:00"), periods=30000, freq='s').tz_convert('Europe/Amsterdam')

Which is faster, `strftime()` or `isoformat()`?

ft = rng.strftime("%FT%R:%S%z")

340 ms ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ft = rng.map(lambda x: x.isoformat(timespec='milliseconds'))

136 ms ± 9.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

`timeseries_dataframe_to_datadict`

timeseries_dataframe_to_datadict


def timeseries_dataframe_to_datadict(
    data:Union, recordformat:str='records', timezone:str='UTC'
):

Convert a timeseries DataFrame or Series into a dictionary representation.

Args: data (Union[pd.DataFrame, pd.Series, dict]): The input data to be converted. It can be a pandas DataFrame, Series, or a dictionary. recordformat (str, optional): The format of the output records. Defaults to ‘records’. timezone (str, optional): The timezone to use for the DataFrame index. Defaults to ‘UTC’.

Returns: Union[dict, list]: The converted dictionary representation of the input data, a dictionary or a list of dictionaries depending on the recordformat parameter.

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df

	value
time
2023-05-04 09:04:49.050000+00:00	16.72
2023-05-04 10:24:51.010000+00:00	16.65
2023-05-04 10:44:53+00:00	16.55
2023-05-04 10:44:53+00:00	NaN

df.index

DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df

	value
time
2023-05-04 09:04:49.050000+00:00	16.72
2023-05-04 10:24:51.010000+00:00	16.65
2023-05-04 10:44:53+00:00	16.55
2023-05-04 10:44:53+00:00	NaN

timeseries_dataframe(df, timezone='UTC').index

DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df

	value
time
2023-05-04 09:04:49.050000+00:00	16.72
2023-05-04 10:24:51.010000+00:00	16.65
2023-05-04 10:44:53+00:00	16.55
2023-05-04 10:44:53+00:00	NaN

timeseries_dataframe_to_datadict(df, recordformat='records')

[{'time': '2023-05-04T09:04:49Z', 'value': 16.72},
 {'time': '2023-05-04T10:24:51Z', 'value': 16.65},
 {'time': '2023-05-04T10:44:53Z', 'value': 16.55},
 {'time': '2023-05-04T10:44:53Z', 'value': nan}]

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df

	value
time
2023-05-04 09:04:49.050000+00:00	16.72
2023-05-04 10:24:51.010000+00:00	16.65
2023-05-04 10:44:53+00:00	16.55
2023-05-04 10:44:53+00:00	NaN

timeseries_dataframe_to_datadict(df, recordformat='records', timezone='Europe/Berlin')

[{'time': '2023-05-04T11:04:49.050+02:00', 'value': 16.72},
 {'time': '2023-05-04T12:24:51.010+02:00', 'value': 16.65},
 {'time': '2023-05-04T12:44:53.000+02:00', 'value': 16.55},
 {'time': '2023-05-04T12:44:53.000+02:00', 'value': nan}]

df.index

DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df

	value
time
2023-05-04 09:04:49.050000+00:00	16.72
2023-05-04 10:24:51.010000+00:00	16.65
2023-05-04 10:44:53+00:00	16.55
2023-05-04 10:44:53+00:00	NaN

timeseries_dataframe_to_datadict(df, recordformat='tight')

{'index': ['2023-05-04T09:04:49Z',
  '2023-05-04T10:24:51Z',
  '2023-05-04T10:44:53Z',
  '2023-05-04T10:44:53Z'],
 'columns': ['value'],
 'data': [[16.72], [16.65], [16.55], [nan]],
 'index_names': ['time'],
 'column_names': [None]}

test_data = timeseries_dataframe_to_datadict(df, recordformat='records')

test_data

[{'time': '2023-05-04T09:04:49Z', 'value': 16.72},
 {'time': '2023-05-04T10:24:51Z', 'value': 16.65},
 {'time': '2023-05-04T10:44:53Z', 'value': 16.55},
 {'time': '2023-05-04T10:44:53Z', 'value': nan}]

timeseries_dataframe_resample


def timeseries_dataframe_resample(
    df:DataFrame, period:str, method:str
):

Resamples a time-series DataFrame on the specified period and method.

Parameters: df (pd.DataFrame): The input time-series DataFrame. period (str): The resampling period. method (str): The resampling method. Can be a string of multiple methods separated by ‘;’. method_args (dict, optional): Additional arguments for the resampling method.

Returns: pd.DataFrame: The resampled DataFrame.

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.000Z",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.000Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T11:04:49.000Z",
         "value":16.47
      },
      {
         "time":"2023-05-04T11:24:51.000Z",
         "value":16.44
      },
      {
         "time":"2023-05-04T11:44:53.000Z",
         "value":16.38
      },
   ], timecolumns=['time'])

timeseries_dataframe_resample(df, "80min", 'mean;count')

	value	value_mean	value_count
time
2023-05-04 09:20:00+00:00	NaN	16.685	2.0
2023-05-04 10:04:49+00:00	16.72	NaN	NaN
2023-05-04 10:24:51+00:00	16.65	NaN	NaN
2023-05-04 10:40:00+00:00	NaN	16.460	4.0
2023-05-04 10:44:53+00:00	16.55	NaN	NaN
2023-05-04 11:04:49+00:00	16.47	NaN	NaN
2023-05-04 11:24:51+00:00	16.44	NaN	NaN
2023-05-04 11:44:53+00:00	16.38	NaN	NaN

Timeseries dataframes

set_time_index_zone

set_time_index_zone

Example

timeseries_dataframe

timeseries_dataframe

timeseries_dataframe_from_datadict

timeseries_dataframe_from_datadict

Timezones

Which is faster, strftime() or isoformat()?

timeseries_dataframe_to_datadict

timeseries_dataframe_to_datadict

timeseries_dataframe_resample

`set_time_index_zone`

`timeseries_dataframe_from_datadict`

Which is faster, `strftime()` or `isoformat()`?

`timeseries_dataframe_to_datadict`