Timeseries datarames

Fill in a module description here

Timeseries dataframes

Timeseries data is a cornerstone of our data manipulation and most processing is on them

set_time_index_zone

Processing may depend on proper timezone awareness, this utility to set the timezone on a datetime index


source

set_time_index_zone

 set_time_index_zone (df:pandas.core.frame.DataFrame, timezone)

*Sets the time zone of the index of a pandas DataFrame.

Args: df (pd.DataFrame): The DataFrame whose index time zone is to be set. timezone (str): The desired time zone.

Returns: pd.DataFrame: The modified DataFrame with its index time zone set to the specified time zone.

Raises: None

Examples: >>> df = pd.DataFrame({‘A’: [1, 2, 3]}, index=pd.DatetimeIndex([‘2022-01-01’, ‘2022-01-02’, ‘2022-01-03’])) >>> set_time_index_zone(df, ‘Europe/Berlin’) A 2022-01-01 1 2022-01-02 2 2022-01-03 3 DatetimeIndex: 3 entries, 2022-01-01 01:00:00+01:00 to 2022-01-03 01:00:00+01:00*

Type Details
df DataFrame Dataframe to set or convert the timeindex on
timezone Timezone to set

Example

df = pd.DataFrame({'A': [1, 2, 3]}, index=pd.DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03']))
set_time_index_zone(df, 'Europe/Berlin')
df.index
DatetimeIndex(['2022-01-01 01:00:00+01:00', '2022-01-02 01:00:00+01:00',
               '2022-01-03 01:00:00+01:00'],
              dtype='datetime64[ns, Europe/Berlin]', name='time', freq=None)

timeseries_dataframe

Converts Pandas dataframes and series, Numpy array’s and recarrays or a dictionary of individual timeseries into a Pandas dataframe with one datetime index. With all arrays dataframes and series it is assumed that the first column contains the timestamps.


source

timeseries_dataframe

 timeseries_dataframe (data:Union[pandas.core.frame.DataFrame,pandas.core.
                       series.Series,dict,numpy.ndarray,numpy.rec.recarray
                       ], timezone='UTC', columnnames=None)

*Convert various tabular data formats to timeseries DataFrame

Args: data (Union[pd.DataFrame, pd.Series, dict, np.ndarray, np.recarray]): The input data to be converted. timezone (str, optional): The timezone to set for the index of the DataFrame. Defaults to ‘UTC’. columnnames (Optional[List[str]]): The column names to use for the DataFrame. Defaults to None.

Returns: pd.DataFrame: The converted timeseries DataFrame with the index set to the specified timezone.*

timeseries_dataframe_from_datadict


source

timeseries_dataframe_from_datadict

 timeseries_dataframe_from_datadict (data:dict, timecolumns=None,
                                     recordformat='records', nested=False)

Converts a data dict into a pandas DataFrame based on the specified record format. Parameters: - data: A dictionary containing the data to convert. - timecolumns: A list of column names to be treated as time columns. - recordformat: A string specifying the format of the data records (‘records’, ‘table’, ‘split’, ‘index’, ‘tight’). Returns: - df: A pandas DataFrame with a DatetimeIndex representing the converted data.

Exported source
def timeseries_dataframe_from_datadict(
        data:dict, 
        timecolumns=None,
        recordformat='records',
        nested=False
):
        
    """
    Converts a data dict into a pandas DataFrame based on the specified record format. 
    Parameters:
        - data: A dictionary containing the data to convert.
        - timecolumns: A list of column names to be treated as time columns.
        - recordformat: A string specifying the format of the data records ('records', 'table', 'split', 'index', 'tight').
    Returns:
        - df: A pandas DataFrame with a DatetimeIndex representing the converted data.
    """

    orient = recordformat.lower()
    assert orient in ['records', 'table', 'split', 'index', 'tight']
    assert timecolumns, 'No time columns specified'

    #print(f"Converting {'nested' if nested else 'flat'} data dict to DataFrame with orient={orient} and timecolumns={timecolumns}")
    
    if orient == 'records':
        if nested:
            # data is a nested structure, we need to normalize it
            df = pd.json_normalize(data, sep='.', errors='ignore')  # type: ignore

        else:
            # data is a structured ndarray, sequence of tuples or dicts, or DataFrame
            df = pd.DataFrame.from_records(data, coerce_float=True)  # type: ignore
            
        time_columns_in_df = [C for C in df.columns if C in timecolumns]
        if not  time_columns_in_df:
            time_column = df.columns[0]
        else:
            time_column = time_columns_in_df[0]

    elif orient == 'table':
        # data is in pandas table format
        time_column = data['schema']['primaryKey'][0]
        df = pd.DataFrame.from_dict(data['data'], coerce_float=True).set_index(data['schema']['primaryKey'])
        df.index.name = 'time'
    else:
        # data  is formatted according to 'orient' parameter (pandas)
        df = pd.DataFrame.from_dict(data, orient=orient, coerce_float=True) # type: ignore
        time_column = df.index.name


    df.columns = list(df.columns)
    df[time_column] = pd.to_datetime(df[time_column],utc=True,format='ISO8601')
    df.set_index(time_column, inplace=True)
    df.index = pd.DatetimeIndex(df.index).round('ms')
    
    df.index.name = 'time'

    return df
df = timeseries_dataframe_from_datadict(test_data_dict_3_samples, timecolumns=['time'])
df
value
time
2023-05-04 10:04:49+00:00 16.72
2023-05-04 10:24:51+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
df.index
DatetimeIndex(['2023-05-04 10:04:49+00:00', '2023-05-04 10:24:51+00:00',
               '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)
df = set_time_index_zone( 
    timeseries_dataframe_from_datadict(
        test_data_dict_3_samples, 
        timecolumns=['time']
    ), 
    timezone='Europe/Amsterdam'
)
df
value
time
2023-05-04 12:04:49+02:00 16.72
2023-05-04 12:24:51+02:00 16.65
2023-05-04 12:44:53+02:00 16.55
df.index
DatetimeIndex(['2023-05-04 12:04:49+02:00', '2023-05-04 12:24:51+02:00',
               '2023-05-04 12:44:53+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', name='time', freq=None)

Timezones

rng = pd.date_range(pd.Timestamp("2018-04-10T09:01:01.123+02:00"), periods=3, freq='s').tz_convert('Europe/Amsterdam')
rng
DatetimeIndex(['2018-04-10 09:01:01.123000+02:00',
               '2018-04-10 09:01:02.123000+02:00',
               '2018-04-10 09:01:03.123000+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq='s')
rng.strftime("%FT%R:%S%z")
Index(['2018-04-10T09:01:01+0200', '2018-04-10T09:01:02+0200',
       '2018-04-10T09:01:03+0200'],
      dtype='object')
pd.DatetimeIndex(rng.strftime("%FT%R:%S%z")).round('ms')
DatetimeIndex(['2018-04-10 09:01:01+02:00', '2018-04-10 09:01:02+02:00',
               '2018-04-10 09:01:03+02:00'],
              dtype='datetime64[ns, UTC+02:00]', freq=None)
rng.tz_convert('UTC').strftime("%FT%R:%SZ")
Index(['2018-04-10T07:01:01Z', '2018-04-10T07:01:02Z', '2018-04-10T07:01:03Z'], dtype='object')
# .map(lambda x: x.isoformat())
rng = pd.date_range(pd.Timestamp("2018-04-10T09:01:01.123+02:00"), periods=30000, freq='s').tz_convert('Europe/Amsterdam')

Which is faster, strftime() or isoformat()?

ft = rng.strftime("%FT%R:%S%z")
340 ms ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ft = rng.map(lambda x: x.isoformat(timespec='milliseconds'))
136 ms ± 9.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

timeseries_dataframe_to_datadict


source

timeseries_dataframe_to_datadict

 timeseries_dataframe_to_datadict
                                   (data:Union[pandas.core.frame.DataFrame
                                   ,pandas.core.series.Series,dict],
                                   recordformat:str='records',
                                   timezone:str='UTC')

*Convert a timeseries DataFrame or Series into a dictionary representation.

Args: data (Union[pd.DataFrame, pd.Series, dict]): The input data to be converted. It can be a pandas DataFrame, Series, or a dictionary. recordformat (str, optional): The format of the output records. Defaults to ‘records’. timezone (str, optional): The timezone to use for the DataFrame index. Defaults to ‘UTC’.

Returns: Union[dict, list]: The converted dictionary representation of the input data, a dictionary or a list of dictionaries depending on the recordformat parameter.*

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df
value
time
2023-05-04 09:04:49.050000+00:00 16.72
2023-05-04 10:24:51.010000+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
2023-05-04 10:44:53+00:00 NaN
df.index
DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)
df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df
value
time
2023-05-04 09:04:49.050000+00:00 16.72
2023-05-04 10:24:51.010000+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
2023-05-04 10:44:53+00:00 NaN
timeseries_dataframe(df, timezone='UTC').index
DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)
df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df
value
time
2023-05-04 09:04:49.050000+00:00 16.72
2023-05-04 10:24:51.010000+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
2023-05-04 10:44:53+00:00 NaN
timeseries_dataframe_to_datadict(df, recordformat='records')
[{'time': '2023-05-04T09:04:49Z', 'value': 16.72},
 {'time': '2023-05-04T10:24:51Z', 'value': 16.65},
 {'time': '2023-05-04T10:44:53Z', 'value': 16.55},
 {'time': '2023-05-04T10:44:53Z', 'value': nan}]
df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df
value
time
2023-05-04 09:04:49.050000+00:00 16.72
2023-05-04 10:24:51.010000+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
2023-05-04 10:44:53+00:00 NaN
timeseries_dataframe_to_datadict(df, recordformat='records', timezone='Europe/Berlin')
[{'time': '2023-05-04T11:04:49.050+02:00', 'value': 16.72},
 {'time': '2023-05-04T12:24:51.010+02:00', 'value': 16.65},
 {'time': '2023-05-04T12:44:53.000+02:00', 'value': 16.55},
 {'time': '2023-05-04T12:44:53.000+02:00', 'value': nan}]
df.index
DatetimeIndex(['2023-05-04 09:04:49.050000+00:00',
               '2023-05-04 10:24:51.010000+00:00',
                      '2023-05-04 10:44:53+00:00',
                      '2023-05-04 10:44:53+00:00'],
              dtype='datetime64[ns, UTC]', name='time', freq=None)
df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.050+01:00",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.010Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":np.nan
      }
   ], timecolumns=['time'])

df
value
time
2023-05-04 09:04:49.050000+00:00 16.72
2023-05-04 10:24:51.010000+00:00 16.65
2023-05-04 10:44:53+00:00 16.55
2023-05-04 10:44:53+00:00 NaN
timeseries_dataframe_to_datadict(df, recordformat='tight')
{'index': ['2023-05-04T09:04:49Z',
  '2023-05-04T10:24:51Z',
  '2023-05-04T10:44:53Z',
  '2023-05-04T10:44:53Z'],
 'columns': ['value'],
 'data': [[16.72], [16.65], [16.55], [nan]],
 'index_names': ['time'],
 'column_names': [None]}
test_data = timeseries_dataframe_to_datadict(df, recordformat='records')
test_data
[{'time': '2023-05-04T09:04:49Z', 'value': 16.72},
 {'time': '2023-05-04T10:24:51Z', 'value': 16.65},
 {'time': '2023-05-04T10:44:53Z', 'value': 16.55},
 {'time': '2023-05-04T10:44:53Z', 'value': nan}]

source

timeseries_dataframe_resample

 timeseries_dataframe_resample (df:pandas.core.frame.DataFrame,
                                period:str, method:str)

*Resamples a time-series DataFrame on the specified period and method.

Parameters: df (pd.DataFrame): The input time-series DataFrame. period (str): The resampling period. method (str): The resampling method. Can be a string of multiple methods separated by ‘;’. method_args (dict, optional): Additional arguments for the resampling method.

Returns: pd.DataFrame: The resampled DataFrame.*

df = timeseries_dataframe_from_datadict([
      {
         "time":"2023-05-04T10:04:49.000Z",
         "value":16.72
      },
      {
         "time":"2023-05-04T10:24:51.000Z",
         "value":16.65
      },
      {
         "time":"2023-05-04T10:44:53.000Z",
         "value":16.55
      },
      {
         "time":"2023-05-04T11:04:49.000Z",
         "value":16.47
      },
      {
         "time":"2023-05-04T11:24:51.000Z",
         "value":16.44
      },
      {
         "time":"2023-05-04T11:44:53.000Z",
         "value":16.38
      },
   ], timecolumns=['time'])
timeseries_dataframe_resample(df, "80min", 'mean;count')
value value_mean value_count
time
2023-05-04 09:20:00+00:00 NaN 16.685 2.0
2023-05-04 10:04:49+00:00 16.72 NaN NaN
2023-05-04 10:24:51+00:00 16.65 NaN NaN
2023-05-04 10:40:00+00:00 NaN 16.460 4.0
2023-05-04 10:44:53+00:00 16.55 NaN NaN
2023-05-04 11:04:49+00:00 16.47 NaN NaN
2023-05-04 11:24:51+00:00 16.44 NaN NaN
2023-05-04 11:44:53+00:00 16.38 NaN NaN