Python for Hardware Engineers: Analyzing Temperature and Humidity Data From an OM-24 Sensor

Data analysis of sensor data is a typical hardware engineering activity that often involves an Excel-based workflow. While Excel (and other spreadsheet tools) are great for quick and dirty, one-off analyses, a Python + pandas workflow can really takes things to the next level.

In this post, we’ll be analyzing sensor data from an Omega OM-24 Temperature and Humidity Data Logger using a modern data analysis workflow. Hardware engineers should be very familiar with Omega, as they are a quick and easy online vendor of a wide range of sensors. The OM-24 is an example of a simple, low-cost, no frills sensor: there’s no fancy wiring to do, no difficult calibrations, just turn it on and start recording.

OM-24 Temperature and Humidity Sensor

Fetching the data is even easier: plug in the sensor’s USB connection and copy the “Excel” data. However, this is where we come to our first bit of weirdness. According to the sensor’s user manual:

Sensor’s Data Formatting

What the user manual is really trying to say is that the logger formats the data as Tab-separated values (TSV). The extension should have ideally been .tsv. This won’t cause us any problems in our workflow, but will be something to keep in mind when importing and extracting the data.

In the following sections, we’ll be following best practices and using an extract, transform, load (ETL) workflow and some sample data from an actual OM-24 sensor. This allows us to segment our code (i.e., separation of concerns) into modular, reusable, and easily understandable pieces. While the following example is quite trivial (i.e., the three phases won’t consume a lot of time or computing resources), it never hurts to follow best practices! For example, by following ETL principles, we can easily expand our code to handle the processing thousands of data files instead of simply being a script that handles just one at a time.

From an “architectural” standpoint, we’ll keep it simple and decide that the intermediary data format will be a pandas.DataFrame object that will be passed between ETL phases.

Overall, the process will look like this:

  1. Extract data from the XLS file to a DataFrame object
  2. Transform (i.e., clean, pivot) data
  3. Load data into output formats (e.g., plots, new CSVs)

Note: I will be overzealous with code comments in the following sections to try and explain every little step and reasoning.

Extract: Loading the Data

The first step to an ETL workflow is to load the data from its source into the processing pipeline. In the current example, this phase is trivial. In the real world, this phase can become quite complex if we’re loading data from many different sources, formats, etc.

# pandas is the best data analysis and manipulation tool
import pandas as pd

# load data
pd.read_csv("data.xls", sep="\t")

Unnamed: 0Unnamed: 1Multiple Function Data loggerUnnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6
0Company NameACME IncSN123456789OKNaNNaN
1NaNNaNDevice SpecificationNaNNaNNaNNaN
2Production date26 Jun. 2019NaNStart15:18,21 Apr. 2020NaNNaN
3Production lot987654321NaNFinish16:34,04 May 2020NaNNaN
4Firmware versionV1.6NaNDuration Time13Days 1Hrs. 30Min. 0Sec.NaNNaN
........................
64962304/05/2002:23:00 PM12.9625.5NaN6.3
65062404/05/2002:53:00 PM12.9825.3NaN5.3
65162504/05/2003:23:00 PM1324.6NaN2.2
65262604/05/2003:53:00 PM13.0224.1NaN2.2
65362704/05/2004:23:00 PM13.0423.7NaN3

654 rows × 7 columns

Hmmm… the data is weirdly formatted. If we take a look at the data in a spreadsheet program, we can see that it’s not a well-formed table, but more of an “Excel report”:

Raw Data Format

The actual raw data only starts at row 29, with row 28 being the column headers. Let’s fix our above code.

df = pd.read_csv(
    "data.xls",
    # tab-separated values
    sep="\t",
    # skip 27 non-data rows and start at row 28
    skiprows=27,
    # directly use the following columns for datetime data
    parse_dates=[["Date", "Time"]],
    # automatically try to optimize datetime parsing
    infer_datetime_format=True,
    # the raw datetime data uses a date-first (DD/MM/YYYY) format
    dayfirst=True,
)

df

Date_TimeIndexTime Elapsed (days)°CUnnamed: 5%RH
02020-04-21 15:23:0010.0027.1NaN10.1
12020-04-21 15:53:0020.0227.1NaN3.1
22020-04-21 16:23:0030.0427.1NaN2.5
32020-04-21 16:53:0040.0627.2NaN2.7
42020-04-21 17:23:0050.0827.0NaN3.1
.....................
6222020-05-04 14:23:0062312.9625.5NaN6.3
6232020-05-04 14:53:0062412.9825.3NaN5.3
6242020-05-04 15:23:0062513.0024.6NaN2.2
6252020-05-04 15:53:0062613.0224.1NaN2.2
6262020-05-04 16:23:0062713.0423.7NaN3.0

627 rows × 6 columns

Much better. At this point, the extract phase is technically done.

One shortcut we already took was parsing our Date and Time columns into a datetime-type column using the extra parse_dates argument. This could have been done during the Transform phase, but was more efficient and simple to be done here.

Let’s take a bit of a closer look at our data:

# check time difference between recordings
df["Date_Time"].diff()
0                 NaT
1     0 days 00:30:00
2     0 days 00:30:00
3     0 days 00:30:00
4     0 days 00:30:00
            ...      
622   0 days 00:30:00
623   0 days 00:30:00
624   0 days 00:30:00
625   0 days 00:30:00
626   0 days 00:30:00
Name: Date_Time, Length: 627, dtype: timedelta64[ns]
# find out the min and max dates
[df["Date_Time"].min(), df["Date_Time"].max()]
[Timestamp('2020-04-21 15:23:00'), Timestamp('2020-05-04 16:23:00')]
# print a concise summary of a DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 627 entries, 0 to 626
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date_Time            627 non-null    datetime64[ns]
 1   Index                627 non-null    int64         
 2   Time Elapsed (days)  627 non-null    float64       
 3   °C                   627 non-null    float64       
 4   Unnamed: 5           0 non-null      float64       
 5   %RH                  627 non-null    float64       
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 29.5 KB
# generate descriptive statistics of the DataFrame
df.describe()

IndexTime Elapsed (days)°CUnnamed: 5%RH
count627.000000627.000000627.0000000.0627.000000
mean314.0000006.52095726.554386NaN1.985965
std181.1435893.7737881.668183NaN2.234080
min1.0000000.00000023.100000NaN0.100000
25%157.5000003.26000024.950000NaN0.200000
50%314.0000006.52000027.100000NaN1.400000
75%470.5000009.78000027.800000NaN2.800000
max627.00000013.04000029.400000NaN20.100000

From the above summaries, we can already start planning our transformation and cleaning steps. Using the diff() function on the time-series column, we can see that the data seems to have been sampled every 30min. Taking the min() and max() of the date_time column gives us the date range of the data: 2020-04-21 to 2020-05-04. The describe() and info() functions give us high-level insight into the data itself, the data types, and the presence of missing (i.e., NaN, NA, null) values.

For example, using this information, our transformation steps might focus on:

  • Dropping the empty column (we can also drop all useless columns)
  • Renaming all our column headers to lowercase (personal preference)
  • Renaming some columns to be easier to parse

Transform: Clean the Data

Next, given the loaded data in a DataFrame object, we clean our data so that it’s well-formed. In complex data science scenarios, this step often accounts for the majority of the (tedious) work.

Throughout this section, I’ll sometimes use the DataFrame.pipe() function and lambda functions to chain my transformation steps. I personally find this much more efficient and tidy than reassigning back to the df variable. I.e., the following approaches are equivalent:

# i tend to use pipes and lambdas
df = (
    df
    .function1()
    .pipe(lambda x: x.function2())
    .function3()
    .pipe(lambda x: x.function4())
)

# instead of reassigning back to `df` at each step

df = df.function1()
df = df.function2()
df = df.function3()
df = function4(df...)

This also has the benefit of not having to reference the df variable directly in any of our transformation functions, which can cause future issues when modifying the code for expanded/future use. Instead, the “piped” functions will reference an intermediate representation of our DataFrame through a scoped local variable (i.e., x in the case above).

Let’s get started cleaning our data:

df = (
    df
    # drop empty columns
    .dropna(how="all", axis="columns")
    # rename columns to more simple-to-type names
    .rename(columns={"°C": "temperature", "%RH": "humidity"})
    # lowercase column names
    # (notice how we reference local variable `x` here and not `df`)
    .pipe(lambda x: x.rename(columns={c: c.lower() for c in x.columns}))
    # keep only the columns we care about
    .pipe(lambda x: x[["date_time", "temperature", "humidity"]])
)

df

date_timetemperaturehumidity
02020-04-21 15:23:0027.110.1
12020-04-21 15:53:0027.13.1
22020-04-21 16:23:0027.12.5
32020-04-21 16:53:0027.22.7
42020-04-21 17:23:0027.03.1
............
6222020-05-04 14:23:0025.56.3
6232020-05-04 14:53:0025.35.3
6242020-05-04 15:23:0024.62.2
6252020-05-04 15:53:0024.12.2
6262020-05-04 16:23:0023.73.0

627 rows × 3 columns

Now that’s some clean, simple, and pretty data!

Transform: Data Processing and Pivoting

Now that we have clean, well-formed data, we can begin the second part of the Transform phase: processing our data into useful tables. While the current data table (i.e., a list of temperatures and humidities vs time) is very useful, we may also be interested in answering other questions.

And that’s the key to data science: always have a core question (or hypothesis) that you’re trying to answer. Fishing for interesting graphs is often a waste of time. Come into your data process with a goal, and come out with an answer. The joy of using an programmatic ETL-based workflow is that if we ever want to ask a new question about our data, we can simply drop in an extra transformation instead of having to recreate a whole new Excel workbook. The overall pipeline will grow and evolve with us and our data.

Given this, I have three additional (example) questions:

  1. What is the daily min, max, mean of the temperature and humidity?
  2. How much and what percentage of time (hours) was spent above the mean temperature?
  3. What is the max humidity with respect to time of day (4h periods) and day of the week?

For each of these questions, we will create a new DataFrame object that will be transformed from the original df. We want to keep our original data as pure as possible (very much a pipeline workflow), so that we can always come back to the source of truth.

Descriptive Statistics Per Day

Our first question is relatively simple and may come up if we’re looking for longitudinal trends in the data. We will take advantage of pandas' fantastic DataFrame.resample() function to change the frequency of our time-series data from every 30min to every day using DateOffset datetime increments. To properly do this, we have to make our date_time column our DataFrame index by creating a DatetimeIndex.

Next, we will apply our aggregation functions to the resampled DataFrame to generate the new values. We can take two approaches to this step:

  1. Create a new DataFrame per aggregation function (i.e., one for the mean, one for the max, one for the min)
  2. Create a multindex DataFrame with all three aggregation functions (preferred)

The first method would look like this:

df_daily_avg = (
    # create DatetimeIndex
    df.set_index("date_time")
    # DateOffset `D` for daily
    .resample("D")
    # retrieve the mean values from the resampled data
    .mean()
)

df_daily_avg

temperaturehumidity
date_time
2020-04-2127.1611112.672222
2020-04-2225.1833331.231250
2020-04-2326.8875000.239583
2020-04-2426.2250000.291667
2020-04-2524.1916670.489583
2020-04-2623.9062501.081250
2020-04-2725.2270831.214583
2020-04-2826.6312500.802083
2020-04-2927.5395831.068750
2020-04-3027.8187502.629167
2020-05-0127.9812505.312500
2020-05-0228.0208333.256250
2020-05-0328.2895834.537500
2020-05-0427.3151524.051515

The second (better) method that combines aggregation functions:

df_daily = (
    # create DatetimeIndex
    df.set_index("date_time")
    # DateOffset `D` for daily
    .resample("D")
    # aggregation functions
    .agg(["min", "mean", "max"])
)

df_daily

temperaturehumidity
minmeanmaxminmeanmax
date_time
2020-04-2126.227.16111127.90.52.67222210.1
2020-04-2223.225.18333327.70.11.2312503.6
2020-04-2325.826.88750028.20.10.2395831.5
2020-04-2424.126.22500028.40.10.2916671.3
2020-04-2523.724.19166724.90.10.4895831.8
2020-04-2623.323.90625024.80.11.0812502.6
2020-04-2723.125.22708327.70.11.2145832.8
2020-04-2826.126.63125027.60.10.8020832.8
2020-04-2926.827.53958328.50.11.0687503.2
2020-04-3027.127.81875028.70.22.6291676.4
2020-05-0127.227.98125029.41.05.31250020.1
2020-05-0227.428.02083328.91.43.2562505.8
2020-05-0327.728.28958329.32.34.5375007.8
2020-05-0423.727.31515229.12.04.0515156.8

Voilà! The first question is answered and we have the daily statistics of temperature and humidity. Plotting and other transformations for sharing will be performed in the Load phase.

Time Above a Threshold Temperature

Compared the the first question, the second question is a little more tricky to calculate. Essentially, we’re looking for a type of integral over a threshold value. This kind of question would arise if we’re trying to figure out things like service time, mean time between failure (MTBF), or other metrics/KPIs that depend on a certain limits.

Let’s get ahead of ourselves a bit and plot the data to have a visual understanding of what we want:

ax = df.set_index("date_time")["temperature"].plot.line(figsize=(10, 4))
ax.axhline(df["temperature"].mean(), c="C1", ls="--");

png

As seen in the above plot, we want the sum of the time of the data above the dashed line (mean temperature). An easy approach would be to segment our data around a condition (i.e., “is above mean temperature”, True or False) and sum the time deltas of the segmented data points (think Riemann sum). The accuracy of this approach would be limited by the temporal resolution of our data.

df_time_temp_threshold = (
    df
    # get the time delta each recording represents
    # and replace (`fillna`) not-a-time (NaT) values with 0
    .pipe(
        lambda x: x.assign(
            time_delta=x["date_time"].diff().fillna(pd.Timedelta(seconds=0))
        )
    )
    # convert time to numeric (i.e., hours)
    .pipe(lambda x: x.assign(time_delta=x["time_delta"].dt.total_seconds() / (60 * 60)))
    # flag the values that are above the threshold
    .pipe(lambda x: x.assign(is_above=x["temperature"] > x["temperature"].mean()))
    # drop unneeded columns
    .drop(["temperature", "humidity"], axis="columns")
    # pivot around threshold flag and sum time
    .groupby("is_above").sum()
    # create percentage column
    .pipe(lambda x: x.assign(percent=x["time_delta"] / x["time_delta"].sum() * 100))
)

df_time_temp_threshold

time_deltapercent
is_above
False114.036.421725
True199.063.578275

Ta-da! It appears that 199h or 63.6% of the time was spent above the mean temperature.

Humidity in Two Dimensions

For our final question, we will be expanding our data into two dimensions: periods of the day and day of week. This kind of question often arises when we’re trying to spot environmental and external trends. For example, do business hours, day/night cycles, or weekday vs weekends affect the data?

As an alternative example, Google Analytics asks a similar type of question: “When do your users visit?”. The resulting Users by time of day visualization is a great approach to understanding external influences on your dataset.

Example of users by time of day.

For the transformation of humidity data, this will require us to both resample our data with respect to time (i.e., create time periods) and group the data with respect to day of the week. Fortunately, pandas makes this pretty straight forward:

# use the built-in package `calendar` to get an array
# that represents the ordered days of the week
from calendar import day_name

df_2d_humidity = (
    df
    # create time series index
    .set_index("date_time")
    # resample the data into 4h periods
    .resample("4H")
    # take the max of the resampled data
    .max()
    # extract the day of week as an ordered Categorical data type
    .pipe(
        lambda x: x.assign(
            day=pd.Categorical(
                # the given day_name values
                values=x.index.day_name(),
                # the correct calendar order
                categories=list(day_name),
                ordered=True,
            )
        )
    )
    # sort by day of the week
    .pipe(lambda x: x.sort_values("day"))
    # extract the hour of day
    .pipe(lambda x: x.assign(hour=x.index.hour))
    # drop unneeded columns
    .drop("temperature", axis="columns")
    # group by our days and hours
    .groupby(["day", "hour"])
    # take the max value of the groupings
    .max()
    # pivot data from grouped rows to 2D matrix
    .unstack(level=0)
    # drop redundant `humidity` column index
    .droplevel(0, axis="columns")
)

df_2d_humidity

dayMondayTuesdayWednesdayThursdayFridaySaturdaySunday
hour
06.82.83.63.25.55.86.3
46.42.63.34.65.85.36.9
86.62.72.95.07.04.97.0
126.310.13.26.420.14.67.2
163.04.22.95.89.34.77.4
202.33.23.05.36.75.37.8

And there we have it. Each row of this transformation represents a single period of time, divided over days of the week (columns). From the raw data, we can already see that noon tends to be an outlier and Wednesdays have relatively low humidity.

Load: Generate Pretty and Useful Outputs

The final phase of an ETL workflow, Load, usually refers to the insertion of data into the final target database. In a more general context, we are preparing the data for a desired output format, whether than be other databases, plots for a report, or CSV files for other spreadsheet users.

Save Your Raw Data!

I personally always begin by saving my (new, clean) raw data back to CSV for safe keeping. This offers other non-technical people Excel-based access to the same data that was used to generate pretty plots and reports.

# structure all the DataFrames into a simple dict to make for efficient processing
dfs = {
    "data-cleaned": df,
    "data-humidity-day": df_2d_humidity,
    "data-daily": df_daily,
    "data-temp-threshold": df_time_temp_threshold,
}

# iterate and save the data to CSV
for name, data in dfs.items():
    data.to_csv(f"{name}.csv", index=False)

Pretty Plots

The target output for most hardware engineering projects is a pretty plot for a report. A well-crafted visualization can make or break a design decision.

Our first step is to set up our global plotting settings:

# import matplotlib as the core plotting framework
from matplotlib import pyplot as plt
import matplotlib as mpl

# we can also import seaborn for some fancy plots
import seaborn as sns

# remove some spines for aesthetics
# (spines are the lines connecting the axis tick marks)
mpl.rcParams["axes.spines.right"] = False
mpl.rcParams["axes.spines.top"] = False

# best fig size for my blog
FIG_SIZE = (10, 4)

Next, we will go through our transformed data and generate useful plots to visualize their meaning.

Raw Data vs Time

# use matplotlib for more fine-grained control
fig, axes = plt.subplots(nrows=2, figsize=FIG_SIZE)

# plot data on two subplots
df.set_index("date_time")["temperature"].plot.line(ax=axes[0])
df.set_index("date_time")["humidity"].plot.line(ax=axes[1])

# fig title and axis labels
fig.suptitle("Raw Measurements vs Time")
[ax.set_xlabel(None) for ax in axes]
axes[0].set_ylabel("Temperature [degC]")
axes[1].set_ylabel("Relative Humidity [%]")

# prepare output
fig.tight_layout()
fig.savefig("my-output-path.pdf")

png

Descriptive Statistics vs Time

# use matplotlib for more fine-grained control
fig, axes = plt.subplots(nrows=2, figsize=FIG_SIZE)

# plot data on two subplots
df_daily["temperature"].plot.line(ax=axes[0])
df_daily["humidity"].plot.line(ax=axes[1])

# fig title and axis labels
fig.suptitle("Daily Measurements")
[ax.set_xlabel(None) for ax in axes]
axes[0].set_ylabel("Temperature [degC]")
axes[1].set_ylabel("Relative Humidity [%]")

# prepare output
fig.tight_layout()
fig.savefig("my-output-path.pdf")

png

Humidity Heatmap by Time of Day

# use matplotlib for more fine-grained control
fig, ax = plt.subplots(figsize=FIG_SIZE)

# plot heatmap
sns.heatmap(
    # data we want to use
    data=df_2d_humidity,
    # make colourmap based on quantiles
    robust=True,
    # mark values in each cell
    annot=True,
    # add lines around each cell
    linewidths=1,
    # don't need colourbar with cell values
    cbar=False,
    # use matplotlib axis
    ax=ax,
)

# fig title and axis labels
fig.suptitle("Relative Humidity by Time of Day")
ax.set_xlabel("Weekday")
ax.set_ylabel("Hour of Day")

# prepare output
fig.savefig("my-output-path.pdf")

png

Discussion

A Python-based workflow for data processing can seem daunting at first, but can really unlock the next level of manipulation and analysis once you get the hang of it. While this blog post sectioned the code into chunks and snippets, it was actually written as a Jupyter Notebook, mixing both the code and text into a single document. Think about how powerful this is: your report and data analysis can live in a single document and single source of truth. If you’re extra curious, you can find the source code (i.e., the raw Jupyter Notebook) in my website’s public repository.

For me personally, the real win of this workflow is keeping the original logger data untouched and pure. No possibility of accidentally changing the data and hitting Ctrl+S in Excel. The workflow is repeatable, deterministic, and the analysis can be audited alongside the results for extra transparency.

Nicholas Nadeau, Ph.D., P.Eng.
Nicholas Nadeau, Ph.D., P.Eng.
Project Director

Nicholas Nadeau is the project director at Halodi Robotics, leading their mission of bringing safe and capable humanoid robots to everyone.

Related