Science and technology

Analyzing information: What pandas and SQL taught me about taking a median

For Python builders who work primarily with information, it is arduous to not end up continuously knee-deep in SQL and Python’s open supply information library, pandas. Despite how straightforward these instruments have made it to control and rework information—generally as concisely as one line of code—analysts nonetheless should all the time perceive their information and what their code means. Even calculating one thing so simple as abstract statistics may be vulnerable to severe errors.

In this text, we check out the arithmetic imply. Although it’s conventionally taught with one-dimensional information in thoughts, calculating it for multidimensional information requires a essentially totally different course of. In truth, calculating the arithmetic imply as in case your information is one-dimensional will produce grossly incorrect figures, generally orders of magnitude off from what was meant. For me, this was a humbling expertise: Even the arithmetic imply is not any much less deserving of double- and triple-checking than every other calculation.

Few statistical calculations rival the simplicity and explanatory energy of essentially the most fundamental: share, sum, and common, above all. As a end result, they crop up all over the place, from exploratory information evaluation to information dashboards and administration stories. But one in all these, the arithmetic imply, is unusually problematic. Although it’s conventionally taught with one-dimensional information in thoughts, calculating it for multidimensional information requires a essentially totally different course of. In truth, calculating the arithmetic imply as in case your information is one-dimensional will produce grossly incorrect figures, generally orders of magnitude off from what was meant. For me, this was a humbling expertise: Even the arithmetic imply is not any much less deserving of double- and triple-checking than every other calculation.

Back to fundamentals

The arithmetic imply is outlined as:

Or:

SUM(all observations) / COUNT(variety of observations)

We can see this by way of a easy apple-picking instance:

Apples

title num_apples
Katie four
Alan eight
John 10
Tess eight
Jessica 5

What composes an commentary right here? One particular person, outlined by a single column (title), also referred to as a dimension or attribute.

Using the formulation above, we will calculate the arithmetic imply:

SUM(four + eight + 10 + eight + 5) / 5 = 7

In SQL, we’d write:

SELECT AVG(num_apples) FROM apples

What did we simply calculate? “The average number of apples picked by person” (the place every individual represents one commentary).

Adding complexity: Two-dimensional information

Apples

date title num_apples
2017-09-24 Katie four
2017-09-24 Alan eight
2017-09-24 John 10
2017-09-24 Tess eight
2017-09-26 Katie 5

In this instance, we changed Jessica with Katie, however on a special date.

Now every commentary within the desk is not merely (title). Katie seems twice, however on separate observations as a result of Katie picked apples on two totally different days. Rather, every commentary consists of two dimensions: (date, title).

We can ask the identical query as earlier than: “What is the average number of apples picked by person?”

We ought to anticipate a single quantity, identical to earlier than. Should we anticipate the typical to equal 7, like we acquired earlier than?

Going again to our formulation:

Or:

SUM(four + eight + 10 + eight + 5) / four = eight.75

So though the numerator (variety of apples picked) has stayed the identical, the denominator (variety of individuals) went from 5 to four. Katie picked apples twice, on separate days, so we do not double depend her.

What’s happening right here? The unit of commentary outlined on the desk degree differs from the unit of commentary for our evaluation.

For our analytical query, we aren’t asking about what number of days everybody picked apples. We’re merely asking for the typical variety of apples picked by all people, and we must always find yourself with a solution like “7 apples picked on average” or “10 apples picked on average.” If Katie occurs to select apples on extra days than everybody else, that ought to genuinely improve the typical. In any random pattern of apple pickers, we could get individuals like Katie who decide apples far more typically than anybody else, which pushes up the typical variety of apples picked by individual.

So how would we write this in SQL? This would not work:

SELECT AVG(num_apples) FROM apples

This would give us the identical reply as earlier than: 7.

What we’ve got to do is collapse the information to the extent of study we care about. We aren’t asking for the typical variety of apples picked by date-person, which is what the question earlier than would give us. We are asking in regards to the variety of apples the typical individual picks. The degree of commentary for our evaluation is an individual (title), not a date-person (date, title).

So our question would appear to be this:

SELECT AVG(num_apples) FROM (
    SELECT title, SUM(num_apples) AS num_apples
    FROM apples
    GROUP BY title
) AS t

Scary.

The interior question offers us this end result set:

Apples

title num_apples
Katie 9
Alan eight
John 10
Tess eight

Now, that is what we wish to take the typical of! The outer question then does this:

SUM(four + eight + 10 + eight + 5) / four = eight.75

So what did we be taught right here? Our analytical query required that we scale back the dimensionality of the information to lower than what was outlined by the desk. The desk outlined an commentary of two dimensions (date, title), however our analytical query demanded an commentary of 1 dimension (title).

This change in dimensionality by way of collapsing resulted in a change within the quantity of observations within the denominator, which modified our common.

And to restate the plain: if we did not carry out this collapsing operation on our unique information, the primary common we calculated can be fallacious.

Why does this occur, extra typically?

When information is saved in a database, a degree of granularity should be specified. In different phrases, “what constitutes an individual observation?”

You may think about a desk storing information like this:

Sales

date products_sold
2017-09-21 21
2017-09-22 28
2017-09-24 19
2017-09-25 21
2017-09-26 19
2017-09-27 18

But you possibly can additionally think about a desk that shops the identical information however simply with extra granularity, like this:

Sales

date product_category products_sold
2017-09-21 T-Shirts 16
2017-09-21 Jackets 2
2017-09-21 Hats three
2017-09-22 T-Shirts 23
2017-09-22 Hats 5
2017-09-24 T-Shirts 10
2017-09-24 Jackets three
2017-09-24 Hats 6
2017-09-25 T-Shirts 21
2017-09-26 T-Shirts 14
2017-09-26 Hats 5
2017-09-27 T-Shirts 14
2017-09-27 Jackets four

The unit of commentary outlined on the table-level is known as the main key. A main secret is required in all database tables and applies a constraint that each commentary should be distinctive. After all, if an commentary seems twice however is not distinctive, it ought to simply be one commentary. 

It sometimes follows a syntax like this:

CREATE TABLE gross sales (
    date DATE NOT NULL default '0000-00-00',
    product_category VARCHAR(40) NOT NULL default '',
    products_sold INT
    PRIMARY KEY (date, product_category) <------
)

Notice that the extent of granularity we select to report about our information is actually a part of the definition of our desk. The main key defines “a single observation” in our information. And it is required earlier than we begin storing any information in any respect.

Now, simply because we report information at that degree of granularity does not imply we have to analyze it at that degree of granularity. The degree of granularity at which we have to analyze our information will all the time be a perform of what sort of questions we are attempting to reply.

The key takeaway right here is that the first key defines an commentary on the desk degree, and this will likely comprise one or two or 20 dimensions. But our evaluation will most likely not outline an commentary so granularly (e.g., we could care nearly gross sales per day), so we should collapse the information and redefine the commentary for our evaluation.

Formalizing the sample

So we all know that, for any analytical query we ask, we have to redefine what constitutes a single commentary, unbiased of regardless of the main key occurs to be. If we simply take the typical with out collapsing our information, we’ll find yourself with too many observations (i.e., the quantity outlined by the first key) in our denominator, and due to this fact too low a median.

To evaluate, utilizing the identical information as above:

Sales

date product_category products_sold
2017-09-21 T-Shirts 16
2017-09-21 Jackets 2
2017-09-21 Hats three
2017-09-22 T-Shirts 23
2017-09-22 Hats 5
2017-09-24 T-Shirts 10
2017-09-24 Jackets three
2017-09-24 Hats 6
2017-09-25 T-Shirts 21
2017-09-26 T-Shirts 14
2017-09-26 Hats 5
2017-09-27 T-Shirts 14
2017-09-27 Jackets four

“What’s the average number products sold per day?”

Well, there are six days on this information set and a complete of 126 merchandise bought. That’s 21 merchandise bought per day on common.

It just isn’t 9.7, which is what you get from this question:

SELECT AVG(products_sold) FROM gross sales

We must collapse the information like so:

SELECT AVG(amount) FROM (
    SELECT date, SUM(products_sold) AS amount
    FROM gross sales
    GROUP BY date
) AS t

Giving us 21. We can get a way of the magnitude right here: 9.7 is in no way near 21.

Annotating that question above:

SELECT AVG(amount) FROM (
    SELECT date, SUM(products_sold) AS amount
    FROM gross sales
    GROUP BY date // [COLLAPSING KEY]
) AS t

Here, I outline the collapsing key because the “unit of observation relevant to our analysis.” It has nothing to do with the first key—it ignores any columns we do not care about, reminiscent of (product_category). The collapsing key says: “We want to work only this level of granularity, so roll up any granularity below by adding it all up.”

In this case, we’re explicitly defining a unit of commentary for our evaluation as (date), which can compose the variety of rows within the denominator. If we do not do that, who is aware of what number of observations (rows) will slip into the denominator? (Answer: However many we see on the main key degree.)

Unfortunately, the collapsing key just isn’t the top of the story.

What if we wish averages for teams? Like, “What’s the average number of products sold by category?”

Working with teams

“What’s the average number of products sold by category?”

Seems like a innocent query. What may go fallacious?

SELECT product_category, AVG(products_sold)
FROM gross sales
GROUP BY product_category

Nothing. That really works. That’s the proper reply. We get:

Sales

product_category AVG(products_sold)
T-Shirts 12.83
Jackets three
Hats four.75

Sanity checking for jackets: There are three days the place we promote jackets, and we promote a complete four + three + 2 = 9, so the typical is three.

Immediately I am considering: “Three what?” Answer: “Three jackets sold on average.” Question: “Average what?” Answer: “On an average day, we sell three jackets.”

OK, now we see our unique query wasn’t exact sufficient—it stated nothing about days!

Here’s the query we actually answered: “For each product category, what’s the average number of products sold per day?”

Anatomy of an averaging query, in English

Since the objective of any SQL question is to finally be a direct, declarative translation of a query posed in plain English, we first want to know the components of the query in English.

Let’s decompose this: “For each product category, what’s the average number of products sold per day?”

There are three components:

  • Groups: We need a median for every product class (product_category)
  • Observation: Our denominator ought to be the variety of days (date)
  • Measurement: The numerator is the measurement variable we’re summing up (products_sold)

For every group, we wish a median, which would be the whole variety of merchandise bought per day divided by the variety of days in that group.

Our objective is to translate these English elements instantly into SQL.

From English to SQL

Here is a few transaction information:

Transactions

date product state purchaser amount
2016-12-23 vacuum NY Brian King 1
2016-12-23 stapler NY Brian King three
2016-12-23 printer ink NY Brian King 2
2016-12-23 stapler NY Trevor Campbell 1
2016-12-23 vacuum MA Lauren Mills 1
2016-12-23 printer ink MA John Smith 5
2016-12-24 vacuum MA Lauren Mills 1
2016-12-24 keyboard NY Brian King 2
2016-12-25 keyboard MA Tom Lewis four
2016-12-26 stapler NY John Doe 1

“For each state and product, what’s the average number of products sold per day?”

SELECT state, product, AVG(amount)
FROM transactions
GROUP BY state, product

This offers us:

Transactions

state product AVG(amount)
NY vacuum 1
NY stapler 1.66
NY printer ink 2
NY keyboard 2
MA vacuum 1
MA printer ink 5
MA keyboard four

Sanity checking on (NY, stapler), we must always get a complete of three + 1 + 1 = 5 over 2 days (2017-12-23 and 2017-12-26), giving us 2.5…

Alas, the SQL end result offers us 1.66. The question should be fallacious.

Here’s the proper question:

SELECT state, product, AVG(amount) FROM (
    SELECT state, product, DATE, SUM(amount) AS amount
    FROM transactions
    GROUP BY state, product, DATE
) AS t
GROUP BY state, product

Giving us:

Transactions

state product AVG(amount)
NY vacuum 1
NY stapler 2.5
NY printer ink 2
NY keyboard 2
MA vacuum 1
MA printer ink 5
MA keyboard four

Anatomy of an averaging query, in SQL

We decided that there are three components to an averaging query in English, and if we do not respect that, we will miscalculate the typical. We additionally know that the elements in English ought to translate into elements in SQL. 

Here they’re:

SELECT state, product,
AVG(amount) // [MEASUREMENT VARIABLE]
FROM (
    SELECT state, product, DATE, SUM(amount) AS amount
    FROM transactions
    GROUP BY state, product, DATE // [COLLAPSING KEY]
) AS t
GROUP BY state, product // [GROUPING KEY]

    -- [OBSERVATION KEY] = [COLLAPSING KEY] - [GROUPING KEY]
    -- (date) = (state, product, date) - (state, product)

This is similar question as above, simply with feedback.

Notice that the collapsing secret is not in our English query—it is like faking a main key, however for our evaluation as a substitute of utilizing the one outlined within the desk. 

Also discover that within the SQL translation, the commentary secret is implicit, not specific. The commentary key is the same as the collapsing key (i.e., simply the scale we want for our evaluation, and nothing extra) minus the grouping key (the scale on which we’re grouping). Whatever’s left—that is the commentary key, or what defines an commentary for our evaluation.

I’m the primary to confess simply how complicated it’s that an important a part of our averaging query—that’s, what defines an commentary—just isn’t even specific in SQL. It is implicit. I name this the pitfall of taking multidimensional averages.

The takeaway is as follows:

  • The collapsing key defines what dimensions we’ll use in our evaluation. Everything else from the first key of the desk is to be “rolled up.” We outline the collapsing key within the GROUP BY of the interior question.
  • The grouping key is upon what dimension we wish to group our information (i.e., “for each group”). This is outlined within the GROUP BY of the outer question.
  • The collapsing key − the grouping key = the commentary key.
  • If you do not outline a collapsing key, you’re implicitly utilizing the first key of the desk as your collapsing key.
  • If you are not doing any grouping, the collapsing key equals the commentary key

By manner of instance, if the first key of your desk is (date, product, state, purchaser) and also you wish to take a median by purchaser (commentary: purchaser) for every state (group: state), you have to resolve for the collapsing key (i.e., what goes within the interior SQL question).

We do not wish to implicitly use the first key, so we will use a collapsing key. What collapsing key? The collapsing key shall be (commentary key: purchaser) + (grouping key: state) = (purchaser, state). That goes within the GROUP BY of our interior question, (state) alone goes in GROUP BY of the outer question, and implicitly the commentary secret is (purchaser).

Finally, discover what occurs if we do not use a collapsing key. The main secret is (date, product, state, purchaser) and our grouping secret is (state). If we do not use any subqueries in any respect, we’ll get a solution which defines an commentary as (date, product, state, purchaser) − (state) = (date, product, purchaser). And that may decide what number of observations we see in every group, which impacts the denominator of our common. Which is fallacious.

Wrapping up

One factor I’ve realized from all of that is, from an analytical viewpoint, by no means belief the first key. It defines the granularity—i.e., what constitutes an commentary—for recording information, however this is probably not what you want for evaluation. And in case you aren’t expressly conscious of how this distinction will have an effect on your calculations, your calculations could very effectively be incorrect. Because the first key will have an effect on your denominator whether or not you’re conscious or not.

So if you cannot belief the first key, the most secure factor to do is to all the time collapse the information. If you are not doing any grouping, then your collapsing key explicitly equals your commentary key. If you’re doing grouping, then your collapsing secret is the sum of your commentary and grouping keys. But one factor’s for positive: If you aren’t collapsing your information, you’re implicitly trusting the first key.

The second factor I realized is that, completely unrelated to SQL, asking a query about a median just isn’t all the time intuitive. “What is the average share price by security by day?” is an ambiguous question, even in plain English! Is that a median share worth by day for every safety, or a median share by safety for every day?

Business questions do not come within the type of database logic or program code. Rather, they’re formulated utilizing pure language and should be translated into information language. As a knowledge analyst, it’s essential to make clear: “What precisely are we taking the typical of?” Here it is useful to suppose by way of collapsing, grouping, and commentary keys, particularly when conceptualizing what number of observations are going into your denominator.

This drawback is not confined to SQL, however quite any retailer of relational information, reminiscent of pandas.DataFrames or R information tables. And in case you’re something like me, you will go poring over your previous code grepping for averages and marvel, “just what exactly am I averaging here?”

This was initially printed on alexpetralia.com and is reprinted with permission.

To be taught extra, attend Alex Petralia’s discuss, Analyzing data: What pandas and SQL taught me about taking an average, at PyCon Cleveland 2018.

Most Popular

To Top