what is important to remember when analyzing data in research

Principles of good data analysis

March 23, 2014 // information science , frameworks , methodology , thoughts

Data assay is difficult.

What makes it hard is the intuitive aspect of it - knowing the direction you want to take based on the express data you have at the moment. Additionally, it's communicating the results and showing why your analysis is correct that makes this all the more hard - doing it deeply, at scale, and in a consequent fashion.

Having been a part of many of these deep-dive analyses, I've noticed some "principles" that I've found useful to follow throughout.

Know your approach

Earlier you begin the assay, know the questions yous're trying to reply and what y'all're trying to reach - don't fall into an analytical rabbit hole. Additionally, you lot should know some basic things near your potential information - what data sources are available to reply the questions? How is that data structured? Is it in a database? CSVs? Third-party APIs? What tools will yous be able to employ for the assay?

Your approach will probable alter throughout, but it's helpful to start with a programme and adjust.

Know how the information was generated

One time you lot've settled on your approach and information sources, you need to brand sure y'all empathize how the data was generated or captured, peculiarly if you are using your own visitor's data.

For example, let's say you're a data scientist at Amazon and you're doing some analysis on orders. Let's assume at that place's a table somewhere in the Amazon world called "orders" that stores data nigh an order. Does this tabular array store incomplete orders? What is the interaction on Amazon.com that creates a new tape in this table? If I starting time an order and do not fully consummate the payment menstruation, will a record have been written to this tabular array? What exactly does each field in the tabular array mean?

You need to know this level of detail in order to take conviction in your analysis - your audience volition ask these questions.

Profile your data

Once you're confident you're looking at the right information, you need to develop some familiarity information technology. Non merely volition this allow you to gain a bones agreement of what you lot're looking at, simply information technology too allows you to gain a certain level of condolement that things are yet "correct" later in the analysis.

For example, I was once helping a friend clarify a fairly large time series dataset (~10GB). The results of the analysis didn't intuitively jive with me - something felt off. When excavation deeper into the assay, I decided to plot the events by date and noticed we had two days without any information - that shouldn't have been the example.

Profiling your data early on helps to ensure your work throughout the assay - you lot'll discover sooner when something is "off."

Facet all the things

I'1000 increasingly convinced that Simpson's Paradox is i of the most important things for anyone working with information to understand. In cases of Simpson's paradox, a trend appearing in different groups of data disappears when the groups are combined and looked at in aggregate. It illustrates the importance of looking at your data by multiple dimensions.

Equally an example, take a wait at the below table.

Simpson's paradox (combined)

The above table shows admission rates for men and women into the University of California, Berkeley's graduate programs for the fall of 1973. Based on the above numbers, the University was sued for an alledged bias against women. Still, when faceting the information by sexual activity AND department, nosotros see women were actually admitted into many departments' graduate programs at a rate higher than men.

Simpson's paradox (splits)

This is probably the most infamous example of Simpson's paradox. The folks over at Berkeley's VUDLab have put together a fantastic visualization assuasive yous to explore the data farther.

When going through your data, practise so with Simpson's paradox in mind. Information technology's extremely of import to understand how aggregate statistics can be misleading and why looking at your information from multiple facets is necessary.

Be skeptical

In addition to profiling and faceting your data, you lot need to exist skeptical throughout your analysis. If something doesn't look or feel correct, it probably isn't. Pore through your data to make sure nothing unexpected going on, and if in that location is something unexpected, brand sure yous understand why information technology'due south occurring and are comfortable with it before you proceed.

I'd argue that no data is improve than incorrect information in near cases. Brand sure the base of operations layer of your analysis is right.

Recollect like a trial lawyer

A proficient trial attorney will prepare their instance while also considering how the opposition might respond. When the opposition does present, our chaser will (hopefully) have prepared for that very piece of new evidence or testimony, hands assuasive he/she to counter in a meaningful way.

Much like a good trial chaser, you need to think alee and consider the audition of your analysis and the questions they might enquire. Preparing accordingly for those will lend to the credibility of your work. No one likes to hear "I'grand not certain, I didn't look at that" and you don't want to be caught flat-footed.

Clarify your assumptions

It's unlikely that your data is perfect and it probably doesn't capture everything yous demand to complete a thorough and exhaustive assay - you'll demand to hold some assumptions throughout your work. These demand to be explicitly stated when you're sharing results.

Additionally, your stakeholders are crucial in helping you determine your assumptions. You lot should exist working with them and other domain experts to ensure your assumptions are logical and unbiased.

Cheque your work

It seems obvious, only people simply don't check their work sometimes. Understandably, at that place are deadlines, quick turnarounds, and last infinitesimal requests; however, I tin can clinch you lot that your audience would rather your results exist correct than quick.

I find it useful to regularly check the basic statistics of the data (sums, counts, etc.) throughout an assay in order to make sure nothing is lost along the way - substantially creating a trail of breadcrumbs I can follow backwards in instance something doesn't seem right after on.

Communicate

Lastly, the whole process should be a conversation with stakeholders - don't work in a silo. Information technology's possible your audience isn't necessarily concerned with decimal signal accurateness - maybe they but desire to sympathise directional impact.

In the end, remember that data assay is near oftentimes nearly solving a problem and that problem has stakeholders - yous should be working with them to answer the questions that are most important; not necessarily those that are most interesting. Interesting doesn't always mean "valuable."


Congenital with Pelican and the newbird theme

© Copyright Greg Reda, 2013 to nowadays

franklinpapined.blogspot.com

Source: http://www.gregreda.com/2014/03/23/principles-of-good-data-analysis/

0 Response to "what is important to remember when analyzing data in research"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel