Data skills for the future: there’s a science for that!

Duncan Watts, principle researcher at Microsoft Research, visited McGill yesterday as part of the the official launch for Centre for Social and Cultural Data Science.   Before his formal talk, he spent an hour with graduate students discussing data science, the skills required, the usefulness of the data and  working for industry versus academia.  The insight then spilled over to the Faculty Club where he talked about projects using online experiments and social media data.  The talk will hopefully be one of many future events discussing all the issues related to the recent big data science advent taking shape.

“Data science” is extraordinarily broad, though we can roughly define some necessary skills.  Dr Watts described himself more specifically as a computational social science.  Some equate ‘data science’ with the ability to scrape social media data though, at least in my experience, this is actually one of the easiest parts of the start-to-finish process of handling  data, and is only one of many ways to acquire a dataset that could lend itself to the skills of a data scientist.  Scraping data can be as few as 2 lines of code in R.  Even if you have overcome the logistics of storing such data, the challenges have just begun.  As Dr Watts mentioned, getting your data to the point it can be analysed is the most time-consuming thing.  These data wrangling skills are a key part of finding the story in the data: imagine, in a large dataset, manually looking at the data is barely possible.

Regardless of the definition, a blend of statistical and coding skills are best to become a data scientist.  This transferable, broad skill set tends to be more useful than substantive expertise, though this depends on what specifically you’re ‘doing data science’ for.  Short of a focused degree in data science, getting these skills usually involves cobbling together your own curriculum by self-teaching skills in R and/or Python and/or statistics on the side.  Dr Watts highlighted the wide availability of summer courses and workshop, including those run by McGill’s CSCDS.  On top of learning them, I’d emphasize the importance of ‘proving’ them, a problem I’ve struggled with lately.  It’s one thing to say you know how to do something, but another for your CV to say that.  Think creatively…

So you’ve learned how to do some scraping, some analysis and how to prove it, but what next? The people who use Twitter are different than those who don’t.  The people who post every minute on Facebook are different to those who don’t.  The people who have never used a computer are just different.  Dr. Watts described how generalizability is a tough problem in research studies relying on the web and, in particular, social media.  Depending on the research question, the problem may be insurmountable. For example, a study of an older population may best not be done using Twitter feeds.  But generalizability was one of many problems inherent in ‘big data’, as Dr. Watts acknowledged, and it may or may not be your biggest problem.  Just be clear on your target population and, subsequently, whether social media data is useful.

Given his dual academic-industry background, students were also curious about the relative freedom researchers get.  Dr Watts suggested freedom was really company-dependent and thus an important question to ask during interviews.  This can become an issue when research results would hurt a company’s bottom line, reputation or create unwanted media attention.  These problems might not matter to you if you are not interested in publishing, but can be bothersome if you consistently feel censored.  Understand what situations you would not like to be in, and find out if any prospective employer would regularly put you in those situations.

After the informal discussion session with students, Dr. Watts gave a formal talk at the official launch for CSCDS in the Faculty Club.  His team has demonstrated that ‘going viral’ is rarely a real thing, that the popularity of a song is difficult to predict because it is  perpetuated by peer pressure, and that some people always choose to cooperate with others to achieve an aim even when they get stung.   Beyond experiments and interest in social phenomena, Dr Watts also described a disaster response system that relied on real-time collection of information reported on social media.  There are other examples, for example, in aid distribution systems.

For more information on the science that has become data skills, see the Centre for Social and Cultural Data Science website.   Also see the number of books Duncan Watts has published:

Do I know my data well enough? Suggestions on relationship-building.

Advice I received when initially starting quantitative research is to ‘know your data REALLY well.’  The ‘why’ quickly became evident.  The ‘how’ remained like an endless abyss.  The larger the number of observations and variables, the more you end up discovering new things about your data everyday.  Despite how exciting this sounds, it is NOT a good thing. While not quite the topic of the course, Data Wrangling and Visualization at the Summer Institute in at the University of Washington helped to bring together this post.  Far from a static, 1-day thing, building a strong relationship with your data is an interactive and drawn out part of the data preparation process.

Getting to know your data can be reduced to two main, iterative aspects: coding and visualization. Good coding and documentation practice is central to reproducibly cleaning raw data, merging datasets, avoiding errors and  re-generating data if something changes. Visualization is central to creating the bigger the picture against which to map your coding.  It involves imagining the end product(s) and working backwards to ensure your code supports the progression towards that.   Effectively going back and forth between coding and visualization should result in a clean dataset that you know well.

Before taking your dataset(s) in any direction, look at your data.  Scroll up and down, scan-read, make a note of inconsistencies (check missing values and date formats, for example).  Cleaning up variable and value names at this stage can save a lot of hassle.  This involves renaming any ‘difficultly’ named variables or values, such as those with capitals, quotations and spaces. gsub is a great function in R that enables pattern replacement, and here is a relevant presentation for Stata.  While some might argue that recoding variables simultaneously is useful, I often ended up having re-re-code so prefer to wait until later.

Using the workable variable names, we can draw out exactly how we want our final dataset to look, ideally using the variable names we just created to avoid confusion.  I’m talking about pen and paper here, but I’m sure there are digital solutions. If possible, write out the equation for your final model, for example, E(y) = a + bx, where x is height and y is weight.  Working backwards, the data required for input in this model consists of two columns – height and weight – with 1 line per (height,weight) pair.   A simple example, but stick with me. So, back to the code.

If we are lucky, our dataset looks just like what we want, minus some basic recoding.  We all dream of such clean, ready-to-analyse data, but ironically these data require a more pragmatic effort to ‘get to know’.  Errors (whether in the data or code) are SO easily missed when working with clean datasets because there is limited opportunity to catch them.  From my own experience, repeatedly reading your code and scrolling your dataset as if you will spot all the weird things hasn’t quite worked out. But, good news, there is a practical way to make coding and understand data easier, and thus less error-prone: plot, plot and plot.  To get a sense of a clean dataset (whether you’ve had to clean it or not), graphical visualization of different dimensions is useful to explore, especially if there are millions of observations to deal with. There are resources and an entire book advocating  exactly this approach, i.e. using visualization to explore data.

The dream, clean dataset is rare, though the process of cleaning should be given due importance as it can simultaneously help you to get to know your data better than anyone.  In the example above,  height may be measured in different metrics, weight may be in wide form with repeat measures, and several other unnecessary variables hang around. Work individually with each variable to keep your sanity. For example, first standardize height measurements to inches.  Then, transpose the weights into long form. Finally, reduce the data to only the variables you are using, using select in R or keep in Stata.   Now, decide how to aggregate the weights to create 1 line per subject, by plotting individuals’ weight distributions to assess skewness, determining the number of measurements per person, and deriving other characteristics.  Once a decision is made, group_by in R and over()  in Stata can help with aggregation to create your final dataset.  Voila! Now we have a dataset with 2 columns and line per subject with one (height,weight) pair each.  And good news.  In this process, we should have identified true and erroneous outliers, missing values, and other weird goings-on. Hence, these data required less pragmatic effort to ‘get to know’ than the clean data.

While getting to a clean, analysable dataset involves continually coding and visualizing final product, ensuring reproducability (a part of coding) can also help you to get to know your data.  Documenting every step using comments and keeping commands general rather than value-specific (for example, not inserting exact values into a recoding function, and instead using data-driven cut-offs like min(x) or max(x)) can prevent errors and facilitate re-running analyses in the future.  I have also found that implementing generalizable code requires an additional level of understanding your data because it forces you to think beyond the specific values you’re facing.

As we imagine more complex data, dataset merging and ‘big data’ possibilities explode, investing in relationship-building with your data will have bigger pay-offs.  ‘Getting to know your data’ is easier said than done.  But, it can be built into the process of manipulating data for analyses, and make use of tools for graphical visualization.   Generating reproducible code forces you to think even deeper about the structure of your data and greatly facilitate future work.  Commenting every command reminds you of the data discoveries you’ve made along the way.  When thinking of data relationship-building and reproducibility, most importantly remember that, from Karl Browman: “Your closest collaborator is you six months ago, but you don’t reply to emails.”

 

Moving beyond Table 1: Clustering methods (Discussion Group March 31)

Note that the images are taken from Sophie’s presentation and should not be re-produced.

Table 1 is the cornerstone of lots of epidemiology papers, but Sophie provided us with an alternative descriptive analysis option that is often overlooked. Clustering methods allow us to visualize where in space or time observations tend to, well, cluster, without any need to stratify. She introduced the k-means and hierarchical agglomerative clustering approaches, before demonstrating an application in her former master’s thesis and sparking a discussion on these methods’ uses.

k-means-pic jpeg

The k-means approach first requires pre-specifying the number of clusters you are after. The centre (or mean) of each group is then determined, and the observations assigned to a given group are those ‘closest’ to this mean.  ‘Closest’ can be measured in several ways, for example Euclidean (meters-based) or Frechet (the distance between curves). Through several iterations of assigning points to different groups, the procedure uses expectation maximization that repeats until stable to identify clusters.

The hierarchical agglomerative clustering approach does not require pre-specifying clusters, as each observation is initially treated as its own cluster.  A tree is then created, joining the ‘most similar’ observations at each branch.  We keep doing so until there is only one cluster. We then choose to ‘cut’ the tree at the most sensible point; if we cut it near the start, we end up with more clusters, and vice versa.    Like with k-ha-cluster-2means, ‘similarity’ can be defined in many ways, for example, at each tree branch data may be combined so that the new clusters have the smallest average distance.

Sophie provided an example based on 10 years of smoke rates in Glasgow districts.  She clustered based on minimizing the distance between the shapes of the curves, and thus could visualize the groups of districts with different time trends in smoking rates.  She found there was one group with a steady decline, another with a decline followed by a spike (that coincided with the economic crisis), and finally one that was stable over time.  Interestingly, these overlapped with socio-economic status.

Since these methods contrast with the causal, parametric focus of most of our classes, the discussion centred on their use.  Key to these clustering approaches is that there are no distributional assumptions, means to calculate error or ‘significance’ testing.  It is exploratory and hypothesis-generating (but can still be a paper on its own!). Perhaps they fit into the surveillance side of epidemiology, which we know little about; if we repeated clustering over time we could monitor changes to disease rates, for example. Regardless, this presentation enabled us to open our minds to a method few had seen, and consider new ways to describe data that move beyond the typical Table 1.

The full presentation is available here.

Data Cleaning (Methods Discussion Group Feb 24, 2017)

Michelle Dimitris talked about classic data cleaning problems and solutions at today’s Methods Group meeting.  From dates to missing data to free text fields, the discussion carried on into students and professors highlighting their own experiences and offering solutions to others’ problems.  Below is a summary.

Dates are often problematic (hence the proliferation of dating sites!…jk).  Month-date-year, date-month-year, we’ve all faced this confusion. The solution? Avoid the confusion the first place. In other words, ensure that data collection forms always specify the desired date format.  Short of altering the survey, Michelle suggests keeping a running list of unclear dates for fieldworkers to clarify if possible. This involves checking date ranges to identify implausible values and interchangeable values: otherwise, 04-01-2015, for example, can be April 1 or January 4. Once we have dates in specific formats, we often wish to alter this format or ‘split’ the date into its year-month-date components.  Using ‘split‘ in Stata or date objects in R or other software is one way to do this. Parsing data allows you to specify the character by which to split, for example, in ’04-01-2015′, the character is ‘-‘.  The date will then be split into separate variables.

While cognitive dissonance is fascinating to psychologists, contradictory responses can be another pain for data cleaning.  For example, income = $40K for one answer, but income < $20K when the same question is asked another way.  Which do we believe?  Similar to dates, it is best to prevent the problem by ensuring there are no duplicate questions on the survey without very good reason.  Summarise, tab and if statements in Stata allow us to verify the extent of illogical answers by, for example, determining the number of cases that responded income > $20K on one question, and the opposite on another question. Hopefully the problem is minimal, but if the fieldworker cannot clarify, we need a priori rules on which answer to consider valid.  This solution will inevitably involve some subjective decision making.

Similarly missing data and free text responses are two other classes of classic data cleaning problems.  Missing data is problematic for data cleaning when it is ambiguous.  Is the data missing because a particular response option was unavailable, or because a person refused to to respond? Such ambiguity can be prevented through proper questionnaire design, but, while tempting, avoid including free text ‘Other’ options as a potential solution without providing an exhaustive list of specific options first. Giving respondents free reign to write in answers = coding nightmare.  While commands such as ‘parse‘ and regular expression searching can help classify them, free text fields are generally only  useful if qualitative analysis is planned. Without a plan to analyse the ‘other’ box, writing out as many relevant options as possible and lumping the rest as a simple ‘Other’ check box  leads to much cleaner data.  Several people highlighted this problem with extracted medical charts in particular, and solutions varied from manual coding to query-based classification schemes using the commands mentioned above.

Finally, some rules  cut across the whole discussion.  First, never alter your actual dataset. Rather, make all ‘adjustments’ through code. Second, most of this issues can be solved at the data-collection phase, so try to be involved in the instrument design. Finally, remember that your data management software does not have to be the same as your analysis software.  Some tools (potentially SAS) are better for data mangement than others, and it may be worth learning some basic commands in another software to make your life easier.

For Michelle’s slides, see: https://drive.google.com/file/d/0BwmCu4M3g_jveUFUSlhIaFJpQzQ/view

Missing Data (Methods Discussion Group 11/26/2016)

Handling missing data could be an entire course in itself, but Gabrielle Simoneau teased the key tenets down to 1 hour on Friday.  In the context of mice DNA data, she first reminded us of missing data assumptions. We then discussed single and multiple imputation, inverse probability of censoring weighting, and finally touched on a complex case study that made all the methods seem inadequate (and still finished on time!).

The form of missing data dictates the methods that are available to address it.  The best case scenario is ‘missing completely at random’.  These data are not associated with the exposure or outcome, and cases with such missing data can be ignored. Larger quantities might still require imputation methods however because it affects your power.   The next best scenario is ‘missing at random’.  These data are required to identify the effect, but can be predicted from other observed variables.  The doomsday scenario is ‘missing not at random’.  Here, the data are associated with the exposure and outcome, but unavailable and unpredictable from the observed dataset. Resorting to population or literature-based values could be an option, but the methods below cannot be used as described.

Other then ignoring the problem, single imputation is the easiest way to handle missing (completely) at random data.  We choose or predict a value, substitute it in, then estimate our effect of interest. For example, we can impute the mean value of variable X for all the missing X values. However, like all things in life, there is no free lunch.  There are two major problems here: the mean value might not be a good guess at each case’s X value, and we don’t account for the added uncertainty created by our ‘invented’ values. Hence, multiple imputation.

Multiple imputation is a more complex, but more valid, way of handling missing (completely) at random data.  Instead of inputting a single value for each missing value, we generate a series of datasets that each imputes slightly different values of X based on the observed values of X in the original dataset.  We then predict the effect of interest in each of these.  The final estimate is an average of the effects estimated in each imputed dataset, and the variance accounts for the uncertainty of our imputed values.   Making the method even more useful, we can even predict several missing values at the same time. Multiple Imputation by Chained Equations (MICE) packages in statistical software can be used to employ this.

Despite the promise of multiple imputation, it is a little sketchy to start imputing things like your actual outcome or exposure variable, since this is the effect of interest.  Enter, inverse probability of censoring weighting. Inverse probability of censoring weighting avoids imputation altogether and re-constructs a complete dataset using weights. The variables in the dataset are used to predict the probability of being exposed, or a certain outcome value, or other value.  These probabilities are then used as weights in the final analysis. Unlike multiple imputation, this method is not good when there are several missing variables, because we might not have enough information to generate sensible predicted probabilities. So, it is best used when the missing values are concentrated in one variable, and imputation is undesirable.

Ready for the complex case study where all the methods above are inadequate?  In a poorly designed trial, patients were randomized to one of 3+ start treatments, then re-randomized up to three times to a choice of 3+ drugs depending on the success of each treatment.  The final dataset had missing values on follow up in each randomization cycle and treatment trajectory.  This is complex because the missing values are dependent on other variables in the randomization round, but also on individual patients’ previous values. There are also very few patients in each treatment trajectory since there were so many possible courses, limiting the amount of information available to predict anything.  In the end, some combination of all the above methods was used. But the lesson: all the methods in the world cannot save you from data that is just bad to begin with.

Resources from Gabrielle:

MICE in R     MICE in STATA      Tutorial on MICE     Tutorial on IPCW

Resource for more complex situations:

Application of multiple imputation methods to sequential
multiple assignment randomized trials

Inverse probability of censoring weighting for missing data

Methods Group Oct 28: Power Calculations

By Daniala Weir and Deepa Jahagirdar

We all learn about basic power calculations in stats 101.  But when it comes time to actually do it, it’s as if we know nothing at all.  In our methods group discussion on October 28th, we talked this challenging aspect of every study. In the simplest case, software (STATA, SAS, R) and online power calculators are your best friend, for example when the effect measure is an unadjusted difference in proportions. Gillian Ainsworth provided us with a few great examples including an overview by Dr. Hanley & Dr. Moodie.

But let’s face it: the simplest case rarely applies.  Simultaneously accounting for confounding, correlation, ranges of prevalence for outcomes and exposures requires methods beyond those online calculators. In this case, it is best to simulate data.   While the word ‘simulation’ can scare a lot of people off, Brooke Levis provided some very useful examples from her own research as well as R code written along with Andrea Benedetti to conduct a simulation for her power calculations.  The code will is pasted below.

Finally, Dr. Hanley gave some important advice,“Every study is valuable for answering an over arching research question….. Just because you don’t have enough power to conduct the study you want, doesn’t mean it shouldn’t be conducted at all. Think about it as contributing to an overall met-analysis on a particular research question.”

R Code for Power Calculations Simulations (credit to Brooke Levis and Dr Andrea Benedetti):
#this function generates a dataset of size n with a binary outcome, binary exposure, and binary confounder
#the exposure prevalence is given by prevx
#the confounder prevalence is given by prevconf
#the outcome prevalence is given by prevy
#the OR between the confounder and x is ORconfx
#the OR between the confounder and y is ORconfy
#the OR between the exposure and y is ORxy
#nreps is the number of times the data is generated and analyzed
#for each data set, a crude model and an adjusted model are fit and the significance of the exposure beta is assessed
getpower<-function(n, prevx, prevconf, prevy, ORconfx, ORconfy, ORxy, nreps){
#make a matrix to hold the results
res<-matrix(NA, ncol=11, nrow=nreps)
res<-as.data.frame(res)
colnames(res)<-c(“n”,”prevx”,”prevconf”,”prevy”, “ORconfx”,”ORconfy”,”ORxy”,”pvaladjmodel”,”sigadjmodel”,”pvalcrudemodel”,”sigcrudemodel”)

for(i in 1:nreps){
#generate the binary exposure – input prevalence of exposure
x<-rbinom(n, 1, prevx)

#generate the binary confounder – prevalence of confounder and OR between exposure and confounder
b0confx<-log(prevconf/(1-prevconf))
b1confx<-log(ORconfx)
regeqxconf<-b0confx+b1confx*x
conf<-rbinom(n,1, exp(regeqxconf)/(1+exp(regeqxconf)) )

#generate the binary outcome – prevalence of outcome, OR between exposure and outcome and OR between confounder and outcome
b0<-log(prevy/(1-prevy))
b1confy<-log(ORconfy)
b1xy<-log(ORxy)
regeq<-b0+b1confy*conf+b1xy*x
y<-rbinom(n, 1, exp(regeq)/(1+exp(regeq)))

#adjusted model
m1<-glm(y~x+conf, family=binomial)
#get p value for exposure beta
res[i,]$pvaladjmodel<-summary(m1)$coef[2,4]
#is it significant?
res[i,]$sigadjmodel<-ifelse(summary(m1)$coef[2,4]<0.05,1,0)

#crude model
m0<-glm(y~x, family=binomial)
#get p value for exposure beta
res[i,]$pvalcrudemodel<-summary(m0)$coef[2,4]
#is it significant?
res[i,]$sigcrudemodel<-ifelse(summary(m0)$coef[2,4]<0.05,1,0)
#hold onto data generation params
res[i,]$n<-n
res[i,]$prevx<-prevx
res[i,]$prevconf<-prevconf
res[i,]$prevy<-prevy
res[i,]$ORconfx<-ORconfx
res[i,]$ORconfy<-ORconfy
res[i,]$ORxy<-ORxy
}
#return the results
res
}

#call the function
p1<-getpower(n=400, prevx=.5, prevconf=.1, prevy=.2, ORconfx=2, ORconfy=2, ORxy=2, nreps=500)
colMeans(p1)

The biggest deterrent to equality: Humanity

Regardless of topic, all epidemiologists will spend time understanding equality. Trump’s (and others’) election have demonstrated that the biggest battle to realizing a more equal world is not about resources, outreach or policy effort.  It’s about what humans (do not) want to do.

Our studies further the notion of equal social status because they give inherent value to the fight. Whether just deciding to adjust for race/sex/ethnicity because we know group X is inherently worse off than others, whether its studying disease  outcomes in ‘vulnerable’ groups, or whether it’s about the effect of health insurance, the examples are infinite.  But time and time again, we show the relative health disadvantage of marginalized groups. The implicit message: more resources, more outreach and more policy effort targeting those who need it the most. While admitting it means crossing the much-dreaded line into advocacy , it is clear we are secretly imagining a world where disadvantage due to characteristics inherited at birth is gone.  If everyone did well, half of epidemiology would disappear.

Unfortunately, such a world will be stalled by those who stand something to lose. Trump’s win, Viktor Orbán, Brexit, Pegida, Marine Le Pen….have demonstrated that the desire for social power trumps (no pun) any broad desire to give everyone a chance.  Yes, poverty, disaffection, and loss of political voice (the problems of type of voter that effectively led to Trump’s and Brexit’s victory) are sad.  No doubt, large groups of people have seriously lost out on our ‘new economy’.   But these factors alone are not what caused the  recent voting and ideological trends.  The missing piece?  It’s poverty, disaffection, and loss of political voice among those who were once all but guaranteed it. 

Maybe we over-estimate humanity.  The election of Trump and others are as much referendums on whether historically excluded groups deserve a better chance, as they are driven by a belief in the urgency of restoring a social identity that was once the pride of certain demographics.  People like Mr. Trump here or Mr Farage in the UK did not create this belief: they merely gave it permission to fly.  We hope that it is a small minority, but it is not. Which leaves us with a dilemma. How can we fight for the best programs, policy and interventions to improve equality when humanity’s desire for inequality is so strong?

Methods Discussion Group 1: Manuscript Writing

The Applied Research Methods Discussion Group met last Friday to discuss this month’s topic of choice – Manuscript Writing. The discussion carried on beyond the time limit with topics including organizing literature into a Background section, journal targeting, the importance of titles and cover letters, and finally, abstracts.

The first part of the paper, the Background section, is the product of hours spent reading dozens of papers.  The purpose of understanding the literature is to fairly summarize its ‘weight’ – generally, are articles saying x or y? But keeping track of 30+ papers with new ones constantly coming in is a challenge. The group shared their best tips for organizing literature. For instance, create an ongoing Evernote, Excel or Word document to make notes about papers as you read them.  At the end, the little blurbs about each paper can jolt your memory and provide little write-ups to include in the paper. Regardless of the number of papers reviewed, it is natural to feel like you might have missed papers on the topic.  Subscribing to RSS feeds or journal alerts can help to keep up to date on developments in your field.  Ideally, you have not missed the most seminal paper ever on the topic, but remember we all have to stop reading a certain point.

We also discussed challenges related to working with interdisciplinary teams and the necessity of tailoring writing to specific journals.  Ultimately, not all disciplines’ journals are like ours.  Within typical epidemiology/health sciences journals, it may be better to write  generically rather then targeting specific journals.  Adjusting the length, a few sentences in the Background/Discussion, and formatting should be enough to submit to multiple journals.  However,there are differences to bear in mind if targeting a journal outside of epidemiology (or working with colleagues in fields such as Economics). For example, the background is often more then twice the length,  the theoretical foundations for the research are described in more detail and the paper is structured differently overall. In these cases, some minor readjustments will not be enough, and targeting while writing more helpful.

After the paper is carefully completed and the journal is finally chosen, some editors have made up their minds by the end of your title or  cover letter. The title should be succinct yet detailed enough to keep their interest. A general template is ‘General: Specific.’ For example: ‘Cat food: the role of tuna in a nutritious diet’ or ‘Obesity prevalence: differences across socio-economic status.’  Humorous titles may or may not be okay; our group was split on this issue. It may take a certain status (or a certain talent) to get away with it. If the editor has not stopped by the end of the title, s/he will at least read your cover letter. This letter’s importance is often under-appreciated.  In addition to summarizing the main findings, personalize the letter to indicate why you have chosen the specific journal.  For example, citing previously published articles that suggest the need for your work from the same journal can help your case.

At last, you have succeeded in drawing the editor to your abstract.  The abstract is likely the last thing the editor will read before deciding to send the paper for review. We had a debate about writing the abstract before or after the rest of the paper.  Beginner writers often write the abstract last but people with more experience in the group suggested writing it first.  Articulating research in ~250 words means  the purpose, findings and importance are clear.  From there, fill in the rest of the paper .  However abstract writing may also be more iterative.  I am personally convinced that the clarity of  the research increases right up until your paper is complete (‘NOW I understand what my research was about’). This clarity is essential for abstract writing.

While we covered practical aspects of writing papers and real-time challenges that go beyond the typical structure of Introduction-Methods-Results-Discussion, more resources are available here:

Stanford Online Writing Course

Clinical Epidemiology Writing Tips

BMJ Writing E-Book

We hope to see you next time when the discussion will centre on power calculations! October 28, 12:30pm, Purvis Hall Room 25.

We, too, can be health economists

Yesterday Dr Jason Guertin presented on the overlap between pharmacoepidemiology and pharmacoeconomics, challenges to translating research into decision making and the potential transition between epidemiology and health economics.

The speaker introduced the incremental cost-effectiveness ratio (ICER), and went on to describe the confounding challenges to determining it.  This ratio is the increase/decrease in cost per unit change in effectiveness (e.g. quality of life or years gained unit) for a new drug/technology, compared to its predecessor. The ICER is the key outcome in pharmacoeconomics and cost-effectiveness research for health technologies in general. It is analogous to the usual health outcomes we study in epidemiology. Similar to epidemiology, confounding is a problem in cost-effectiveness research based on observational studies. However, the ICER is actually composed of two things rather then just one health outcome – the cost component and the effectiveness component. Confounding takes on new life because of these two outcomes and the positive or negative correlation between them.

In epidemiology, our effect estimates can swing above and below the null when confounders are excluded or included.  In cost-effectiveness research, the cost per quality of life years gained can swing above and below the acceptable threshold to approve new drugs/technologies for reimbursement. In an extreme example, Dr Guertin found a difference of up to $80,000 per quality-adjusted life year gained between unadjusted and adjusted models.  Evidently such a price tag has practical implications for decision-making – in this case whether to approve a new technology to treat aortic aneurysm.

Beyond the actual study, translating findings into policy faces further complications. The public reaction has a bigger influence on the technologies and drugs approved then even the best quality cost-effectiveness studies.  For example, a very expensive drug to treat rare genetic disorders in infants may be approved because of the value society places on young lives.  At the same time, treatments for hair loss are not approved for reimbursement despite their extreme cost-effectiveness.  In epidemiology, we face similar challenges. For example, maternity leave allowances of 6 weeks may lead to better breastfeeding outcomes.  Say the research on this issue was perfect.  Would the policy be implemented everywhere? No.

In sum, Dr Guertin effectively translated his health economics research into a language epidemiologists could understand.  The overlap in confounding and study design-related challenges demonstrated that the skills also overlap.  So, pharmacoeconomics may be a new field to pursue for you!

 

 

 

 

Am I passionate enough about my PhD?

We are often told passion is one of the most important aspects of a PhD.  That if you don’t like your topic or field of study, you are doomed from the start.   It is idyllic, actually: being so passionate about your topic that you will never procrastinate, you will put in 110% every day, and, most of all, have a lifelong devotion.

Realistically, choosing a topic is one of the biggest challenges for graduate students even if you are floating in a cloud of passion. Regardless of whether the topic is from a blank slate or a continuation from previous work, for many students, passion goes something like this:

  1. An initial idea driven by passion and excitement (and practicality)
  2. Excitement builds and you feel confident
  3. Excitement dwindles and you question everything
  4. Repeat 2 and 3 until you end up in a static state of one or the other

The scary part is ending up permanently at step 3. What does this mean? Should you stick to your plan of becoming a tenured expert in fruit fly migration? Regardless of your PhD stage, divorcing yourself from a career path you had perfectly planned and a topic that used to be your passion is not impossible. Practically, one can always apply to non-traditional jobs post-PhD, and build contacts to transition into preferable topic areas and career paths.

At the same time, pursuing alternate plans is more difficult then it seems. Think about the achievements that are rewarded in our department, where reward = verbal praise, postings on the news websites, congrats from professors, wow factors at thesis/protocol defense.  These ‘wow factor’ achievements include awards at conferences, speaking invitations, novel methods, publications in NEJM, CIHR funding..Someone who has all these things is a ‘very good’, ‘very bright’ student. We all like praise, so adhering to the above model is highly tempting despite dwindling interest in the topic and career path that’s receiving the praise.

Unfortunately, similar external validation is not available for alternate plans, making two things necessary to move on from your set-in-stone path. Admitting the mismatch between previous thinking and the current state of mind, and learning to rely on internal validation are mind games that must be overcome.  So what if no one notices that you just published a very creative idea in a very mediocre journal? You should be proud, when you think of yourself explaining this idea to someone who will notice, at a time when it actually matters for you. Learning to define your own achievements is a pre-requisite to defining your own path beyond the PhD, and ultimately finding a career that is truly driven by passion.

Further reading for those interested:

Dr. Levine could no longer focus on astronomy with developing political events 

Dr Borniger started a PhD in a different field despite success in anthropology

Top 10-alternative careers for STEM PhDs & the importance of understanding your options

I hate my PhD