#3: Thoughts and observations on data journalism in India

Today, I complete five years in journalism: one year at The Hindu (my first job, straight out of university), two years at the Hindustan Times—which overlapped with a five-month fellowship at the Wall Street Journal—and two years as a freelancer.

"Data journalist" was my official job description in full-time roles. But I find that term redundant and I don't like to use it anymore: all journalism should be data-informed and evidence-based. I prefer calling myself just a "reporter". In part, this identity shift—primarily in my head, I doubt if anyone else cares—is a reflection of the changing nature of my work: I have moved from writing newsy analytical stories to a mix of investigative and narrative reportage. Data and empirical thinking continue to be at the centre of my work, but it's no longer defined by it.

This post is about data journalism in the way most people understand it: journalists who specialise in producing data-driven stories, a task that includes everything from collecting data to analysing and presenting it visually.

Here are some thoughts and observations on data work in Indian newsrooms.

1. Data journalism is an umbrella term which means different things to different people: adding data-backed context in daily news coverage; analysis of issues dominating the news cycle; making infographics with interesting data factoids; narrating stories and explaining complex concepts using visual graphics; using hard data for investigations, highlighting previously unreported trends and throwing light on specific incidents; empirical coverage of big-ticket events like elections and budgets.

I have done all of this. It's useful to think of them as different tasks as the skillsets vary widely. A lot of stuff can be done without writing a single line of code or making pivot tables in spreadsheets. But the fanciest stuff we do gets the most attention, making regular data work appear more complicated than it is.

2. Data journalism is not just about data visualisation or interactive stories: I categorically mention this as a separate point because I used to conflate the two when I started. Back in 2015, data journalism for me was the visual stuff the New York Times published. I would browse their interactive stories, think about how we can take inspiration for telling similar stories in India, rant about the technical restrictions our archaic online publishing systems posed, and ultimately make peace with the baby steps we took in creating an ecosystem that valued interactive storytelling.

I learnt my lesson soon: I was getting too attached to the form. Yes, charts are nice. Interactives, when thoughtfully designed, can help tell brilliant stories. But that should not divert our focus from the fundamentals, which is the data itself and the meaning it's adding to the story. Once you get past that stage, you can figure out how to stitch words and visuals to communicate your findings to the world. Sometimes, one small static chart can tell a story that no words can capture, and in other cases, data leads you to nuanced explanations where clear prose becomes crucial.

3. Visual journalism in India can wait: If a newsroom has infinite resources, sure, they can—should—hire a team of a dozen or two coder-journalists with excellent programming and design skills. Let them work on exciting special projects, fully leveraging the interactive nature of the internet to tell relevant and compelling stories. From mid-2016 to mid-2018, our small data and visuals team at the Hindustan Times—synonymous in journalism circles with bylines of Gurman Bhatia and Harry Stevens—did just that. I loved the work we did, and I learnt a lot. Bhatia has since moved to work for Reuters in Singapore and Stevens is at the Washington Post.

But over these years, as I have been thinking more broadly about Indian journalism, I have become increasingly convinced that we are not yet ready to institutionalise this kind of work. Cash-constrained Indian newsrooms neither have the resources to hire that talent nor the imagination to nurture it. Yes, things have to start somewhere, but I don't see that time is now.

Instead, we should be prioritising work that produces original reporting and boosts collective news-gathering efforts. Newsrooms should hire journalists who can code and ask them to think creatively and build original datasets, ideate on innovative ways to use statistics in identifying ignored trends, invest in long investigative projects. While the focus remains on sharp data reporting, newsroom nerds can occasionally produce visually appealing journalism.

Journalists should still make simple, effective and aesthetically pleasing charts using simple tools like DataWrapper, which have evolved significantly in terms of capabilities over the last few years. But for now, a dedicated focus on building a culture of interactive and visual journalism can wait.

4. Data as a tool for independent assessment: It is empowering to work with raw data and not relying solely on the findings of government committees, think tanks and academics, especially when the data is easily available, the analysis work is easy and the issue is hot in the news cycle. It also allows journalists take a fresh look at things and have more informed discussions with domain experts about their work, understand the analytical frameworks that form the backbone of their arguments, and get to the core of disputes when experts disagree on the interpretation of the same data. Doing this work made me appreciate the limitations of data work and a more mindful consumer of numbers in a world where a high-degree of bullshit comes packaged with charts and figures.

5. Context is king: Every data analyst knows this. Having data is not enough. You need to ask interesting and meaningful questions to interrogate the data, place it in context, devoid of which you get meaningless number-heavy stories produced for the sake of filling newspaper pages. That's also the reason I enjoy collaborating with reporters who have strong domain knowledge of their beats.

6. Data journalism is not objective journalism: Data is objective? Haha. I can narrate multiple stories from the same underlying dataset. It's easy. Context introduces subjectivity in interpretation, from questions you ask to what you consider significant. Data journalism should not occupy any special place in the larger philosophical discussions about objectivity in journalism. The same arguments are applicable.

7. Data is not the solution to misinformation: Many solutions aiming to solve the internet-driven deluge of misinformation ignore how communication technology has evolved over centuries, how mediums impact how we learn about the world, the cognitive processes that drive how humans process and consume information, and the complexities of our chaotic and polluted information ecosystem. Ignoring that leads to an incomplete diagnosis of the problem and diverts attention to misguided solutions. "Data journalism as an antidote to fake news" is one of those.

It is not. We are confronting a much larger epistemic problem and data-based articles are not going to solve it.

8. It's important to know what you know and don't know, what you can know and can’t know: numbers, many think, offer divine certainty. In the wake of controversial issues dominating the news cycle, data journalists are asked to wave their magic wands to offer authoritative data-backed answers.

There is some merit in that expectation. It is, in fact, the fundamental premise of evidence-based journalism. But it's not always possible and we must acknowledge the limitations of data. Reasons can vary from lack of data to important caveats which can't simply be left as an asterisk.

Maybe the best answer an analysis can produce is: "we don't know for sure what the hell is happening. But we do know that A and B are likely false, C and D may be true, and not enough is known about E and F".

This is a perfectly reasonable outcome of a meaningful data analysis. But this doesn't satisfy our deep desire for instant explanations and catchy headlines. In some cases, I see that a version of the story I skipped and filed in my trash—because it was not meeting rigorous standards—appeared in another publication. They let it through. This is especially true for reporting on non-scientific public opinion surveys. There is not much I can do about it, other than adding it to my long list of rants on how journalists do their work.

9. Supporting data work with non-data work, and vice versa: A solid authoritative story is one where your key point is backed by quantitative data, qualitative evidence and theoretical arguments. If you find contradictions, the easy way is to leave it there, and write a version based on the "he said, she said" reporting template. The other option—which requires more work, and time—is to dig in further to reconcile what is happening and explaining those contradictions. If the data is good enough, it must reflect on something real that is happening.

10. Data as an entry point into a story: A key journalistic lesson I have learnt in the last two years is that getting access to a novel dataset—also, documents—is not the end. Great stories emerge when you look at data as a lead in to a story, and not the story itself. Think about data as a human source, who came and told you something interesting, which makes you curious and raises questions, and you go on with a reporter's mindset to find answers. This is the data journalism I would like to see more of.

11. Newsroom collaborations: Many journalists treat data reporters as members of a data desk, and say things like "I am working on this article, can you give me some data?" or "Can you make a chart for my story?".

I haven't yet figured out polite ways to tell them this is not our job, and this is not the way to collaborate.

Some journalists are just lazy, not even making the effort to find relevant data reports or call their sources to ask for directions, thinking they can just dump it on the data team. They are the ones who think of numbers and charts as an afterthought, a thing to strike off their to-do list.

Some make the effort and reach out for help when they get stuck. A few see our skills as complementary and want to collaborate in telling a good story, right from ideation to production. These are folks I like to work with.

12. Correlation is not causation: If you have read this far about data journalism, you have probably heard this phrase a zillion times. Yet, analysis pieces continue to attribute cause-effect relationships based on correlations alone. Correlations are important as they help in generating a possible set of hypothesis. But we must be careful in drawing inferences based on correlations.

13. So what is causation? I also see a reverse trend: People say "correlation is not causation" and feel smart. Yet, many of the same folks will casually make causal claims in other places without hard evidence.

Lack of causal explanations hinders our ability to answer important "why" questions. But thinking about causation, if done methodically and rigorously, is not trivial. I don't have any meaningful insight to offer here than just highlighting that this is also a problem I confront and currently in the process of thinking and reading more on causality.

14. Lack of transparency: One of my major concerns that goes in tandem with my excitement looking at the rise of data journalism is the lack of transparency and openness—in sharing raw data, conveying analytical techniques, assumptions and calculation methods used in the analysis process.

Editors don't have the time or ability to question the methods that end up generating a summary table. And readers don't have the relevant details or access to raw data to replicate the findings or test the conclusion for themselves. This is problematic.

That's my general observation. There are exceptions: for instance, Mint does a good job of adding explanations on their calculations in chart descriptions. But I think as a community, we can do a better job of being more open and sharing more data. At the Hindustan Times, we started a GitHub repository for making all our data public, but the practice that didn't last beyond a few stories. There is no incentive apart from individual motivation. But it's not that hard if we place it high on priority. We should.

15. Editorial filters: A common problem across newsrooms is the lack of senior editors who understand data. Every reporter—irrespective of how good or experienced they are—needs an editor who looks critically at their work. An editor's job is an important part of the machinery that makes journalism different from other forms of communication. But if editors are not equipped to edit and question data stories, the filter becomes weak and garbage is guaranteed. I am not immune to this either and guilty of having let some bullshit through in my writing—sorry about that—retrospectively hoping that an editor had spotted that stuff at the time of publishing. 

16. Data literacy in the newsroom: While I always wish that more journalists appreciate the value in learning spreadsheets and put consistent efforts to get used to it, there is a deeper problem—the lack of data literacy. We don't emphasise enough on empirical thinking as one of the foundations for understanding the world when we talk about data in journalism. You don't need any number-crunching skills to find relevant data for a story. What comes first is asking the right empirical questions and having a frame to evaluate claims.

Data literacy is different from data analysis. It is a skill in itself. You don't need a math degree to assess the quality of data and think about possible errors and biases in sampling or measurement process, to gauge the limitations of proxy stats, among other things. For instance, if someone sends you a survey report or a press release, you need to run basic sanity checks before referring it in your story.

Data often misleads. To guard themselves, journalists need to become data literate and regularly fine-tune their bullshit detector.


I repeat: all journalism should be data-informed. If you are a journalist interested in gaining a better understanding of data, I strongly recommend the following books:

It's never too late to start learning new things!

Questions or comments? I am at samarthbansal@protonmail.com. For occasional updates about my work, sign up here: bansalsamarth.substack.com


More from Samarth's Notes