Data Science and the Ushahidi Platform

You might be wondering why the Director of Data Projects is writing code for the Ushahidi Platform. I'd think that strange too, but as a good old-fashioned data scientist, I care very much about data sources, formats and access, and couldn't pass up the chance to influence the ways that Ushahidi Platform imports and exports datasets and data summaries.

Let's talk about data science.

[caption id="attachment_16040" align="alignnone" width="538"]

Courtesy of http://alexbraunstein.com/[/caption] Data science is about understanding and working with datasets - what I call "a sympathy for the data" is essentially the skills and tools needed to talk about the statements and insights that the data can and cannot support. This is not the same as data visualization, which to me is essentially the psychology of data representation - knowing things like:-

How different presentation of the same dataset can influence decision-makers differently
Where the pooh-traps are in those representations
How best to create visual stories that inspire insights in the people who see them.

This split is also apparent in the Ushahidi Platform code: the data science support is part of the back-end API code, the data visualization support is part of the front-end, and these two parts are being written by different people negotiating with each other over features; we'll cover the data visualizations being built into the Ushahidi Platform in a separate post. So, what do we have for the budding data scientists in our community? Essentially four things:

Raw data available through the Ushahidi Platform reports (and other entities) API
Data summaries and cross-tables available through the Ushahidi Platform stats API
Raw data downloads to CSV
External data imports

Each of these is useful for something different. The raw data through the API is useful to other systems (like CrisisNet) that pull data into their own real-time analysis software. The data summaries are useful inputs into other system that just want the cross-tables (or want to see platform statistics in near-real-time), like the Ushahidi Platform data visualization modules. The raw data CSVs are useful for people who want to use Ushahidi data in other data science and data visualization tools like Tableau. The external data imports work is about how to include and use data from outside (e.g. GIS data and sensor feeds) via the Ushahidi Platform's custom forms. We're still working on the Ushahidi Platform's stats API: right now, you can filter data by time and get report counts by form type, category and times (and cross tables for each combination of those) across different time steps (e.g. day, month, year). The first CSV download code is in, and we're now working on filters and cross tables for custom formfields and GIS files (yes, you will have choropleths!).

How can you help?

Talk to us about the types of statistics data and visualizations that would be most useful to you by chiming into this phabricator task. And if you have a nice big Ushahidi V2 dataset (preferably with lots of custom form fields) that we could test this on, we'd love to hear from you too.