Uchaguzi Data Clean Up

Ushahidi
Jul 14, 2013

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. (Wikipedia) So I had a Crowdmap full of data sitting there. It had to be cleaned ready for further analysis by researchers. This blog is how I managed to clear the irrelevant/corrupt data to produce an accurate database. After downloading the large CSV file of data it was obvious for a start that there was an issue with trusted reports. There were 2084 reports with no geo locations  or categories, a total mixture of everything. They were from all forms of entering information on the platform. The reports marked trusted were not able to be cleaned as the information had not been reviewed, just approved and verified. So there were so many categories missing and the majority had not be geolocated. The main learning from me is there has to be a quality control team as sadly there was so much data not useable Having to go through these one by one to check then was extremely time consuming. This is really worth remembering that you set criteria out before people class report as trusted.   Following the guidelines that Ushahidi had already published on the Wiki: I removed all personal information from every report. Removed all data that could not be geolocated. Removed all duplicates   The most simple, less time consuming way was to use filters in Excel. How many sms came into platform   4372 (Taken from information on dashboard) How many SMS were turned into reports  17 (Taken from clean data) How many SMS were approved  17 (Taken from clean data) Verified 8 (Taken from clean data) Ultimately mapped.  17 (Taken from clean data The SMS reports ended up as just 17, the reasons for this are: Not relevant to deployment Not enough information to be useful Many had no geolocation possible. Some SMS were just put on platform as "trusted source" with no information. Not stated as an SMS when report was created. (thus unable to state which was a SMS and which were not. Which does not show the true end figure of over 1600 clean data reports and how many were actually SMS) A point worth remembering if it is a SMS then make sure SMS is on report when it is created. Having 70 categories was also a challenge. This is before the data was cleaned:

Geolocated

2339

Trusted Reports

1907

No Need To Translate

1470

Everything Fine

741

Translated

684

Polling Station Logistical Issues

418

Impossible to Geolocate

323

Other

252

Voting Irregularities

201

Threat of violence

178

Unresolved

170

Unverified

149

BVR Issues

142

Voter Integrity Irregularities

139

Civilian Peace Efforts

127

Provisional Citizen Results

110

Voter Register Irregularities

106

Violent Attacks

104

GEOLOCATION

98

IEBC Officials not Acting In Accordance to Set Rules

81

Voter Identification Issues

60

Absence/Insufficient IEBC Officials At Polling Station

55

Irregularities with voter assistance

54

Mobilisation towards violence

51

Missing/Inadequate Voting Materials

49

Counting Irregularities

47

Fear and Tension

45

Rumours

41

TRANSLATION

38

Demonstrations

33

Ballot Box Irregularities

32

Absence/Insufficient law enforcement officials at Polling Station

32

Eviction/population displacement

29

Property Loss/Theft

25

Police Peace Efforts

25

Polling station logisitcal issues

23

Polling Station Closed Before Voting Concluded

22

Campaign material in polling station

21

Ambush

20

Protest over declared results

20

Purchasing of Voters Cards

20

Observers/Media Blocked From Entering Polling Station

19

Hate Speech

18

Resolved

18

POSITIVE EVENTS

17

Irregularities with transportation of ballot boxes

15

Party Agent Irregularities

14

Verified

14

Failure to Announce Results By IEBC official

13

Presence of weapons

13

To Be Geolocated

11

Riots

11

Sexual and Gender Based Violence

8

To Be Translated

8

Armed Clashes

6

Voters Issued Invalid Ballot Papers

5

URGENT

4

Abductions/kidnapping

4

SMS-V

4

Bombings

4

Ballot Boxes Destroyed After Announcing Final Results

3

Purchase of weapons

3

certificate Issues

2

Polling Station Administration

1

Voting Issues

1

After using the filter functions in excel, I manually had to go through each line of the CSV file making sure I had not missed anything. Please if anything can be learnt from this data cleaning post deployment, it is that quality control HAS to be in place during deployment, This is key to gaining accurate information. I gained so much personally from completing this task. I hope it helps you when performing a deployment and want to use results post event. If anyone has any questions I am always available to answer them. Happy Mapping Folks, Jus