Sunday 11 January 2015

Extracting data generated in a Facebook Group

Introduction

I do Parkour since august of 2007 and I have also created with other people in town a Facebook Group.
This facebook group is used by practitioners and beginners in order to find other people to train with, make questions,  post photos and links on parkour but also schedule training appointment.
I was curious to know which was the spot/the place, where practitioners have played most activities in the 2014.



Facebook Data: Graph Api Explorer
To dump all the post of facebook of my parkour group, I started with the documentation at developer.facebook.com.
The API is called Graph Api, the documentation at facebook.com is good enough done: https://developers.facebook.com/docs/graph-api/using-graph-api/v2.2

One of the first tools to use to test the facebook API is Graph Api Explorer https://developers.facebook.com/tools/explorer/145634995501895/

Graph API Explorer (GAE) is useful to test facebook api and to generate a valid token for your tests/scripts, just press "Get Access Token" button on the user interface.

The first step with GAE is to generate a valid token for your logged user with the permissions requested for your queries: in my case user_groups is mandatory.
I've tested then some queries:
  • "me" show data of the current logged user
  • "me/friends" show data about the friends of the logged user
  • "me/groups" show data about the groups of the logged user
Graph Api Explorer, the response (cutted) on a user data
The data is returned in JSON format and every returned object: person, comment,  status, group...etc  have an own id.
The "me/groups" is useful to find the group-id of the group to later analysis.
So, simply putting the group-id and pressing then submit, it will show the information of the group; if you need to fetch all the feeds of the group you need to insert "group-id/feed".

Pagination and period of interest
It is not possible to fetch all the posts of a group with one query: results are divided in pages.
Every succesful json response contains a "paging" section

  "paging": {  
   "previous": "https://graph.facebook.com/v2.2/34343/feed?since=1420799539&limit=25&__paging_token=enc_AexSv5Apv-nbkvNZ",  
   "next": "https://graph.facebook.com/v2.2/23ddsds/feed?limit=25&until=1417117943&__paging_token=enc_AezUcra6A2_dsSv5A"  
  }  

This section contains 2 links one to query the next results and the other to previous results.
Another important parameter to consider is the "until" parameter (or since), it needs to be used to return all the post until a certain period expressed as Epoc Timestamp.
In order to get all the post of a particular year:
  • I fetch the results recorded until 1/1/2015 00:00:00
  • I use the "next" record until I have a first record with created_time  == 2013
  • all the data is saved in a file for later analysis
  • between one fetch and another, it waits 5 seconds, in order to be safe from api rate limit

The script
To use the following script you need a python enviroment with this library installed: https://github.com/pythonforfacebook/facebook-sdk (on a mac, download it, then sudo python setup.py inside the directory)
It follows the script to analyse all the 2014's posts of a particular group; you need to edit:
  • value of the token variable
  • value of the group_id variable
  • change the start_date_time variable
  • change the stopyear variable

 import time  
 import datetime  
 import facebook  
 import cPickle as pickle  
 token = "put your graph api key here you can have one with Graph API Explorer"  
 group_id = 'put your group id here';  
 start_date_time = '01.01.2015 00:00:00';  
 pattern = '%d.%m.%Y %H:%M:%S'; 
 stopyear = '2013';
 epoch = int(time.mktime(time.strptime(start_date_time, pattern)))  
 print epoch  
 #code control from epoch string... -> print (datetime.datetime.fromtimestamp( epoch ).strftime('%Y-%m-%d %H:%M:%S'))  
 action = group_id + "/feed?limit=25&until=" + str(epoch);  
 print action;  
 graph = facebook.GraphAPI(token)  
 urlLen = len("https://graph.facebook.com/vX.Y"); #to remove from every request.  
 goOn = True;  
 allRecords = [];  
 while (goOn):  
      profile = graph.get_object(action)  
      if 'data' in profile:   
           print 'lecit...';  
      else:  
           print 'problems found... rate limit?';  
           print profile;  
           break;  
      allRecords.extend(profile["data"]);  
      if stopyear in profile["data"][0]['created_time']:  
           print "stop";  
           goOn = False;  
      else:  
           print "not ready..Go On."+str(len(allRecords));  
           action = profile["paging"]["next"][urlLen:];  
      time.sleep(5);  
 with open('genovapkdata.fb', 'wb') as fp:  
   pickle.dump(allRecords, fp)  

Conclusions
At the end of the execution I will have a file (genovapkdata.fb) with all the record saved in a format readable by python with another script:

 import cPickle as pickle  
 data = "";  
 with open('genovapkdata.fb', 'rb') as fp:  
   data = pickle.load(fp)  
 print "Records: "+str (len (data));  
 countGovi = 0;  
 ....  
 for onerecord in data:  
      if 'message' in onerecord:  
           msg = onerecord['message'];  
           msg = msg.lower();  
           if 'created_time' in onerecord and '2013' in onerecord['created_time']:  
                print 'record not important';       
           elif 'govi' in msg:  
                countGovi = countGovi + 1;  
 ...  

I discard posts with created_time of 2013 again, because this time I'm analysing 1 post at time, and I'm also counting record with a particular word in it "govi"... of course I did other analysis.

It is possible now do some statistics like:
- know who is the main post-writer in the group
- do a top10 of the place where the people train most
- how many are the generated posts
- how many are the generated comments

No comments:

Post a Comment