brendano · June 14, 2011 02:56 · iceMBD · Mar 11, 2014
diff --git a/analysis.txt b/analysis.txt
 How much text versus metadata is in a tweet?
 Brendan O'Connor (brenocon.com), 2011-06-13
 http://twitter.com/brendan642/status/80473880111742976

 What's it mean to compare the amount of text versus metadata?
 Let's start with raw size of the data that comes over the wire from Twitter.

 ## Get tweets out of a sample stream archive.
 ## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
 % cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets

 ## Full tweet size.
 % cat 100k_tweets | wc
 100000 3869324 211077132
                ^^^^^^^^^
                   size: ~2kb/tweet

 ## Example tweet.
 % head -1 100k_tweets 
 {"in_reply_to_status_id_str":null,"text":"Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?","contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"geo":null,"source":"web","coordinates":null,"in_reply_to_user_id":null,"truncated":false,"entities":{"hashtags":[],"urls":[],"user_mentions":[]},"place":null,"favorited":false,"created_at":"Thu May 19 00:15:11 +0000 2011","user":{"contributors_enabled":false,"profile_sidebar_fill_color":"","profile_image_url":"http:\/\/a3.twimg.com\/profile_images\/1240716730\/feiaaaa_normal.jpg","follow_request_sent":null,"profile_background_tile":true,"url":"http:\/\/www.orkut.com.br\/Main#Profile?uid=13310717337747944298","screen_name":"Natalyyia","profile_link_color":"fa5573","description":"Falsidade \u00e9 caracter\u00edstica de fracos. Os fortes falam na cara, e n\u00e3o se importam de escutar o que a outra pessoa pensa ao seu respeito....\r\n","show_all_inline_media":false,"verified":false,"geo_enabled":false,"favourites_count":6,"profile_sidebar_border_color":"181A1E","listed_count":0,"time_zone":"Santiago","followers_count":52,"location":"Bras\u00edlia DF","is_translator":false,"notifications":null,"profile_use_background_image":true,"lang":"en","statuses_count":1947,"friends_count":195,"profile_background_color":"1A1B1F","protected":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/253006506\/tumblr_ll3kpikZm91qe4nyno1_500_large.jpg","created_at":"Thu Oct 07 23:39:29 +0000 2010","name":"Nat\u00e1lya","default_profile_image":false,"default_profile":false,"id":199887606,"id_str":"199887606","following":null,"utc_offset":-14400,"profile_text_color":"0a060a"},"retweet_count":0,"in_reply_to_status_id":null,"id":71005947086110720,"id_str":"71005947086110720","in_reply_to_screen_name":null}

 ## Extract text. (Quick and dirty; JSON parser would be cleaner, but slower.)
 % cat 100k_tweets | grep -Po '"text":.*?[^\\]",' | perl -pe 's/"text"://; s/^"//; s/",$//'  > 100k_texts

 ## Example texts.
 % head -5 100k_texts
 Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?
 @gunslikeana kd clara ana DD:
 \u041a\u0440\u0430\u0441\u0438\u0432\u043e!!!RT@avvlas \u0413\u0434\u0435-\u0442\u043e \u0432 \u0417\u0430\u043f\u0430\u0434\u043d\u043e\u0439 \u0410\u0444\u0440\u0438\u043a\u0435 http:\/\/nblo.gs\/i2tug
 @bamin27 i already tried
 I just took \"[1-4] Justin Bieber hid camera's in 'Never Say Never' Poster's of himself ...\" and got: Part 1 <3! Try it: http:\/\/bit.ly\/jm6wHX

 ## Text size.
 % cat 100k_texts | wc
 129499 1184935 10553809
                ^^^^^^^^
                 size: ~100 bytes/tweet

 ## Percent text.
 > 10553809/(10553809+211077132)
 [1] 0.04761884


 Note, we're using the JSON-escaped version of the text.  Is this defensible?
 If we parsed the JSON then measured the UTF-8 size, it would be smaller.
 But it's not like UTF-8 is the "true" representation either: if we used UCS-2
 or -4 it would most likely be bigger. So let's just use something close to
 the format we get over the wire from Twitter.


 This suggests the next experiment: sizes under compression.  This is a fairer
 comparison because the algorithm should normalize for different levels of
 redundancy in different representations.  A perfect compression algorithm
 would tell us their true information contents.
 Life is short, so we use gzip.

 % cat 100k_tweets | gzip -9 | wc -c
 36191951

 % cat 100k_texts | gzip -9 | wc -c
 4625716

 So under compression:
 * 362 bytes / tweet
 * 46 bytes of text / tweet

 This implies text is 11% the information content, if you believe LZW does a
 reasonable job at language modeling.  This isn't totally true, but I might
 hazard it's good enough for the general comparison against metadata.  I once
 did a class project where PPM with a 3-gram Markov model outperformed gzip on
 multilingual text by a relative 7ish%, so let's say the text is <=10% the
 information content of the entire tweet.

 The other consideration is whether LZW is good at compressing metadata.  I bet
 it is very good, especially at all those repetitive key/value pairs.  Though,
 the metadata has lots of language, space, and time data in it too, for which
 you could imagine we could make good probabilistic models and therefore better
 compression.  This might be interesting to further investigate.  As a start:
 full tweet compression ratio is 17%, compared to text compression ratio of 44%.
 The metadata compresses much better. 


 Data caveat: Sample is out of a tiny slice of time.  I wonder of time zone
 effects cause different language speakers to be more active (one half of the
 globe is asleep), therefore causing sizes of texts to be different (since Asian
 languages have larger texts, and there may be other geographic
 community-specific variation in text size too).
	How much text versus metadata is in a tweet?
	Brendan O'Connor (brenocon.com), 2011-06-13
	http://twitter.com/brendan642/status/80473880111742976

	What's it mean to compare the amount of text versus metadata?
	Let's start with raw size of the data that comes over the wire from Twitter.

	## Get tweets out of a sample stream archive.
	## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
	% cat tweets.2011-05-19 \| grep -P '"text":' \| head -100000 > 100k_tweets

	## Full tweet size.
	% cat 100k_tweets \| wc
	100000 3869324 211077132
	^^^^^^^^^
	size: ~2kb/tweet

	## Example tweet.
	% head -1 100k_tweets
	{"in_reply_to_status_id_str":null,"text":"Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?","contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"geo":null,"source":"web","coordinates":null,"in_reply_to_user_id":null,"truncated":false,"entities":{"hashtags":[],"urls":[],"user_mentions":[]},"place":null,"favorited":false,"created_at":"Thu May 19 00:15:11 +0000 2011","user":{"contributors_enabled":false,"profile_sidebar_fill_color":"","profile_image_url":"http:\/\/a3.twimg.com\/profile_images\/1240716730\/feiaaaa_normal.jpg","follow_request_sent":null,"profile_background_tile":true,"url":"http:\/\/www.orkut.com.br\/Main#Profile?uid=13310717337747944298","screen_name":"Natalyyia","profile_link_color":"fa5573","description":"Falsidade \u00e9 caracter\u00edstica de fracos. Os fortes falam na cara, e n\u00e3o se importam de escutar o que a outra pessoa pensa ao seu respeito....\r\n","show_all_inline_media":false,"verified":false,"geo_enabled":false,"favourites_count":6,"profile_sidebar_border_color":"181A1E","listed_count":0,"time_zone":"Santiago","followers_count":52,"location":"Bras\u00edlia DF","is_translator":false,"notifications":null,"profile_use_background_image":true,"lang":"en","statuses_count":1947,"friends_count":195,"profile_background_color":"1A1B1F","protected":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/253006506\/tumblr_ll3kpikZm91qe4nyno1_500_large.jpg","created_at":"Thu Oct 07 23:39:29 +0000 2010","name":"Nat\u00e1lya","default_profile_image":false,"default_profile":false,"id":199887606,"id_str":"199887606","following":null,"utc_offset":-14400,"profile_text_color":"0a060a"},"retweet_count":0,"in_reply_to_status_id":null,"id":71005947086110720,"id_str":"71005947086110720","in_reply_to_screen_name":null}

	## Extract text. (Quick and dirty; JSON parser would be cleaner, but slower.)
	% cat 100k_tweets \| grep -Po '"text":.*?[^\\]",' \| perl -pe 's/"text"://; s/^"//; s/",$//' > 100k_texts

	## Example texts.
	% head -5 100k_texts
	Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?
	@gunslikeana kd clara ana DD:
	\u041a\u0440\u0430\u0441\u0438\u0432\u043e!!!RT@avvlas \u0413\u0434\u0435-\u0442\u043e \u0432 \u0417\u0430\u043f\u0430\u0434\u043d\u043e\u0439 \u0410\u0444\u0440\u0438\u043a\u0435 http:\/\/nblo.gs\/i2tug
	@bamin27 i already tried
	I just took \"[1-4] Justin Bieber hid camera's in 'Never Say Never' Poster's of himself ...\" and got: Part 1 <3! Try it: http:\/\/bit.ly\/jm6wHX

	## Text size.
	% cat 100k_texts \| wc
	129499 1184935 10553809
	^^^^^^^^
	size: ~100 bytes/tweet

	## Percent text.
	> 10553809/(10553809+211077132)
	[1] 0.04761884


	Note, we're using the JSON-escaped version of the text. Is this defensible?
	If we parsed the JSON then measured the UTF-8 size, it would be smaller.
	But it's not like UTF-8 is the "true" representation either: if we used UCS-2
	or -4 it would most likely be bigger. So let's just use something close to
	the format we get over the wire from Twitter.


	This suggests the next experiment: sizes under compression. This is a fairer
	comparison because the algorithm should normalize for different levels of
	redundancy in different representations. A perfect compression algorithm
	would tell us their true information contents.
	Life is short, so we use gzip.

	% cat 100k_tweets \| gzip -9 \| wc -c
	36191951

	% cat 100k_texts \| gzip -9 \| wc -c
	4625716

	So under compression:
	* 362 bytes / tweet
	* 46 bytes of text / tweet

	This implies text is 11% the information content, if you believe LZW does a
	reasonable job at language modeling. This isn't totally true, but I might
	hazard it's good enough for the general comparison against metadata. I once
	did a class project where PPM with a 3-gram Markov model outperformed gzip on
	multilingual text by a relative 7ish%, so let's say the text is <=10% the
	information content of the entire tweet.

	The other consideration is whether LZW is good at compressing metadata. I bet
	it is very good, especially at all those repetitive key/value pairs. Though,
	the metadata has lots of language, space, and time data in it too, for which
	you could imagine we could make good probabilistic models and therefore better
	compression. This might be interesting to further investigate. As a start:
	full tweet compression ratio is 17%, compared to text compression ratio of 44%.
	The metadata compresses much better.


	Data caveat: Sample is out of a tiny slice of time. I wonder of time zone
	effects cause different language speakers to be more active (one half of the
	globe is asleep), therefore causing sizes of texts to be different (since Asian
	languages have larger texts, and there may be other geographic
	community-specific variation in text size too).