Created
June 14, 2011 02:56
-
-
Save brendano/1024217 to your computer and use it in GitHub Desktop.
How much text versus metadata is in a tweet?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
How much text versus metadata is in a tweet? | |
Brendan O'Connor (brenocon.com), 2011-06-13 | |
http://twitter.com/brendan642/status/80473880111742976 | |
What's it mean to compare the amount of text versus metadata? | |
Let's start with raw size of the data that comes over the wire from Twitter. | |
## Get tweets out of a sample stream archive. | |
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json) | |
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets | |
## Full tweet size. | |
% cat 100k_tweets | wc | |
100000 3869324 211077132 | |
^^^^^^^^^ | |
size: ~2kb/tweet | |
## Example tweet. | |
% head -1 100k_tweets | |
{"in_reply_to_status_id_str":null,"text":"Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?","contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"geo":null,"source":"web","coordinates":null,"in_reply_to_user_id":null,"truncated":false,"entities":{"hashtags":[],"urls":[],"user_mentions":[]},"place":null,"favorited":false,"created_at":"Thu May 19 00:15:11 +0000 2011","user":{"contributors_enabled":false,"profile_sidebar_fill_color":"","profile_image_url":"http:\/\/a3.twimg.com\/profile_images\/1240716730\/feiaaaa_normal.jpg","follow_request_sent":null,"profile_background_tile":true,"url":"http:\/\/www.orkut.com.br\/Main#Profile?uid=13310717337747944298","screen_name":"Natalyyia","profile_link_color":"fa5573","description":"Falsidade \u00e9 caracter\u00edstica de fracos. Os fortes falam na cara, e n\u00e3o se importam de escutar o que a outra pessoa pensa ao seu respeito....\r\n","show_all_inline_media":false,"verified":false,"geo_enabled":false,"favourites_count":6,"profile_sidebar_border_color":"181A1E","listed_count":0,"time_zone":"Santiago","followers_count":52,"location":"Bras\u00edlia DF","is_translator":false,"notifications":null,"profile_use_background_image":true,"lang":"en","statuses_count":1947,"friends_count":195,"profile_background_color":"1A1B1F","protected":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/253006506\/tumblr_ll3kpikZm91qe4nyno1_500_large.jpg","created_at":"Thu Oct 07 23:39:29 +0000 2010","name":"Nat\u00e1lya","default_profile_image":false,"default_profile":false,"id":199887606,"id_str":"199887606","following":null,"utc_offset":-14400,"profile_text_color":"0a060a"},"retweet_count":0,"in_reply_to_status_id":null,"id":71005947086110720,"id_str":"71005947086110720","in_reply_to_screen_name":null} | |
## Extract text. (Quick and dirty; JSON parser would be cleaner, but slower.) | |
% cat 100k_tweets | grep -Po '"text":.*?[^\\]",' | perl -pe 's/"text"://; s/^"//; s/",$//' > 100k_texts | |
## Example texts. | |
% head -5 100k_texts | |
Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam? | |
@gunslikeana kd clara ana DD: | |
\u041a\u0440\u0430\u0441\u0438\u0432\u043e!!!RT@avvlas \u0413\u0434\u0435-\u0442\u043e \u0432 \u0417\u0430\u043f\u0430\u0434\u043d\u043e\u0439 \u0410\u0444\u0440\u0438\u043a\u0435 http:\/\/nblo.gs\/i2tug | |
@bamin27 i already tried | |
I just took \"[1-4] Justin Bieber hid camera's in 'Never Say Never' Poster's of himself ...\" and got: Part 1 <3! Try it: http:\/\/bit.ly\/jm6wHX | |
## Text size. | |
% cat 100k_texts | wc | |
129499 1184935 10553809 | |
^^^^^^^^ | |
size: ~100 bytes/tweet | |
## Percent text. | |
> 10553809/(10553809+211077132) | |
[1] 0.04761884 | |
Note, we're using the JSON-escaped version of the text. Is this defensible? | |
If we parsed the JSON then measured the UTF-8 size, it would be smaller. | |
But it's not like UTF-8 is the "true" representation either: if we used UCS-2 | |
or -4 it would most likely be bigger. So let's just use something close to | |
the format we get over the wire from Twitter. | |
This suggests the next experiment: sizes under compression. This is a fairer | |
comparison because the algorithm should normalize for different levels of | |
redundancy in different representations. A perfect compression algorithm | |
would tell us their true information contents. | |
Life is short, so we use gzip. | |
% cat 100k_tweets | gzip -9 | wc -c | |
36191951 | |
% cat 100k_texts | gzip -9 | wc -c | |
4625716 | |
So under compression: | |
* 362 bytes / tweet | |
* 46 bytes of text / tweet | |
This implies text is 11% the information content, if you believe LZW does a | |
reasonable job at language modeling. This isn't totally true, but I might | |
hazard it's good enough for the general comparison against metadata. I once | |
did a class project where PPM with a 3-gram Markov model outperformed gzip on | |
multilingual text by a relative 7ish%, so let's say the text is <=10% the | |
information content of the entire tweet. | |
The other consideration is whether LZW is good at compressing metadata. I bet | |
it is very good, especially at all those repetitive key/value pairs. Though, | |
the metadata has lots of language, space, and time data in it too, for which | |
you could imagine we could make good probabilistic models and therefore better | |
compression. This might be interesting to further investigate. As a start: | |
full tweet compression ratio is 17%, compared to text compression ratio of 44%. | |
The metadata compresses much better. | |
Data caveat: Sample is out of a tiny slice of time. I wonder of time zone | |
effects cause different language speakers to be more active (one half of the | |
globe is asleep), therefore causing sizes of texts to be different (since Asian | |
languages have larger texts, and there may be other geographic | |
community-specific variation in text size too). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello, thank you about that way, but what about if i want to grep 3 element from each tweets.
for example i need 'text' and 'location' and 'geo'.
Thank you!