Analyzing the XSEDE Campus Champions List Archive

Prepare the data

There are mbox files for each month named in %Y-%m format (e.g. 2017-12). Most are uncompressed and have no extension but a few are gzipped and have a .gz extension.

Remove the gzip files that are duplicates of uncompressed files.

rm 2014-0[1-9].gz

Uncompress all the gzip files.

gunzip *.gz

Make a table

Create a table of the message header data for easy searching. This is a pipeline of three commands (sed, sed, awk) with the output redirected to a text file.

First command

Here we extract the message headers and the following blank lines separating them from the message bodies (see section 2.1 of RFC2822) from all the mbox files.

sed -n '/^From /,/^$/p' 20??-??

The output from this step consists of header lines for each message followed by a blank line.

We use the pattern 20??-?? to match all the mbox files. The question marks here are a bash glob, each matching any single character.

The mbox format begins each message with "From " at the beginning of a line; the next lines until the first blank line are the headers.

Second command

Here we "unfold" the "folded" header lines (see section 2.2.3 of RFC2822) and also strip out the double-quote characters.

sed -E -n 'H; ${ x; s/\n +/ /g; s/\n\t+/ /g; s/\"//g; p }'

The output from this step consists of unfolded header lines for each message followed by a blank line.

The sed command unfolds the headers by deleting line breaks followed by one or more spaces or tabs and replacing the spaces or tabs with a single space.

The double-quotes are removed because they cause problems with the import into SQLite. The quotes are mostly found in the Sender: header which generally looks like "Name" <email> and SQLite wants to interpret this as a separate quoted field missing a delimiter.

Third command

Here we pull out all the header values that we're interested in and print them in a tabular format instead of one value per line. This command is the majority of the pipeline, so I haven't copied it here in isolation -- just look at the full command pipeline below.

The output from this step is one line per message with each header field we're interested in and a few extra derived fields. The fields in each line of output are delimited with the ~ (tilde) character. (The ~ character was chosen because it wasn't found in any of the header data.)

The awk command here depends on the blank lines after each message header as a trigger to print its output. We can't depend on the headers to be in any particular order and not all of them may be present, so each is assigned back to its default value ("") after the blank line triggers printing and is only assigned another value when we find the line the contains it.

Extra fields:

From address
Year
Month
Day of month
Datestamp (targeted for SQLite)
Length of the references header
First message ID from the references header

The full pipeline

sed -n '/^From /,/^$/p' 20??-?? | sed -E -n 'H; ${ x; s/\n +/ /g; s/\n\t+/ /g; s/\"//g; p }' | awk '
    /^From: / { 
        # Delete the header name and following space, leaving only the value.
        gsub(/^[A-Za-z-]*: /,""); 
        from=$0;
        from_address=tolower(gensub(/^.*<(.+)>.*$/,"\\1","g",$0));
    }
    /^Date: / { 
        gsub(/^[A-Za-z-]*: /,""); 
        date=$0;
        $0=gensub(/^(.*, )?([0-9]+ [A-Za-z]+ 20[0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] .[0-9][0-9][0-9][0-9]).*$/,"\\2","g",$0)
        split($0,a," ");
        day=a[1]; gsub(/^0/,"",day); 
        mon=a[2]; 
        sub("Jan","1",mon); sub("Feb","2",mon); sub("Mar","3",mon); 
        sub("Apr","4",mon); sub("May","5",mon); sub("Jun","6",mon); 
        sub("Jul","7",mon); sub("Aug","8",mon); sub("Sep","9",mon); 
        sub("Oct","10",mon); sub("Nov","11",mon); sub("Dec","12",mon);
        year=a[3];
        timestamp=a[4];
        # Add colon between offset hours and minutes so SQLite can use it correctly
        offset=(substr(a[5],1,3) ":" substr(a[5],4,2));
        datestamp=sprintf("%d-%02d-%02d %s %s",year,mon,day,timestamp,offset);
    }
    /^Subject: / { gsub(/^[A-Za-z-]*: /,""); subject=$0 }
    /^Message-I[Dd]: / { gsub(/^[A-Za-z-]*: /,""); messageid=$0 }
    /^In-Reply-To: / { gsub(/^[A-Za-z-]*: /,""); inreplyto=$0 }
    /^References: / { gsub(/^[A-Za-z-]*: /,""); references=$0; ref_length=split($0,r," "); ref0=r[1]; } 
    /^$/ && NR>1 { 
        if(ref_length=="") ref_length=0;
        printf("%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s\n", from, date, subject, messageid, inreplyto, references, from_address, year, mon, day, datestamp, ref_length, ref0); 
        from=date=subject=messageid=inreplyto=references=year=mon=day=datestamp=ref_length=ref0="";
    }' > message_index6.txt

Import to sqlite

Create an SQLite database and import the table.

sqlite3 messages3.db << EOF
create table messages(sender TEXT, date TEXT, subject TEXT, message_id TEXT, in_reply_to TEXT, reference_ids TEXT, from_address TEXT, year TEXT, month TEXT, day TEXT, datestamp TEXT, ref_length INTEGER, ref0 TEXT);
.separator "~"
.import message_index6.txt messages
EOF

SQL Queries

Clean-up

Get rid of SPAM messages

delete from messages where lower(subject) like '% sex!%';

Message distribution in time

Number of messages per month

select 
    year||substr("0"||month,length(month)+2,-2),
    count(*) 
from 
    messages 
group by 
    year||month 
order by 
    1 asc;

Total message count by month (time adjusted to UTC)

select 
    substr(datetime(datestamp),1,7) as mon, 
    count(*) 
from 
    messages 
group by 
    mon 
order by 
    mon asc;

Average number of messages per month

select 
    substr(mon,6,2) as m, 
    avg(c) 
from 
    (select substr(datetime(datestamp),1,7) as mon, count(*) as c from messages group by mon) 
group by 
    m 
order by 
    m asc;

Total message count by day of week

select 
    strftime('%w',datetime(datestamp)) as dow,
    count(*) 
from 
    messages 
group by 
    dow 
order by 
    dow;

Total message count by hour

--UTC
select 
    strftime('%H',datetime(datestamp))||':00' as hour,
    count(*) 
from 
    messages 
group by 
    hour 
order by 
    hour;

--local time (U.S. Central)
select 
    strftime('%H',datetime(datestamp,'localtime'))||':00' as hour,
    count(*) 
from 
    messages 
group by 
    hour 
order by 
    hour;

Message count by day of week and hour

--UTC
select 
    strftime('%w',datetime(datestamp)) as dow, 
    replace(replace(replace(replace(replace(replace(replace(strftime('%w',datetime(datestamp)), 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name,
    strftime('%H',datetime(datestamp))||':00' as hour,
    count(*) 
from 
    messages 
group by 
    dow,hour 
order by 
    dow,hour;

--UTC with explicit zeros
select 
    timetable.dow,
    timetable.hour||':00',
    timetable.dow_name,
    count(messages.sender) 
from 
    (select 
        dows.dow as dow, 
        hours.hour as hour,
        replace(replace(replace(replace(replace(replace(replace(dow, 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name 
    from 
        (select distinct strftime('%w',datetime(datestamp)) as dow from messages) dows, 
        (select distinct strftime('%H',datetime(datestamp)) as hour from messages) hours
    ) timetable 
left join 
    messages 
on 
    strftime('%w',datetime(messages.datestamp)) = timetable.dow 
    and strftime('%H',datetime(messages.datestamp)) = timetable.hour 
group by 
    1,2 
order by 
    1,2;

--local time (U.S. Central Time) with explicit zeros
select 
    timetable.dow,
    timetable.hour||':00',
    timetable.dow_name,
    count(messages.sender) 
from 
    (select 
        dows.dow as dow, 
        hours.hour as hour,
        replace(replace(replace(replace(replace(replace(replace(dow, 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name 
    from 
        (select distinct strftime('%w',datetime(datestamp)) as dow from messages) dows, 
        (select distinct strftime('%H',datetime(datestamp)) as hour from messages) hours
    ) timetable 
left join 
    messages 
on 
    strftime('%w',datetime(messages.datestamp,'localtime')) = timetable.dow 
    and strftime('%H',datetime(messages.datestamp,'localtime')) = timetable.hour 
group by 
    1,2 
order by 
    1,2;

Threads

Total number of threads

--all time
select count(distinct ifnull(nullif(ref0,''),message_id)) from messages;

--active in 2017
select count(distinct ifnull(nullif(ref0,''),message_id)) from messages where year=2017;

Average response time

--all time
select 
    avg(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds 
from 
    messages o, 
    messages r 
where 
    r.in_reply_to != '' 
    and o.message_id != '' 
    and r.in_reply_to = o.message_id;

--responses sent in 2017
select 
    avg(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds 
from 
    messages o, 
    messages r 
where 
    r.in_reply_to != '' 
    and o.message_id != '' 
    and r.in_reply_to = o.message_id 
    and r.year=2017;

Top 10 posts with most references (deepest thread depth)

select year,month,day,sender,subject,ref_length from messages order by ref_length desc limit 10;

Top 10 posts with the most children (largest thread)

--all time
select 
    o.year,
    o.month,
    o.day,
    o.subject,
    count(*) 
from 
    messages o, 
    messages r 
where 
    r.ref_length>1 
    and r.ref0=o.message_id 
group by 
    r.ref0 
order by 
    5 desc 
limit 
    10;

--2017
select 
    o.year,
    o.month,
    o.day,
    o.subject,
    count(*) 
from 
    messages o, 
    messages r 
where 
    r.ref_length>1 
    and r.ref0=o.message_id 
    and o.year=2017 
group by 
    r.ref0 
order by 
    5 desc 
limit 
    10;

Top 10 longest-running (time) threads

--all time
select 
    max(julianday(t2.datestamp)-julianday(t1.datestamp)) as days, 
    t1.subject, 
    t1.datestamp, 
    t2.datestamp 
from 
    messages t1, 
    messages t2 
where 
    t2.ref0 != '' 
    and t1.message_id != '' 
    and ifnull(nullif(t1.ref0,''),t1.message_id) = t2.ref0 
group by 
    t2.ref0 
order by 
    days desc 
limit 
    10;

--threads ending in 2017
select 
    max(julianday(t2.datestamp)-julianday(t1.datestamp)) as days, 
    t1.subject, 
    t1.datestamp, 
    t2.datestamp 
from 
    messages t1, 
    messages t2 
where 
    t2.ref0 != '' 
    and t1.message_id != '' 
    and ifnull(nullif(t1.ref0,''),t1.message_id) = t2.ref0 
    and t2.year=2017 
group by 
    t2.ref0 
order by 
    days desc 
limit 
    10;

Top 10 posts with the most direct replies (all time)

select 
    o.year,
    o.month,
    o.day,
    o.subject,
    r.in_reply_to,
    count(*) 
from 
    messages r, 
    messages o 
where 
    o.message_id=r.in_reply_to 
group by 
    r.in_reply_to 
order by 
    6 desc 
limit 
    10;

Number of initial messages with no response (threads of 1 message) (excluding meeting reminders)

select 
    count(*) 
from 
    messages 
where 
    message_id != '' 
    and ref0 = '' 
    and message_id not in (select ref0 from messages) 
    and lower(subject) not like '%cc call%' 
    and lower(subject) not like '%champions call%' 
    and lower(subject) not like '%champion call%' 
    and lower(subject) not like '%monthly call%' 
    and lower(subject) not like '%monthly meeting%' 
    and lower(subject) not like '%champions meeting%' 
    and lower(subject) not like '%champion meeting%' 
    and lower(subject) not like '%community chat%' 
    and lower(subject) not like '%conference call%';

People

Number of people who first emailed the list per year

select 
    year,
    count(*) 
from 
    (select 
        min(year) as year,
        from_address 
    from 
        messages 
    group by 
        from_address 
    order by 
        year asc) 
group by 
    year 
order by 
    year asc;

Number of people who posted to the list each year

select year,count(distinct from_address) from messages group by year;

Number of people who posted to the list each month

select year,month,count(distinct from_address) from messages group by year,month;

Top 10 longest "serving" list members (days between oldest and newest list posts)

select 
    maxday-minday as timespan, 
    from_address 
from 
    (select 
        max(julianday(datestamp)) as maxday, 
        min(julianday(datestamp)) as minday,
        from_address 
    from 
        messages 
    group by 
        from_address)
order by 
    timespan desc 
limit 10;

Misc

Top 10 fastest responses

--2017
select distinct 
    r.datestamp,
    (julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
    r.from_address,
    o.subject 
from 
    messages o, 
    messages r 
where 
    r.in_reply_to != '' 
    and o.message_id != '' 
    and r.in_reply_to = o.message_id 
    and r.year=2017 
order by 
    seconds asc 
limit 
    10;

--all time
select distinct 
    r.datestamp,
    (julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
    r.from_address,
    o.subject 
from 
    messages o, 
    messages r 
where 
    r.in_reply_to != '' 
    and o.message_id != '' 
    and r.in_reply_to = o.message_id 
order by 
    seconds asc 
limit 
    10;

--among senders with 20 or more responses
select 
    avg(seconds) as average_seconds,
    from_address,
    count(*) 
from 
    (select distinct 
        (julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
        r.from_address 
    from 
        messages o, 
        messages r 
    where 
        r.in_reply_to != '' 
        and o.message_id != '' 
        and r.in_reply_to = o.message_id 
    order by 
        seconds asc) 
group by 
    from_address 
having 
    count(*) > 20 
order by 
    average_seconds asc 
limit 
    10;

Number of initial messages in 2017 with no response (threads of 1 message) (excluding meeting reminders)

select 
    count(*) 
from 
    messages 
where 
    message_id != '' 
    and ref0 = '' 
    and message_id not in (select ref0 from messages) 
    and lower(subject) not like '%cc call%' 
    and lower(subject) not like '%champions call%' 
    and lower(subject) not like '%champion call%' 
    and lower(subject) not like '%monthly call%' 
    and lower(subject) not like '%monthly meeting%' 
    and lower(subject) not like '%champions meeting%' 
    and lower(subject) not like '%champion meeting%' 
    and lower(subject) not like '%community chat%' 
    and lower(subject) not like '%conference call%' 
    and year=2017;

Message subjects (excluding meeting reminders) -- make word clouds with this

--2017
select 
    replace(replace(replace(replace(lower(subject),'[campuschampions] ',''),'re: ',''),'fw: ',''),'fwd: ','') 
from 
    messages 
where 
    lower(subject) not like '%cc call%' 
    and lower(subject) not like '%champions call%' 
    and lower(subject) not like '%champion call%' 
    and lower(subject) not like '%monthly call%' 
    and lower(subject) not like '%monthly meeting%' 
    and lower(subject) not like '%champions meeting%' 
    and lower(subject) not like '%champion meeting%' 
    and lower(subject) not like '%community chat%' 
    and lower(subject) not like '%conference call%' 
    and year=2017;

--all time
select 
    replace(replace(replace(replace(lower(subject),'[campuschampions] ',''),'re: ',''),'fw: ',''),'fwd: ','') 
from 
    messages 
where 
    lower(subject) not like '%cc call%' 
    and lower(subject) not like '%champions call%' 
    and lower(subject) not like '%champion call%' 
    and lower(subject) not like '%monthly call%' 
    and lower(subject) not like '%monthly meeting%' 
    and lower(subject) not like '%champions meeting%' 
    and lower(subject) not like '%champion meeting%' 
    and lower(subject) not like '%community chat%' 
    and lower(subject) not like '%conference call%';

evanlinde/xsede_campus_champions_email.md