There are mbox files for each month named in %Y-%m
format (e.g. 2017-12). Most are uncompressed and have no extension but a few are gzipped and have a .gz
extension.
Remove the gzip files that are duplicates of uncompressed files.
rm 2014-0[1-9].gz
Uncompress all the gzip files.
gunzip *.gz
Create a table of the message header data for easy searching. This is a pipeline of three commands (sed
, sed
, awk
) with the output redirected to a text file.
Here we extract the message headers and the following blank lines separating them from the message bodies (see section 2.1 of RFC2822) from all the mbox files.
sed -n '/^From /,/^$/p' 20??-??
The output from this step consists of header lines for each message followed by a blank line.
We use the pattern 20??-??
to match all the mbox files. The question marks here are a bash
glob, each matching any single character.
The mbox format begins each message with "From " at the beginning of a line; the next lines until the first blank line are the headers.
Here we "unfold" the "folded" header lines (see section 2.2.3 of RFC2822) and also strip out the double-quote characters.
sed -E -n 'H; ${ x; s/\n +/ /g; s/\n\t+/ /g; s/\"//g; p }'
The output from this step consists of unfolded header lines for each message followed by a blank line.
The sed
command unfolds the headers by deleting line breaks followed by one or more spaces or tabs and replacing the spaces or tabs with a single space.
The double-quotes are removed because they cause problems with the import into SQLite. The quotes are mostly found in the Sender:
header which generally looks like "Name" <email>
and SQLite wants to interpret this as a separate quoted field missing a delimiter.
Here we pull out all the header values that we're interested in and print them in a tabular format instead of one value per line. This command is the majority of the pipeline, so I haven't copied it here in isolation -- just look at the full command pipeline below.
The output from this step is one line per message with each header field we're interested in and a few extra derived fields. The fields in each line of output are delimited with the ~
(tilde) character. (The ~
character was chosen because it wasn't found in any of the header data.)
The awk
command here depends on the blank lines after each message header as a trigger to print its output. We can't depend on the headers to be in any particular order and not all of them may be present, so each is assigned back to its default value ("") after the blank line triggers printing and is only assigned another value when we find the line the contains it.
Extra fields:
- From address
- Year
- Month
- Day of month
- Datestamp (targeted for SQLite)
- Length of the references header
- First message ID from the references header
sed -n '/^From /,/^$/p' 20??-?? | sed -E -n 'H; ${ x; s/\n +/ /g; s/\n\t+/ /g; s/\"//g; p }' | awk '
/^From: / {
# Delete the header name and following space, leaving only the value.
gsub(/^[A-Za-z-]*: /,"");
from=$0;
from_address=tolower(gensub(/^.*<(.+)>.*$/,"\\1","g",$0));
}
/^Date: / {
gsub(/^[A-Za-z-]*: /,"");
date=$0;
$0=gensub(/^(.*, )?([0-9]+ [A-Za-z]+ 20[0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] .[0-9][0-9][0-9][0-9]).*$/,"\\2","g",$0)
split($0,a," ");
day=a[1]; gsub(/^0/,"",day);
mon=a[2];
sub("Jan","1",mon); sub("Feb","2",mon); sub("Mar","3",mon);
sub("Apr","4",mon); sub("May","5",mon); sub("Jun","6",mon);
sub("Jul","7",mon); sub("Aug","8",mon); sub("Sep","9",mon);
sub("Oct","10",mon); sub("Nov","11",mon); sub("Dec","12",mon);
year=a[3];
timestamp=a[4];
# Add colon between offset hours and minutes so SQLite can use it correctly
offset=(substr(a[5],1,3) ":" substr(a[5],4,2));
datestamp=sprintf("%d-%02d-%02d %s %s",year,mon,day,timestamp,offset);
}
/^Subject: / { gsub(/^[A-Za-z-]*: /,""); subject=$0 }
/^Message-I[Dd]: / { gsub(/^[A-Za-z-]*: /,""); messageid=$0 }
/^In-Reply-To: / { gsub(/^[A-Za-z-]*: /,""); inreplyto=$0 }
/^References: / { gsub(/^[A-Za-z-]*: /,""); references=$0; ref_length=split($0,r," "); ref0=r[1]; }
/^$/ && NR>1 {
if(ref_length=="") ref_length=0;
printf("%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s~%s\n", from, date, subject, messageid, inreplyto, references, from_address, year, mon, day, datestamp, ref_length, ref0);
from=date=subject=messageid=inreplyto=references=year=mon=day=datestamp=ref_length=ref0="";
}' > message_index6.txt
Create an SQLite database and import the table.
sqlite3 messages3.db << EOF
create table messages(sender TEXT, date TEXT, subject TEXT, message_id TEXT, in_reply_to TEXT, reference_ids TEXT, from_address TEXT, year TEXT, month TEXT, day TEXT, datestamp TEXT, ref_length INTEGER, ref0 TEXT);
.separator "~"
.import message_index6.txt messages
EOF
Get rid of SPAM messages
delete from messages where lower(subject) like '% sex!%';
Number of messages per month
select
year||substr("0"||month,length(month)+2,-2),
count(*)
from
messages
group by
year||month
order by
1 asc;
Total message count by month (time adjusted to UTC)
select
substr(datetime(datestamp),1,7) as mon,
count(*)
from
messages
group by
mon
order by
mon asc;
Average number of messages per month
select
substr(mon,6,2) as m,
avg(c)
from
(select substr(datetime(datestamp),1,7) as mon, count(*) as c from messages group by mon)
group by
m
order by
m asc;
Total message count by day of week
select
strftime('%w',datetime(datestamp)) as dow,
count(*)
from
messages
group by
dow
order by
dow;
Total message count by hour
--UTC
select
strftime('%H',datetime(datestamp))||':00' as hour,
count(*)
from
messages
group by
hour
order by
hour;
--local time (U.S. Central)
select
strftime('%H',datetime(datestamp,'localtime'))||':00' as hour,
count(*)
from
messages
group by
hour
order by
hour;
Message count by day of week and hour
--UTC
select
strftime('%w',datetime(datestamp)) as dow,
replace(replace(replace(replace(replace(replace(replace(strftime('%w',datetime(datestamp)), 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name,
strftime('%H',datetime(datestamp))||':00' as hour,
count(*)
from
messages
group by
dow,hour
order by
dow,hour;
--UTC with explicit zeros
select
timetable.dow,
timetable.hour||':00',
timetable.dow_name,
count(messages.sender)
from
(select
dows.dow as dow,
hours.hour as hour,
replace(replace(replace(replace(replace(replace(replace(dow, 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name
from
(select distinct strftime('%w',datetime(datestamp)) as dow from messages) dows,
(select distinct strftime('%H',datetime(datestamp)) as hour from messages) hours
) timetable
left join
messages
on
strftime('%w',datetime(messages.datestamp)) = timetable.dow
and strftime('%H',datetime(messages.datestamp)) = timetable.hour
group by
1,2
order by
1,2;
--local time (U.S. Central Time) with explicit zeros
select
timetable.dow,
timetable.hour||':00',
timetable.dow_name,
count(messages.sender)
from
(select
dows.dow as dow,
hours.hour as hour,
replace(replace(replace(replace(replace(replace(replace(dow, 0, 'Sun'), 1, 'Mon'), 2, 'Tue'), 3, 'Wed'), 4, 'Thu'), 5, 'Fri'), 6, 'Sat') as dow_name
from
(select distinct strftime('%w',datetime(datestamp)) as dow from messages) dows,
(select distinct strftime('%H',datetime(datestamp)) as hour from messages) hours
) timetable
left join
messages
on
strftime('%w',datetime(messages.datestamp,'localtime')) = timetable.dow
and strftime('%H',datetime(messages.datestamp,'localtime')) = timetable.hour
group by
1,2
order by
1,2;
Total number of threads
--all time
select count(distinct ifnull(nullif(ref0,''),message_id)) from messages;
--active in 2017
select count(distinct ifnull(nullif(ref0,''),message_id)) from messages where year=2017;
Average response time
--all time
select
avg(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds
from
messages o,
messages r
where
r.in_reply_to != ''
and o.message_id != ''
and r.in_reply_to = o.message_id;
--responses sent in 2017
select
avg(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds
from
messages o,
messages r
where
r.in_reply_to != ''
and o.message_id != ''
and r.in_reply_to = o.message_id
and r.year=2017;
Top 10 posts with most references (deepest thread depth)
select year,month,day,sender,subject,ref_length from messages order by ref_length desc limit 10;
Top 10 posts with the most children (largest thread)
--all time
select
o.year,
o.month,
o.day,
o.subject,
count(*)
from
messages o,
messages r
where
r.ref_length>1
and r.ref0=o.message_id
group by
r.ref0
order by
5 desc
limit
10;
--2017
select
o.year,
o.month,
o.day,
o.subject,
count(*)
from
messages o,
messages r
where
r.ref_length>1
and r.ref0=o.message_id
and o.year=2017
group by
r.ref0
order by
5 desc
limit
10;
Top 10 longest-running (time) threads
--all time
select
max(julianday(t2.datestamp)-julianday(t1.datestamp)) as days,
t1.subject,
t1.datestamp,
t2.datestamp
from
messages t1,
messages t2
where
t2.ref0 != ''
and t1.message_id != ''
and ifnull(nullif(t1.ref0,''),t1.message_id) = t2.ref0
group by
t2.ref0
order by
days desc
limit
10;
--threads ending in 2017
select
max(julianday(t2.datestamp)-julianday(t1.datestamp)) as days,
t1.subject,
t1.datestamp,
t2.datestamp
from
messages t1,
messages t2
where
t2.ref0 != ''
and t1.message_id != ''
and ifnull(nullif(t1.ref0,''),t1.message_id) = t2.ref0
and t2.year=2017
group by
t2.ref0
order by
days desc
limit
10;
Top 10 posts with the most direct replies (all time)
select
o.year,
o.month,
o.day,
o.subject,
r.in_reply_to,
count(*)
from
messages r,
messages o
where
o.message_id=r.in_reply_to
group by
r.in_reply_to
order by
6 desc
limit
10;
Number of initial messages with no response (threads of 1 message) (excluding meeting reminders)
select
count(*)
from
messages
where
message_id != ''
and ref0 = ''
and message_id not in (select ref0 from messages)
and lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%';
Number of people who first emailed the list per year
select
year,
count(*)
from
(select
min(year) as year,
from_address
from
messages
group by
from_address
order by
year asc)
group by
year
order by
year asc;
Number of people who posted to the list each year
select year,count(distinct from_address) from messages group by year;
Number of people who posted to the list each month
select year,month,count(distinct from_address) from messages group by year,month;
Top 10 longest "serving" list members (days between oldest and newest list posts)
select
maxday-minday as timespan,
from_address
from
(select
max(julianday(datestamp)) as maxday,
min(julianday(datestamp)) as minday,
from_address
from
messages
group by
from_address)
order by
timespan desc
limit 10;
Top 10 highest posters
--this year
select from_address,count(*) from messages where year=2017 group by from_address order by 2 desc limit 10;
--all time
select from_address,count(*) from messages group by from_address order by 2 desc limit 10;
Top 10 thread initiators (excluding meeting reminders)
--all time
select
count(*),
from_address
from
messages
where
reference_ids=''
and lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%'
group by
from_address
order by
1 desc
limit
10;
--2017
select
count(*),
from_address
from
messages
where
reference_ids=''
and year=2017
and lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%'
group by
from_address
order by
1 desc
limit
10;
Top 10 responders
--all time
select
count(*),
from_address
from
messages
where
ref0 != ''
group by
from_address
order by
1 desc
limit
10;
--2017
select
count(*),
from_address
from
messages
where
ref0 != ''
and year=2017
group by
from_address
order by
1 desc
limit
10;
Top 10 first responders
--all time
select
count(*),
from_address
from
messages
where
ref0 != ''
and ref0 = reference_ids
group by
from_address
order by
1 desc
limit
10;
--2017
select
count(*),
from_address
from
messages
where
ref0 != ''
and ref0 = reference_ids
and year=2017
group by
from_address
order by
1 desc
limit
10;
Top 10 fastest responses
--2017
select distinct
r.datestamp,
(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
r.from_address,
o.subject
from
messages o,
messages r
where
r.in_reply_to != ''
and o.message_id != ''
and r.in_reply_to = o.message_id
and r.year=2017
order by
seconds asc
limit
10;
--all time
select distinct
r.datestamp,
(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
r.from_address,
o.subject
from
messages o,
messages r
where
r.in_reply_to != ''
and o.message_id != ''
and r.in_reply_to = o.message_id
order by
seconds asc
limit
10;
--among senders with 20 or more responses
select
avg(seconds) as average_seconds,
from_address,
count(*)
from
(select distinct
(julianday(r.datestamp)-julianday(o.datestamp))*86400 as seconds,
r.from_address
from
messages o,
messages r
where
r.in_reply_to != ''
and o.message_id != ''
and r.in_reply_to = o.message_id
order by
seconds asc)
group by
from_address
having
count(*) > 20
order by
average_seconds asc
limit
10;
Number of initial messages in 2017 with no response (threads of 1 message) (excluding meeting reminders)
select
count(*)
from
messages
where
message_id != ''
and ref0 = ''
and message_id not in (select ref0 from messages)
and lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%'
and year=2017;
Message subjects (excluding meeting reminders) -- make word clouds with this
--2017
select
replace(replace(replace(replace(lower(subject),'[campuschampions] ',''),'re: ',''),'fw: ',''),'fwd: ','')
from
messages
where
lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%'
and year=2017;
--all time
select
replace(replace(replace(replace(lower(subject),'[campuschampions] ',''),'re: ',''),'fw: ',''),'fwd: ','')
from
messages
where
lower(subject) not like '%cc call%'
and lower(subject) not like '%champions call%'
and lower(subject) not like '%champion call%'
and lower(subject) not like '%monthly call%'
and lower(subject) not like '%monthly meeting%'
and lower(subject) not like '%champions meeting%'
and lower(subject) not like '%champion meeting%'
and lower(subject) not like '%community chat%'
and lower(subject) not like '%conference call%';