1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
<?php
/*
Question2Answer by Gideon Greenspan and contributors
http://www.question2answer.org/
Description: Database-level access to userfavorites table
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
More about this license: http://www.question2answer.org/license.php
*/
if (!defined('QA_VERSION')) { // don't allow this page to be requested directly from browser
header('Location: ../../');
exit;
}
/*
Why do we have two types of event streams, shared (in qa_sharedevents) and user-specific (in qa_userevents)?
An event stream is defined as the set of events which are thrown off ("published") by a particular entity. For
example, it could include the activity on a particular question, or the activity by a particular user.
We have an arbitrary many-to-many mapping between event streams and users subscribed to those streams. Over time, a
particularly popular event stream could accumulate thousands of subscribers. Similarly, over time, a particularly
hyperactive user could end up subscribing to thousands of event streams.
If we stored a single copy of each event stream in the database, publishing an event would be very fast. However
retrieving a hyperactive user's update page would be extremely slow, because it would require retrieving all the
streams they are subscribed to, and finding the globally most recent (e.g.) 50 events across all those streams.
So instead we could store a list of news updates for each user. In this case, retrieving a user's update page would
be very fast. However, recording an event for a popular stream could become extremely slow, since it would have to
be copied for every user subscribed to the stream.
The standard solution to these "publish and subscribe" situations is a message-passing architecture. That's what
Twitter et al use. However that's not a viable option here, because it requires a process to be running in the
background to manage the queuing and transport of these messages from publishers (event streams) to subscribers
(users' lists of news updates). While we could have a cron-style process to manage this, I'm avoiding it for as long
as possible since it complicates setup. It also means there can be delays in updating users' news feeds.
So instead we adopt a hybrid approach. For each event created in an entity's stream, we record a single copy of that
event in the entity's stream in the qa_sharedevents table. In addition, by default, we place a copy of that event into
the list of news updates for each user subscribed to the stream, via the qa_userevents table.
However, if there are more than a certain number of subscribers to the stream, we skip this second step, i.e. we
only record one copy in the qa_sharedevents table. This limits the cost of publishing an event.
When we generate a user's list of recent updates, we of course retrieve the list of news updates for that user from
qa_userevents. However we also check to see whether that user is subscribed to any event streams for which updates
are no longer posted into the user's own list, because the stream has too many subscribers. For each of these
popular streams, we also retrieve the stream's events from qa_sharedevents. Since users are only likely to be
subscribed to a small number of popular streams, this limits the cost of retrieving the news updates.
(Having a shared event stream helps us another way. When a user subscribes to a stream, they can immediately have
recent events from that stream copied into their list of news updates.)
Note that this approach isn't aimed at reducing the total cost of keeping all users up-to-date on all events, but
rather ensuring that no individual operation (posting an event or retrieving a user's list of updates) takes too
long, since that would turn into a very slow response time for the corresponding HTTP request.
What should we use for the threshold T, so that if a stream has more than T subscribers, its events are only
recorded in the shared stream? One approach is as follows:
[this ignores stream length and truncation, which are constant factors]
T = our threshold
M = the maximum number of streams subscribed to by any user
P(x) = the probability that a particular stream has more than x subscribers
C1 = maximum cost of adding an event = maximum number of streams to which event must be added = O(T)
C2 = maximum cost of retrieving news updates = maximum number of shared streams to be combined = O(M * P(T))
[we assume that the chance a particular user is subscribed to a particular stream is independent of the user]
Now if we assume the power law, aka 80/20 rule, we can estimate that P(T) is proportional to 1/T, so that:
C2 = O(M / T)
To minimize the maximum of these two complexity maxima, we want to equate them, so that:
T = M/T => T=sqrt(M)
So we could keep track of the maximum number of event streams any user is subscribed to, and use its square root.
Instead of that, we adopt an on-the-fly approach. We start by setting T=10 (see 'max_copy_user_updates' in
/qa-include/app/options.php) since it's no big deal to write 10 rows to a table. Recall that whenever an event stream
gets more than T subscribers, we switch those subscribers over to the shared stream. At that point, we check the
maximum number of (total) shared streams that any of those users are subscribed to. If this is above T, that means that
our maximum cost of retrieving a list of news updates is starting to go past our maximum cost of recording an event. So
we rebalance things out by increasing T as appropriate, for use in future cases.
Note that once an event stream has made this switch, to be accessed only via its shared stream, we don't go back.
*/
/**
* Add the entity $entitytype with $entityid to the favorites list of $userid. Handles switching streams across from
* per-user to per-entity based on how many other users have favorited the entity (see long explanation above). If
* appropriate, it also adds recent events from that entity to the user's event stream.
* @param $userid
* @param $entitytype
* @param $entityid
*/
function qa_db_favorite_create($userid, $entitytype, $entityid)
{
$threshold = qa_opt('max_copy_user_updates'); // if this many users subscribe to it, create a shared stream
// Add in the favorite for this user, unshared events at first (will be switched later if appropriate)
qa_db_query_sub(
'INSERT IGNORE INTO ^userfavorites (userid, entitytype, entityid, nouserevents) VALUES ($, $, #, 0)',
$userid, $entitytype, $entityid
);
// See whether this entity already has another favoriter who uses its shared event stream
$useshared = qa_db_read_one_value(qa_db_query_sub(
'SELECT COUNT(*) FROM ^userfavorites WHERE entitytype=$ AND entityid=# AND nouserevents>0 LIMIT 1',
$entitytype, $entityid
));
// If not, check whether it's time to switch it over to a shared stream
if (!$useshared) {
$favoriters = qa_db_read_one_value(qa_db_query_sub(
'SELECT COUNT(*) FROM ^userfavorites WHERE entitytype=$ AND entityid=# LIMIT #',
$entitytype, $entityid, $threshold
));
$useshared = ($favoriters >= $threshold);
}
// If we're going to use the shared stream...
if ($useshared) {
// ... for all the people for whom we're switching this to a shared stream, find the highest number of other shared streams they have
$maxshared = qa_db_read_one_value(qa_db_query_sub(
'SELECT MAX(c) FROM (SELECT COUNT(*) AS c FROM ^userfavorites AS shared JOIN ^userfavorites AS unshared ' .
'WHERE shared.userid=unshared.userid AND shared.nouserevents>0 AND unshared.entitytype=$ AND unshared.entityid=# AND unshared.nouserevents=0 GROUP BY shared.userid) y',
$entitytype, $entityid
));
// ... if this number is greater than our current 'max_copy_user_updates' threshold, increase that threshold (see long comment above)
if (($maxshared + 1) > $threshold)
qa_opt('max_copy_user_updates', $maxshared + 1);
// ... now switch all unshared favoriters (including this new one) over to be shared
qa_db_query_sub(
'UPDATE ^userfavorites SET nouserevents=1 WHERE entitytype=$ AND entityid=# AND nouserevents=0',
$entitytype, $entityid
);
} else {
// Otherwise if we're going to record this in user-specific streams ...
require_once QA_INCLUDE_DIR . 'db/events.php';
// ... copy across recent events from the shared stream
qa_db_query_sub(
'INSERT INTO ^userevents (userid, entitytype, entityid, questionid, lastpostid, updatetype, lastuserid, updated) ' .
'SELECT #, entitytype, entityid, questionid, lastpostid, updatetype, lastuserid, updated FROM ' .
'^sharedevents WHERE entitytype=$ AND entityid=#',
$userid, $entitytype, $entityid
);
// ... and truncate the user's stream as appropriate
qa_db_user_events_truncate($userid);
}
}
/**
* Delete the entity $entitytype with $entityid from the favorites list of $userid, removing any corresponding events
* from the user's stream.
* @param $userid
* @param $entitytype
* @param $entityid
*/
function qa_db_favorite_delete($userid, $entitytype, $entityid)
{
qa_db_query_sub(
'DELETE FROM ^userfavorites WHERE userid=$ AND entitytype=$ AND entityid=#',
$userid, $entitytype, $entityid
);
qa_db_query_sub(
'DELETE FROM ^userevents WHERE userid=$ AND entitytype=$ AND entityid=#',
$userid, $entitytype, $entityid
);
}