g*********n 发帖数: 43 | 1 Two tables, the first one "Users" saves the user name and some properties of
the user, age, gender, . . and the last column is "hobbies", to save the
hobbies for this particular user. Since hobbies can be quite diverse, in
this table we put an id of the hobbies.
name age gender hobby_id
A 20 M 1
B 45 F 1
C 52 M 2
. . .
And we define another table to map this id into real hobbies:
hobby_id hobby
1 swimming
1 pingpong
1 soccer
. . . other hobbies associated with id "1"
2 badminton
2 travel
2 pingpong (a hobby can be associated with multiple id)
. . . other hobbies associated with id "2"
3 TV
3 movie
. . .
Requirement:
1. for all combinations of (age, gender), return how many users have the
corresponding age and gender.
This can be achieved by
"select age, gender, count(*)
from Users
group by 1, 2;"
2. for all combinations of (age, hobby), return how many users have the
corresponding age and hobby.
Theoretically, this can be achieved by
select age, hobbies.hobby, count(*)
from Users join hobbies on Users.hobby_id = hobbies.hobby_id
group by 1,2;
But the problem is, the join is just too expensive:
(1) the Users table has 200k rows
(2) each hobby_id may map to 200 hobbies
thus the join will result in 200k * 200 = 40 million rows and causes
connection timeout.
It looks like we cannot use one single query to return all the results, and
have to break down the query into smaller ones? | d***e 发帖数: 793 | 2 Can you do this?
select a.age, hobbies.hobby, a.ct
(
select age,hobby_id, count(*) ct
from Users
group by 1,2;
)a
join hobbies
on a.hobby_id = hobbies.hobby_id | g*********n 发帖数: 43 | 3 Good point. The subquery
select age,hobby_id, count(*) ct
from Users
group by 1,2;
can compress table Users into a, then we join "a" and "hobbies".
This indeed helps. But the thing is, even after the compression, table a
still has 25k rows:), and the join is still 25k * 200 = 5 million rows.
【在 d***e 的大作中提到】 : Can you do this? : select a.age, hobbies.hobby, a.ct : ( : select age,hobby_id, count(*) ct : from Users : group by 1,2; : )a : join hobbies : on a.hobby_id = hobbies.hobby_id
| i*****w 发帖数: 75 | 4 I am a bit confused by the question:
If multiple hobbies can share the same hobby_id, then, for name A, it's
hobby_id =1 in the user table, does that mean person A has all the hobbies
defined in the hobby table with hobby_id =1? If so, then person A and B have
exactly the same hobbies?
Based on your design, then when you join your user table and the hobby table
by hobby_id, if one hobby_id has n hobbies, then for a given user, you will
have n rows of data, which caused the CROSS JOIN effect.
Since you mentioned that one hobby can belong to multiple hobby_id, so it
makes me feel like that you actually wanted to find Hobby Groups. For
example, Person A and B share the same hobbies of swimming, pingpong and
soccer, so you may try to store the data like below:
User table:
User Age Gender HobbyGroupID
A 20 M 100
B 45 F 100
C 52 M 101
Then have a HobbyGroup table:
HobbyGroupID HobbyGroupTypeDes
100 Hobby Group 100
101 Hobby Group 101
and a Hobby table:
HobbyID HobbyDescription
1 Swimming
2 Pingpong
3 Soccer
4 Travel
Then you can have a bridge table:
HobbyGroupID HobbyID
100 1
100 2
100 3
101 1
101 3
Note that HobbyID 1 and 3 are in multiple HobbyGroups in the previous bridge
table. Doing this way, you can group by in the user table, then make the
calculations in the bridge table, and finally put these two together.
Just my 2 cents.
of
【在 g*********n 的大作中提到】 : Two tables, the first one "Users" saves the user name and some properties of : the user, age, gender, . . and the last column is "hobbies", to save the : hobbies for this particular user. Since hobbies can be quite diverse, in : this table we put an id of the hobbies. : name age gender hobby_id : A 20 M 1 : B 45 F 1 : C 52 M 2 : . . . : And we define another table to map this id into real hobbies:
| p********n 发帖数: 11 | 5 HobbyGroup is kind of no much use. Why not simply create a user table and a
hobby table, and create a (mapping) table to link a user with hobbies?
User
Hobby
UserHobby | P********8 发帖数: 12 | 6 discretizing age might help, such as 21-25,26-30.....
anyway it still depends on what's the business requirements.
【在 g*********n 的大作中提到】 : Good point. The subquery : select age,hobby_id, count(*) ct : from Users : group by 1,2; : can compress table Users into a, then we join "a" and "hobbies". : This indeed helps. But the thing is, even after the compression, table a : still has 25k rows:), and the join is still 25k * 200 = 5 million rows.
|
|