Skip to content

Commit 8886095

Browse files
Pedro Geadaspgeadas
authored andcommitted
Added downsides of Keyset pagination
1 parent 1b9991c commit 8886095

1 file changed

Lines changed: 82 additions & 21 deletions

File tree

IMPLEMENTATION_DETAILS.md

Lines changed: 82 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,101 @@
11
# Implementation details
22

3-
The solution presented makes use of two different techniques for query optimization:
3+
The solution presented makes use of two different techniques for query optimization, which I will be covering below.
44

5-
1. Keyset pagination
5+
## Keyset pagination
66

7-
Since we have a unique identifier for each row of the database (the id), we can improve our query time (especially for large datasets) by filtering our results directly by it. This operation is usually faster than relying on the OFFSET, since the latter needs to skip N amount of rows until it reaches the desired one, which still involves some processing. This can obviously be a bigger issue for very large datasets, therefore it was avoided.
8-
Additionally, since the id is the primary key which is indexed by default, using keyset pagination will be very efficient (it works better on indexed columns). Also, another advantage of using the id (which is auto incremented) is that we will not suffer from potential race condition issues like we could when we use OFFSET (since a value could be added in, between the time that we query and the time that the database seeks the desired position).
7+
Since we have a unique identifier for each row of the database (the id), we can improve our query time (especially for
8+
large datasets) by filtering our results directly by it. This operation is usually faster than relying on the OFFSET,
9+
since the latter needs to skip N amount of rows until it reaches the desired one, which still involves some processing.
10+
This can obviously be a bigger issue for very large datasets, therefore it was avoided.
911

10-
2. Prepared statements
12+
Additionally, since the id is the primary key which is indexed by default, using keyset pagination will be very
13+
efficient (it works better on indexed columns). Also, another advantage of using the id (which is auto incremented) is
14+
that we will not suffer from potential race condition issues like we could when we use OFFSET (since a value could be
15+
added in, between the time that we query and the time that the database seeks the desired position).
1116

12-
This technique involves preparing the statements before we even execute them and sending them to the database to be compiled beforehand, using placeholders for values, which are only set latter when we receive the requests. This way, since the database already has the query stored, we avoid resending it every time which will also speed up the query time. The only caveat (at least using the default Groovy Sql lib) is that this technique can only be used when what vary are values. Operations like orderBy or sort direction cannot be set dynamically this way, so for those queries this improvement could not be used.
13-
This technique is also used to sanitize queries and avoiding SQL injection, but in this case it was not needed since we already do all validations beforehand.
17+
### Possible downsides
1418

15-
# Tests
19+
In the end, the decision was taken after weighting all the advantages and disadvantages of both approaches considered,
20+
keyset and offset pagination (taking into account the size of the dataset and the probability of some of them
21+
happening).
22+
Even though I decided to move forward with keyset pagination, there are some points where it might struggle:
1623

17-
There are three types of tests:
24+
1. **Table schema changes**: If we need to change the schema of the database and decide to remove or change the "id"
25+
type or use another field as primary key, we will need to adapt our solution which involves changing the code (offset
26+
pagination does not suffer from this, as it does not care about columns) and possibly adding indexes to the columns
27+
used as keys, in order to keep our solution performant. Depending on the new "id" type, it might not even be possible
28+
to use keyset pagination, which works on the premise of having one or more columns that provide a unique and ordered
29+
identifier;
1830

19-
1. Unit tests
31+
2. **Missing records**: We can still see some pages with lesser items than intended if a delete happens in the middle of
32+
a query. If we are unlucky, the deleted row is one that is already added to the ResultSet while the query is still
33+
fetching the remaining rows, so either we are fetching a value that is not in the database anymore, or we will have
34+
one less item in the final result.
35+
This depends on the strategy used for concurrency and if we are using locks or not, but considering the current
36+
implementation this could be an issue since we are not using any kind of concurrency control. (Both approaches can
37+
suffer from this, even though, for large datasets, it is more likely to happen using OFFSET since we need to skip a
38+
huge amount of rows to reach our target);
2039

21-
These are applied to the domain objects and their goal is to test the business logic. They intend to test specific classes in isolation and making sure that the domain objects can be created correctly and are validated.
40+
3. **Sorting**: We are only allowing sorting AFTER we fetch the respective number of items. Another option would be to
41+
sort BEFORE we fetch the LIMIT amount of items, but to do that efficiently we would need to have indexes in those
42+
columns (both approaches suffer from the lack of indexes in this case). However, I think it is worth saying that if
43+
we had more indexed columns, adapting the solution would be trivial: we could ask the user for an extra parameter,
44+
indicating if the sort should be done before or after the filtering;
2245

23-
2. Integration tests
46+
4. **ID size**: For very large datasets we would need to use a BigInt for the id, which consumes more space, but would
47+
avoid overflow which would break our solution (Offset pagination does not suffer from this);
2448

25-
These aim to test the interactions between our controllers and use cases, since the moment we receive a request until we have to call our infrastructure for results. Usually we can mock the infrastructure part, since infra is totally independent from both application and domain and is tested separately. Our main focus here is to verify that our controllers are receiving and processing the requests correctly, forwarding it to the use cases and receiving the expected responses.
49+
5. **Multiple-table queries**: If we had multiple tables and complex join operations, maintaining the same order of
50+
results could be tricky (nevertheless, both approaches suffer from this).
2651

27-
3. Infrastructure
28-
29-
These tests focus on testing the different adapters that implement our domain interfaces (ports). Since we are testing the contract between adapters and ports, it showuld not matter what is the actual underlying technology, so the tests should be exactly the same and won't need to change at all. The only thing that is needed when adding a new adapter is to populate them with the same test data.
52+
## Prepared statements
3053

31-
In our case we have two adapters:
32-
- SqlEventRepository
33-
- InMemoryEventRepository (only used in tests)
54+
This technique involves preparing the statements before we even execute them and sending them to the database to be
55+
compiled beforehand, using placeholders for values, which are only set latter when we receive the requests. This way,
56+
since the database already has the query stored, we avoid resending it every time which will also speed up the query
57+
time.
3458

35-
# TODOs
59+
The only caveat (at least using the default Groovy Sql lib) is that this technique can only be used when what vary
60+
are values. Operations like orderBy or sort direction cannot be set dynamically this way, so for those queries this
61+
improvement could not be used.
3662

37-
There is a list of improvement ideas that are not directly related to the given requirements, but rather some ideas for future work:
63+
This technique is also used to sanitize queries and avoiding SQL injection, but in this case it was not needed since we
64+
already do all validations beforehand.
65+
66+
## Tests
67+
68+
There are three types of tests:
69+
70+
### Unit tests
71+
72+
These are applied to the domain objects and their goal is to test the business logic. They intend to test specific
73+
classes in isolation and making sure that the domain objects can be created correctly and are validated.
74+
75+
### Integration tests
76+
77+
These aim to test the interactions between our controllers and use cases, since the moment we receive a request until
78+
we have to call our infrastructure for results. Usually we can mock the infrastructure part, since infra is totally
79+
independent from both application and domain and is tested separately. Our main focus here is to verify that our
80+
controllers are receiving and processing the requests correctly, forwarding it to the use cases and receiving the
81+
expected responses.
82+
83+
### Infrastructure
84+
85+
These tests focus on testing the different adapters that implement our domain interfaces (ports). Since we are
86+
testing the contract between adapters and ports, it showuld not matter what is the actual underlying technology, so
87+
the tests should be exactly the same and won't need to change at all. The only thing that is needed when adding a new
88+
adapter is to populate them with the same test data.
89+
90+
In our case we have two adapters:
91+
92+
- SqlEventRepository
93+
- InMemoryEventRepository (only used in tests)
94+
95+
## TODOs
96+
97+
There is a list of improvement ideas that are not directly related to the given requirements, but rather some ideas for
98+
future work:
3899

39100
- Logger instead of print
40101
- Async processing of requests (queue)

0 commit comments

Comments
 (0)