ATG Repository Performance Considerations

for ATG

Document Information

Title
Repository Performance Considerations
Subject
for ATG
Author(s)
Güray Sen
Comments
 

Document Version

v233@08/12/2005

Copyright

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

No Warranty

This document is provided 'as is' without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

The contents of this publication could contain technical inaccuracies or typographical errors. Changes are periodically added to the information herein; these changes will be incorporated in the new editions of the publication. Spindrift may make improvements and/or changes in the publication and/or product(s) described in the publication at any time without notice.

Limitation of Liability

In no event will Spindrift be liable for direct, indirect, special, incidental, economic, cover, or consequential damages arising out of the use of or inability to use this documentation even if advised of the possibility of such damages.

Contact

Web
www.spindriftgroup.com

Contents

1 Introduction
2 Background Information
    2.1 Item Caches and Query Caches
2.2 Property Fetching
3 Examples
    3.1 Example A
    3.2 Example B
    3.3 Example C
4 Extending the Standard User Profile Repository Definition
    4.1 Adding Properties to a Database Table
5 General Recommendations
    5.1 Use "distributed" mode only when necessary
    5.2 "Locked" mode does not support query cache
    5.3 Use repository items as properties wisely
    5.4 Query caching is off by default
    5.5 Try to control property fetching
6 Conclusion
7 Must Read


1 Introduction

This document includes a number of simple examples of SQL repository definition files and the corresponding SQL statements to create the tables described by the definition files. These examples demonstrate a variety of data relationship mappings:

  • Example A maps a repository item to a single table row. It includes just a primary table, with no joins with other tables.

  • Example B maps a repository item to a primary table and an auxiliary table (a one-to-one relationship). Each user has a job title and function.

  • Example C maps a repository item to a primary table and a multi-value table using an array property. This demonstrates a one-to-many relationship. The multi table, named subjects_tbl, contains a list of a user's favorite subjects (simple strings). When using an array or list type property, the multi table requires a multi-column-name attribute (in this example, seq_num) to ensure that the ordering of the multi-values are maintained.

For each example, I will try to outline the impact of the choice of data relationship mapping on performance and the role of caching.


2 Background Information

2.1 Item Caches and Query Caches

Each item descriptor in a SQL repository has its own separate item cache and query cache. This lets you size and flush caches separately for separate item types. The item cache holds property values for repository items. It is indexed by the repository item IDs. The query cache holds the repository IDs of items that match particular queries in the cache.

For example, suppose you have a query like this:

color = red AND size = XXL

If query caching is enabled, the first time this query is issued, the result set is retrieved from the database and stored in the query cache. Then, the next time this same query is issued, the SQL repository can retrieve the result set from the cache, rather than needing to access the database.

The item caches hold the values of repository items. Repository queries are performed in two passes, using two separate SELECT statements. The first statement gathers the IDs of the repository items that match that query. The SQL repository then examines the result set from the first SELECT statement and finds any items that already exist in the item cache. A second SELECT statement retrieves from the database any items that are not in the item cache.

Query caching is turned off by default. If items in your repository are updated frequently, or if repeated queries are not common, you may not receive enough benefit from a query cache to justify the minor amount of overhead in maintaining the cache.

2.2 Property Fetching

Normally, when a repository item is loaded from the database, properties in each table are loaded at the same time. By default, all of the primary table properties of a repository item are loaded when getItem is called for the first time on the item.

You may need to modify this default property fetching behavior. For some applications, the database activity required in loading all of the primary table properties may result in performance suffering unnecessarily. For example, an application may want the SQL repository to load a large GIF property only if it is specifically asked for. This is referred to as lazy evaluation. On the other end of the spectrum, you may need to load properties from different tables immediately. For example, an application may want to always load a user's last name whenever a profile is read from the database. Finally, some applications want to group properties so that when one value is requested, all values in this group are loaded. An example of caching groups would be loading a zip code and state code whenever a street address is loaded.


3 Examples

3.1 Example A

This example maps a repository item to a single table row. It includes just a primary table, with no joins with other tables.

<gsa-template>
<header>
<name>Example A</name>
<author></author>
<description>
This template maps a repository item to a single table row. Just a primary table... no joins with other tables.
</description>
</header>
<item-descriptor name="user" default="true">
<table name="usr_tbl" type="primary" id-column-names="id">
<property name="id" data-type="string"/>
<property name="name" column-names="nam_col" data-type="string"/>
<property name="age" column-names="age_col" data-type="int"/>
</table>
</item-descriptor>
</gsa-template>

SQL Statements for Example A

CREATE TABLE usr_tbl (
id VARCHAR(32) not null,
nam_col VARCHAR(32) null,
age_col INTEGER null,
primary key(id)
);

The first time a query is executed, the repository queries for the repository ID's of the items that will be returned. For example, returning all user items from example A:

1 SELECT t1.id FROM usr_tbl t1 WHERE 1=1

If query caching is enabled, the result of this query (a list of repository ID's) is stored in the query cache.

Then, for each item returned the appropriate property values from the primary table are fetched.

2 SELECT t1.id,t1.age_col,t1.nam_col FROM usr_tbl t1 WHERE t1.id IN (?,?,?,?,?)

If item cache is enabled, the result of this query is stored in the item cache.

What happens when we execute the same query again? Well, it depends if item cache and/or query cache are enabled or not. If query cache is enabled, the first query will not be executed again. If it is disabled - the default - the same query will retrieve the repository ID's from the database again. If item cache is enabled - the default - the second query will only be executed for those items returned by the first query that are not in the item cache. If item cache is disabled, the second query will be executed again for each item returned by the first query.

3.2 Example B

This example maps a repository item to a primary table and an auxiliary table (a one-to-one relationship).

<gsa-template>
<header>
<name>Example B</name>
<author></author>
<description>
This template maps a repository item to a primary table and an auxiliary table (a one-to-one relationship). Each user has a job title and function.
</description>
</header>
<item-descriptor name="user" default="true">
<table name="usr_tbl" type="primary" id-column-names="id">
<property name="id" data-type="string"/>
<property name="name" column-names="nam_col" data-type="string"/>
<property name="age" column-names="age_col" data-type="int"/>
</table>
<table name="job_tbl" type="auxiliary" id-column-names="id">
<property name="function"/>
<property name="title"/>
</table>
</item-descriptor>
</gsa-template>

SQL Statements for Example B

CREATE TABLE usr_tbl (
id VARCHAR(32) not null,
nam_col VARCHAR(32) null,
age_col INTEGER null,
primary key(id)
);

CREATE TABLE job_tbl (
id VARCHAR(32) not null references usr_tbl(id),
function VARCHAR(32) null,
title VARCHAR(32) null,
primary key(id)

The first time a query is executed, the repository queries for the repository ID's of the items that will be returned. For example, returning all user items from example B:

1 SELECT t1.id FROM usr_tbl t1 WHERE 1=1

If query caching is enabled, the result of this query (a list of repository IDss) is stored in the query cache.

Secondly, for each item returned the appropriate property values from the primary table are fetched.

2 SELECT t1.id,t1.age_col,t1.nam_col FROM usr_tbl t1 WHERE t1.id IN (?,?,?,?,?)

If item cache is enabled, the result of this query is stored in the item cache.

Finally, for each item returned the appropriate property values from the auxiliary table are fetched. By default this query will not be executed as only properties from the primary table are fetched when the item is requested. As soon as one of the properties from any auxiliary table is requested, all properties stored in that table are fetched.

3 SELECT function,title FROM job_tbl WHERE id=? (*N)

If item cache is enabled, the result of this query is stored in the item cache.

What happens when we execute the same query again? Well, it depends if item cache and/or query cache are enabled or not. If query cache is enabled, the first query will not be executed again. If it is disabled - the default - the same query will retrieve the repository ID's from the database again. If item cache is enabled - the default - the second and third queries will only be executed for those items returned by the first query that are not in the item cache. If item cache is disabled, the second and third queries will be executed again for each item returned by the first query.

Note that because we are using an auxiliary table to store additional properties, to fill the item caches - or return data from the database when item caching is disabled - the repository needs N more database queries than Example A (where N is the number of items returned).

3.3 Example C

This example maps a repository item to a primary table and a multi-value table using an array property. This demonstrates a one-to-many relationship.

<gsa-template>
<header>
<name>Example C</name>
<author></author>
<description>
This template maps a repository item to a primary table and a multi-value table using an array property. A one-to-many relationship. The "multi" table contains a list of a user's favorite subjects (simple strings). When using an "array" property, the "multi" table requires a "multi-column-name" (e.g., seq_num) to ensure that the ordering of the multi-values are maintained.
</description>
</header>
<item-descriptor name="user" default="true">
<table name="usr_tbl" type="primary" id-column-names="id">
<property name="id" data-type="string"/>
<property name="name" column-names="nam_col" data-type="string"/>
<property name="age" column-names="age_col" data-type="int"/>
</table>
<table name="subjects_tbl" type="multi" id-column-names="id" multi-column-name="seq_num">
<property name="favoriteSubjects" column-names="subject" data-type="array" component-data-type="string"/>
</table>
</item-descriptor>
</gsa-template>

SQL Statements for Example C

CREATE TABLE usr_tbl (
id VARCHAR(32) not null,
nam_col VARCHAR(32) null,
age_col INTEGER null,
primary key(id)
);

CREATE TABLE subjects_tbl (
id VARCHAR(32) not null references usr_tbl(id),
seq_num INTEGER not null,
subject VARCHAR(32) null,
primary key(id, seq_num)
);

The first time a query is executed, the repository queries for the repository ID's of the items that will be returned. For example, returning all user items from example C:

1 SELECT t1.id FROM usr_tbl t1 WHERE 1=1

If query caching is enabled, the results of this query (a list of repository ID's) is stored in the query cache.

Secondly, for each item returned the appropriate property values from the primary table are fetched.

2 SELECT t1.id,t1.age_col,t1.nam_col FROM usr_tbl t1 WHERE t1.id IN (?,?,?,?,?)

If item cache is enabled, the result of this query is stored in the item cache.

Finally, for each item returned the appropriate property values from the multi table are fetched. By default this query will not be executed as only properties from the primary table are fetched when the item is requested.

3 SELECT seq_num,subject FROM subjects_tbl WHERE id=? ORDER BY 1 ASC (*N)

If item cache is enabled, the result of this query is stored in the item cache.

What happens when we execute the same query again? Well, it depends if item cache and/or query cache are enabled or not. If query cache is enabled, the first query will not be executed again. If it is disabled - the default - the same query will retrieve the repository ID's from the database again. If item cache is enabled - the default - the second and third queries will only be executed for those items returned by the first query that are not in the item cache. If item cache is disabled, the second and third queries will be executed again for each item returned by the first query.

Note that because we are using a multi table to store an array property, to fill the item caches - or return data from the database when item caching is disabled - the repository needs N more database queries than Example A (where N is the number of items returned).


4 Extending the Standard User Profile Repository Definition

4.1 Adding Properties to a Database Table

If you want to add properties to the profile definition file, you need to add a column in the appropriate database table for the item you want to add, and you also need to add a reference to that column in the profile repository definition file. For example, if you wanted to add the property "region" to the dps_contact_info table, add a column called region to the table, then add the following lines to the Personalization module's userProfile.xml file:

<table name="dps_contact_info" type="primary">
<property name="region" data-type="string" column-name="region" required="false"/>

Even if the most simple way to add a property to the profile is by modifying an existing ATG defined table, this is not recommended. In future version, there may be changes to ATG's out-of-the-box implementation that may conflict with your customizations. Usually, you would create an additional table and define your property there:

<table name="my_contact_info" type="auxiliary" id-column-names="id">
<property name="region" data-type="string" column-name="region"

According to our findings earlier, this may impact the number of database queries that are used to retrieve all needed data. You can use property cache groups to control data loading:

<table name="my_contact_info" type="auxiliary" id-column-names="id">
<property name="region" data-type="string" column-name="region" required="false" group="my_group"/>


5 General Recommendations

5.1 Use "distributed" mode only when necessary

The "distributed" cache mode is useful for synchronizing cached data that is accessed frequently by many servers at the same time but modified relatively infrequently. Beware that in a distributed caching model, each server must be connected to every other server in the cluster for sending cache events. This results in N*(N-1) connections, and for each update event, requires sending N-1 events (where N is the number of servers in the cluster). For data that is modified frequently in large clusters, the number of events that need to be sent to invalidate data can be quite large.

5.2 "Locked" mode does not support query cache

If you use locked cache mode, you should also disable the query cache, since query cache invalidation messages are not distributed between server instances. If query caching is important for the performance of your application, use distributed cache mode.

5.3 Use repository items as properties wisely

The value of a property of a repository item can be another repository item. Both multi-valued properties and single-valued properties can have repository items as property values. This is a powerful feature, and allows you much greater flexibility in defining a database schema that your application will access as a repository. However, in some cases it is sufficient to have an id reference to another item instead. This will prevent those items to be loaded into cache.

Alternatively, you can use the <attribute> tag to set the attribute cacheReferencesById=true. For example:

<property name="childProducts" ...>
...
<attribute name="cacheReferencesById" value="true"/>
</property>

The SQL Repository then will not store the repository items referred to, but will instead store only the repository ids of the items referred to in the item cache.

5.4 Query caching is off by default

Query caching is turned off by default. If items in your repository are updated frequently, or if repeated queries are not common, you may not receive enough benefit from a query cache to justify the minor amount of overhead in maintaining the cache.

5.5 Try to control property fetching

You can achieve a finer level of control over property loading using cache groups in your repository definition. By default, the cache group of a property is the same name as the table that the property is defined in. You can set the cache group of a property with the group attribute in the property's definition in the repository definition file. All properties with the same group attribute are fetched whenever any member of the group is fetched. Only those properties that are in the same cache group as the repository ID (or, if there is no ID property, then all the properties in the primary table) are loaded when getItem is called for the first time on an item.

In other words, create groups of properties that are accessed at the same time by either creating cache groups or choosing an appropriate database layout. In addition, if you want to assure that only the repository ID is returned and none of the repository item's other properties, you can isolate the repository ID in its own cache group.


6 Conclusion

Monitoring cache statistics during testing and after deployment can help you improve performance by setting cache sizes to appropriate levels. If you have a high quantity of misses and no hits, you are gaining no benefit from caching, and you can probably just turn it off, by setting the cache size to 0. If you have a mix of hits and misses, you might want to increase the cache size. If you have all hits and no misses, your cache size is big enough and perhaps too big. There is no harm in setting a cache to be too big unless it will fill up eventually and consume more memory than is necessary.


7 Must Read

  • Understanding ATG Data Anywhere Architecture
    ATG
    http://www.atg.com
  • ATG Dynamo 5.6.1 Application Server Programming Guide, Part II: Repositories
    ATG