Information Technology and Computer Science: Chapter 12. Squid

Squid is a feature-rich and extremely flexible Web-caching proxy daemon. Most configuration is performed by editing a simple configuration file called squid.conf which is usually located in /usr/local/squid/etc/squid.conf or, on systems derived from Red Hat Linux, /etc/squid/squid.conf. Each behavior is set by a directive followed by one or more options. The Webmin interface provides access to most of the directives available for configuring Squid. Because Squid is a quite complex package, the Webmin interface opens with a series of icons to represent the different types of configuration options. Figure 12.1, “Squid Proxy Main Page” shows the Squid main page.

Figure 12.1. Squid Proxy Main Page

These options are pretty self explanatory, though a couple of them are worth discussing. The Cache Manager Statistics icon, when clicked, will open the Squid cachemgr.cgi program to provide direct access to all of Squids various runtime values and statistics. The program provides real-time information about hit ratios, request rates, storage capacity, number of users, system load, and more. The Calamaris Log Analysis icon is only present if the calamaris access.log analyzer is present on your system. Calamaris is a nice Perl script that will parse your access log files and provide a nice overview of the type of usage your cache is seeing. Note that by default the Calamaris Webmin tool will only parse the last 50,000 lines of your access log. This number can be raised in the Squid module configuration, but is not recommended on heavily loaded caches. The parsing of the access logs is a very system intensive task that could interfere with your system's ability to continue answering requests.

Ports and Networking

The Ports and Networking page provides you with the ability to configure most of the network level options of Squid. Squid has a number of options to define what ports Squid operates on, what IP addresses it uses for client traffic and intercache traffic, and multicast options. Usually, on dedicated caching systems these options will not be useful. But in some cases you may need to adjust these to prevent the Squid daemon from interfering with other services on the system, or on your network.

Proxy port: Sets the network port on which Squid operates. This option is usually 3128 by default, and can almost always be left on this address, except when multiple Squids are running on the same system, which is usually ill-advised. This option corresponds to the http_port option in squid.conf.
ICP port: This is the port on which Squid listens for Internet Cache Protocol, or ICP, messages. ICP is a protocol used by Web caches to communicate and share data. Using ICP it is possible for multiple Web caches to share cached entries so that if any one local cache has an object, the distant origin server will not have to be queried for the object. Further, cache hierarchies can be constructed of multiple caches at multiple privately interconnected sites to provide improved hit rates and higher quality Web response for all sites. More on this in later sections. This option correlates to the icp_port directive.
Incoming TCP address: The address on which Squid opens an HTTP socket that listens for client connections and connections from other caches. By default Squid does not bind to any particular address, and will answer on any address that is active on the system. This option is not usually used, but can provide some additional level of security, if you wish to disallow any outside network users from proxying through your Web cache. This option correlates to the tcp_incoming_address directive.
Outgoing TCP address: Defines the address on which Squid sends out packets via HTTP to clients and other caches. Again, this option is rarely used. It refers to the tcp_outgoing_address directive.
Incoming UDP address: sets the address on which Squid will listen for ICP packets from other Web caches. This option allows you to restrict which subnets will be allowed to connect to your cache on a multi-homed, or containing multiple subnets, Squid host. This option correlates to the udp_incoming_address directive.
Outgoing UDP address: The address on which Squid will send out ICP packets to other Web caches. This option correlates to the udp_outgoing_address.
Multicast groups: The multicast groups which Squid will join to receive multicast ICP requests. This option should be used with great care, as it is used to configure your Squid to listen for multicast ICP queries. Clearly if your server is not on the MBone, this option is useless. And even if it is, this may not be an ideal choice. Refer to the Squid FAQ on this subject for a more complete discussion. This option refers to the mcast_groups directive.
TCP receive buffer: The size of the buffer used for TCP packets being received. By default Squid uses whatever the default buffer size for your operating system is. This should probably not be changed unless you know what you're doing, and there is little to be gained by changing it in most cases. This correlates to the tcp_recv_bufsize directive.

Other Caches

The Other Caches page provides an interface to one of Squids most interesting, but also widely misunderstood, features. Squid is the reference implementation of ICP, or Internet Cache Protocol, a simple but effective means for multiple caches to communicate with each other regarding the content that is available on each. This opens the door for many interesting possibilities when one is designing a caching infrastructure.

Internet Cache Protocol

It is probably useful to discuss how ICP works and some common usages for ICP within Squid, in order to quickly make it clear what it is good for, and perhaps even more importantly, what it is not good for. The most popular uses for ICP are discussed, and more good ideas will probably arise in the future as the Internet becomes even more global in scope, and the Web caching infrastructure must grow with it.

Parent and Sibling Relationships

The ICP protocol specifies that a Web cache can act as either a parent or a sibling. A parent cache is simply an ICP capable cache that will answer both hits and misses for child caches, while a sibling will only answer hits for other siblings. This subtle distinction means simply that a parent cache can proxy for caches that have no direct route to the Internet. A sibling cache, on the other hand, cannot be relied upon to answer all requests, and your cache must have another method to retrieve requests that cannot come from the sibling. This usually means that in sibling relationships, your cache will also have a direct connection to the Internet or a parent proxy that can retrieve misses from the origin servers. ICP is a somewhat chatty protocol in that an ICP request will be sent to every neighbor cache each time a cache miss occurs. By default, whichever cache replies with an ICP hit first, will be the cache used to request the object.

When to Use ICP

ICP is often used in situations wherein one has multiple Internet connections, or several types of path to Internet content. Other possibilities include having a cache mesh such as the IRCache Hierarchy in the US or The National Janet Web Caching Service in the UK, which can utilize lower cost non-backbone links to connect several remote caches in order to lower costs and raise performance. Finally, it is possible, though usually not recommended, to implement a rudimentary form of load balancing through the use of multiple parents and multiple child Web caches. All of these options are discussed in some detail, but this document should not be considered the complete reference to ICP. Other good sources of information include the two RFCs on the subject, RFC 2186 which discusses the protocol itself, and RFC 2187 which describes the application of ICP.

One common ICP-based solution in use today, is satellite cache pre-population services. In this case, there are at least two caches at a site, one of which is connected to a satellite Internet uplink. The satellite connected cache is provided by the service provider, and it is automatically filled with popular content via the satellite link. The other cache uses the satellite connected cache as a sibling, which it queries for every cache miss that it has. If the satellite connected sibling has the content it will be served from the sibling cache, if not the primary cache will fetch the content from the origin server or a parent cache. ICP is a pretty effective, if somewhat bandwidth and processor intensive, means of accomplishing this task. A refinement of this process would be to use Cache-Digests for the satellite connected sibling in order to reduce traffic between the sibling caches. Nonetheless, ICP is a quite good method of implementing this idea.

Another common use is cache meshes. A cache mesh is, in short, a number of Web caches at remote sites interconnected using ICP. The Web caches could be in different cities, or they could be in different buildings of the same university or different floors in the same office building. This type of hierarchy allows a large number of caches to benefit from a larger client population than is directly available to it. All other things being equal, a cache that is not overloaded will perform better (with regard to hit ratio) with a larger number of clients. Simply put, a larger client population leads to a higher quality of cache content, which in turn leads to higher hit ratios and improved bandwidth savings. So, whenever it is possible to increase the client population without overloading the cache, such as in the case of a cache mesh, it may be worth considering. Again this type of hierarchy can be improved upon by the use of Cache Digests, but ICP is usually simpler to implement and is a widely supported standard, even on non-Squid caches.

Finally, ICP is also sometimes used for load balancing multiple caches at the same site. ICP, or even Cache Digests for that matter, are almost never the best way to implement load balancing. However, for completeness, I'll discuss it briefly. Using ICP for load balancing can be achieved in a few ways. One common method is to have several local siblings, which can each provide hits to the others' clients, while the client load is evenly divided across the number of caches. Another option is to have a very fast but low capacity Web cache in front of two or more lower cost, but higher capacity, parent Web caches. The parents will then provide the requests in a roughly equal amount. As mentioned, there are much better options for balancing Web caches, the most popular being WCCP (version 1 is fully supported by Squid), and L4 or L7 switches.

Other Proxy Cache Servers

This section of the Other Caches page provides a list of currently configured sibling and parent caches, and also allows one to add more neighbor caches. Clicking on the name of a neighbor cache will allow you to edit it. This section also provides the vital information about the neighbor caches, such as the type (parent, sibling, multicast), the proxy or HTTP port, and the ICP or UDP port of the caches. Note that Proxy port is the port where the neighbor cache normally listens for client traffic, which defaults to 3128.

Edit Cache Host

Clicking on a cache peer name or clicking Add another cache on the primary Other Caches page brings you to this page, which allows you to edit most of the relevant details about neighbor caches (Figure 12.2, “Edit Cache Host Page”).

Figure 12.2. Edit Cache Host Page

Hostname: The name or IP address of the neighbor cache you want your cache to communicate with. Note that this will be one way traffic. Access Control Lists, or ACLs, are used to allow ICP requests from other caches. ACLs are covered later. This option plus most of the rest of the options on this page correspond to cache_peer lines in squid.conf.
Type: The type of relationship you want your cache to have with the neighbor cache. If the cache is upstream, and you have no control over it, you will need to consult with the administrator to find out what kind of relationship you should set up. If it is configured wrong, cache misses will likely result in errors for your users. The options here are sibling, parent, and multicast.
Proxy port: The port on which the neighbor cache is listening for standard HTTP requests. Even though the caches transmit availability data via ICP, actual Web objects are still transmitted via HTTP on the port usually used for standard client traffic. If your neighbor cache is a Squid based cache, then it is likely to be listening on the default port of 3128. Other common ports used by cache servers include 8000, 8888, 8080, and even 80 in some circumstances.
ICP port: The port on which the neighbor cache is configured to listen for ICP traffic. If your neighbor cache is a Squid based proxy, this value can be found by checking the icp_port directive in the squid.conf file on the neighbor cache. Generally, however, the neighbor cache will listen on the default port 3130.
Proxy only?: A simple yes or no question to tell whether objects fetched from the neighbor cache should be cached locally. This can be used when all caches are operating well below their client capacity, but disk space is at a premium or hit ratio is of prime importance.
Send ICP queries?: Tells your cache whether or not to send ICP queries to a neighbor. The default is Yes, and it should probably stay that way. ICP queries is the method by which Squid knows which caches are responding, and which caches are closest or best able to quickly answer a request.
Default cache: This be switched to Yes if this neighbor cache is to be the last-resort parent cache to be used in the event that no other neighbor cache is present as determined by ICP queries. Note that this does not prevent it from being used normally while other caches are responding as expected. Also, if this neighbor is the sole parent proxy, and no other route to the Internet exists, this should be enabled.
Round-robin cache?: Chooses whether to use round robin scheduling between multiple parent caches in the absence of ICP queries. This should be set on all parents that you would like to schedule in this way.
ICP time-to-live: Defines the multicast TTL for ICP packets. When using multicast ICP, it is usually wise for security and bandwidth reasons to use the minimum tty suitable for your network.
Cache weighting: Sets the weight for a parent cache. When using this option it is possible to set higher numbers for preferred caches. The default value is 1, and if left unset for all parent caches, whichever cache responds positively first to an ICP query will be be sent a request to fetch that object.
Closest only: Allows you to specify that your cache wants only CLOSEST_PARENT_MISS replies from parent caches. This allows your cache to then request the object from the parent cache closest to the origin server.
No digest?: Chooses whether this neighbor cache should send cache digests.
No NetDB exchange: When using ICP, it is possible for Squid to keep a database of network information about the neighbor caches, including availability and RTT, or Round Trip Time, information. This usually allows Squid to choose more wisely which caches to make requests to when multiple caches have the requested object.
No delay?: Prevents accesses to this neighbor cache from effecting delay pools. Delay pools, discussed in more detail later, are a means by which Squid can regulate bandwidth usage. If a neighbor cache is on the local network, and bandwidth usage between the caches does not need to be restricted, then this option can be used.
Login to proxy: Select this is you need to send authentication information when challenged by the neighbor cache. On local networks, this type of security is unlikely to be necessary.
Multicast responder: Allows Squid to know where to accept multicast ICP replies. Because multicast is fed on a single IP to many caches, Squid must have some way of determining which caches to listen to and what options apply to that particular cache. Selecting Yes here configures Squid to listen for multicast replies from the IP of this neighbor cache.
Query host for domains, Don't query for domains,: These two options are the only options on this page to configure a directive other than cache_peer in Squid. In this case it sets the cache_peer_domain option. This allows you to configure whether requests for certain domains can be queried via ICP and which should not. It is often used to configure caches not to query other caches for content within the local domain. Another common usage, such as in the national Web hierarchies discussed above, is to define which Web cache is used for requests destined for different TLDs. So, for example, if one has a low cost satellite link to the US backbone from another country that is preferred for Web traffic over the much more expensive land line, one can configure the satellite connected cache as the cache to query for all .com, .edu, .org, net, .us, and .gov, addresses.

Cache Selection Options

This section provides configuration options for general ICP configuration (Figure 12.3, “Some global ICP options”. These options effect all of the other neighbor caches that you define.

Figure 12.3. Some global ICP options

Directly fetch URLs containing: Allows you to configure a match list of items to always fetch directly rather than query a neighbor cache. The default here is cgi-bin ? and should continue to be included unless you know what you're doing. This helps prevent wasting inter-cache bandwidth on lots of requests that are usually never considered cacheable, and so will never return hits from your neighbor caches. This option sets the hierarchy_stoplist directive.
ICP query timeout: The time in milliseconds that Squid will wait before timing out ICP requests. The default allows Squid to calculate an optimum value based on average RTT of the neighbor caches. Usually, it is wise to leave this unchanged. However, for reference, the default value in the distant past was 2000, or 2 seconds. This option edits the icp_query_timeout directive.
Multicast ICP timeout: Timeout in milliseconds for multicast probes, which are sent out to discover the number of active multicast peers listening on a give multicast address. This configures the mcast_icp_query_timeout directive and defaults to 2000 ms, or 2 seconds.
Dead peer timeout: Controls how long Squid waits to declare a peer cache dead. If there are no ICP replies received in this amount of time, Squid will declare the peer dead and will not expect to receive any further ICP replies. However, it continues to send ICP queries for the peer and will mark it active again on receipt of a reply. This timeout also effects when Squid expects to receive ICP replies from peers. If more than this number of seconds have passed since the last ICP reply was received, Squid will not expect to receive an ICP reply on the next query. Thus, if your time between requests is greater than this timeout, your cache will send more requests DIRECT rather than through the neighbor caches.

Memory Usage

This page provides access to most of the options available for configuring the way Squid uses memory and disks (Figure 12.4, “Memory and Disk Usage”). Most values on this page can remain unchanged, except in very high load or low resource environments, where tuning can make a measurable difference in how well Squid performs.

Figure 12.4. Memory and Disk Usage

Memory usage limit

The limit on how much memory Squid will use for some parts of its in core data. Note that this does not restrict or limit Squids total process size. What it does do is set aside a portion of RAM for use in storing in-transit and hot objects, as well as negative cached objects. Generally, the default value of 8 MB is suitable for most situations, though it is safe to lower it to 4 or 2 MB in extremely low load situations. It can also be raised significantly on high memory systems to increase performance by a small margin. Keep in mind that large cache directories increase the memory usage of Squid by a large amount, and even a machine with a lot of memory can run out of memory and go into swap if cache memory and disk size are not appropriately balanced. This option edits the cache_mem directive. See the section on cache directories for more complete discussion of balancing memory and storage.

	Caution
	If Squid is using what you consider to be too much memory, do not look here for a solution. It defaults to a modest 8 MB, and only when you have configured a very small amount of cache storage will this 8 MB be a significant portion of the memory Squid allocates. If you do find yourself running out of memory, you can lower the size of your configured cache directories for a more noticeable decrease in memory used.

FQDN cache size

Size of the in memory cache of fully qualified domain names. This configures the fqdncache_size parameter and defaults to 1024, which is usually a safe value. In environments where DNS queries are slow, raising this may help.

Memory high-water mark, Memory low water mark,

Sets the points at which Squid begins to remove objects from memory. As memory usage climbs past the low water mark, Squid more aggressively tries to free memory. Note this applies to the memory usage limit defined above, not the total process size of Squid. If you have a system that is doing double or triple duty and providing more than cache services it may be wise to set the low mater mark at a low number, like 50%, and the high mark at a high number like 95%. In such a case, Squid will mostly keep its usage at 50%, but if it begins to get overloaded, or a particularly large object comes through the cache, it can briefly go over that point. This option configures the cache_mem_low and cache_mem_high options, which default to 90% and 95%, respectively.

Disk high-water mark, Disk low-water mark,

Provide a mechanism for disk usage similar to the memory water marks above. To maximize hit ratio, and provide most efficient use of disk space, leave this at the default values of 90% and 95%. Or to maximize performance and minimize fragmentation on disk, set them to a higher spread, such as 85% and 100%. Note that these settings are not where the amount of disk space to use is configured, they only define the percent of the allotted cache space at which Squid should begin to prune out old data to make room for incoming new objects. These options correlate to the cache_swap_high and cache_swap_low directives.

Maximum cached object size

The size of the largest object that Squid will attempt to cache. Objects larger than this will never be written to disk for later use. Refers to the maximum_object_size directive.

IP address cache size, IP cache high-water mark, IP address low-water mark,

The size of the cache used for IP addresses and the high and low water marks for the cache, respectively. This option configures the ipcache_size, ipcache_high, and ipcache_low, directives, which default to 1024 entries, 95% and 90%.

Logging

Squid provides a number of logs that can be used when debugging problems, and when measuring the effectiveness and identifying users and the sites they visit (Figure 12.5, “Logging Configuration”). Because Squid can be used to "snoop" on users browsing habits, one should carefully consider privacy laws in your region and more importantly be considerate to your users. That being said, logs can be very valuable tools in insuring that your users get the best service possible from your cache.

Figure 12.5. Logging Configuration

Access log file

The location of the cache access.log. The Squid access.log is the file in which Squid writes a small one line entry for every request served by the cache. This option correlates to the cache_access_log directive and usually defaults to /usr/local/squid/log/access.log or on some RPM based systems /var/log/squid/access.log. The format of the standard log file looks like this:

            973421337.543  11801 192.168.1.1 TCP_MISS/200 1999 GET http://www.google.com/ - DIRECT/64.208.34.100 text/html

In the preceding line, each field represents some piece of information that may be of interest to an administrator:

System time in standard UNIX format. The time in seconds since 1970. There are many tools to convert this to human readable time, including this simple Perl script.
```
                #! /usr/bin/perl -p
           s/^\d+\.\d+/localtime $&/e;
         
```
Duration or the elapsed time in milliseconds the transaction required.
Client address or the IP address of the requesting browser. Some configurations may lead to a masked entry here, so that this field is not specific to one IP, but instead reports a whole network IP.
Result codes provides two entries separated by a slash. The first position is one of several result codes, which provide information about how the request was resolved or wasn't resolved if there was a problem. The second field contains the status code, which comes from a subset of the standard HTTP status codes.
Bytes is the size of the data delivered to the client in bytes. Headers and object data are counted towards this total. Failed requests will deliver and error page, the size of which will also be counted.
Request method is the HTTP request method used to obtain an object. The most common method is, of course, GET, which is the standard method Web browsers use to fetch objects.
URL is the complete Uniform Resource Locator requested by the client.
RFC931 is the ident lookup information for the requesting client, if ident lookups are enabled in your Squid. Because of the performance impact, ident lookups are not used by default, in which case this field will always contain "-".
Hierarchy code consists of three items. The first is simply a prefix of TIMEOUT_ if all ICP requests timeout. The second (first if there is not TIMEOUT_ prepended) is the code that explains how the request was handled. This portion will be one of several hierarchy codes. This result is informative regardless of whether your cache is part of a cache hierarchy, and will explain how the request was served. The final portion of this field contains the name or IP of the host from which the object was retrieved. This could be the origin server, a parent, or any other peer.
Type is simply the type of object that was requested. This will usually be a recognizable MIME type, but some objects have no type or are listed as ":".

There are two other optional fields for cases when MIME header logging has been turned on for debugging purposes. The full HTTP request and reply headers will be included enclosed in [ and ] square brackets.

Debug log file

The location for Squids cache.log file. This file contains startup configuration information, as well as assorted error information during Squids operation. This file is a good place to look when a Web site is found to have problems running through the Web cache. Entries here may point towards a potential solution. This option correlates to the cache_log directive and usually defaults to either /usr/local/squid/log/cache.log or /var/log/squid/cache.log on RPM based systems.

Storage log file

Location of the caches store log file. This file contains a transaction log of all objects that are stored in the object store, as well as the time when the get deleted. This file really doesn't have very much use on a production cache, and it primarily recommended for use in debugging. Therefore, it can be turned off by entering none in the entry field. The default location is either /usr/local/squid/log/store.log or /var/log/squid/store.log.

Cache metadata file

Filename used in each store directory to store the Web cache metadata, which is a sort of index for the Web cache object store. This is not a human readable log, and it is strongly recommended that you leave it in its default location on each store directory, unless you really know what you're doing. This option correlates to the cache_swap_log directive.

Use HTTPD log format

Allows you to specify that Squid should write its access.log in HTTPD common log file format, such as that used by Apache and many other Web servers. This allows you to parse the log and generate reports using a wider array of tools. However, this format does not provide several types of information specific to caches, and is generally less useful when tracking cache usage and solving problems. Because there are several effective tools for parsing and generating reports from the Squid standard access logs, it is usually preferable to leave this at its default of being off. This option configures the emulate_httpd_log directive. The Calamaris cache access log analyzer does not work if this option is enabled.

Log MIME headers

Provides a means to log extra information about your requests in the access log. This causes Squid to also write the request and response MIME headers for every request. These will appear in brackets at the end of each access.log entry. This option correlates to the log_mime_hdrs directive.

Perform RFC931 ident lookups for ACLs

Indicates which of the Access Control Lists that are defined should have ident lookups performed for each request in the access log. Because the performance impact of using this option, it is not on by default. This option configures the ident_lookup_access directive.

RFC931 ident timeout

The timeout, usually in seconds, for ident lookups. If this is set too high, you may be susceptible to denial or service from having too many outstanding ident requests. The default for this is 10 seconds, and it applies to the ident_timeout directive.

Log full hostnames

Configures whether Squid will attempt to resolve the host name, so the the fully qualified domain name can be logged. This can, in some cases, increase latency of requests. This option correlates to the log_fqdn directive.

Logging netmask

Defines what portion of the requesting client IP is logged in the access.log. For privacy reasons it is often preferred to only log the network or subnet IP of the client. For example, a netmask of 255.255.255.0 will log the first three octets of the IP, and fill the last octet with a zero. This option configures the client_netmask directive.

Debug options

Provides a means to configure all of Squids various debug sections. Squids debugging code has been divided into a number of sections. If there is a problem in one part of Squid, debug logging can be made more verbose for just that section. For example, to increase debugging for just the Storage Manager in Squid to its highest level of 9 while leaving the rest at the default of 1, the entry would look like Figure 12.6, “Setting Squid Debug Levels”.

Figure 12.6. Setting Squid Debug Levels

There is a complete list of debug sections at the Swell Technology Web site and in the Squid source distribution in the doc directory. More information can be found in the Squid FAQ.

MIME headers table

The pathname to Squids MIME table. This usually should remain at the default value. This option configures the mime_table directive, and defaults to /usr/local/squid/etc/mime.conf or /etc/squid/mime.conf.

Cache Options

The Cache Options page provides access to some important parts of the Squid configuration file. This is where the cache directories are configured as well as several timeouts and object size options.

Cache directories

The Cache directories option sets the cache directories and the amount of space Squid is allowed to use on the device. The first example displays an example cache_dir line in the squid.conf file, while Figure 12.7, “Configuring Squids Cache Directories” shows the Cache Options screen in Webmin where the same options can be configured.

        cache_dir ufs /cache0 1500 8 256

Figure 12.7. Configuring Squids Cache Directories

The directive is cache_dir while the options are the type of filesystem, the path to the cache directory, the size allotted to Squid, the number of top level directories, and finally the number of second level directories. In the example, I've chosen the filesystem type ufs, which is a name for all standard UNIX filesystems. This type includes the standard Linux ext2 filesystem as well. Other possibilities for this option include aufs and diskd.

The next field is simply the space, in megabytes, of the disk that you want to allow Squid to use. Finally, the directory fields define the upper and lower level directories for Squid to use. Calculating the L1 number precisely is tricky, but not difficult if you use this formula:

        x=Size of cache dir in KB (i.e., 6GB=~6,000,000KB)
   y=Average object size (just use 13KB)
   z=Number of directories per first level directory

   (((x / y) / 256) / 256) * 2 = # of directories

As an example, if your cache used 6GB of each of two 13GB drives:

        6,000,000 / 13 = 461538.5 / 256 = 1802.9 / 256 = 7 * 2 = 14

Your cache_dir line would look like this:

        cache_dir ufs /cache0 6000 14 256

Other Cache Options

The rest of the Cache Options page is a bit of a hodge podge of other general options (Figure 12.8, “Other Cache Options”). Generally most configuration will take place on this page, as it addresses many of the tunable items in Squid that can make a difference in performance.

Figure 12.8. Other Cache Options

Average object size

The average size of objects expected in your cache. This item is generally safe at its default value of 13KB. In older versions of Squid (prior to 2.3) this option effected the number of filemap bits set aside by Squid at runtime. In newer Squids filemap bits are configured dynamically, as needed, and so configuration of this option is unnecessary and arguably pointless. This option will go away in some future version of Squid. Until then, it correlates to the store_avg_object_size directive.

Objects per bucket

The number of objects in each store hash table. Again, it is not worth bothering to change it, as its default value is probably a good safe value, and changing it provides little or nothing in the way of performance or efficiency improvements. This option corresponds to the store_objects_per_bucket directive.

Don't cache URLs for ACLs

Allows you to easily pick which ACL matches will not be cached. Requests that match the selected ACLs will always be answered from the origin server. This option correlates to the always_direct directive.

Maximum cache time

The maximum time an object will be allowed to remain in the cache. This time limit is rarely reached, because objects have so many occasions to be purged via the normal replacement of the cache object store. However, if object freshness is of prime importance, then it may be worthwhile to lower this from its default of 1 year to something much shorter, such as 1 week, though it is probably counter-productive to do so for most users. This option configures the reference_age parameter.

Maximum request size

The maximum size of request that will be accepted by Squid from the client. The default is 100 KB, however, in environments where the POST method may be used to send larger files to Web servers, or Web mail is used for sending attachments, it will probably be necessary to raise this limit to something more reasonable. 8 or 16 MB is probably a good size that will permit most uploads without any problems. Note that this bears no relation to the size of object that is being retrieved, it only effects the size of the HTTP request being sent from the client. This option corresponds to the request_size directive.

	Note
	Squid versions from 2.5 onward have removed the default size limit, and requests can be unlimited in size. Thus if you rely on the default limit being in place, you will need to modify your configuration when upgrading.

Failed request cache time

The amount of time that a request error condition is cached. For example, some types of connection refused and 404 Not Found errors are negatively cached.

DNS lookup cache time

The length of time that DNS lookups are cached. Squid provides functionality as a basic caching name server, in order to further accelerate Web service through the proxy. This value defaults to 6 hours and correlates to the positive_dns_ttl directive. If rapid updates of cache DNS data is required, such as to keep up with dynamic DNS systems or to avoid load balance problems with local network sites in a Web site acceleration environment, it may be wise to reduce this value significantly. Insuring you have reliable and suitably fast DNS service on the local network is mandatory, however, if you do reduce this value by a large amount.

Failed DNS cache time

The time period for which failed DNS requests are cached. This option corresponds to the negative_dns_ttl directive, and defaults to 5 minutes. This option rarely needs tuning.

Connect timeout

An option to force Squid to close connections after a specified time. Some systems (notably older Linux versions) can not be relied upon to time out connect requests. For this reason, this option specifies the timeout for how long Squid should wait for the connection to complete. This value defaults to 120 seconds (2 minutes) and correlates to the connect_timeout directive.

Read timeout

The timeout for server-side connections. Each successful read() request the timeout is reset to this amount. If no data is read within this period of time, the request is aborted and logged with ERR_READ_TIMEOUT. This option corresponds to read_timeout and defaults to 15 minutes.

Site selection timeout

The timeout for URN to multiple URLs selection. URN is a protocol designed for location-independent name resolution, specified in RFC 2169. This option configures the siteselect_timeout directive and defaults to 4 seconds. There is probably no need to change this.

Client request timeout

The timeout for HTTP requests from clients. By default the value is 5 minutes, and correlates to the request_timeout directive.

Max client connect time

The time limit Squid sets for a client to remain connected to the cache process. This is merely a safeguard against clients that disappear without properly shutting down. It is designed to prevent a large number of sockets from being tied up in a CLOSE_WAIT state. The default for this option is 1440 minutes, or 1 day. This correlates to the client_lifetime.

Max shutdown time

The time Squid allows for existing connections to continue after it has received a shutdown signal. It will stop accepting new connections immediately, but connections already in progress will continue to be served for this amount of time. This option corresponds to the shutdown_lifetime configuration directive and defaults to 30 seconds, which is a good safe value. If rapid down->up time is more important than being polite to current clients, it can be lowered.

Half-closed clients

Defines Squids behavior towards some types of clients that close the sending side of a connection while leaving the receiving side open. Turning this option off will cause Squid to immediately close connections when a read(2) returns "no more data to read". It's usually safe to leave this at the default value of yes. It corresponds to the half_close_clients directive.

Persistent timeout

The timeout value for persistent connections. Squid will close persistent connections if they are idle for this amount of time. Persistent connections will be disable entirely if this option is set to a value less than 10 seconds. The default is 120 seconds, and likely doesn't need to be changed. This option configures the pconn_timeout directive.

WAIS relay host, WAIS relay port,

The WAIS host and port to relay WAIS requests to. WAIS, or Wide Area Information System, is a system to catalog and search large amounts of data via a WAIS or WWW browser. WAIS is a mostly deprecated protocol, but some sites probably do still exist, though this author has been unable to locate any to satiate his curiosity about the subject. These options correspond to the wais_relay_host and wais_relay_port directives, and defaults to localhost and 8000.

Helper Programs

Squid uses helper programs to provide extra functionality, or to provide greater performance. Squid provides a standard API for several types of programs that provide extra services that do not fit well into the Squid core. Helper programs could be viewed as a simple means of modular design, allowing third-parties to write modules to improve the features of Squid (Figure 12.9, “Cache Helper Program”). That being said, some of Squids standard functionality is also provided by helper programs. The standard helper programs include dnsserver, pinger, and several authentication modules. Third party modules include redirectors, ad blockers, and additional authentication modules.

	Note
	Squid versions from 2.3 onward do not use the dnsserver helper program by default, replacing it with an internal non-blocking DNS resolver. This new internal DNS resolver is more memory and processor efficient, so is preferred. But in some circumstances, the older helper program is the better choice. If your Squid must be able to resolve based on any source other than a DNS server, such as via a hosts file or NIS, then you may need to use the external dnsserver helper.

Figure 12.9. Cache Helper Program

FTP column width

The column width for auto-generated Web pages of FTP sites queried through Squid when Squid is in forward proxy mode. Squid provides limited FTP proxy features to allow browsers (even older, non-FTP aware browsers) to communicate with FTP servers. This option gives some control over how Squid formats the resulting file lists. This option correlates to the ftp_list_width and defaults to 32.

	Note
	Squid only provides FTP proxy and caching services when acting as a traditional proxy, not when acting transparently. Squid does not currently provide FTP caching or proxying for standard FTP clients. The clients must be HTTP clients, for which Squid can provide gateway services.

Anon FTP login

The email address Squid uses to login to remote FTP servers anonymously. This can simply be a user name followed by an @ symbol, which your domain name can be automatically attached to. Or it can be a full email address. This should be something reasonable for your domain, such as wwwuser@mydomain.com, or in the domainless case first mentioned, Squid@, which happens to be the default for this option. This corresponds to the ftp_user directive.

Squid DNS program

The helper program to use for DNS resolution. Because Squid requires a non-blocking resolver for its queries, an external program called dnsserver is included in the standard distribution. In Squid versions prior to 2.3, this program is the only standard choice for resolution, and the path to the file can be entered here. In Squid versions later than 2.3, there is a new default option, which is an internal non-blocking resolver that is more memory and CPU efficient. This option rarely needs to be changed from its default value. This option configures cache_dns_program directive.

Number of DNS programs

The number of external DNS resolver processes that will be started in order to serve requests. The default value of five is enough for many networks, however, if your Squid serves a large number of users, this value may need to be increased to avoid errors. However, increasing the number of processes also increases the load on system resources and may actually hinder performance if set too high. More than 10 is probably overkill. Correlates to the dns_children directive.

Append domain to requests

When enabled, causes the dnsserver to add the local domain name to single component host names. It is normally disabled to prevent caches in a hierarchy from interpreting single component host names locally. This option configures the dns_defnames directive.

DNS server addresses

Normally defaults to From resolv.conf, which simply means that Squids parent DNS servers will be drawn from the /etc/resolv.conf file found on the system Squid runs on. It is possible to select other DNS servers if needed, for example to choose a more local caching DNS server, or a remote Internet connected server. This option corresponds to the dns_nameservers directive.

Cache clearing program

The name of the helper program that deletes, or unlinks, old files in the cache to make room for newer objects. In all current versions of Squid, this helper is known as unlinkd and should probably not be changed from this unless you know what you're doing. This option configures the unlinkd_program directive.

Squid ping program

An external program that provides Squid with ICMP RTT information so that it can more effectively choose between multiple remote parent caches for request fulfillment. There are special cases when this option is required, and your Squid must have been compiled with the --enable-icmp configure option in order for it to work. This option should only be used on caches that have multiple parent caches on different networks that it must choose between. The default program to use for this task is called pinger. This option configures the pinger_program directive.

Custom redirect program, Number of redirect programs,

Provides access to the redirector interface in Squid, so a redirector can be selected and the number of redirector processes needed configured. A redirector is, in short, just what it sounds like: a program that, when given a URL that matches some circumstances, redirects Squid to another URL. To be a little less brief and perhaps more complete, a redirector provides a method to export a request to an external program, and then to import that programs response and act as though the client sent the resulting request. This allows for interesting functionality with Squid and an external redirector. To configure a redirector, enter the path to the redirector and the redirector filename as shown in Figure 12.10, “Configuring a Redirector”. You should also enter any options to be passed to the redirector in the same field, as in the example shown.

Figure 12.10. Configuring a Redirector

One common usage is to block objectionable content using a tool like SquidGuard. Another popular use is to block advertising banners using the simple, but effective Ad Zapper. The Ad Zapper not only allows one to block ads, it can also remove those pesky flashing "New" images and moving line images used in place of standard horizontal rules. Several other general purpose redirectors exist that provide URL remapping for many different purposes. Two popular and well supported general redirectors are Squirm and JesRed. Finally, it is possible to write a custom redirector to provide any kind of functionality needed from your Squid. While it is not possible to use the redirector interface to alter a Web page's content it is possible to perform in-line editing of some or all URLs to force many different types of results. The two redirect options configure the redirect_program and redirect_children directives.

Custom Authentication program, Number of authentication programs,

Provides an interface to the external authentication interface within Squid. There are a large number of authentication modules for use with Squid, allowing users to be authenticated in a number of ways. The simplest authentication type is known as ncsa_auth, which uses a standard htpasswd style password file to check for login name and password. More advanced options include a new NTLM module that allows authentication against a Windows NT domain controller, and LDAP authentication that allows use of Lightweight Directory Access Protocol servers. Most authentication modules work the same way, and quite similarly to a redirector as discussed above. In Figure 12.11, “Authentication Configuration”, you'll see the standard ncsa_auth authenticator and the location of the passwd file it should use for authenticating users. You'll notice the the number of authenticator child processes has been increased from the default of 5 to 10, in order to handle quite heavy loads. These options edit the authenticate_program and authenticate_children directives, respectively.

Figure 12.11. Authentication Configuration

	Note
	Authentication has been enhanced significantly in Squid 2.5 and above, adding new types of authentication (NTLM and Digest), as well as more flexible configuration options. If you are using one of these Squid versions, read the section called “Authentication Programs” for more complete information.

Authentication Programs

In version 2.5.STABLE1 and above, Squid has a wealth of new authentication features and options. Webmin has been expanded to account for the changes, and in the change, authentication configuration received its own section (Figure 12.12, “Autentication Programs”.

Figure 12.12. Autentication Programs

Squid 2.5 and above supports client authentication of three distinct types: Basic, Digest, and NTLM. Older Squids supported only Basic authentication, which is a simple unencrypted authentication method documented originaly. The two new methods, Digest and NTLM, support both plain-text and encrypted authentication mechanisms. Which method to use depends on your client population. Basic authentication works for all browser clients that fully support proxies. Digest authentication is a standard method of authenticating web clients securely, supported by all browsers that are fully HTTP/1.1 compliant (including modern versions of IE, Netscape, and Mozilla). NTLM is a proprietary mechanism developed by Microsoft and currently only supported by Microsoft client software.

Basic and Digest authentication options

Basic authentication is the simplest to use, because it is the most widely accessible, and the simplest in implementation. Digest authentication is a secure authentication method documented in RFC 2617, providing encrypted authentication of proxy and web server users. The only configuration detail really required is the location of the authentication program and any command line arguments to pass to the program, though a couple of additional parameters are available.

Basic authentication program, Digest authentication program,: This is the only required option, if you wish to use Basic or Digest authentication with Squid. You may specify one of several authentication programs provided with Squid, like ncsa_auth to authenticate against a standalone htpasswd style password file, pam_auth to authenticate against the local PAM system, or smb_auth to authenticate against a remote or local SMB server. Another, even simpler choice, is to use the built-in Webmin authentication module. It is a simple NCSA style authenticator that ties into the Webmin Squid user and password management tools without any additional configuration. This option correlates to the auth_param basic program directive or auth_param digest program, and is disabled by default.
Number of authentication programs: This option correlates to the auth_param basic children or auth_param digest children directive and defaults to 5. If your Squid supports a very large number of users, you may need to raise this to 10 or 20.
Authentication cache time: This is the amount of time that Squid will cache authentication credentials. Squid normally queries the authentication program periodically, and stores the result of the test in its own memory so it doesn't need to query the external authenticator frequently. This improves performance, but if rapidly changing credentials are required, you may wish to lower this value. This option correlates to the auth_param basic credentialsttl.
Authentication realm: When a client browser receives a basic authentication request, it includes a short string identifying the requester of the data. This information will usually be displaye to the user in the pop-up box where credentials are entered. This option correlates to the auth_param basic realm or auth_param digest realm directive and has no default.

NTLM authentication options

Like Digest authentication, NTLM authentication provides an encrypted connection to a network server. Beware that it is not an HTTP authentication scheme, however. It is a connection authentication scheme which cannot be proxied (though a proxy can use it to authenticate its own clients). NTLM is also less secure than Digest authentication, and has a history of vulnerabilities. That said, it is quite popular, because Windows supports a sort of network "single sign-on", in which users only need to logon once to the local network, and they can be automatically authenticated to the proxy server using the same credentials.

NTLM authentication program: As in Basic and Digest authentication above, this will contain a path to the authentication program and any command line arguments. Often the arguments will specify the domain or workgroup and the server to authenticate against. This option correlates to the auth_param ntlm program directive and is disabled by default.
Number of times an NTLM challenge can be re-used: This option configures the number of times that a particular NTLM challenge can be re-used. Increasing this number may reduce latency and load on the server slightly, but can also increase the risk of replay attacks (where a challenge response is recorded and played back to imitate the connection of a legitimate user). This option correlates to the auth_param ntlm max_challenge_reuses directive and defaults to 0.
Lifetime of NTLM challenges: This option corresponds to the auth_param ntlm max_challenge_lifetime directive and defaults to 2 minutes. It is used to specify the length of time that a challenge can be reused. If a challenge is newer than this value and the challenge has been re-used fewer than the previous value, a challenge will be re-used.

Access Control

The Access Control functionality of Squid is perhaps its most complex set of features, but also among its most powerful. In fact, many use Squid primarily for these features. Because of its complexity, you will learn about it in steps, examining the process of creating and implementing an access control list. Access control lists in Squid has two meanings within the configuration file and within the Webmin interface. First, it signifies the whole concept of access control lists and all of the logic that can be applied to those lists. Second, it applies to the lists themselves, which are simply lists of some type of data to be matched against when some type of access rule is in place. For example, forcing a particular site or set of sites to not be cached requires a list of sites to not cache, and then a separate rule to define what to do with that list (in this case, don't cache them). There is also a third type of option for configuring ICP access control. These three types of definition are separated in the Webmin panel into three sections. The first is labeled Access control lists, which lists existing ACLs and provides a simple interface for generating and editing lists of match criteria (Figure 12.13, “Access Control Lists”. The second is labeled Proxy restrictions and lists the current restrictions in place and the ACLs they effect. Finally, the ICP restrictions section lists the existing access rules regarding ICP messages from other Web caches.

Figure 12.13. Access Control Lists

Access Control Lists

This section provides a list of existing ACLs and provides a means to create new ones (Figure 12.14, “ACL section”). The first field in the table represents the name of the ACL, which is simply an assigned name, that can be just about anything the user chooses. The second field is the type of the ACL, which can be one of a number of choices, that indicates to Squid what part of a request should be matched against for this ACL. The possible types include the requesting clients address, the Web server address or host name, a regular expression matching the URL, and many more. The final field is the actual string to match. Depending on what the ACL type is, this may be an IP address, a series of IP addresses, a URL, a host name, etc.

Figure 12.14. ACL section

To edit an existing ACL, simply click on the highlighted name. You will then be presented with a screen containing all relevant information about the ACL. Depending on the type of the ACL, you will be shown different data entry fields. The operation of each type is very similar, so for this example, you'll step through editing of the localhost ACL. Clicking the localhost button presents the page that's shown in Figure 12.15, “Edit an ACL”.

Figure 12.15. Edit an ACL

The title of the table is Client Address ACL which means the ACL is of the Client Address type, and tells Squid to compare the incoming IP address with the IP address in the ACL. It is possible to select an IP based on the originating IP or the destination IP. The netmask can also be used to indicate whether the ACL matches a whole network of addresses, or only a single IP. It is possible to include a number of addresses, or ranges of addresses in these fields. Finally, the Failure URL is the address to send clients to if they have been denied access due to matching this particular ACL. Note that the ACL by itself does nothing, there must also be a proxy restriction or ICP restriction rule that uses the ACL for Squid to use the ACL.

Creating a new ACL is equally simple (Figure 12.16, “Creating an ACL”). From the ACL page, in the Access control lists section, select the type of ACL you'd like to create. Then click Create new ACL. From there, as shown, you can enter any number of ACLs for the list. In my case, I've created a list called SitesThatSuck, which contains the Web sites of the Recording Industry Association of America and the Motion Picture Association of America. From there, I can add a proxy restriction to deny all accesses through my proxy to those two Web sites.

Figure 12.16. Creating an ACL

Available ACL Types

Browser Regexp: A regular expression that matches the clients browser type based on the user agent header. This allows for ACL's to operate based on the browser type in use, for example, using this ACL type, one could create an ACL for Netscape users and another for Internet Explorer users. This could then be used to redirect Netscape users to a Navigator enhanced page, and IE users to an Explorer enhanced page. Probably not the wisest use of an administrators time, but does indicate the unmatched flexibility of Squid. This ACL type correlates to the browser ACL type.
Client IP Address: The IP address of the requesting client, or the clients IP address. This option refers to the src ACL in the Squid configuration file. An IP address and netmask are expected. Address ranges are also accepted.
Client Hostname: Matches against the client domain name. This option correlates to the srcdomain ACL, and can be either a single domain name, or a list or domain names, or the path to a file that contains a list of domain names. If a path to a file, it must be surrounded parentheses. This ACL type can increase the latency, and decrease throughput significantly on a loaded cache, as it must perform an address-to-name lookup for each request, so it is usually preferable to use the Client IP Address type.
Client Hostname Regexp: Matches against the client domain name. This option correlates to the srcdom_regex ACL, and can be either a single domain name, or a list of domain names, or a path to a file that contains a list of domain names. If a path to a file, it must be surrounded parentheses
Date and Time: This type is just what it sounds like, providing a means to create ACLs that are active during certain times of the day or certain days of the week. This feature is often used to block some types of content or some sections of the Internet during business or class hours. Many companies block pornography, entertainment, sports, and other clearly non-work related sites during business hours, but then unblock them after hours. This might improve workplace efficiency in some situations (or it might just offend the employees). This ACL type allows you to enter days of the week and a time range, or select all hours of the selected days. This ACL type is the same as the time ACL type directive.
Dest AS Number: The Destination Autonomous System Number is the AS number of the server being queried. The autonomous system number ACL types are generally only used in Cache Peer, or ICP, access restrictions. Autonomous system numbers are used in organizations that have multiple Internet links and routers operating under a single administrative authority using the same gateway protocol. Routing decisions are then based on knowledge of the AS in addition to other possible data. If you are unfamiliar with the term autonomous system, it is usually safe to say you don't need to use ACLs based on AS. Even if you are familiar with the term, and have a local AS, you still probably have little use for the AS Number ACL types, unless you have cache peers in other autonomous systems and need to regulate access based on that information. This type correlates to the dest_as ACL type.
Source AS Number: The Source Autonomous System Number is another AS related ACL type, and matches on the AS number of the source of the request. Equates to the src_as ACL type directive.
Ethernet Address: The ethernet or MAC address of the requesting client. This option only works for clients on the same local subnet, and only for certain platforms. Linux, Solaris, and some BSD variants are the supported operating systems for this type of ACL. This ACL can provide a somewhat secure method of access control, because MAC addresses are usually harder to spoof than IP addresses, and you can guarantee that your clients are on the local network (otherwise no ARP resolution can take place).
External Auth: This ACL type calls an external authenticator process to decide whether the request will be allowed. Many authenticator helper programs are available for Squid, including PAM, NCSA, UNIX passwd, SMB, NTLM (only in Squid 2.4), etc. Note that authentication cannot work on a transparent proxy or HTTP accelerator. The HTTP protocol does not provide for two authentication stages (one local and one on remote Web sites). So in order to use an authenticator, your proxy must operate as a traditional proxy, where a client will respond appropriately to a proxy authentication request as well as external Web server authentication requests. This correlates to the proxy_auth directive.
External Auth Regex: As above, this ACL calls an external authenticator process, but allows regex pattern or case insensitive matches. This option correlates to the proxy_auth_regex directive.
Proxy IP Address: The local IP address on which the client connection exists. This allows ACLs to be constructed that only match one physical network, if multiple interfaces are present on the proxy, among other things. This option configures the myip directive.
RFC931 User: The user name as given by an ident daemon running on the client machine. This requires that ident be running on any client machines to be authenticated in this way. Ident should not be considered secure except on private networks where security doesn't matter much. You can find free ident servers for the following operating systems: Win NT, Win95/Win98, and UNIX. Most UNIX systems, including Linux and BSD distributions, include an ident server.
Request Method: This ACL type matches on the HTTP method in the request headers. This includes the methods GET, PUT, etc. This corresponds to the method ACL type directive.
URL Path Regex: This ACL matches on the URL path minus any protocol, port, and host name information. It does not include, for example, the "http://www.swelltech.com" portion of a request, leaving only the actual path to the object. This option correlates to the urlpath_regex directive.
URL Port: This ACL matches on the destination port for the request, and configures the port ACL directive.
URL Protocol: This ACL matches on the protocol of the request, such as FTP, HTTP, ICP, etc.
URL Regexp: Matches using a regular expression on the complete URL. This ACL can be used to provide access control based on parts of the URL or a case insensitive match of the URL, and much more. The regular expressions used in Squid are provided by the GNU Regex library which is documented in the section 7 and 3 regex man pages. Regular expressions are also discussed briefly in a nice article by Guido Socher at LinuxFocus. This option is equivalent to the url_regex ACL type directive.
Web Server Address: This ACL matches based on the destination Web server's IP address. Squid a single IP, a network IP with netmask, as well as a range of addresses in the form "192.168.1.1-192.168.1.25". This option correlates to the dst ACL type directive.
Web Server Hostname: This ACL matches on the host name of the destination Web server.
Web Server Regexp: Matches using a regular expression on the host name of the destination Web server.

More information on Access Control Lists in Squid can be found in [Section 10] of the Squid FAQ. Authentication information can be found in [Section 23] of the Squid FAQ.

Administrative Options

Administrative Options provides access to several of the behind the scenes options of Squid. This page allows you to configure a diverse set of options, including the user ID and group ID of the Squid process, cache hierarchy announce settings, and the authentication realm (Figure 12.17, “Administrative Options”).

Figure 12.17. Administrative Options

Run as Unix user and group: The user name and group name Squid will operate as. Squid is designed to start as root but very soon after drop to the user/group specified here. This allows you to restrict, for security reasons, the permissions that Squid will have when operating. Although Squid has proven itself to be quite secure through several years of use on thousands of sites, it is never a bad thing to take extra precautions to avoid problems. By default, Squid will operate as either nobody user and the nogroup group, or in the case of some Squids installed from RPM as squid user and group. These options correlate to the cache_effective_user and cache_effective_group directives.
Proxy authentication realm: The realm that will be reported to clients when performing authentication. This option usually defaults to Squid proxy-caching web server, and correlates to the proxy_auth_realm directive. This name will likely appear in the browser pop-up window when the client is asked for authentication information.
Cache manager email address: The email address of the administrator of this cache. This option corresponds to the cache_mgr directive and defaults to either webmaster or root on RPM based systems. This address will be added to any error pages that are displayed to clients.
Visible hostname: The host name that Squid will advertise itself on. This effects the host name that Squid uses when serving error messages. This option may need to be configured in cache clusters if you receive IP-Forwarding errors. This option configures the visible_hostname.
Unique hostname: Configures the unique_hostname directive, and sets a unique host name for Squid to report in cache clusters in order to allow detection of forwarding loops. Use this if you have multiple machines in a cluster with the same Visible Hostname.
Cache announce host, port and file: The host address and port that Squid will use to announce its availability to participate in a cache hierarchy. The cache announce file is simply a file containing a message to be sent with announcements. These options correspond to the announce_host, announce_port, and announce_file directives.
Announcement period: Configures the announce_period directive, and refers to the frequency at which Squid will send announcement messages to the announce host.

Miscellaneous Options

Miscellaneous Options is just what it sounds like. A hodgepodge of options that don't seem to fit anywhere else. Here you'll find several memory related options, options regarding headers and user agent settings, and the powerful HTTP accelerator options (Figure 12.18, “Miscellaneous Options”).

Figure 12.18. Miscellaneous Options

Startup DNS test addresses

This should point to a number of hosts that Squid can use to test if DNS service is working properly on your network. If DNS isn't working properly, Squid will not be able to service requests, so it will refuse to start, with a brief message regarding why in the cache.log. It is recommended that you select two or more host names on the Internet and one or two host names on your intranet, assuming you have one and Squid is expected to service it. By default, the dns_testnames directive checks a few well known and popular sites: netscape.com, internic.net, nlanr.net, and microsoft.com.

SIGUSR1 logfile rotations

The number of old rotated log files Squid will keep. On Red Hat systems, this option defaults to zero, as logs are rotated via the system standard logrotate program. On other systems, this defaults to 10, which means Squid will keep 10 old log files before overwriting the oldest. This option corresponds to the logfile_rotate directive.

Default domain

The domain that Squid will append to requests that are not possibly fully qualified domain names (more precisely, those that have no dots in them). This option correlates to the append_domain directive.

Error messages text

Provides a means to automatically add some extra information to Squids error pages. You can add HTML or plain text comments or links here, which will be added to the error messages displayed to clients. This option correlates to the err_html_text directive.

Per-client statistics?

Allows you to choose whether Squid will keep statistics regarding each individual client. This option configures the client_db directive and defaults to on.

X-Forwarded-For header?

This option allows you to choose whether Squid will report the host name of the system that originally made the request to the origin server. For example, if this option is disabled every request through your cache will be reported as originating from the cache. Usually, this should remain enabled. This correlates to the forwarded_for directive and defaults to on.

Log ICP queries?

Dictates whether Squid will log ICP requests. Disabling this can be a good idea if ICP loads are very high. This option correlates to the log_icp_queries directive and defaults to on.

Minimum direct hops

When using ICMP pinging features of Squid to determine distance to peers and origin servers, this configures when Squid should prefer going direct over a peer. This option requires your Squid to have been compiled with the --enable-icmp, and you must be in a peering relationship with other Squid caches, also with the appropriate build option compiled in. This option correlates to the minimum_direct_hops directive.

Keep memory for future use?

This option turns on memory_pools and allows Squid to keep memory that it has allocated (but no longer needs), so that it will not need to reallocate memory in the future. This can improve performance by a small margin, but may need to be turned off if memory is at a premium on your system. This option defaults to on and should generally be left on, unless you know what you're doing.

Amount of memory to keep

The amount of memory Squid will keep allocated, assuming the Keep memory for future use option is turned on. This option configures the memory_pools_limit directive, and defaults to unlimited. Any non-zero value will instruct Squid not to keep more than that amount allocated, and if Squid requires more memory than that to fulfill a request, it will use your system's malloc library. Squid does not pre-allocate memory, so it is safe to set this reasonably high. If your Squid runs on a dedicated host, it is probably wisest to leave it to its default of unlimited. If it must share the system with other server processes (like Apache or Sendmail) then it might be appropriate to limit it somewhat.

Headers to pass through

Configures the anonymizing features of Squid. This option allows you to dictate what kinds of request headers are allowed to pass through Squid. For example, to prevent origin servers from being able to detect the type of browser your clients are using you would choose to allow all except User-Agent. This option has mostly obscure uses and usually doesn't need to be changed from its default of allowing all headers to pass through. There is a relevant Squid FAQ section that describes in more detail what can be accomplished with this option. This option corresponds to the anonymize_headers directive and defaults to allow All headers.

	Caution
	Indiscriminate use of the anonymizing features of Squid can cause Web sites to behave incorrectly. Because modern Web sites often rely on the contents of cookies or other headers, to know the right JavaScript and HTML code to serve for everything to look and act correctly, many sites could be confused into serving the wrong content, or refusing to serve any content to the user.

Fake User-Agent

Acts as an addition to the above option, in that it allows you to configure Squid to report a faked User-Agent header. For example, using this option you could have your Squid report that every client being served is named Mozilla/42.2 (Atari 2600; 8-bit). That would be lying, but perhaps the person looking over the logs at origin servers will find it amusing. If you are using the anonymize headers features to hide your clients User-Agent headers, it is probably wise to include a fake User-Agent header because some servers will not be happy with requests without one. Further, this will cause problems with some Web pages for your users, as the User-Agent header is sometimes used to decide which of a number of pages to send based on the features available within a particular browser. The server will usually end up choosing the least interesting page for your clients (i.e., text only, or no Javascript/Java/etc.).

HTTP Accel Host and Port

The options you will use to configure Squid to act as an accelerator, or as a transparent proxy. When using your Squid as an accelerator, you must configure these two options to point to the IP and port of the Web server you are accelerating. If you are using Squid to accelerate a number of virtual hosts, you must choose virtual as the Accel Host. Note that this opens potential security problems, in that your Squid will then be open to users outside of your network as a proxy. This can be avoided via proper firewall rules on your router or on the Squid system itself. Finally, if you are operating your Squid transparently, you would also configure the Accel Host to be virtual and the Accel Port to be 80. Outgoing port 80 traffic will then need to be redirected to your Squid process in order for it to work. This is discussed in much greater detail in the tutorial on transparent proxying. These options configure the httpd_accel_host and httpd_accel_port directives.

HTTP Accel with Proxy

Allows you to operate your cache as both an accelerator and a caching proxy. This option tells Squid to accept both traditional proxy connections as well as requests intended for an origin Web server. This option correlates to the httpd_accel_with_proxy.

HTTP Accel Uses Host Header

Configures Squid to use the host header information as described in the HTTP 1.1 specification. This option must be turned on for transparent operation, in order for virtual servers to be cached properly. This option correlates to the httpd_uses_host_header directive.

WCCP Router Address, WCCP Incoming Address, WCCP Outgoing Address,

The Web Cache Coordination Protocol is a standard method of implementing an interception proxy. Routers that support WCCP can be configured to direct traffic to one or more web caches using an efficient load balancing mechanism. WCCP also provides for automatic bypassing of an unavailable cache in the event of a failure. Usually, configuring Squid to use WCCP is as simple as configuring it for interception proxying, using the steps discussed later in the Interception Caching tutorial, and then entering the address of the router in the WCCP Router Address field. The other two options are very rarely needed, but can be used in some complex network environments where incoming and outgoing data must travel via different routes or from different addresses. These options correspond to the wccp_router, wccp_incoming_address, and wccp_outgoing_address directives, and are disabled by default.

Tutorial: A Basic Squid Proxy Configuration

Squid is almost entirely preconfigured for traditional proxying as soon as it is installed from source distribution or from a binary package. It can be up and running in just a few minutes, if your needs are simple. This tutorial covers the first changes you'll need to make to get your caching proxy up and running quickly.

	Note
	This tutorial assumes you have already installed Squid, and have configured Webmin to know where to find all of the appropriate Squid files. If you've installed from a vendor supplied package, Webmin will probably already know where to find everything.

Opening access to local clients

The only change that must be made before using your Squid installation is to open access for your local users. By default Squid denies access to all users from any source. This is to prevent your proxy from being used for illicit purposes by users outside of your local network (and you'd be amazed at how many nasty things someone can do with an open proxy).

Click on the Access Control icon to edit the access control lists and access rules for your proxy. First, create a new ACL by selecting selecting Client Address from the drop-down list, and then clicking Create new ACL. This will open a new page where you can define your ACL. First, enter a name, like localnet, in the Name field. Next, specify your network either in terms of a network range, or by specifying a network and netmask. If you have only 10 addresses for example that you would like to be permitted to use your proxy you could enter, for example, a From IP of 192.168.1.20 and a To IP of 192.168.1.30. Or if you have a whole network to which you would like to allow proxy access, you could enter a From IP of 192.168.1.0 and a Netmask of 255.255.255.0. Click Save.

Next, you need to add a proxy restriction to permit the clients matched by the localnet ACL to use the proxy. So click the Add proxy restriction link. On the proxy selection page, choose the Allow option for the Action, and select localnet in the Match ACLs selection box. Click Save.

Then use the arrow icons to the right of the list of proxy restrictions to move the rule you've just created above the Deny all rule.

Initializing the Cache Directory

You may have noticed, on the front page of the Webmin Squid module, there is a warning that the configured cache directory has not been initialized. Before starting Squid, you'll want to make sure it gets initialized. Webmin, of course, will do this for us. Just click the Initialize Cache button. If you plan to alter your cache directories to something other than the default. you'll likely want to do so in the Cache Options page before initializing the directories. Details are covered earlier in this chapter.

Starting Squid and Testing

To start Squid, click on the Start Squid link in the upper right corner of the main module page. It is worthwhile to then check the information provided by Squid during its startup in the cache.log. You can use the Webmin file manager, or you can add this log to the System Logs module for viewing there (read the section covering that module for information on adding non-syslog log files to make them viewable). Squid is usually quite forthcoming about problems that might prevent it from starting or operating correctly.

To test your new Squid, configure a browser on your local network to use the Squid server as its proxy. Doing this is browser dependent. In Netscape and Mozilla, the proxy options are located under the Advanced:Proxy Settings preferences category, while in Internet Explorer, they are located under Internet Options:Connections. Squid can act as a proxy for HTTP, HTTPS, FTP, Gopher, and WAIS protocols. Socks is not supported by Squid, though there are a few good Open Source Socks proxies available.

Now, just browse for a bit to be sure your caching proxy is working. Take a look in the access.log for information about whether a request was served with a cache hit or a cache miss. If Calamaris is installed on your system, Webmin will generate an access report on demand whenever you click the Calamaris icon on the Squid module main page.

Tutorial: Interception Proxying

Ordinarily, when using Squid on a network to cache Web traffic, browsers must be configured to use the Squid system as a proxy. This type of configuration is known as traditional proxying. In many environments, this is simply not an acceptable method of implementation. Therefore Squid provides a method to operate as an interception proxy, or transparently, which means users do not even need to be aware that a proxy is in place. Web traffic is redirected from port 80 to the port where Squid resides, and Squid acts like a standard Web server for the browser.

Using Squid transparently is a two part process, requiring first that Squid be configured properly to accept non-proxy requests, and second that Web traffic gets redirected to the Squid port. The first part of configuration is performed in the Squid module, while the second part can be performed in the Linux Firewall module. That is, assuming you are using Linux, otherwise you should consult the Squid FAQ Transparent Caching/Proxying entry.

Configuring Squid for Transparency

In order for Squid to operate as a transparent proxy, it must be configured to accept normal Web requests rather than (or in addition to) proxy requests. Here, you'll learn about this part of the process, examining both the console configuration and the Webmin configuration. Console configuration is explained, and Webmin configuration is shown in the figure below.

As root, open the squid.conf file in your favorite text editor. This file will be located in one of a few different locations depending on your operating system and the method of installation. Usually it is found in either /usr/local/squid/etc, when installed from source, or /etc/squid, on Red Hat style systems. First you'll notice the http_port option. This tells you what port Squid will listen on. By default, this is port 3128, but you may change it if you need to for some reason. Next you should configure the following options, as shown in Figure 12.19, “Transparent Configuration of Squid”.

Figure 12.19. Transparent Configuration of Squid

        httpd_accel_host virtual
   httpd_accel_port 80
   httpd_accel_with_proxy  on
   httpd_accel_uses_host_header on

These options, as described in the Miscellaneous Options section of this document, configures Squid as follows. httpd_accel_host virtual causes Squid to act as an accelerator for any number of Web servers, meaning that Squid will use the request header information to figure out what server the user wants to access, and that Squid will behave as a Web server when dealing with the client. httpd_accel_port 80 configures Squid to send out requests to origin servers on port 80, even though it may be receiving requests on another port, 3128 for example. httpd_accel_with_proxy on allows you to continue using Squid as a traditional proxy as well as a transparent proxy. This isn't always necessary, but it does make testing a lot easier when you are trying to get transparency working, which is discussed a bit more later in the troubleshooting section. Finally, httpd_accel_uses_host_header on tells Squid that it should figure out what server to fetch content from based on the host name found in the header. This option must be configured this way for transparency.

Linux Firewall Configuration For Transparent Proxying

The iptables portion of your transparent configuration is equally simple. The goal is to hijack all outgoing network traffic that is on the HTTP port (that's port 80, to be numerical about it). iptables, in its incredible power and flexibility allows you to do this with a single command line or a single rule. Again, the configuration is shown and discussed for both the Webmin interface and the console configuration.

	Note
	The Linux Firewall module is new in Webmin version 1.00. All previous revisions lack this module, so to follow these steps, you'll need to have a recent Webmin revision.

When first entering the Linux Firewall module, the Packet filtering rules will be displayed. For your purposes you need to edit the Network address translation rules. So, select it from the drop-down list beside the Showing IPtable button, and click the button to display the NAT rules.

Now, add a new rule to the PREROUTING chain by clicking the Add rule button to the right of the PREROUTING section of the page.

Fill in the following fields. The Action to take should be Redirect, and the Target ports for redirect set to 3128. Next you'll need to specify what clients should be redirected to the Squid port. If you know all port 80 traffic on a single interface should be redirected, it is simplest to specify an Incoming interface, but you could instead specify a Source address or network. Next, set the Network protocol to Equals TCP. Finally, set the Destination TCP or UDP port to 80. Click Create to add the new rule to the configuration. Once on the main page again, click the Apply Configuration button to make the new rule take effect. Finally, set the firewall to be activated at boot so that redirection will continue to be in effect on reboots.

        # iptables -t nat -I PREROUTING 1 -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3128

While a detailed description of the iptables tool is beyond the scope of this section, it should briefly be explained what is happening in this configuration. First, you are inserting a rule into the first PREROUTING chain of the NAT routing path, with the -t nat -I PREROUTING 1 portion of the command. Next you're defining whose requests will be acted upon, in this case iptables will work on all packets originating from the network attached to device eth0. This is defined by the -i eth0 portion of the rule. Then comes the choice of protocol to act upon; here you've chosen TCP with the -p tcp section. Then, the last match rule specifies the destination port you would like for your redirect to act upon with the --dport 80 section. Finally, iptables is told what to do with packets that match the prior defined criteria, specifically, it will REDIRECT the packets --to-port 3128.

Information Technology and Computer Science

Senin, 27 Oktober 2008

Chapter 12. Squid