プルリクエストがマージされたときの Jira Automation ルール

Posted by: AJ Welch

Statistical Approximations
Increasing the DISTINCT Approximation Threshold
Using the EXACT_COUNT_DISTINCT Function

Typically in SQL database engines, the use of COUNT(DISTINCT [field]) within a query is used to count the exact number of DISTINCT items within the specified field. In Google BigQuery, however, COUNT(DISTINCT [field]) functions slightly differently due to the massive quantities of data that are often involved when performing queries.

In this tutorial we’ll briefly explore how BigQuery handles COUNT(DISTINCT [field]) to support better performance and scalability, and how you can choose to circumvent those limitations, if necessary.

アジャイルを超える

As stated directly in the official documentation, BigQuery’s implementation of DISTINCT returns a value that is a “statistical approximation and is not guaranteed to be exact.” Obviously this is for performance reasons and to reduce the cost to the end-user. In most cases, this difference is largely irrelevant, since when you perform a Google search, it doesn’t matter to you whether it says “About 10,400,000 results” or it says “10,415,027 results” – you’re still clicking on the first handful of links and going about your business.

BigQuery uses approximation for all DISTINCT quantities greater than the default threshold value of 1000.

As an example, if we execute the following query, which aggregates the total number of DISTINCT authors, publishers, and titles from all books in the gdelt-bq:hathitrustbooks dataset between 1920 and 1929, we will not get exact results:

SELECT
  COUNT(DISTINCT BookMeta_Author) AS authors,
  COUNT(DISTINCT BookMeta_Publisher) AS publishers,
  COUNT(DISTINCT BookMeta_Title) AS titles
FROM
  TABLE_QUERY([gdelt-bq:hathitrustbooks],
              'REGEXP_MATCH(table_id , r"^192[\d]")')

The DISTINCT quantities it returned can be seen below:

[
  {
    "authors": "48911",
    "publishers": "34517",
    "titles": "77386"
  }
]

アジャイルを超える

As mentioned above, by default, the approximation threshold for DISTINCT queries is set to 1000, but by including the second numeric parameter (n) in the COUNT(DISTINCT [field], n) function call, we can increase this threshold, forcing BigQuery to return an exact count for any number at or below that threshold.

For example, let’s change our query from above to use a threshold of 50,000:

SELECT
  COUNT(DISTINCT BookMeta_Author, 50000) AS authors,
  COUNT(DISTINCT BookMeta_Publisher, 50000) AS publishers,
  COUNT(DISTINCT BookMeta_Title, 50000) AS titles
FROM
  TABLE_QUERY([gdelt-bq:hathitrustbooks],
              'REGEXP_MATCH(table_id , r"^192[\d]")')

Our expectation now is that the first two quantities, authors and publishers, will be exact counts since those quantities are below our threshold:

[
  {
    "authors": "48642",
    "publishers": "35140",
    "titles": "76938"
  }
]

The results are all different from our default threshold example above, but we cannot yet determine if the threshold setting worked as intended. To verify that the threshold is working, we can perform one final test, by increasing the threshold yet again to exceed all three quantities, this time to 80,000:

SELECT
  COUNT(DISTINCT BookMeta_Author, 80000) AS authors,
  COUNT(DISTINCT BookMeta_Publisher, 80000) AS publishers,
  COUNT(DISTINCT BookMeta_Title, 80000) AS titles
FROM
  TABLE_QUERY([gdelt-bq:hathitrustbooks],
              'REGEXP_MATCH(table_id , r"^192[\d]")')

Our expectation is that the first two values for authors and publishers should remain identical to the returned values from our 50,000 threshold query, and sure enough they are the same.

We can therefore conclude that all three numbers are now exact counts of the DISTINCT quantities for each field across all 1920s tables in the dataset.

[
  {
    "authors": "48642",
    "publishers": "35140",
    "titles": "77155"
  }
]

アジャイルを超える

Obviously, if there are situations where you don’t care about the processing time or performance of the query, and instead absolutely must ensure your DISTINCT counter returns the exact quantities, you can make use of the EXACT_COUNT_DISTINCT function.

Here we’re performing the same query as the above example, but using EXACT_COUNT_DISTINCT so we don’t need to manually specify any approximation threshold value:

SELECT
  EXACT_COUNT_DISTINCT(BookMeta_Author) AS authors,
  EXACT_COUNT_DISTINCT(BookMeta_Publisher) AS publishers,
  EXACT_COUNT_DISTINCT(BookMeta_Title) AS titles
FROM
  TABLE_QUERY([gdelt-bq:hathitrustbooks],
              'REGEXP_MATCH(table_id , r"^192[\d]")')

We would expect our results to match the query above where we specicied a threshold of 80,000, giving us the exact values, and sure enough the data is identical:

[
  {
    "authors": "48642",
    "publishers": "35140",
    "titles": "77155"
  }
]

次のトピック

DevOps のメリット

注目

アトラシアン コレクション 新

Jira

Confluence

Loom

エージェント

位置合わせ

重点

人材

エージェント

Atlassian Cloud プラットフォーム

おすすめのアプリ

Jira

Confluence

Jira Service Management

Rovo 新

デベロッパー

Jira

Bitbucket

Compass 新

プロダクトマネージャー

Jira

Confluence

Jira Product Discovery 新

IT プロフェッショナル

Jira Service Management

Guard 新

ビジネスチーム

Jira

Confluence

Trello

Loom 新

リーダーシップチーム

Jira

Confluence

Loom 新

Jira Align

重点 新

ソリューション

ソリューション

Jira

Confluence

Loom

エージェント

位置合わせ

重点

人材

エージェント

Atlassian Cloud プラットフォーム

アトラシアンが選ばれる理由

System of Work 新規

統合

お客様事例

FedRAMP

対応性

プラットフォーム

Trust Center

リソース

カスタマー サポート

パートナーを探す

移行プログラム

University

サポート

学ぶ

Jira

Jira Service Management

Confluence

プル リクエストがマージされたときの Jira Automation ルール

Posted by: AJ Welch

アジャイルを超える

アジャイルを超える

アジャイルを超える

次のトピック

製品

リソース

学ぶ

アトラシアンコレクション新

重点新

カスタマーサポート

プルリクエストがマージされたときの Jira Automation ルール